# Recommending Similar Dresses - Milestone Report

## Problem statement

Often when searching online for a new dress to buy, I don’t have a solid idea of what exactly I am looking for. I might find one dress, that is almost what I want, but not totally. It would be great to be able to see similar items and have a wider choice of something related. 

The solution to this problem can benefit both: large retail companies with many different dresses for sale (like Macy’s, Bloomingdales, Nordstrom etc.) to implement or improve this useful tool into their websites and mobile applications, and end users that can spend less time browsing and more time buying.

## Data Acquisition

To solve this problem I will need a lot of dress images labeled with multiple attributes. I intend to build a convolutional neural network and train it to predict labels. And then further to use predicted labels to identify similar items. In the search for the data my choice stopped at Deep Fashion dataset, carefully crafted by The Chinese University of Hong Kong. The dataset can be found here: http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html

These dataset contains over 200,000 images of different clothing categories. Images are presented in two quality types: low resolution for less memory consumption and high resolution for better use in neural networks. There are 1000 attributes that are spread among all categories. The data is organized in several .txt files:

1. list_category_cloth contains the list of clothing categories labled with unique number
2. list_category_img is a list of image urls with a clothing category number to which the item belongs
3. list_attr_cloth contains all attributes names numbered according to a category attribute belongs to (like upper body etc.)
4. list_attr_img is the same url list with vectors of -1, 0 and 1 corresponding to absence or presence of each attribute (0 means uncertainty)
5. list_bbox contains box coordinates for each image that builds around the object

## Data Preparation

By combining first two above datasets I have obtained the urls for only images representing dresses. By merging this new dataframe with the fourth one containing attributes information for each image and assigning column names to be attribute names from the third one, I constructed a dataframe that shows presence of all attributes for dress images. I assumed that uncertainly is rather a negative uncertainty (a different approach of viewing uncertainty as a presence of attribute can be also considered) and I have changed -1 into 0. 

Below histogram shows the number of attributes per dress.

![title](inserts/attributes_per_dress.png)  

Note: From 1000 total attributes each dress image has at most 18 attributes



As the attributes describe all clothing categories, it makes sense that lots of attributes are irrelevant for the dresses. Below histogram shows how the dress count is distributed among attributes.

![title](inserts/dress_count_per_attr.png) 

Note: A large portion of attributes is not relevant for the dresses

So what attributes should we be looking for when describing dresses? Below is the list of the attributes that are found in at least 1000 dresses. They can be divided into categories depending on what they describe:

 - Length: 
     - maxi
     - midi
     - mini
     
 - Sleeve:
     - sleeve
     - sleeveless
     - long sleeve
    
 - Full body shape:
     - bodycon
     - fit 
     - flare
     - skater
     - shift
     - sheath
     - belted
     - shirt
     - babydoll
     
 - Top shape:
     - v-neck
     - shoulder
     - sweetheart
     
 - Bottom shape:
     - a-line
     - slit
     
 - Print:
     - printed
     - floral
     - striped
     - abstract
     - tribal
     - paisley
     - rose
     
 - Material:
     - chiffon
     - lace
     - floral lace
     - cotton
     - denim
     
 - Look:
     - mesh
     - beaded
     - textured
     - trim
     - pleated
     - sheer

After filtering dataframe for only above attributes and removing rows of all zeroes (not having any of these attributes), I have obtained 61,414 dresses and 39 attributes to work with. 

## Data Exploration

With the help of below funtions I will pick into several of these categories to get a better idea of the actual images.


In [2]:
# define functions

def get_attr_index(attr):
    """Returns a list of indexes which have a given attribute"""
    return cloth_attr[cloth_attr[attr]==1].index

def print_image(ind):
    """Prints image of a given index"""
    img = mpimg.imread('data/img/'+cloth_attr.loc[ind,'url'])
    plt.imshow(img)
    del img

def print_5_images(attr_list):
    """Prints 5 random images for each attribute in a given list of attribute names"""
    l = len(attr_list)
    j = 1 #index to move along subplots
    plt.figure(figsize=[15,5*l])
    for ind in range(l):
        print_ind = get_attr_index(attr_list[ind])
        for i in range(5):
            plt.subplot(l,5,j)
            j += 1
            n = random.randint(0,1000)
            print_image(print_ind[n])
            plt.title(attr_list[ind])
    

One of the first things we define a dress by is its length and here are the examples:

![title](inserts/length.png)

Another interesting category represents different styles of the dress:

![title](inserts/body_shape.png)

Note that 'fit', 'flare' and 'skater' look very similar as well as 'shift' and 'sheath' dresses. I will first keep them as separate attributes, but will keep an option of combining them open.

One more category to mention is the one reflecting print patterns:

![title](inserts/print_pattern.png)



## Image sizes

As we can see from above images they vary in size a lot. Some are square, some more rectangular (which is a logical shape for the box around the dress). In order to use images in neural network they all need to be the same size. I took a deeper look into sizes of low and high resolution dresses and below density plots show corresponding height and width distributions:

![title](inserts/lr_height_width.png)

Note that the larger values are found more ofter

![title](inserts/hr_height_width.png)

Note that in case of high resolution images the small values are found more often.

For the start I will be using low resolution images all resized to 300x200 and then extend to high resolution ones starting with 450x300.