# Similar Dress Selection

![title](inserts/banner.png)

## Problem statement

Online shopping has entered our lives and is a true life saver for busy individuals or for those who does not like the crowd, or at times like now when a new virus is spreading out... The reasons to shop online are endless. However, in online stores you don't have a luxury of just walking around and picking what you like. In most cases you need to enter certain search parameters. And often (I for sure is a strong representative of this behaviour) we don't have a clear idea of what we want. We may find something interesting, but not quite the right one in terms of look, price or other parameters. In this case a smart trained model that suggests similar looking items can solve the issue.

In this project I have created a model that selects similar looking dresses. It can be a great tool for online websites and mobile retail applications, that can benefit both: consumers (they will see similar items and may discover things they would not find other way) and retailers (they will be able to display different merchandise including less known brands that may visually look like more well-known ones.

## Solution outline

The problem is solved in two stages:

1. Feature extraction from computer vision models 
2. Using extracted features in unsupervised grouping model that picks closest images based on relationship between feature vectors.

For the first stage I am building custom convolutional neural network model as one approach and also use pre-trained Inception V3 (ImageNet weights) model. The former way requires a large collection of pre-labeled images to train the model. I will discuss data acquisition in the following chapter. The pre-trained computer vision model, in this case Inception V3 one, has multiple convolutional and mixed inception layers. It was trained on an extremely large ImageNet dataset (over 1 million labeled images) and represents a great tool for feature extraction.

In the second stage I am using the predicted features as training data for unsupervised k-Nearest Neighbors model. The model inputs feature vectors and selects the specified number of the closest vectors from the training data. Each feature vector corresponds to an image. Thus the model selects the closest images based on their features.

## Data acquisition

As outlined above, a vast collection of labeled images is required to train custom CNN model. In the search for the data my choice stopped at Deep Fashion dataset, carefully crafted by The Chinese University of Hong Kong. The dataset can be found here: http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html

These dataset contains over 200,000 images of different clothing categories. Images are presented in two quality types: low resolution for less memory consumption and high resolution for better use in neural networks. There are 1000 attributes that are spread among all categories. The data is organized in several .txt files:

1. list_category_cloth contains the list of clothing categories labled with unique number
2. list_category_img is a list of image urls with a clothing category number to which the item belongs
3. list_attr_cloth contains all attributes names numbered according to a category attribute belongs to (like upper body etc.)
4. list_attr_img is the same url list with vectors of -1, 0 and 1 corresponding to absence or presence of each attribute (0 means uncertainty)
5. list_bbox contains box coordinates for each image that builds around the object

I am using this dataset to mainly train my own CNN models. However, for evaluation stage I am additionally using two other datasets:

1. A relatively small dataset of 101 images scrapped from Google search on 'dress' term
2. A set of over 8,000 images from image attribute collection. https://data.world/crowdflower/categorization-dress-patterns

Neither of this sets has any labels and thus are great generalization cases.

## Data Preparation

By combining first two dataframes from Deep Fashion dataset I have obtained the urls for only images representing dresses. By merging this new dataframe with the fourth one containing attributes information for each image and assigning column names to be attribute names from the third one, I constructed a dataframe that shows presence of all attributes for dress images. I assumed that uncertainty is rather a negative uncertainty (a different approach of viewing uncertainty as a presence of attribute can be also considered) and I have changed -1 into 0. 

Below histogram shows the number of attributes per dress.

![title](inserts/attributes_per_dress.png)  

Note: From 1000 total attributes each dress image has at most 18 attributes




As the attributes describe all clothing categories, it makes sense that lots of attributes are irrelevant for the dresses.

So what attributes should we be looking for when describing dresses? Below is the list of the attributes that are found in at least 1000 dresses. They can be divided into categories depending on what they describe:

 - Length: 
     - maxi
     - midi
     - mini
     
 - Sleeve:
     - sleeve
     - sleeveless
     - long sleeve
    
 - Full body shape:
     - bodycon
     - fit 
     - flare
     - skater
     - shift
     - sheath
     - belted
     - shirt
     - babydoll
     
 - Top shape:
     - v-neck
     - shoulder
     - sweetheart
     
 - Bottom shape:
     - a-line
     - slit
     
 - Print:
     - printed
     - floral
     - striped
     - abstract
     - tribal
     - paisley
     - rose
     
 - Material:
     - chiffon
     - lace
     - floral lace
     - cotton
     - denim
     
 - Look:
     - mesh
     - beaded
     - textured
     - trim
     - pleated
     - sheer

After filtering dataframe for only above attributes and removing rows of all zeroes (not having any of these attributes), I have obtained 61,414 dresses and 38 attributes to work with. Let's take a quick look at the images.

One of the first things we define a dress by is its length and here are the examples:

![title](inserts/length.png)

Another interesting category represents different styles of the dress:

![title](inserts/body_shape.png)

Note that 'fit', 'flare' and 'skater' look very similar as well as 'shift' and 'sheath' dresses. I will first keep them as separate attributes, but will keep an option of combining them open.

One more category to mention is the one reflecting print patterns:

![title](inserts/print_pattern.png)

As we can see from above images they vary in size a lot. Some are square, some more rectangular (which is a logical shape for the box around the dress). In order to use images in neural network they all need to be the same size. I took a deeper look into sizes of low and high resolution dresses and below density plots show corresponding height and width distributions:

![title](inserts/lr_height_width.png)

Note that the larger values are found more ofter

![title](inserts/hr_height_width.png)

Note that in case of high resolution images the small values are found more often.

I will be using low resolution images all resized to 300x200 and then extend to high resolution ones to 450x300.

With the help of two below functions I am performing resizing of low and high res images to the above sizes. Also I am adjusting ratio two other test sets to have 3:2 height/width ratio and 1:1 ratio for use for Inception model.

In [2]:
def adjust_ratio(img, target_ratio):
    """Crops the the image to be proportional to target_ratio"""
    width, height = img.size
    img_ratio = height / width
    if img_ratio > target_ratio:
        extra = (height - target_ratio*width)//2
        img = img.crop((0, extra, width, height-extra))
    elif img_ratio < target_ratio:
        extra = (target_ratio*width - height)//2
        img = img.crop((extra, 0, width-extra, height))
    return img

def crop_images(df, target_width, target_height, target_directory,column='url'):
    """Crops and resizes images to the target size and saves to target directory"""
    target_ratio = target_height / target_width
    for i in df.index:
        img = Image.open(df.loc[i, column])
        img = adjust_ratio(img, target_ratio)
        img = img.convert('RGB')
        img = img.resize((target_width, target_height))
        img.save(target_directory + 'img' + str(i) + '.png')
        del img

As the result I have below folders to work with:

1. cropped_images_300x200 with cropped low resolution images
2. cropped_images_450x300 with cropped high resolution images
3. cropped_images_300x300 with cropped images for feature extraction from Inception model
4. test_images_large with cropped (3:2 ratio) images from larger test set
5. test_images_small with cropped (3:2 ratio) images from smaller test set
6. test_images_large_squared with cropped (1:1 ratio) images from larger test set
7. test_images_small_squared with cropped (1:1 ratio) images from smaller test set

For the training purposes I have split the main data into train, test and validation splits. Since the label representation was highly imbalanced (some labels had over 12,000 images and some slightly over a 1000), I ensured the appropriate percentage of each label was represented in each set. The below function takes 10% of images belonging to specified label.


In [4]:
# function to select test data for a specified column
def select_test_data(col_name, df, test_split = 0.1):
    """Returns list of indexes for randomly selected 10% of specified attribute"""
    n = df[col_name].sum()
    test_size = round(n*test_split).astype('int64')
    index = df[col_name].sample(test_size).index
    return pd.Series(index)

Then the selected images for each label category are combined into one set by removing duplicates. First a test set is created and then from applying the same function to the remaining train set, a validation set is created. As the result there are three sets:

- train set containing 40277 images
- test set with 11651 image
- validation set of 9486 images

I am also randomly selecting 5 images from each test set (test set from above, small and large unlabeled sets) to then test results of the final models.

Now everything is ready to build CNN model for feature extraction.

## Feature Extraction

As mentioned on Solution outline section, the first step to build a selective model is to construct feature vectors for unsupervised learning. For this purpose, I am exploring the possibilities of building custom convolutional neural network models (CNN) and also using pretrained computer vision model.
 
### Custom Convolutional Neural Network

The process of building a CNN model was done at my local machine and therefore I have started by exploring different model structres and training them on low resolution images for speed and capacity purposes. I am then visually testing the results on test set to use the same structure to train on high resolution images. Before feeding training and validation data to the models I pre-processed train data using random rotations and flips. All data was scaled by dividing pixel values by 255.

The outcome is then evaluated by me and summed in the below table, showing the number of similar looking (in my view) dresses for each test image by each built and trained model.

![title](inserts/model_comparison.png)
Note that only the model with 3 hidden layers (without further modifications)has at least two similar predictions for each test dress.

The results from all of the models are not extremely great, and the model with three hidden layers in my opinion performed better than others. For every test image it selected at least 2 similar looking dresses. I am then using the same model to train on high resolution images. The model structure looks like below image: 

![](inserts/CNN_model.png)

The CNN got its name from a special layers that it consists of - convolutional layers. The neurons in the first convolutional layer are not connected to every single pixel from the input. Instead they are connected to pixels in certain squares, defined by kernel size. I am using 3x3 kernel size and ReLU activation function in each convolutional layer. Each of these layers is followed then by a pooling layer to specify how to transformed a 3x3 field into a single neuron. I am using max-pooling layers - they take the maximum pixel value from each kernel, thus enhancing the brightest pixel from each kernel. The last layer is a fully connected Dense layer. It is an output layer with a vector of 38 features corresponding to selected attribute labels. As the problem is a multi-label classification (each sample can have multiple labels assigned to it), I am using sigmoid as an activation function for the output layer that provides probabilities for each label.

There are two more layers to pay attention to. After the input layer I am creating a batch normalization layer that standardizes the inputs to a network. It is a regularization technique that helps to reduce generalization error and accelerates training. Right before the output layer, the flatten layer is added. All of the images I am dealing with are in color and therefore have three layers for each of RGB channel. Flatten layer creates a one dimensional vector to be fed to the Dense output layer.

After training two models, I am using both of them for predictions on hold-over test set and two additional unlabeled data. These preditions will be further used for unsupervised model to recomment closer looking dresses.

### Feature extraction from Inception V3 model

As an alternative to training CNN models from scratch, I am using pre-trained computer vision model, Inception V3 trained on ImageNet image dataset for feature extraction. For model loading I am using below code with removing of several last layers and using internal layer as an output.

In [None]:
# build the model that uses input from Inception V3 model and output from its hidden layer
base_model = inception_v3.InceptionV3(weights='imagenet', include_top=False)
model = Model(inputs=base_model.input, outputs=base_model.get_layer('average_pooling2d_9').output)

Then features are extracted for three testing sets. Each vector has 16384 entries, which definitely should provide a lot better results than the small models predicting 38 features. 

## Unsupervised Nearest Neighbors

After feature extraction from three models, it is time to make some similar dress selections. For this purpose I am building unsupervised nearest neighbor model. On the fit stage it calculates similarities from input feature vectors and then as the prediction it returned specified number (5 in my case) indexes that correspond to the closest choice. The model has several parameters (algorithm used to combine the neighbors, leaf size, metric used to calculate nearest neighbors). I am starting with a default model with auto algorithm (model itself selects the best one), 30 leaves and Eucledian metric (Minkowski with p=2).

As discussed earlier, the problem of this project does not have any numeric metric to rely on. The results are visual and can be very subjective. In order to compare different models I am defining below function that plots 5 similar images for each model.

In [None]:
# define function to compare models
def compare_models(i, trees, predictions, labels, directory, df, column, k_neighbors=6):
    """Prints out specified number of similar dresses for each model, 
    expects lists for trees, predictions and labels """
    l = min(len(trees), len(predictions), len(labels))
    plt.figure(figsize=[k_neighbors*2+2,l*4])
    for j in range(l):
        plt.subplot(l,k_neighbors,k_neighbors*j+1)
        print_image(i, directory=directory, df=df, column=column)
        plt.axis('off')
        plt.title('Original')
        dist, ind = trees[j].kneighbors(predictions[j][i].reshape(1,-1), n_neighbors=6)
        ind = ind[0][1:]
        for k in range(len(ind)):
            plt.subplot(l, k_neighbors,k_neighbors*j+k+2)
            print_image(ind[k], directory=directory, df=df, column=column)
            plt.axis('off')
            plt.title(labels[j])
            
# define function that prints specified image 
def print_image(ind, directory, df, column):
    """Prints image of a given index"""
    img = mpimg.imread(directory+df.loc[ind,column], format='jpeg')
    plt.imshow(img)
    del img

The testing are done on all pre-selected images from three test sets (15 images in total). For each of the image I am using extracted features from both custom CNN as well as from Inception V3 model. As the result there are lots of printed images. 225 images just to see results for each test image by using each feature extraction way. Therefore, I will only be showing few examples here. All of the images can be found in 'Unsupervised_kNN.ipynb' file. 

First lets take a look how the results for hold test set look like.

#### Image 1 from test set

![title](inserts/test_image_auto.png)

#### Image 2 from test set

![title](inserts/test_image_auto2.png)

The results for the model built of features selected from pre-trained model are lot more solid than from the other two. Here are results from a small unlabeled dataset.

#### Image 1 from small unlabeled set
![title](inserts/small_image_auto.png)

#### Image 2 from small unlabeled set

![title](inserts/small_image_auto2.png)

In this case the outcome does not differ a lot for all of the models. While each model selects completely different dresses, in each case only some of the pieces are in fact similar. It can be due to the really small size of the dress set (only 101 images) and absense from enough indeed like looking items. In this case the smaller models can perform well also. Lets see the models' behaviour for larger unlabeled set.

#### Image 1 from large unlabeled set

![title](inserts/large_image_auto.png)

#### Image 2 from large unlabeled set

![title](inserts/large_image_auto2.png)

Again model trained on features extracted from Inception V3 model provide a lot better result. Please note how in case of the second dress Inception model does not only picks features related to the dress, but also the ones describing the background. In this case all of the models are pictured in front of the brick walls.

As the models using Inception V3 features showed the best results, I am trying to explore different parameters values and see if there is any improvent upon that. Changing algorithm and number of leaves did not change the outcome. Then I am testing several metrics in addition to default Euclidean one (defined as Minkowski with p=2):

- Minkowski with p=3
- Manhattan=sum(|x - y|)
- Chebyshev=max(|x - y|)

Results are as below for test set:

#### Image 1 from test data set for each metric

![title](inserts/test_image_metric.png)

#### Image 2 from test data set for each metric

![title](inserts/test_image_metric2.png)

Here are results for large unlabelled data:

#### Image 1 from large test data set for each metric

![title](inserts/large_image_metric.png)

#### Image 2 from large test data set for each metric

![title](inserts/large_image_metric2.png)

It is again a very subjective decision to evaluate the performance. However, from my point of view both Minkowski metrics (for p=2 and p=3) outperformed others. In some cases Manhattan metric showed very similar results. Chebyshev one was not as effective, which can be explained by the way the metric is computed. It takes the maximum value and may not account for all other details.

## Conclusion

The efficient tool to select similar looking dresses can be developed by combining feature extraction from pre-trained computer vision model and feeding these features to unsupervised k-Nearest Neighbors algorithm. The performance was then tested on three different test sets with various number of images and different image settings. Unfortunately, the model cannot benefit from ground truth as there are no numeric metrics to measure its performance. In the evaluation process I used my own expertise in dress domain and visual judgement. The default settings for unsupervised kNN produced good results for all test sets.

In the real like setting the goodness of the model can be tested over time. The good way to determine model's usefulness will be to create A|B test where two different versions of the website are created. One will be without similar dress suggestions and one with. The sales results from both pages will be an interesting comparison point.

Another idea of model testing is based on measuring click rates on the suggested dresses and maybe even purchase history of those.

In any case it can be a great tool for online websites and mobile retail applications, that can benefit both: consumers (they will see similar items and may discover things they would not find other way) and retailers (they will be able to display different merchandise including less known brands that may visually look like more well-known ones.

