## Question 1
Different brands interpret the size scale differently. A small t-shirt from brand_A can very well be of the same size as a medium t-shirt from brand_B. I would say that this problem is fairly easy to overcome as it is only a matter of metrics. You could for instance map the sizes of the different brands to a sort of “true scale” that holds values that can’t be discussed, i.e. the measurements of the t-shirts. However, this doesn’t solve the problem mentioned in my answer to question 3.

## Question 3
We have to distinguish between fitting and the preferences of the individual user. A medium t-shirt might fit user X, but she prefers them baggy, so a large t-shirt would actually be the right size for her. It is very hard - if not impossible - to rate a new user in terms of a return-risk score, if we don’t know anything about his or hers preferences. Therefore requesting information from the user about their preferences for different brands sizewise is absolutely crucial - as EasySize already does and knows.

Let's briefly look on the data:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [108]:
data = pd.read_csv("dataset.csv", "\t")

In [3]:
data.groupby("order_status").size()

order_status
R    14407
S    89626
dtype: int64

In [109]:
# assign returned orders to 1, otherwise 0
data.loc[data['order_status']=='R', 'returned'] = 1
data.loc[data['order_status']=='S', 'returned'] = 0

In [110]:
# maps the size names to a number from 1 to 15

size_list = ['XXSmall', 'XSmall', 'XSmall/Small', 'Small',
             'Small/Medium', 'Medium', 'Medium/Large', 'Large',
             'Large/XLarge', 'XLarge', 'XLarge/XXLarge', 'XXLarge',
             'XXXLarge', 'XXXXLarge', 'XXXXXLarge']
n = np.int64(1)
for size in size_list:
    data.loc[data['size']==size, 'num_size'] = n
    n += 1

In [112]:
# converts the values of the two columns to np.int64

data['num_size'] = data['num_size'].apply(lambda x: np.int64(x))
data['returned'] = data['returned'].apply(lambda x: np.int64(x))

## Question 2

We can start by looking on the distributions of the returned t-shirts:

In [85]:
data[data['returned']==1]['return_reason'].value_counts()

Too small/short    8066
Too large/long     6341
Name: return_reason, dtype: int64

About 14% of the t-shirts were returned, and it seems there could be a tendency for people to order them too small. It would be nice if we were able to show that some brands would be small sized while other would be larger in size, i.e. for one part of the brands the majority of returns would be too small, and for the other part the returns would be too large. However, this is not the case if we look on the return distribution for the 15 most popular brands:

In [104]:
brands_ordered = data['brand_id'].value_counts().index.tolist() # list of brand ordered with regards to popularity

for brand_id in brands_ordered[:15]:
    brand_data = data[data['brand_id']==brand_id]
    print ("BRAND_ID:", brand_id, "\n", brand_data[brand_data['returned']==1]['return_reason'].value_counts(), "\n")

BRAND_ID: 204 
 Too small/short    915
Too large/long     582
Name: return_reason, dtype: int64 

BRAND_ID: 59 
 Too small/short    390
Too large/long     238
Name: return_reason, dtype: int64 

BRAND_ID: 2339 
 Too small/short    403
Too large/long     244
Name: return_reason, dtype: int64 

BRAND_ID: 75 
 Too small/short    529
Too large/long     225
Name: return_reason, dtype: int64 

BRAND_ID: 66 
 Too small/short    266
Too large/long     220
Name: return_reason, dtype: int64 

BRAND_ID: 144 
 Too large/long     234
Too small/short    213
Name: return_reason, dtype: int64 

BRAND_ID: 216 
 Too small/short    317
Too large/long     153
Name: return_reason, dtype: int64 

BRAND_ID: 2350 
 Too small/short    260
Too large/long     192
Name: return_reason, dtype: int64 

BRAND_ID: 115 
 Too small/short    395
Too large/long     168
Name: return_reason, dtype: int64 

BRAND_ID: 359 
 Too small/short    195
Too large/long     167
Name: return_reason, dtype: int64 

BRAND_ID: 181 
 Too s

It goes for all brands that the main reason for returning t-shirts is that they're too small. So we can't just adjust the size up for some brands and down for others.

In this proposed solution I won’t take account for the preferences the users might have for the different brands, i.e. a user prefer size X of brand A, but size Y of brand B - even though they’re sizewise identical. I will only look on whether the t-shirt fit them or not. I know it’s a simplified version of what’s going on in reality, but that’s what I’m capable of now within this more or less time limited assessment.

Let’s convert the XXSmall, … , XXXXXLarge scale into a scale ranging from 1 to 15. The goal is to map the sizes of the shirts to the values of the “true scale” mentioned above in question 1.

We can start by looking at the individual sizes of the different brands - is there a tendency to a lot of returned shirts of size 8 of brand X? Then maybe that size of this brand should be adjusted by some factor and according to this score be placed in a size-bin containing all the t-shirts (across all brands) that scores likewise. This way we’d end up with 15 bins containing all the t-shirts. But notice that after adjusting the size of the t-shirts, a t-shirt that on the label is a size 6 (for instance) might end up in the bin that says size 7.

A way of determining the adjustment factors could be to look on the relationship between different sized t-shirts of different brands. If there's a tendency for users to return a t-shirt of brand A of size 8 but keep the shirts of brand B of size 7, then maybe size 8 of brand A should be adjusted downwards or vice versa for brand B size 7.

### One account - multiple users
If a family places an order, there might be shirts of size 2 as well as shirts of size 13. This could be addressed by assigning every account with a list of approved sizes, i.e. t-shirt sizes that wasn't returned. That way a new order of t-shirts of size 3 wouldn't necessarily be adjusted to 13 but the correct size 2.

## Question 4
If we had a mapping of the brand sizes to the "true sizes" this could be done on the fly. This would of course be a bit rigid unless the mapping would be recalculated periodically.