## Data Cleaning

Before we can start with our analysis, we have to check and prepare our data. This is a crucial task and is necessary to prevent data-driven mistakes.

In [38]:
# Basic Python Libs
import json
import numpy as np
import pandas as pd

First off, we should get to know the data. So let's start with some simple questions.

* How many orders were recorded?
* Which features were considered?
* Are all orders labeled as kept/returned?
* Do we have categorial and/or numerical features?
* Does every features has a value for every order?
* Are the values for the specific features consistent?
* ... ?

Always keep asking yourself questions like that. You can never run enough checks to make sure you have a clean dataset to work with! So let's dig into the data and see how we can somewhat efficiently go about these questions. 

In [39]:
data_path = "Data\\"
file_name = "orders.csv"

df = pd.read_csv(data_path+file_name, delimiter=',')
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
order,78834,,,,39416.5,22757.6,0.0,19708.2,39416.5,59124.8,78833.0
age_in_years,78834,,,,35.8873,10.3745,15.0,28.0,35.0,42.0,79.0
category,78834,2.0,T-Shirts,41135.0,,,,,,,
product_group,78834,8.0,Slim fit,20504.0,,,,,,,
height_in_cm,78803,,,,181.033,6.7853,149.4,176.3,180.7,185.3,219.5
date_shipped,78834,253.0,2017-06-28 00:00:00,758.0,,,,,,,
size,78834,5.0,M,26691.0,,,,,,,
color,78834,17.0,Blue,40238.0,,,,,,,
item_returned,78834,,,,0.506241,0.499964,0.0,0.0,1.0,1.0,1.0
weight_in_kg,78797,,,,83.0411,11.5758,48.3,75.1,82.4,89.9,179.8


In [40]:
print(df.dtypes)

order              int64
age_in_years       int64
category          object
product_group     object
height_in_cm     float64
date_shipped      object
size              object
color             object
item_returned      int64
weight_in_kg     float64
price_in_euro    float64
dtype: object


Our starting point is the describe feature for a Pandas dataframe.

Here we have a summary of our features.

Numerical:
* order
* age_in_years
* height_in_cm
* item_returned
* weight_in_kg
* price_in_euro

Categorial:
* category
* product_group
* size
* color

Time:
* date_shipped

Excellent choice for the names of the features, since it's clear what was recorded.  
We can also see that we have __$78834$ orders__ in our data.  
For all features but __height_in_cm__ and __weight_in_kg__, we have a value for all orders. Seems like some people were shy about their height and/or weight. Hence we should also see a label for all orders.  
Furthermore we have both numerical and categorial features plus one date feature. We can also graps that the features __age_in_years__, __height_in_cm__, __weight_in_kg__ belong to the __customer__ with the other features belonging to the product/shipping.  
As for our categorial features, we can see that there seems to be $2$ __categories__, $8$ __products__, $5$ __sizes__ and $17$ __colors__. Additionally the __most frequent value__ for these features are given.  
Lastly we already get a preview about the __details__ of our numerical features like the __mean age, height and weight of our customer, the return ratio and the average price paid per order.__   
But at the moment, we are not interested into exploring our data in detail. We just want to understand what we have to work with and make sanity checks.

We will start with the __missing values in __height_in_cm__ and __weight_in_kg__. There are several options on how to deal with missing values. At most 68 orders are affected, which corresponds to less than $0.1\,\%$ of our data. So removing the data is certainly a good option. 

* Comment: An other useful way to check for missing vallues
```python
df.isnull().sum()
```

In [41]:
df = df.dropna()
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
order,78797,,,,39418.6,22757.8,0.0,19711.0,39422.0,59126.0,78833.0
age_in_years,78797,,,,35.8898,10.3755,15.0,28.0,35.0,42.0,79.0
category,78797,2.0,T-Shirts,41108.0,,,,,,,
product_group,78797,8.0,Slim fit,20499.0,,,,,,,
height_in_cm,78797,,,,181.033,6.78546,149.4,176.3,180.7,185.3,219.5
date_shipped,78797,253.0,2017-06-28 00:00:00,758.0,,,,,,,
size,78797,5.0,M,26689.0,,,,,,,
color,78797,17.0,Blue,40218.0,,,,,,,
item_returned,78797,,,,0.506123,0.499966,0.0,0.0,1.0,1.0,1.0
weight_in_kg,78797,,,,83.0411,11.5758,48.3,75.1,82.4,89.9,179.8


We got rid of our missing values and are left with $78797$ orders. Next up we should explore the values of our features, starting with the categorial ones.

In [42]:
print(np.unique(df.category))
print(np.unique(df.product_group))
print(np.unique(df.color))
print(np.unique(df['size'])) # bad naming! df.size is a command"

['Shirts casual' 'T-Shirts']
['Polo shirt longsleeves' 'Polo shirt shortsleeves' 'Regular fit'
 'Slim fit' 'T-shirt Basic' 'T-shirt Print' 'T-shirt long sleeves'
 'T-shirt striped / patterned']
['Beige' 'Black' 'Blue' 'Bluec' 'Brown' 'Green' 'Grey' 'Grey  ' 'Metal'
 'Multicolor' 'Orange' 'Pink' 'Red' 'Turquoise' 'Violet' 'White' 'Yellow']
['L' 'M' 'S' 'XL' 'XXL']


The $2$ __categories__ are __Shirts casual__, __T-Shirts__. They are divded into $8$ groups from __Polo shirt longsleeves__ to __T-shirt striped / patterned__. There seems to be no typos here and the the values seem to be fitting.

However, If we take a look at the __colors__, we can spot the colors __Blue__ and __Bluec__ as well as __Grey__ and __Grey(space)__. Seems like something went wrong in the data collection. We should investigate how this happened and determine if this is a simple typo or an other error. Depending on the answer we should correct the typo or dismiss the affected orders. Luckly it is just a typo! Bluec should be Blue and Grey(space) should be Grey. We can easily fix this. The remaining color values seem fine.  

Lastly we have the $5$ known __sizes__ from __S__ up to __XXL__. Nothing to worry about.

In [43]:
df.color[df['color'] == 'Grey  '] = 'Grey'
df.color[df['color'] == 'Bluec'] = 'Blue'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Now we can take a look at our numerical values. We should check if the ranges make sense and if there are any outliners.

In [44]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
order,78797.0,39418.60713,22757.840711,0.0,19711.0,39422.0,59126.0,78833.0
age_in_years,78797.0,35.889844,10.37549,15.0,28.0,35.0,42.0,79.0
height_in_cm,78797.0,181.032505,6.785464,149.4,176.3,180.7,185.3,219.5
item_returned,78797.0,0.506123,0.499966,0.0,0.0,1.0,1.0,1.0
weight_in_kg,78797.0,83.041101,11.575792,48.3,75.1,82.4,89.9,179.8
price_in_euro,78797.0,7.460359,1.363198,2.24,6.3,7.39,8.52,11.66


The labeling for the returned order is given by the integers $0$ and $1$, with the later marking the returned orders.  
As for the order feature, it seems to be a simple list. Hence this feature won't be useful for us.   
The __age_in_years__ are just integers from $15$ to $79$. So we have both very young and also older customers!

The features __height_in_cm__, __weight_in_kg__ and __price_in_euro__ are all floats. For both height and weight we have $1$ decimal, while we have $2$ decimals for the price. For height, the range of the values makes sense. We have short and tall customers. The range for the weight is pretty wide with a minimum of less than $50\,$kg and a maximum of almost $180\,$kg. Nothing humanly impossible, but we should take a deeper look here. A useful feature for that would be a the BMI.

As for the price, we have a bargain bottom of just $2.24\,$€ and a maximum of just $11.66\,$€. Though, most prices are in the range of $6\,$€ to $8.50\,$€. We may have some outliners, which could potentially be typos. Unluckly, we can't if the low prices are outliners or real. So he have to accept them.

In [45]:
df['BMI'] = df['weight_in_kg'] / (df['height_in_cm']/100)**2
df['BMI'].describe(include='all')

count    78797.000000
mean        25.322604
std          3.204001
min         10.465093
25%         23.265629
50%         24.961760
75%         27.077605
max         80.330246
Name: BMI, dtype: float64

As we can see, we have an humanly impossible minimal BMI and an extremly high maximum BMI, while most values are within the healthy to overweight region. 

| BMI [$kg/m^2$] | Describtion |
|----------|:-------------:|
| < 15    |  Very severely underweight |
| 15-16   | Severely underweight  |
| 16-18.5 |  Underweight |
| 18.5-25 |  Healthy weight |
| 25-30   | Overweight  |
| 30-35   | Moderately obese |
| 35-40   | Severely obese  |
|  >40    | Very severely or morbidly obese  |

As a result we should set a lower limit and upper limit on our BMI to get rid of the outliniers. I'll base my thresholds on this table and exclude a BMI lower than 15 $\frac{kg}{m^2}$ and a BMI higher than 35 $\frac{kg}{m^2}$.

In [46]:
# BMI describtion
cuts = [df['BMI'].between(0, 15),
        df['BMI'].between(15, 16),
        df['BMI'].between(16, 18.5),
        df['BMI'].between(18.5, 25),
        df['BMI'].between(25, 30),
        df['BMI'].between(30, 35),
        df['BMI'].between(35, 40),
        df['BMI'].between(40, 500)
       ]
groups = ['Very severely underweight',
          'Severely underweight',
          'Underweight',
          'Healthy weight',
          'Overweight',
          'Moderately obese',
          'Severely obese',
          'Very severely or morbidly obese']

df['BMI_describtion'] = np.select(cuts, groups, 0)
# BMI cuts 
indexNames = df[(df['BMI'] < 15) | (df['BMI'] > 35)].index
df.drop(indexNames , inplace=True)

In [47]:
df.to_csv(r'order_cleaned.csv')