## Data Cleaning

Before we can start with our analysis, we have to check and prepare our data. This is a crucial task and is necessary to prevent data-driven mistakes.

In [2]:
# Basic Python Libs
import json
import numpy as np
import pandas as pd

In [50]:
data_path = "Data\\"
file_name = "orders.csv"

df = pd.read_csv(data_path+file_name, delimiter=',')
print(df.columns)
print(df[0:5])
print(df.isnull().sum())
df.describe(include='all').T

Index(['order', 'age_in_years', 'category', 'product_group', 'height_in_cm',
       'date_shipped', 'size', 'color', 'item_returned', 'weight_in_kg',
       'price_in_euro'],
      dtype='object')
   order  age_in_years       category            product_group  height_in_cm  \
0      0            30  Shirts casual                 Slim fit         183.7   
1      1            47       T-Shirts  Polo shirt shortsleeves         178.9   
2      2            36  Shirts casual              Regular fit         193.0   
3      3            43       T-Shirts  Polo shirt shortsleeves         179.9   
4      4            27  Shirts casual              Regular fit         187.0   

          date_shipped size color  item_returned  weight_in_kg  price_in_euro  
0  2017-01-27 00:00:00    S  Blue              0          66.0           9.12  
1  2017-09-07 00:00:00    L  Blue              0          86.8           6.35  
2  2017-09-12 00:00:00   XL  Blue              1          85.3           8.27  
3 

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
order,78834,,,,39416.5,22757.6,0.0,19708.2,39416.5,59124.8,78833.0
age_in_years,78834,,,,35.8873,10.3745,15.0,28.0,35.0,42.0,79.0
category,78834,2.0,T-Shirts,41135.0,,,,,,,
product_group,78834,8.0,Slim fit,20504.0,,,,,,,
height_in_cm,78803,,,,181.033,6.7853,149.4,176.3,180.7,185.3,219.5
date_shipped,78834,253.0,2017-06-28 00:00:00,758.0,,,,,,,
size,78834,5.0,M,26691.0,,,,,,,
color,78834,17.0,Blue,40238.0,,,,,,,
item_returned,78834,,,,0.506241,0.499964,0.0,0.0,1.0,1.0,1.0
weight_in_kg,78797,,,,83.0411,11.5758,48.3,75.1,82.4,89.9,179.8


So what we can already see here is that there are categorical and numerical variables. Futhermore we already have a labeled (shipping_country), a onehotencoded (itret) and a date (date_shipped) variable.
itret is the dependent variable for our problem.

As for the categorial variables, we have 15 categorys, 51 product_groups, 5 sizes, 17 colors and 82 brands.
For sizes it makes sense to use a labeling, since there is an intrinsic order from smallest to biggest.
For the other variables, we should make onehotencoded variables.

Also it seems like there are 44 missing values for height_in_cm and 51 for weight_in_kg. The easiest way to deal with that is to remove these rows from the dataframe. One may think about filling in the missing values through one of various methods, but with a datasize of about 147k one probably don't have to rely on that and can savely remove these rows.

In [51]:
df = df.dropna()
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
order,78797,,,,39418.6,22757.8,0.0,19711.0,39422.0,59126.0,78833.0
age_in_years,78797,,,,35.8898,10.3755,15.0,28.0,35.0,42.0,79.0
category,78797,2.0,T-Shirts,41108.0,,,,,,,
product_group,78797,8.0,Slim fit,20499.0,,,,,,,
height_in_cm,78797,,,,181.033,6.78546,149.4,176.3,180.7,185.3,219.5
date_shipped,78797,253.0,2017-06-28 00:00:00,758.0,,,,,,,
size,78797,5.0,M,26689.0,,,,,,,
color,78797,17.0,Blue,40218.0,,,,,,,
item_returned,78797,,,,0.506123,0.499966,0.0,0.0,1.0,1.0,1.0
weight_in_kg,78797,,,,83.0411,11.5758,48.3,75.1,82.4,89.9,179.8


Before I'm going to do the onehotencoding for the categorial variables, I want to check for typos.

In [52]:
print(np.unique(df.category))
print(np.unique(df.product_group))
print(np.unique(df.color))
print(np.unique(df['size'])) # bad naming! df.size is a command"

['Shirts casual' 'T-Shirts']
['Polo shirt longsleeves' 'Polo shirt shortsleeves' 'Regular fit'
 'Slim fit' 'T-shirt Basic' 'T-shirt Print' 'T-shirt long sleeves'
 'T-shirt striped / patterned']
['Beige' 'Black' 'Blue' 'Bluec' 'Brown' 'Green' 'Grey' 'Grey  ' 'Metal'
 'Multicolor' 'Orange' 'Pink' 'Red' 'Turquoise' 'Violet' 'White' 'Yellow']
['L' 'M' 'S' 'XL' 'XXL']


It's not easy to spot, but the color 'Grey' appears as 'Grey' and 'Grey  ' and then there is 'Blue' and 'Bluec'. I don't what kind of color 'Bluec' is meant to be. I assume this is a typo and I hope to get an answer per E-Mail. I also checked the original jsonlines file and the same typo can be found there.

/EDIT: No responds so far, so I'm going to assume that 'Grey' equals 'Grey   ' and 'Blue' equals 'Bluec'.

In [53]:
df.color[df['color'] == 'Grey  '] = 'Grey'
df.color[df['color'] == 'Bluec'] = 'Blue'

['Beige' 'Black' 'Blue' 'Brown' 'Green' 'Grey' 'Metal' 'Multicolor'
 'Orange' 'Pink' 'Red' 'Turquoise' 'Violet' 'White' 'Yellow']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


I can't spot any more typos so I'm going to move on (time!). Lastly I'll save the dataframe as a CSV file so I can use it in TABLEU and attack my first objective there.

/EDIT: typo. Forgot to specify the column to edit.

In [54]:
print(np.unique(df.category))
print(np.unique(df.product_group))
print(np.unique(df.color))
print(np.unique(df['size'])) # bad naming! df.size is a command"

['Shirts casual' 'T-Shirts']
['Polo shirt longsleeves' 'Polo shirt shortsleeves' 'Regular fit'
 'Slim fit' 'T-shirt Basic' 'T-shirt Print' 'T-shirt long sleeves'
 'T-shirt striped / patterned']
['Beige' 'Black' 'Blue' 'Brown' 'Green' 'Grey' 'Metal' 'Multicolor'
 'Orange' 'Pink' 'Red' 'Turquoise' 'Violet' 'White' 'Yellow']
['L' 'M' 'S' 'XL' 'XXL']


In [51]:
df.to_csv(r'order_cleaned.csv')