# Pre-processing examples on modcloth dataset

### Attribution: Soujanya G, Kaggle

The notebook was released under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) open source license.

# **About the dataset**

In this notebook, we will use modcloth_final_data.json as input dataset

In [None]:
# import necessary libraries

# File read and EDA(Data Cleansing & Transformations)
import numpy as np  
import pandas as pd 

# EDA Visualization
import matplotlib.pyplot as plt
import seaborn as sns

## Mounting gDrive

In [None]:
#Mounting gDrive in Colaboratory
try:
    from google.colab import drive
    drive.mount("/content/drive/", force_remount=True)
    google_drive_prefix = "/content/drive/My Drive"
    data_prefix = "{}/mnist/".format(google_drive_prefix)
except ModuleNotFoundError: 
    data_prefix = "data/"

In [None]:
#Change directory to my folder for analytics labs where I have cloned my gitHub repositories with magic command.

%cd drive/My Drive/Data_analytics_lab

# Read input json data

In [None]:
#Read file and view first ten rows
mc_data= pd.read_json("data_code_along/modcloth_final_data.json", lines=True)
mc_data.head() # displays first 5 records in the dataframe

## EDA - Exploratory Data Analysis

# Column names are inconsistent
Some of the column names are having space and rest of them are having underscore in between them. Hence try to be consistent by adding underscore instead of space

> size is a keyword in pandas , make sure to change the feature name "size" to some user defined name like "mc_size"
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.size.html

In [None]:
mc_data.columns = ['item_id', 'waist', 'mc_size', 'quality', 'cup_size', 'hips', 'bra_size', 'category', 'bust', 'height', 'user_name', 'length', 'fit', 'user_id', 'shoe_size', 'shoe_width', 'review_summary', 'review_test']

> see the total number of observations, column names and datatypes info 

In [None]:
mc_data.info()

# Sparse Data

Given data is having lot of missing values , for example look at the columns such as shoe_size and show_width.

> lets check the missing values percentage for each feature

In [None]:
missing_data_sum = mc_data.isnull().sum()
missing_data = pd.DataFrame({'total_missing_values': missing_data_sum,'percentage_of_missing_values': (missing_data_sum/mc_data.shape[0])*100})
missing_data

*Out of 18 columns, only 6 columns have complete data. And columns such as waist , bust, shoe_size, show_width and hips are highly sparse*

> Check Data types which are having numerical/categorical data

In [None]:
mc_data.dtypes

*pandas library identifies item_id, waist, mc_size, quality, hips, bra_size, user_id, shoe_size are numeric And cup_size, category, bust, height, user_name, length, fit, shoe_width, review_summary , review_test are object type . Take away from this is "There are some numeric data columns are fall under Object types" . Hence we need to handle the misclassification of these data types. For example, bust data contains numeric values but its dtype is Object.*

# Unique number of observations for each feature

> If the dataset is having less number of observations then we can see the unique data that resides in each feature(There are 82790 observations)

In [None]:
mc_data.nunique()

*With this, we can clearly understand there are no columns with unique data. Further, the columns item_id and user_id are repeating.*

> Lets look into unique observations which are having less uniqueness

In [None]:
def countplot(independent_features):
  plt.figure(figsize=(25, 25))
  for loc, feature in enumerate(independent_features):
    ax = plt.subplot(3, 4, loc+1)
    ax.set_xlabel('{}'.format(feature), fontsize=10)
    chart = sns.countplot(x=mc_data[feature])
    chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
  return None

In [None]:
uniques_data = ['quality', 'cup_size', 'bra_size', 'category', 'length', 'fit',  'shoe_size', 'shoe_width', 'height', 'bust', 'mc_size']
countplot(uniques_data)

*Few observations*
* cup_size contains some format which might represents the measurement
* shoe_size 38 is an outlier , there we can see lot of variance 
* height column also having few outliers (May be we can see those things after converting categorical data into numeric values)
* there are categorical data exists such as shoe_width, category, length, fit and height. 
* For the feature bust - clearly there is one observation with different data, hence we need to format it i.e "37 - 39". Will try to replace this value with mean


In [None]:
# replacing bust unformatted value with mean 38 which is taken from the values 37 & 39 
mc_data.at[mc_data[mc_data.bust == '37-39'].index[0],'bust'] = '38'

# Height feature - Convert US units to Metric units (ft & in to cm).

In [None]:
def height_in_cms(ht):
  if ht.lower() != 'nan':
    ht = ht.replace('ft','').replace('in', '')
    h_ft = int(ht.split()[0])
    if len(ht.split()) > 1:
      h_inch = int(ht.split()[1])
    else:
      h_inch = 0
    h_inch += h_ft * 12
    h_cm = round(h_inch * 2.54, 1)
    return h_cm

mc_data.height = mc_data.height.astype(str).apply(height_in_cms)
mc_data.head()

> we successfully converted metrics to centimetres. Now lets handle the missing values with mean imputation and then look into the outliers for this height feature. Use box/scatter plot for outliers visualization

In [None]:
mc_data.height.fillna(value=mc_data.height.mean(), inplace=True)
mc_data.height.isnull().sum()

In [None]:
def plot_outlier(feature):
  plt.figure(figsize=(25, 6))
  ax = sns.boxplot(x=feature, linewidth=2.5)
plot_outlier(mc_data.height)

> Check the lower and upper cutoff range values for the outliers

In [None]:
def get_outliers_range(datacolumn):
  sorted(datacolumn)
  Q1,Q3 = np.percentile(datacolumn , [25,75])
  IQR = Q3 - Q1
  lower_range = Q1 - (1.5 * IQR)
  upper_range = Q3 + (1.5 * IQR)
  return lower_range,upper_range

In [None]:

ht_lower_range,ht_upper_range = get_outliers_range(mc_data.height)
ht_lower_range,ht_upper_range

> Take away "there are many outliers". Here I have used Inter Quartile Range calculation to find the lower range and upper range cutoff. 
So the outlier would be anything less than the lower range cutoff(144.7) or anything more than the upper range cutoff(185.5) is an outlier.  

Note: there are different techniques to identify outliers. Outliers can also be bi- and multivariate outliers. 
Please check out this link for more details on this
https://statisticsbyjim.com/basics/outliers/


> Lets count how many outliers exists for this height feature


In [None]:
mc_data[(mc_data.height < ht_lower_range) | (mc_data.height > ht_upper_range)]

> "There are 199 outliers". which is 0.2% of total number observations. Hence we can remove/drop/delete these outliers

In [None]:
mc_df = mc_data.drop(mc_data[(mc_data.height < ht_lower_range) | (mc_data.height > ht_upper_range)].index)

mc_df.reset_index(drop=True, inplace=True)
mc_df.shape

> Lets look again the height feature using box plot to see the handling of outlier 

In [None]:
plot_outlier(mc_df.height)

# Numeric features distributions visualization 

In [None]:
def plot_dist(df, independent_features):
  plt.figure(figsize=(25, 20))
  for loc, feature in enumerate(independent_features):
    ax = plt.subplot(3, 3, loc+1)
    sns.histplot(df[feature]) # you can try histplot as well
  return None

In [None]:
plot_dist(mc_data, ['height', 'waist', 'mc_size', 'quality', 'hips', 'bra_size', 'shoe_size'])

**What do the figures tell us?**

* Height är normalfördelad (bra!)

* Waist borde vara normalfördelad men kanske många nullvärden? Kommer ni ihåg?

* Size är oklart vad det är för typ av data, kvantitativ eller kvalitativ? Kanske beror på vilken analys vi vill göra? 

* Quality är ett typiskt enkätsvar (varför) hur ska vi behandla det? 

* Hips, varför inte perfekt normalfördelning?

* Bra size, tycks kanske snarare vara storlekar som är kvalitativa?

* shoe_size, ser normalfördelad ut, men varför så stort intervall på x?

# Missing Values Handling for numeric features

> As we see there are lot of missing values in this dataset. Since the data is highly sparse, i am trying to use KNN algorith to impute the relavant features.

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=10)

# finding imputation using other features (it will take couple of minutes to complete the execution)
mc_data_knn_ind_features = mc_df[['waist', 'hips', 'bra_size', 'bust', 'height', 'shoe_size']]

df_filled = imputer.fit_transform(mc_data_knn_ind_features)


knn_numeric_imputations = pd.DataFrame(data=df_filled, columns=['waist', 'hips', 'bra_size', 'bust', 'height', 'shoe_size'])


# remove the existing numeric columns (waist, height, hips, bra_size, bust, shoe_size ) from the main dataframe and concatenate  with knn imputed data
#mc_df = mc_data
mc_new_df = mc_df.drop(['waist', 'hips', 'bra_size', 'bust', 'height', 'shoe_size'], axis=1)




In [None]:
# concat the imputations data with mc data frame
mc = pd.concat([mc_new_df, knn_numeric_imputations], axis=1)
mc.isnull().sum()

> we successfully done the imputations for some of the numeric features

# Handling shoe-size outliers

In [None]:
plot_outlier(mc.shoe_size)

clearly, there are few outliers, using IQR cutoff range values remove there observations

In [None]:
ss_lower_range,ss_upper_range = get_outliers_range(mc.shoe_size)
#print(ss_lower_range,ss_upper_range)

mc.drop(mc[(mc.shoe_size < ss_lower_range) | (mc.shoe_size > ss_upper_range)].index, axis=0, inplace=True) # found 390 observations 
plot_outlier(mc.shoe_size)

# Different solutions to transform categorical variables to numeric ones.

In real world datasets, variables (features) are often categorial, most often such variables are represented by strings. Most machine learning models, however, cannot process strings, they can only handle numerical values i.e. numbers. The categorial features therefore needs to be transformed to numerical values, but at the same time it is important not to change the meaning and interpretations of the values. 

To read more about transforming categorial features in different ways (there are several all with different weaknesses and strengths depending on the data), see for instance [here](https://pbpython.com/categorical-encoding.html) and [here](https://towardsdatascience.com/beyond-one-hot-17-ways-of-transforming-categorical-features-into-numeric-features-57f54f199ea4), also pandas have a dtype called category which can also be helpful, see documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html).


### Example 1: 

Applied to feature *Cup size* - used open source data to convert measurements into numerical data

source taken from https://www.blitzresults.com/en/bra-size/

Adding two new columns for the feture "cup_size" in order to convert the measurements into centimeters and then imputing missing values with mean values. 

In [None]:
def convert_cup_size_to_cms(cup_size_code):
  if cup_size_code == 'aa':
    return 10, 11
  if cup_size_code == 'a':
    return 12, 13
  if cup_size_code == 'b':
    return 14, 15
  if cup_size_code == 'c':
    return 16, 17
  if cup_size_code == 'd':
    return 18, 19
  if cup_size_code == 'dd/e':
    return 20, 21
  if cup_size_code == 'ddd/f':
    return 22, 23
  if cup_size_code == 'dddd/g':
    return 24, 25
  if cup_size_code == 'h':
    return 26, 27
  if cup_size_code == 'i':
    return 28, 29
  if cup_size_code == 'j':
    return 30, 31
  if cup_size_code == 'k':
    return 32, 33 
  else:
    return str('unknown')

In [None]:
mc['cup_size_in_cms'] = mc.cup_size.apply(convert_cup_size_to_cms)
mc.head()

In [None]:
def split_cup_size_data(data, index):
  if data.lower() == 'unknown':
    return 0
  value = data.replace('(','').replace(')','').replace(',','')
  return value.split()[index]

mc['cup_size_start_in_cms'] =  mc.cup_size_in_cms.astype(str).apply(lambda x : split_cup_size_data(x, 0))
mc['cup_size_end_in_cms'] =  mc.cup_size_in_cms.astype(str).apply(lambda x : split_cup_size_data(x, 1))
mc.head()

In [None]:
mc['cup_size_start_in_cms'] = mc.cup_size_start_in_cms.astype('int')
mc['cup_size_end_in_cms'] = mc.cup_size_end_in_cms.astype('int')


# missing values imputation with mean
mc['cup_size_start_in_cms']  = mc.cup_size_start_in_cms.mask(mc.cup_size_start_in_cms==0).fillna(value=mc.cup_size_start_in_cms.mean())
mc['cup_size_end_in_cms']  = mc.cup_size_end_in_cms.mask(mc.cup_size_end_in_cms==0).fillna(value=mc.cup_size_end_in_cms.mean())

> lets double check the NaN values imputations for the newly added features

In [None]:
mc[mc.cup_size.isnull()]

In [None]:
# drop the columns which are used for reference
mc = mc.drop(['cup_size', 'cup_size_in_cms'], axis = 1)
mc.reset_index(drop=True,  inplace=True)

#Example 2 Shoe_with 

> lets try to see the visualization for categorical data against the dependent feature fit


In [None]:
def countplot_wrt_target(indipendent_features, df):
  plt.figure(figsize=(28, 10))
  for loc, feature in enumerate(indipendent_features):
    ax = plt.subplot(1, 3, loc+1)
    ax.set_xlabel('{}'.format(feature), fontsize=10)
    chart = sns.countplot(x=df[feature], hue=df.fit)
    chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
  return None


In [None]:
countplot_wrt_target(['category', 'length', 'quality'], mc)

## Example 2:
Applied to feature *shoe_width* : used open source data to identify shoe width based on shoe size

Reference link : https://images-na.ssl-images-amazon.com/images/I/71u90X9oX3S.pdf

In [None]:
# fill NaN with average shoe width category (this is just an assumption)
mc.shoe_width = mc.shoe_width.fillna('average')

In [None]:
# Use above chart to convert shoe width data such as 'wide','average','narrow' to inches
mc['shoe_width_in_inches'] = np.where(((mc.shoe_size >= 5) & (mc.shoe_size < 5.5)) & (mc.shoe_width == 'narrow') , 2.81, 
np.where(((mc.shoe_size >= 5) & (mc.shoe_size < 5.5)) & (mc.shoe_width == 'average') , 3.19, 
np.where(((mc.shoe_size >= 5) & (mc.shoe_size < 5.5)) & (mc.shoe_width == 'wide') , 3.56,
np.where(((mc.shoe_size >= 5.5) & (mc.shoe_size < 6)) & (mc.shoe_width == 'narrow') , 2.87, 
np.where(((mc.shoe_size >= 5.5) & (mc.shoe_size < 6)) & (mc.shoe_width == 'average') , 3.25, 
np.where(((mc.shoe_size >= 5.5) & (mc.shoe_size < 6)) & (mc.shoe_width == 'wide') , 3.62, 
np.where(((mc.shoe_size >= 6) & (mc.shoe_size < 6.5)) & (mc.shoe_width == 'narrow') , 2.94, 
np.where(((mc.shoe_size >= 6) & (mc.shoe_size < 6.5)) & (mc.shoe_width == 'average') , 3.31, 
np.where(((mc.shoe_size >= 6) & (mc.shoe_size < 6.5)) & (mc.shoe_width == 'wide') , 3.69,
np.where(((mc.shoe_size >= 6.5) & (mc.shoe_size < 7)) & (mc.shoe_width == 'narrow') , 3, 
np.where(((mc.shoe_size >= 6.5) & (mc.shoe_size < 7)) & (mc.shoe_width == 'average') , 3.37, 
np.where(((mc.shoe_size >= 6.5) & (mc.shoe_size < 7)) & (mc.shoe_width == 'wide') , 3.75,
np.where(((mc.shoe_size >= 7) & (mc.shoe_size < 7.5)) & (mc.shoe_width == 'narrow') , 3.06, 
np.where(((mc.shoe_size >= 7) & (mc.shoe_size < 7.5)) & (mc.shoe_width == 'average') , 3.44, 
np.where(((mc.shoe_size >= 7) & (mc.shoe_size < 7.5)) & (mc.shoe_width == 'wide') , 3.81, 
np.where(((mc.shoe_size >= 7.5) & (mc.shoe_size < 8)) & (mc.shoe_width == 'narrow') , 3.12, 
np.where(((mc.shoe_size >= 7.5) & (mc.shoe_size < 8)) & (mc.shoe_width == 'average') , 3.5, 
np.where(((mc.shoe_size >= 7.5) & (mc.shoe_size < 8)) & (mc.shoe_width == 'wide') , 3.87, 
np.where(((mc.shoe_size >= 8) & (mc.shoe_size < 8.5)) & (mc.shoe_width == 'narrow') , 3.19, 
np.where(((mc.shoe_size >= 8) & (mc.shoe_size < 8.5)) & (mc.shoe_width == 'average') , 3.56, 
np.where(((mc.shoe_size >= 8) & (mc.shoe_size < 8.5)) & (mc.shoe_width == 'wide') , 3.94, 
np.where(((mc.shoe_size >= 8.5) & (mc.shoe_size < 9)) & (mc.shoe_width == 'narrow') , 3.25, 
np.where(((mc.shoe_size >= 8.5) & (mc.shoe_size < 9)) & (mc.shoe_width == 'average') , 3.62, 
np.where(((mc.shoe_size >= 8.5) & (mc.shoe_size < 9)) & (mc.shoe_width == 'wide') , 4, 
np.where(((mc.shoe_size >= 9) & (mc.shoe_size < 9.5)) & (mc.shoe_width == 'narrow') , 3.37, 
np.where(((mc.shoe_size >= 9) & (mc.shoe_size < 9.5)) & (mc.shoe_width == 'average') , 3.69, 
np.where(((mc.shoe_size >= 9) & (mc.shoe_size < 9.5)) & (mc.shoe_width == 'wide') , 4.06, 
np.where(((mc.shoe_size >= 9.5) & (mc.shoe_size < 10)) & (mc.shoe_width == 'narrow') , 3.37, 
np.where(((mc.shoe_size >= 9.5) & (mc.shoe_size < 10)) & (mc.shoe_width == 'average') , 3.75, 
np.where(((mc.shoe_size >= 9.5) & (mc.shoe_size < 10)) & (mc.shoe_width == 'wide') , 4.12, 
np.where(((mc.shoe_size >= 10) & (mc.shoe_size < 10.5)) & (mc.shoe_width == 'narrow') , 3.44, 
np.where(((mc.shoe_size >= 10) & (mc.shoe_size < 10.5)) & (mc.shoe_width == 'average') , 3.75, 
np.where(((mc.shoe_size >= 10) & (mc.shoe_size < 10.5)) & (mc.shoe_width == 'wide') , 4.19, 
np.where(((mc.shoe_size >= 10.5) & (mc.shoe_size < 11)) & (mc.shoe_width == 'narrow') , 3.5, 
np.where(((mc.shoe_size >= 10.5) & (mc.shoe_size < 11)) & (mc.shoe_width == 'average') , 3.87, 
np.where(((mc.shoe_size >= 10.5) & (mc.shoe_size < 11)) & (mc.shoe_width == 'wide') , 4.19, 
np.where(((mc.shoe_size >= 11) & (mc.shoe_size < 12)) & (mc.shoe_width == 'narrow') , 3.56, 
np.where(((mc.shoe_size >= 11) & (mc.shoe_size < 12)) & (mc.shoe_width == 'average') , 3.94, 
np.where(((mc.shoe_size >= 11) & (mc.shoe_size < 12)) & (mc.shoe_width == 'wide') , 4.19,
np.nan)))))))))))))))))))))))))))))))))))))))

In [None]:
# drop the refrence colum shoe_width
mc.drop(['shoe_width'], axis=1, inplace=True)

# Example 3: 

Applied to features *lenght & category* Using one-hot encoding to change categorial data to numeric 

> One hot encoding is a common way to change categorial (often string) data to numeric data without changing the scale of the feature. This type of transformation is suitable when the data is in nominal scale i.e. don't have any order. 



In [None]:
# lets replace NaN values with unknown for the feature length
mc.length = mc.length.fillna('unknown')

In [None]:
# apply one hot encoding using dummies

length_dummies  = pd.get_dummies(mc['length'])
length_dummies.columns = ['just_right','slightly_long','very_short','slightly_short','very_long', 'length_unkown']

category_dummies  = pd.get_dummies(mc['category'])
category_dummies.columns = ['new','dresses','wedding','sale','tops', 'bottoms','outerwear']

model_input_df = pd.concat([mc, length_dummies,category_dummies], axis = 1)
model_input_df.drop(['length'], axis=1, inplace=True)
model_input_df.drop(['category'], axis=1, inplace=True)

# target variable 
fit = {'small':0, 'fit':1, 'large':2}
model_input_df['fit'] = model_input_df['fit'].map(fit)


In [None]:
# since there is no value add to the features like item_id , user_id and user_name

model_input_df.drop(['item_id'], axis=1, inplace=True)

model_input_df.drop(['user_id'], axis=1, inplace=True)

model_input_df.drop(['user_name'], axis=1, inplace=True)
model_input_df.head()

# Example 4: 

Applied to feature *fit* Change into ordinal (dtype categorial). 


In [None]:
mc_df.fit = mc.fit.astype('category').cat.as_ordered()

In [None]:
mc_df.head(10)

In [None]:
mc_df.dtypes

In [None]:
mc_df.fit.info

In [None]:
mc_df.fit.min()

In [None]:
mc_df.fit.cat.reorder_categories(['small', 'fit', 'large'], ordered=True)