In [1]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"


import pandas as pd                                 # Data Frame

import matplotlib.pyplot as plt                     # Basic visualisation

from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

In [2]:
# Read the dataset
df = pd.read_csv('https://raw.githubusercontent.com/HOGENT-ML/course/main/datasets//clothes_size_prediction.csv')
df.head()

Unnamed: 0,weight,age,height,size
0,62,28.0,172.72,XL
1,59,36.0,167.64,L
2,61,34.0,165.1,M
3,65,27.0,175.26,L
4,62,45.0,172.72,M


## Take a look at the dataset

We'll try to predict the size based on the weight, age and height.   
  
Show some general info about the dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119734 entries, 0 to 119733
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   weight  119734 non-null  int64  
 1   age     119477 non-null  float64
 2   height  119404 non-null  float64
 3   size    119734 non-null  object 
dtypes: float64(2), int64(1), object(1)
memory usage: 3.7+ MB


What are number of records for each size?  

M: 29575  
S: 21829  
XXXL: 21259  
XL: 19033  
L: 17481  
XXS: 9907  
XXL: 69

In [4]:
df['size'].value_counts(    )

size
M       29712
S       21924
XXXL    21359
XL      19119
L       17587
XXS      9964
XXL        69
Name: count, dtype: int64

Because there are only very few records for XXL, remove those records from the dataset

In [5]:
# remove all XXL
df = df[df['size'] != 'XXL']

Train a transformer to fill in the median value of the corresponding attribute for all missing values.

In [6]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
df_num = df.drop('size', axis=1)
imputer.fit(df_num)
imputer.statistics_

array([ 61. ,  32. , 165.1])

Apply the imputer to the dataset and check the results.

In [8]:
df_num_tr = imputer.transform(df_num)
print(type(df_num_tr))
df_num_tr_df = pd.DataFrame(df_num_tr, columns=df_num.columns,index=df_num.index)
df_num_tr_df.info()

<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>
Index: 119665 entries, 0 to 119733
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   weight  119665 non-null  float64
 1   age     119665 non-null  float64
 2   height  119665 non-null  float64
dtypes: float64(3)
memory usage: 3.7 MB


In [9]:
df_num_tr_df = pd.concat([df_num_tr_df, df['size']], axis=1)
df_num_tr_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119665 entries, 0 to 119733
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   weight  119665 non-null  float64
 1   age     119665 non-null  float64
 2   height  119665 non-null  float64
 3   size    119665 non-null  object 
dtypes: float64(3), object(1)
memory usage: 4.6+ MB


At first sight this seems quite a large dataset, but is this actually true?  
First we are going to change the datatype of height from float to integer.


In [10]:
# change the datatype of height from float to integer.
df_num_tr_df['height'] = df_num_tr_df['height'].astype('int')


It seems reasonable to round the ages to the nearest five-fold

In [None]:
# round the ages to the nearest five-fold
df_num_tr_df['age'] = np.round(df_num_tr_df['age']/5)*5

Change the datatype of age from float to integer.

In [None]:
# Change the datatype of age from float to integer.
df_num_tr_df['age'] = df_num_tr_df['age'].astype('int')

We drop duplicate rows in the dataset.

In [15]:
df_num_tr_df = df_num_tr_df.drop_duplicates()

In [16]:
df_num_tr_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26974 entries, 0 to 119721
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  26974 non-null  float64
 1   age     26974 non-null  float64
 2   height  26974 non-null  int64  
 3   size    26974 non-null  object 
dtypes: float64(2), int64(1), object(1)
memory usage: 1.0+ MB


How many records are left?

In [17]:
26974

26974

We want to know if there are any 'wrong duplicates' in the dataset, i.e. the same values for weight, age and height, but still another size. So we count the nunique

Unnamed: 0,weight,age,height,size
0,22.0,30,167,2
1,22.0,45,152,1
2,26.0,45,172,1
3,31.0,35,175,1
4,35.0,20,182,1


We want to know how many records there are with the same values for weight, age and height, but another value for size.

weight    2726
age       2726
height    2726
size      2726
dtype: int64

We decide to remove those records and to keep the first one

How many records are left?

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5159 entries, 0 to 119682
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  5159 non-null   float64
 1   age     5159 non-null   int32  
 2   height  5159 non-null   int32  
 3   size    5159 non-null   object 
dtypes: float64(1), int32(2), object(1)
memory usage: 161.2+ KB


Check if the dataset is heavily skewed.

XXXL    2326
M        689
XL       659
S        624
L        473
XXS      388
Name: size, dtype: int64

Because we want to apply regression first, map the sizes to numbers as follows:  
'XXS' : 0, 'S' : 1, 'M': 2, 'L': 3,'XL':4,'XXXL': 5

Unnamed: 0,weight,age,height,size
0,62.0,30,172,4
1,59.0,35,167,3
2,61.0,35,165,2
3,65.0,25,175,3
4,62.0,45,172,2


What is X and what is y?

What is X_train, y_train, X_test, y_test?

What is the shape of X_train, y_train, X_test and y_test?

(3869, 3) (1290, 3) (3869,) (1290,)


What are columns of X containing text?

Index([], dtype='object')


What are the columns of X containing numbers?

Index(['age', 'height'], dtype='object')


Define the ColumnTransformer for applying Standard Scaling on all numeric columns.  

## Regression

Define the model LinearRegression  

Define the data preparation (= ColumnTransformer for standard scaling) and modeling pipeline

Train the model

Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('std_scaler',
                                                  StandardScaler(),
                                                  Index(['weight', 'age', 'height'], dtype='object'))])),
                ('lin_reg', LinearRegression())])

What is the accuracy of the model?  
Use K-fold cross-validation with k = 3.  
Find an appropriate value for the attribute scoring on [metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html) 

0.9103762789910393

What are the values for intercept and the coefficients.  
Why are there 3 coefficients?  
Why do we have 3 coefficients?  
What is the most important coefficient?

(3.4166451279400363, array([1.33419981, 0.25076335, 0.0072536 ]))

Apply the model to the test set.  

Calculate the Mean Absolute Error and the Root Mean Squared Error

The mean squared error is 1.1079904464895924
The mean absolute error is 0.9251567530335617


Interprete the results. 

## Classification

Use the softmax classifier to try to predict the class (0, 1, 2, 3, 4, 5).  
What is the accuracy score?

accuracy score is 0.650301196969788


Create and show the confusion matrix for the test set.

[[ 44  41   1   0   0   1]
 [ 22 110  28   1   1   5]
 [  0  42 101   3  11  10]
 [  1   9  55   3  26  17]
 [  0   4  34   1  30  87]
 [  1   2   6   3  18 572]]


The accuracy of the classifier is low, but we see that we often predict only one size too high or too small. 
Calculate how many times 
* the classifier was correct
* the classifier predicted the size to be one size higher than the actual size
* the classifier predicted the size to be one size smaller than the actual size


correct = 860
oneSizeTooHigh = 185
oneSizeTooSmall = 138
Total number of predictions = 1290
