In [53]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"


import pandas as pd                                 # Data Frame

import matplotlib.pyplot as plt                     # Basic visualisation

from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

In [54]:
# Read the dataset
df = pd.read_csv('https://raw.githubusercontent.com/HOGENT-ML/course/main/datasets//clothes_size_prediction.csv')
df.head()

Unnamed: 0,weight,age,height,size
0,62,28.0,172.72,XL
1,59,36.0,167.64,L
2,61,34.0,165.1,M
3,65,27.0,175.26,L
4,62,45.0,172.72,M


## Take a look at the dataset

We'll try to predict the size based on the weight, age and height.   
  
Show some general info about the dataset

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119734 entries, 0 to 119733
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   weight  119734 non-null  int64  
 1   age     119477 non-null  float64
 2   height  119404 non-null  float64
 3   size    119734 non-null  object 
dtypes: float64(2), int64(1), object(1)
memory usage: 3.7+ MB


What are number of records for each size?  

M: 29575  
S: 21829  
XXXL: 21259  
XL: 19033  
L: 17481  
XXS: 9907  
XXL: 69

size
M       29712
S       21924
XXXL    21359
XL      19119
L       17587
XXS      9964
XXL        69
Name: count, dtype: int64

Because there are only very few records for XXL, remove those records from the dataset

Train a transformer to fill in the median value of the corresponding attribute for all missing values.

array([ 61. ,  32. , 165.1])

Apply the imputer to the dataset and check the results.

<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>
Index: 119665 entries, 0 to 119733
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   weight  119665 non-null  float64
 1   age     119665 non-null  float64
 2   height  119665 non-null  float64
 3   size    119665 non-null  object 
dtypes: float64(3), object(1)
memory usage: 4.6+ MB


At first sight this seems quite a large dataset, but is this actually true?  
First we are going to change the datatype of height from float to integer.


It seems reasonable to round the ages to the nearest five-fold

Change the datatype of age from float to integer.

We drop duplicate rows in the dataset.

<class 'pandas.core.frame.DataFrame'>
Index: 11330 entries, 0 to 119721
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  11330 non-null  float64
 1   age     11330 non-null  int64  
 2   height  11330 non-null  int64  
 3   size    11330 non-null  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 442.6+ KB


How many records are left?

We want to know if there are any 'wrong duplicates' in the dataset, i.e. the same values for weight, age and height, but still another size. So we count the nunique

<class 'pandas.core.frame.DataFrame'>
Index: 5159 entries, 0 to 119682
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  5159 non-null   float64
 1   age     5159 non-null   int64  
 2   height  5159 non-null   int64  
 3   size    5159 non-null   object 
dtypes: float64(1), int64(2), object(1)
memory usage: 201.5+ KB


How many records are left?

size
XXXL    2326
M        689
XL       659
S        624
L        473
XXS      388
Name: count, dtype: int64

Check if the dataset is heavily skewed.

size
XXXL    2326
M        689
XL       659
S        624
L        473
XXS      388
Name: count, dtype: int64

Because we want to apply regression first, map the sizes to numbers as follows:  
'XXS' : 0, 'S' : 1, 'M': 2, 'L': 3,'XL':4,'XXXL': 5

Unnamed: 0,weight,age,height,size
72950,48.0,35,170,1
2697,97.0,40,167,5
30546,43.0,45,152,1
203,90.0,30,165,5
2500,89.0,30,170,5


What is X and what is y?

What is X_train, y_train, X_test, y_test?

What is the shape of X_train, y_train, X_test and y_test?

(4127, 3)
(1032, 3)


What are columns of X containing text?

What are the columns of X containing numbers?

Define the ColumnTransformer for applying Standard Scaling on all numeric columns.  

## Regression

Define the model LinearRegression  

Define the data preparation (= ColumnTransformer for standard scaling) and modeling pipeline

Train the model

What is the accuracy of the model?  
Use K-fold cross-validation with k = 3.  
Find an appropriate value for the attribute scoring on [metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html) 

np.float64(0.9134307117763821)

What are the values for intercept and the coefficients.  
Why do we have 3 coefficients?  
What is the most important coefficient?

np.float64(3.433971407802278)

array([1.33094381, 0.2484815 , 0.00512916])

Apply the model to the test set.  

Calculate the Mean Absolute Error and the Root Mean Squared Error

root mean squared error: 1.103009248760387
mean absolute error: 0.9203436604478167


Interprete the results. 

## Classification

Use the softmax classifier to try to predict the class (0, 1, 2, 3, 4, 5).  
What is the accuracy score?

np.float64(0.6520489781536293)

Create and show the confusion matrix for the test set.

array([[ 36,  31,   4,   0,   0,   0],
       [ 22,  81,  32,   0,   1,   6],
       [  0,  22, 101,   0,  10,   5],
       [  1,   4,  44,   0,  28,  15],
       [  0,   3,  27,   0,  38,  56],
       [  1,   1,   7,   0,  21, 435]])

The accuracy of the classifier is low, but we see that we often predict only one size too high or too small. 
Calculate how many times 
* the classifier was correct
* the classifier predicted the size to be one size higher than the actual size
* the classifier predicted the size to be one size smaller than the actual size


np.int64(691)

np.int64(147)

np.int64(109)