# **Data Cleaning**

First I'm going to look at the data we are working with:

In [1]:
import numpy as np
import pandas as pd

url_train = "data/train.csv"
df_train = pd.read_csv(url_train)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3960 entries, 0 to 3959
Data columns (total 82 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   id                                      3960 non-null   object 
 1   Basic_Demos-Enroll_Season               3960 non-null   object 
 2   Basic_Demos-Age                         3960 non-null   int64  
 3   Basic_Demos-Sex                         3960 non-null   int64  
 4   CGAS-Season                             2555 non-null   object 
 5   CGAS-CGAS_Score                         2421 non-null   float64
 6   Physical-Season                         3310 non-null   object 
 7   Physical-BMI                            3022 non-null   float64
 8   Physical-Height                         3027 non-null   float64
 9   Physical-Weight                         3076 non-null   float64
 10  Physical-Waist_Circumference            898 non-null    floa

In [2]:
dataset2 = df_train.astype('object')
dataset2.describe().T

Unnamed: 0,count,unique,top,freq
id,3960,3960,00008ff9,1
Basic_Demos-Enroll_Season,3960,4,Spring,1127
Basic_Demos-Age,3960,18,8,490
Basic_Demos-Sex,3960,2,0,2484
CGAS-Season,2555,4,Spring,697
...,...,...,...,...
SDS-SDS_Total_Raw,2609.0,62.0,35.0,132.0
SDS-SDS_Total_T,2606.0,49.0,50.0,132.0
PreInt_EduHx-Season,3540,4,Spring,985
PreInt_EduHx-computerinternet_hoursday,3301.0,4.0,0.0,1524.0


We can see that we have 3 type of data: int, float and string.
`string` data have only 4 possible values: the season, so they are good to be one-hot encoded.

In [3]:
num_inst, num_features = df_train.shape

for f in range(num_features):
    col = df_train.iloc[:, f].astype(str)
    print(f, np.unique(col))

0 ['00008ff9' '000fd460' '00105258' ... 'ffcd4dbd' 'ffed1dd5' 'ffef538e']
1 ['Fall' 'Spring' 'Summer' 'Winter']
2 ['10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '20' '21' '22' '5' '6'
 '7' '8' '9']
3 ['0' '1']
4 ['Fall' 'Spring' 'Summer' 'Winter' 'nan']
5 ['25.0' '30.0' '31.0' '33.0' '35.0' '38.0' '39.0' '40.0' '41.0' '42.0'
 '44.0' '45.0' '46.0' '47.0' '48.0' '49.0' '50.0' '51.0' '52.0' '53.0'
 '54.0' '55.0' '56.0' '57.0' '58.0' '59.0' '60.0' '61.0' '62.0' '63.0'
 '64.0' '65.0' '66.0' '67.0' '68.0' '69.0' '70.0' '71.0' '72.0' '73.0'
 '74.0' '75.0' '76.0' '77.0' '78.0' '79.0' '80.0' '81.0' '82.0' '83.0'
 '85.0' '87.0' '88.0' '90.0' '91.0' '92.0' '93.0' '95.0' '999.0' 'nan']
6 ['Fall' 'Spring' 'Summer' 'Winter' 'nan']
7 ['0.0' '10.28168847' '10.67543945' ... '9.693766159' '9.959166667' 'nan']
8 ['33.0' '36.0' '37.5' '39.0' '39.5' '40.0' '40.5' '41.0' '41.5' '42.0'
 '42.25' '42.5' '42.75' '42.9' '43.0' '43.15' '43.2' '43.25' '43.4' '43.5'
 '43.75' '44.0' '44.2' '44.25' '44.3' '44.4' 

Another thing we can see is that most of the features have `nan` value, so we also have to deal with missing values.

We can get additional clues by looking at `data_dictionary.csv`

In [4]:
url = "data/data_dictionary.csv"
description = pd.read_csv(url)
description

Unnamed: 0,Instrument,Field,Description,Type,Values,Value Labels
0,Identifier,id,Participant's ID,str,,
1,Demographics,Basic_Demos-Enroll_Season,Season of enrollment,str,"Spring, Summer, Fall, Winter",
2,Demographics,Basic_Demos-Age,Age of participant,float,,
3,Demographics,Basic_Demos-Sex,Sex of participant,categorical int,01,"0=Male, 1=Female"
4,Children's Global Assessment Scale,CGAS-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
...,...,...,...,...,...,...
76,Sleep Disturbance Scale,SDS-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
77,Sleep Disturbance Scale,SDS-SDS_Total_Raw,Total Raw Score,int,,
78,Sleep Disturbance Scale,SDS-SDS_Total_T,Total T-Score,int,,
79,Internet Use,PreInt_EduHx-Season,Season of participation,str,"Spring, Summer, Fall, Winter",


Now that we have additional information about the dataset, we can start the data cleaning process, we will start by removing the rows where the value of `sii` is `NaN` since it's the features that we use for our supervised learning. <br>
We will then remove the column that represent the id feature since it's used as "primary key" to distinguish the rows and it's not relevant for a classification task.

In [5]:
# remove rows where the value of sii is NaN
df_train.dropna(subset=['sii'], inplace=True)

# remove the column id
del df_train['id'] 

We can now start dealing with the missing values.
I've notice that in the dataset we have a group of Physical Measures like "weight" and "height", that for child are related mostly to the age and the sex of the specific children.<br>
Luckily we can see that age and sex are feature that are never null in our dataset, so I thought that we could use them to insert in the rows where the phisical measures are missing the average value for the specific age and sex of the child. <br>
I did this only in the rows where all the physical measures are missing, since in the case some of them are missing and some are present, I thought that using a KNN-Imputer was a better idea. <br>
To do this I used an external source, since I thought that it would be more reliable than trying to predict the values with the mean or other ways. <br>
The external source is a csv file that I've created using the NHANES data that assess the health and nutritional status of children in the United States. <br>
I decided to use this approach since it limits the distortion of the measures compared to the global means, indeed a girl of 7 years isn't going to be as tall as a boy of 15 years. <br>

In [6]:
# load external file
physical_measures_df = pd.read_csv('data/physical_measures.csv')

# add the columns of the average physical measures (given an age and a sex) to each row in the dataframe
df_train = df_train.merge( physical_measures_df, on=['Basic_Demos-Age', 'Basic_Demos-Sex'], suffixes=('', '_avg'))
cols = ['Physical-BMI','Physical-Height','Physical-Weight','Physical-Waist_Circumference','Physical-Diastolic_BP','Physical-HeartRate','Physical-Systolic_BP']
# a list of boolean corresponding to each row -> true if the physical measures are all nan
tot_nan_phys = df_train[cols].isna().all(axis=1)

for col in cols:
    # first fill the rows with only nan values for the physical measure with the average
    df_train.loc[tot_nan_phys, col] = df_train.loc[tot_nan_phys, f"{col}_avg"]
    # then remove the average columns
    del df_train[f"{col}_avg"]
    #print(np.unique(df_train[col]))


Now that we have enriched our dataset where rows had a very low level of information, we can start working on the rest of the dataset to handle other missing values. <br>
What we want to do is :
- replace the missing values of the columns with the KNN imputer for the numerical values
- replace the missing values of the columns with the mode for the categorical values

But first it's better to separate the features of the classification task from the one to be classified: `sii`. <br>
It's also important to separate the test set from the training set since otherwise the value that are going to substitute the missing values in the test set would be affected by one of the training set.

In [7]:
from sklearn.model_selection import train_test_split
X = df_train.iloc[:, :-1]
y = df_train.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

We can now start replacing the missing values by dividing the numerical and categorical features and operate on them separately:

In [8]:
# array of boolean saying if a specific column is numeric or not
is_numerical = np.array([np.issubdtype(dtype, np.number) for dtype in X.dtypes])  
numerical_idx = np.flatnonzero(is_numerical) 
# takes only the column that are numerical
new_X_train = X_train.iloc[:, numerical_idx]
new_X_test = X_test.iloc[:, numerical_idx]
new_X_train.head(10)

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,...,PCIAT-PCIAT_15,PCIAT-PCIAT_16,PCIAT-PCIAT_17,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-computerinternet_hoursday
247,9,0,61.0,22.96,53.5,93.5,26.2,65.4,90.5,100.8,...,3.0,1.0,3.0,1.0,0.0,0.0,26.0,33.0,47.0,0.0
2488,8,0,,15.866065,46.5,48.8,22.0,79.0,72.0,124.0,...,1.0,0.0,1.0,0.0,1.0,1.0,29.0,62.0,85.0,2.0
2318,9,1,70.0,19.424006,53.2,78.2,,63.0,99.0,115.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.0,47.0,0.0
347,10,1,,20.902811,52.0,80.4,27.0,72.0,92.0,127.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,34.0,49.0,2.0
1090,5,1,60.0,15.143991,42.0,38.0,20.0,62.0,92.0,109.0,...,0.0,0.0,0.0,1.0,0.0,0.0,7.0,,,2.0
929,7,0,60.0,23.67704,50.0,84.2,,76.0,105.0,126.0,...,5.0,4.0,4.0,5.0,4.0,4.0,75.0,55.0,76.0,2.0
2225,9,1,70.0,18.023942,51.5,68.0,,64.0,82.0,113.0,...,2.0,2.0,1.0,2.0,1.0,1.0,26.0,37.0,53.0,2.0
2678,18,0,70.0,33.658295,72.0,248.2,,80.0,89.0,145.0,...,4.0,1.0,4.0,0.0,3.0,0.0,49.0,41.0,58.0,3.0
2397,8,0,82.0,16.067086,52.0,61.8,25.0,57.0,91.0,106.0,...,0.0,1.0,2.0,1.0,2.0,0.0,14.0,41.0,58.0,0.0
1744,16,1,65.0,39.337793,64.0,229.2,,70.0,83.0,125.0,...,3.0,1.0,2.0,2.0,1.0,1.0,38.0,59.0,81.0,0.0


Now that we have all the numerical column we can replace the NaN values of specific column with the value of the closer neighbors. <br>
To do this I used `KNNImputer`, which fills missing values based on nearest neighbors, in this way we take correlation into account. <br>
It's important to scale our values before using the k-nearest neighbors method since otherwise it will consider a lot more field where an high value is a default and not consider the one where a low value is normal. Naturally i will rescale the values to normal once computed the transformation. <br>
I decided to use it because in this case it's way better than the mean since, as already mentioned, having child of 6 and 19 years old in the same dataset give us a mean that doesn't represent coherently the specific kid.

In [9]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
imputer = KNNImputer(n_neighbors=3)

# train set
scaled_train = scaler.fit_transform(new_X_train)
X_array = imputer.fit_transform(scaled_train)
X_array = scaler.inverse_transform(X_array)
new_X_train = pd.DataFrame(X_array, columns=new_X_train.columns, index=new_X_train.index) # convert into a dataframe since X_array is of type ndarray

# test set
scaled_test = scaler.fit_transform(new_X_test)
X_array = imputer.fit_transform(scaled_test)
X_array = scaler.inverse_transform(X_array)
new_X_test = pd.DataFrame(X_array, columns=new_X_test.columns, index=new_X_test.index)

new_X_train.head(10)

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,...,PCIAT-PCIAT_15,PCIAT-PCIAT_16,PCIAT-PCIAT_17,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-computerinternet_hoursday
247,9.0,0.0,61.0,22.96,53.5,93.5,26.2,65.4,90.5,100.8,...,3.0,1.0,3.0,1.0,0.0,1.110223e-16,26.0,33.0,47.0,0.0
2488,8.0,0.0,62.333333,15.866065,46.5,48.8,22.0,79.0,72.0,124.0,...,1.0,0.0,1.0,2.220446e-16,1.0,1.0,29.0,62.0,85.0,2.0
2318,9.0,1.0,70.0,19.424006,53.2,78.2,24.8,63.0,99.0,115.0,...,0.0,0.0,0.0,2.220446e-16,0.0,1.110223e-16,0.0,33.0,47.0,0.0
347,10.0,1.0,60.333333,20.902811,52.0,80.4,27.0,72.0,92.0,127.0,...,0.0,0.0,0.0,2.220446e-16,0.0,1.110223e-16,2.0,34.0,49.0,2.0
1090,5.0,1.0,60.0,15.143991,42.0,38.0,20.0,62.0,92.0,109.0,...,0.0,0.0,0.0,1.0,0.0,1.110223e-16,7.0,32.666667,47.0,2.0
929,7.0,0.0,60.0,23.67704,50.0,84.2,27.733333,76.0,105.0,126.0,...,5.0,4.0,4.0,5.0,4.0,4.0,75.0,55.0,76.0,2.0
2225,9.0,1.0,70.0,18.023942,51.5,68.0,23.6,64.0,82.0,113.0,...,2.0,2.0,1.0,2.0,1.0,1.0,26.0,37.0,53.0,2.0
2678,18.0,0.0,70.0,33.658295,72.0,248.2,37.0,80.0,89.0,145.0,...,4.0,1.0,4.0,2.220446e-16,3.0,1.110223e-16,49.0,41.0,58.0,3.0
2397,8.0,0.0,82.0,16.067086,52.0,61.8,25.0,57.0,91.0,106.0,...,0.0,1.0,2.0,1.0,2.0,1.110223e-16,14.0,41.0,58.0,0.0
1744,16.0,1.0,65.0,39.337793,64.0,229.2,37.0,70.0,83.0,125.0,...,3.0,1.0,2.0,2.0,1.0,1.0,38.0,59.0,81.0,0.0


As we can see in the field where there was a missing value, we substitute it with the total mean of the non-missing values. <br>
Let's now check if there are still `NaN` values in the numerical features:

In [10]:
num_inst, num_features = new_X_train.shape
for f in range(num_features):
    col = new_X_train.iloc[:, f].astype(str)
    print(f, np.unique(col))

0 ['10.0' '11.0' '12.0' '13.0' '14.0' '15.0' '16.0' '17.0' '18.0' '19.0'
 '20.0' '21.0' '22.0' '5.0' '6.0' '7.0' '8.0' '9.0']
1 ['0.0' '1.0']
2 ['25.0' '31.0' '33.0' '35.0' '38.0' '39.0' '40.0' '41.0' '42.0' '44.0'
 '45.0' '47.0' '48.0' '48.33333333333333' '48.666666666666664'
 '48.66666666666667' '49.0' '50.0' '50.333333333333336' '51.0'
 '51.666666666666664' '52.0' '53.0' '53.33333333333333'
 '53.666666666666664' '53.66666666666667' '54.0' '54.33333333333333'
 '54.333333333333336' '54.666666666666664' '55.0' '55.333333333333336'
 '55.666666666666664' '56.0' '56.666666666666664' '57.0'
 '57.333333333333336' '57.666666666666664' '57.66666666666667' '58.0'
 '58.33333333333333' '58.666666666666664' '59.0' '59.333333333333336'
 '59.666666666666664' '60.0' '60.333333333333336' '60.666666666666664'
 '61.0' '61.666666666666664' '62.0' '62.333333333333336'
 '62.666666666666664' '63.0' '63.333333333333336' '63.666666666666664'
 '64.0' '64.33333333333333' '64.66666666666667' '65.0' '65.33333333

We can feel satisfied by this first part of the data processing.

Now we have to handle the categorical values, we have 2 things to do:
- replace missing values with the mode
- transform them with One-Hot Encoding <br>

I've decided to use the mode to replace missing values because since the number of categories is small (the seasons), we don't need a complex modelling so we can use a simple model.<br>
What the mode imputer does is fill the missing values with the most common value of the selected feature.

In [12]:
from sklearn.impute import SimpleImputer

categorical_idx = np.flatnonzero(is_numerical==False)
categorical_X_train = X_train.iloc[:, categorical_idx]
categorical_X_test = X_test.iloc[:, categorical_idx]

imputer = SimpleImputer(strategy='most_frequent')
X_array = imputer.fit_transform(categorical_X_train)
categorical_X_train = pd.DataFrame(X_array, columns=categorical_X_train.columns, index=categorical_X_train.index)

X_array = imputer.fit_transform(categorical_X_test)
categorical_X_test = pd.DataFrame(X_array, columns=categorical_X_test.columns, index=categorical_X_test.index)

categorical_X_train.head(10)

Unnamed: 0,Basic_Demos-Enroll_Season,CGAS-Season,Physical-Season,Fitness_Endurance-Season,FGC-Season,BIA-Season,PAQ_A-Season,PAQ_C-Season,PCIAT-Season,SDS-Season,PreInt_EduHx-Season
247,Spring,Fall,Spring,Spring,Spring,Summer,Winter,Summer,Summer,Summer,Spring
2488,Summer,Spring,Summer,Spring,Summer,Summer,Winter,Spring,Summer,Summer,Summer
2318,Spring,Winter,Summer,Summer,Summer,Summer,Winter,Spring,Summer,Summer,Summer
347,Fall,Spring,Fall,Spring,Fall,Fall,Winter,Fall,Fall,Fall,Fall
1090,Winter,Summer,Winter,Spring,Winter,Winter,Winter,Spring,Winter,Spring,Winter
929,Fall,Spring,Winter,Winter,Winter,Winter,Winter,Spring,Winter,Winter,Fall
2225,Spring,Fall,Summer,Summer,Summer,Fall,Winter,Summer,Summer,Summer,Spring
2678,Summer,Fall,Summer,Spring,Spring,Summer,Winter,Spring,Fall,Fall,Summer
2397,Winter,Spring,Winter,Spring,Winter,Winter,Winter,Spring,Winter,Winter,Winter
1744,Fall,Winter,Winter,Spring,Winter,Winter,Winter,Spring,Fall,Fall,Fall


In [None]:
# now that we have no more missing values, we can handle categorical labels using one-hot encoding
from sklearn.preprocessing import OneHotEncoder

oh = OneHotEncoder(sparse_output=False)

oh.fit(categorical_X_train)
encoded = oh.transform(categorical_X_train)
print(oh.get_feature_names_out())
# we now add the encoded string features to the new data frame
for i, col in enumerate(oh.get_feature_names_out()):
    new_X_train = new_X_train.copy()
    new_X_train[col] = encoded[:, i]

oh.fit(categorical_X_test)
encoded = oh.transform(categorical_X_test)

# we now add the encoded string features to the new data frame
for i, col in enumerate(oh.get_feature_names_out()):
    new_X_test = new_X_test.copy()
    new_X_test[col] = encoded[:, i]

['Basic_Demos-Enroll_Season_Fall' 'Basic_Demos-Enroll_Season_Spring'
 'Basic_Demos-Enroll_Season_Summer' 'Basic_Demos-Enroll_Season_Winter'
 'CGAS-Season_Fall' 'CGAS-Season_Spring' 'CGAS-Season_Summer'
 'CGAS-Season_Winter' 'Physical-Season_Fall' 'Physical-Season_Spring'
 'Physical-Season_Summer' 'Physical-Season_Winter'
 'Fitness_Endurance-Season_Fall' 'Fitness_Endurance-Season_Spring'
 'Fitness_Endurance-Season_Summer' 'Fitness_Endurance-Season_Winter'
 'FGC-Season_Fall' 'FGC-Season_Spring' 'FGC-Season_Summer'
 'FGC-Season_Winter' 'BIA-Season_Fall' 'BIA-Season_Spring'
 'BIA-Season_Summer' 'BIA-Season_Winter' 'PAQ_A-Season_Fall'
 'PAQ_A-Season_Spring' 'PAQ_A-Season_Summer' 'PAQ_A-Season_Winter'
 'PAQ_C-Season_Fall' 'PAQ_C-Season_Spring' 'PAQ_C-Season_Summer'
 'PAQ_C-Season_Winter' 'PCIAT-Season_Fall' 'PCIAT-Season_Spring'
 'PCIAT-Season_Summer' 'PCIAT-Season_Winter' 'SDS-Season_Fall'
 'SDS-Season_Spring' 'SDS-Season_Summer' 'SDS-Season_Winter'
 'PreInt_EduHx-Season_Fall' 'PreInt_EduHx

In [14]:
# we now a good dataset to train our model with
new_X_train.head(10)

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,...,PCIAT-Season_Summer,PCIAT-Season_Winter,SDS-Season_Fall,SDS-Season_Spring,SDS-Season_Summer,SDS-Season_Winter,PreInt_EduHx-Season_Fall,PreInt_EduHx-Season_Spring,PreInt_EduHx-Season_Summer,PreInt_EduHx-Season_Winter
247,9.0,0.0,61.0,22.96,53.5,93.5,26.2,65.4,90.5,100.8,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2488,8.0,0.0,62.333333,15.866065,46.5,48.8,22.0,79.0,72.0,124.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2318,9.0,1.0,70.0,19.424006,53.2,78.2,24.8,63.0,99.0,115.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
347,10.0,1.0,60.333333,20.902811,52.0,80.4,27.0,72.0,92.0,127.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1090,5.0,1.0,60.0,15.143991,42.0,38.0,20.0,62.0,92.0,109.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
929,7.0,0.0,60.0,23.67704,50.0,84.2,27.733333,76.0,105.0,126.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
2225,9.0,1.0,70.0,18.023942,51.5,68.0,23.6,64.0,82.0,113.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2678,18.0,0.0,70.0,33.658295,72.0,248.2,37.0,80.0,89.0,145.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2397,8.0,0.0,82.0,16.067086,52.0,61.8,25.0,57.0,91.0,106.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1744,16.0,1.0,65.0,39.337793,64.0,229.2,37.0,70.0,83.0,125.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


Let's now try with a first approach: <br>
At first I will calculate the baseline accuracy, that represents the accuracy of a naive classifier that basically classify every instance as if it was of the most frequent in the train set. <br>
Then I will use a basic Random Forest model and check its accuracy. <br>
Our goal is to at least predict better than the naive classifier.

In [20]:
baseline_accuracy = y_train.value_counts().max() / y_train.value_counts().sum()
print (f"Majority class accuracy: {baseline_accuracy:.3f}")
# our goal is to have a model that can predict better than the naive classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
model = RandomForestClassifier(max_leaf_nodes=20)
model.fit(new_X_train, y_train)

test_acc = accuracy_score(y_true = y_test, y_pred = model.predict(new_X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

print(model.feature_importances_)

Majority class accuracy: 0.589
Test Accuracy: 0.985
[2.73354280e-03 0.00000000e+00 6.01553925e-04 1.25561853e-03
 1.05323993e-03 2.67558369e-03 6.28642517e-03 5.37663440e-04
 6.48891977e-04 3.37574718e-04 9.03134237e-05 7.30962458e-04
 6.83962795e-04 1.26627077e-03 0.00000000e+00 1.77174164e-03
 3.66208962e-04 1.46235117e-03 6.39883015e-04 1.21997955e-03
 2.28387628e-04 3.94744938e-04 0.00000000e+00 2.93052182e-04
 0.00000000e+00 1.50358706e-04 1.29260229e-04 0.00000000e+00
 1.01412991e-04 3.74277195e-04 5.14702348e-03 1.44440515e-03
 2.07593942e-03 1.17563842e-03 1.04047062e-03 1.49173591e-03
 2.28916089e-03 2.27104872e-04 3.83213637e-03 5.57183171e-03
 2.43167990e-03 3.32926413e-03 4.66496544e-03 2.88243922e-04
 6.05562857e-04 2.44581661e-02 6.20409749e-02 4.97349984e-02
 1.42571098e-02 8.53256139e-02 1.62984571e-02 5.33073011e-03
 2.96711343e-02 2.03700806e-02 3.38523279e-02 2.21287809e-02
 4.87448133e-03 3.50726376e-02 2.77617286e-02 5.93422324e-02
 3.86523989e-02 4.27283719e-02 4.

We can see that even with a basic Random Forest, we get a perfect classifier. <br>
But there is a problem, a few features are quite important for the prediction, they are:

In [23]:
importances = model.feature_importances_
feature_names = new_X_train.columns
important_features = [name for name, importance in zip(feature_names, importances) if importance > 0.05]
print(important_features)

['PCIAT-PCIAT_02', 'PCIAT-PCIAT_05', 'PCIAT-PCIAT_15', 'PCIAT-PCIAT_Total']


We can see that the most significative features are the one relative to PCIAT that means Parent-Child Internet Addiction Test. <br>
These features measures characteristics and behaviors associated with compulsive use of the Internet. <br>
From the description we can understand that they can easily be very useful for the prediction, but unfortunately they are not present in the `test.csv` file, so to have a coherent predictor I've removed those features since in practice, they don't provide any help in the classification.

In [24]:
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")
print(df_train.shape[1])
print(df_test.shape[1])

82
59


In [25]:
train_features = df_train.columns.tolist()
test_features = df_test.columns.tolist()
features_toremove =  list(set(train_features) - set(test_features) - {'sii'})
print(features_toremove)

['PCIAT-PCIAT_08', 'PCIAT-PCIAT_10', 'PCIAT-PCIAT_Total', 'PCIAT-PCIAT_20', 'PCIAT-PCIAT_17', 'PCIAT-PCIAT_13', 'PCIAT-PCIAT_16', 'PCIAT-PCIAT_01', 'PCIAT-PCIAT_14', 'PCIAT-PCIAT_04', 'PCIAT-PCIAT_02', 'PCIAT-PCIAT_18', 'PCIAT-PCIAT_06', 'PCIAT-PCIAT_11', 'PCIAT-PCIAT_03', 'PCIAT-PCIAT_19', 'PCIAT-PCIAT_05', 'PCIAT-PCIAT_09', 'PCIAT-PCIAT_12', 'PCIAT-PCIAT_15', 'PCIAT-PCIAT_07', 'PCIAT-Season']


As we can see, the features that are not included in the test set are the PCIAT features.<br>
Let's now try to process the cleaning without including the PCIAT features:

In [26]:
del df_train['id']
for col in features_toremove: # this time we remove the PCIAT features from the train set
    del df_train[col]
df_train.dropna(subset=['sii'], inplace=True)
df_train.columns.to_list()

['Basic_Demos-Enroll_Season',
 'Basic_Demos-Age',
 'Basic_Demos-Sex',
 'CGAS-Season',
 'CGAS-CGAS_Score',
 'Physical-Season',
 'Physical-BMI',
 'Physical-Height',
 'Physical-Weight',
 'Physical-Waist_Circumference',
 'Physical-Diastolic_BP',
 'Physical-HeartRate',
 'Physical-Systolic_BP',
 'Fitness_Endurance-Season',
 'Fitness_Endurance-Max_Stage',
 'Fitness_Endurance-Time_Mins',
 'Fitness_Endurance-Time_Sec',
 'FGC-Season',
 'FGC-FGC_CU',
 'FGC-FGC_CU_Zone',
 'FGC-FGC_GSND',
 'FGC-FGC_GSND_Zone',
 'FGC-FGC_GSD',
 'FGC-FGC_GSD_Zone',
 'FGC-FGC_PU',
 'FGC-FGC_PU_Zone',
 'FGC-FGC_SRL',
 'FGC-FGC_SRL_Zone',
 'FGC-FGC_SRR',
 'FGC-FGC_SRR_Zone',
 'FGC-FGC_TL',
 'FGC-FGC_TL_Zone',
 'BIA-Season',
 'BIA-BIA_Activity_Level_num',
 'BIA-BIA_BMC',
 'BIA-BIA_BMI',
 'BIA-BIA_BMR',
 'BIA-BIA_DEE',
 'BIA-BIA_ECW',
 'BIA-BIA_FFM',
 'BIA-BIA_FFMI',
 'BIA-BIA_FMI',
 'BIA-BIA_Fat',
 'BIA-BIA_Frame_num',
 'BIA-BIA_ICW',
 'BIA-BIA_LDM',
 'BIA-BIA_LST',
 'BIA-BIA_SMM',
 'BIA-BIA_TBW',
 'PAQ_A-Season',
 'PAQ_

In [None]:
# this part is the same as before
physical_measures_df = pd.read_csv('data/physical_measures.csv')
df_train = df_train.merge( physical_measures_df, on=['Basic_Demos-Age', 'Basic_Demos-Sex'], suffixes=('', '_avg')) # add the column of the average physical measures to each row in the dataframe

cols = ['Physical-BMI','Physical-Height','Physical-Weight','Physical-Waist_Circumference','Physical-Diastolic_BP','Physical-HeartRate','Physical-Systolic_BP']
for col in cols:
    df_train[col] = df_train[col].fillna(df_train[f"{col}_avg"])
    del df_train[f"{col}_avg"]


X = df_train.iloc[:, :-1]
y = df_train.iloc[:, -1]

is_numerical = np.array([np.issubdtype(dtype, np.number) for dtype in X.dtypes])
numerical_idx = np.flatnonzero(is_numerical)
new_X = X.iloc[:, numerical_idx]

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_array = imputer.fit_transform(new_X)
new_X = pd.DataFrame(X_array, columns=new_X.columns, index=new_X.index)

categorical_idx = np.flatnonzero(is_numerical==False)
categorical_X = X.iloc[:, categorical_idx]
imputer = SimpleImputer(strategy='most_frequent')
X_array = imputer.fit_transform(categorical_X)
categorical_X = pd.DataFrame(X_array, columns=categorical_X.columns, index=categorical_X.index)


from sklearn.preprocessing import OneHotEncoder

oh = OneHotEncoder(sparse_output=False)
oh.fit(categorical_X)

encoded = oh.transform(categorical_X)
for i, col in enumerate(oh.get_feature_names_out()):
    new_X = new_X.copy()
    new_X[col] = encoded[:, i]
feature_names = new_X.columns.tolist()
print(feature_names)

['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI', 'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference', 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP', 'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU', 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR', 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI', 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num', 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM', 'BIA-BIA_TBW', 'PAQ_A-PAQ_A_Total', 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw', 'SDS-SDS_Total_T', 'PreInt_EduHx-computerinternet_hoursday', 'Basic_Demos-Enroll_Season_Fall', 'Basic_Demos-Enroll_Season_Sprin

Let's now retry our first approach with the prediction using the Decision Tree:

In [None]:
X_train, X_test, y_train, y_test = train_test_split( new_X, y, test_size=0.20, random_state=42)

from sklearn.tree import DecisionTreeClassifier
# this time we also tune the hyperparameter
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

model = DecisionTreeClassifier()
parameters = {'max_leaf_nodes': [2, 5, 10, 30],
    'max_depth': [3, 5, 10, None],
    'criterion': ['gini', 'entropy']
    }
# tune the hyperparameter using the validation set -> automatic parameter tuning using Grid Search:
tuned_model = GridSearchCV(model, parameters, cv=5, verbose=0)
tuned_model.fit(X_train, y_train)

print ("Best Score: {:.3f}".format(tuned_model.best_score_) )
print ("Best Params: ", tuned_model.best_params_)

test_acc = accuracy_score(y_true = y_test, y_pred = tuned_model.predict(X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

Best Score: 0.596
Best Params:  {'criterion': 'gini', 'max_depth': 5, 'max_leaf_nodes': 5}
Test Accuracy: 0.597


As we can see this time that we didn't take into consideration the PCIAT features, we get a classifier that is very similiar to the naive classifier, so not a good one. <br>
In the next notebook we will try to get a better classifier using the Random Forest method.