# **Data Cleaning**

First I'm going to look at the data we are working with:

In [57]:
import numpy as np
import pandas as pd

url_train = "data/train.csv"
df_train = pd.read_csv(url_train)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3960 entries, 0 to 3959
Data columns (total 82 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   id                                      3960 non-null   object 
 1   Basic_Demos-Enroll_Season               3960 non-null   object 
 2   Basic_Demos-Age                         3960 non-null   int64  
 3   Basic_Demos-Sex                         3960 non-null   int64  
 4   CGAS-Season                             2555 non-null   object 
 5   CGAS-CGAS_Score                         2421 non-null   float64
 6   Physical-Season                         3310 non-null   object 
 7   Physical-BMI                            3022 non-null   float64
 8   Physical-Height                         3027 non-null   float64
 9   Physical-Weight                         3076 non-null   float64
 10  Physical-Waist_Circumference            898 non-null    floa

In [58]:
dataset2 = df_train.astype('object')
dataset2.describe().T

Unnamed: 0,count,unique,top,freq
id,3960,3960,00008ff9,1
Basic_Demos-Enroll_Season,3960,4,Spring,1127
Basic_Demos-Age,3960,18,8,490
Basic_Demos-Sex,3960,2,0,2484
CGAS-Season,2555,4,Spring,697
...,...,...,...,...
SDS-SDS_Total_Raw,2609.0,62.0,35.0,132.0
SDS-SDS_Total_T,2606.0,49.0,50.0,132.0
PreInt_EduHx-Season,3540,4,Spring,985
PreInt_EduHx-computerinternet_hoursday,3301.0,4.0,0.0,1524.0


We can see that we have 3 type of data: int, float and string.
`string` data have only 4 possible values: the season, so they are good to be one-hot encoded.

In [59]:
num_inst, num_features = df_train.shape

for f in range(num_features):
    col = df_train.iloc[:, f].astype(str)
    print(f, np.unique(col))

0 ['00008ff9' '000fd460' '00105258' ... 'ffcd4dbd' 'ffed1dd5' 'ffef538e']
1 ['Fall' 'Spring' 'Summer' 'Winter']
2 ['10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '20' '21' '22' '5' '6'
 '7' '8' '9']
3 ['0' '1']
4 ['Fall' 'Spring' 'Summer' 'Winter' 'nan']
5 ['25.0' '30.0' '31.0' '33.0' '35.0' '38.0' '39.0' '40.0' '41.0' '42.0'
 '44.0' '45.0' '46.0' '47.0' '48.0' '49.0' '50.0' '51.0' '52.0' '53.0'
 '54.0' '55.0' '56.0' '57.0' '58.0' '59.0' '60.0' '61.0' '62.0' '63.0'
 '64.0' '65.0' '66.0' '67.0' '68.0' '69.0' '70.0' '71.0' '72.0' '73.0'
 '74.0' '75.0' '76.0' '77.0' '78.0' '79.0' '80.0' '81.0' '82.0' '83.0'
 '85.0' '87.0' '88.0' '90.0' '91.0' '92.0' '93.0' '95.0' '999.0' 'nan']
6 ['Fall' 'Spring' 'Summer' 'Winter' 'nan']
7 ['0.0' '10.28168847' '10.67543945' ... '9.693766159' '9.959166667' 'nan']
8 ['33.0' '36.0' '37.5' '39.0' '39.5' '40.0' '40.5' '41.0' '41.5' '42.0'
 '42.25' '42.5' '42.75' '42.9' '43.0' '43.15' '43.2' '43.25' '43.4' '43.5'
 '43.75' '44.0' '44.2' '44.25' '44.3' '44.4' 

Another thing we can see is that most of the features have `nan` value, so we also have to deal with missing values.

We can get additional clues by looking at `data_dictionary.csv`

In [60]:
url = "data/data_dictionary.csv"
description = pd.read_csv(url)
description

Unnamed: 0,Instrument,Field,Description,Type,Values,Value Labels
0,Identifier,id,Participant's ID,str,,
1,Demographics,Basic_Demos-Enroll_Season,Season of enrollment,str,"Spring, Summer, Fall, Winter",
2,Demographics,Basic_Demos-Age,Age of participant,float,,
3,Demographics,Basic_Demos-Sex,Sex of participant,categorical int,01,"0=Male, 1=Female"
4,Children's Global Assessment Scale,CGAS-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
...,...,...,...,...,...,...
76,Sleep Disturbance Scale,SDS-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
77,Sleep Disturbance Scale,SDS-SDS_Total_Raw,Total Raw Score,int,,
78,Sleep Disturbance Scale,SDS-SDS_Total_T,Total T-Score,int,,
79,Internet Use,PreInt_EduHx-Season,Season of participation,str,"Spring, Summer, Fall, Winter",


Now that we have additional information about the dataset, we can start the data cleaning process, we will start by removing the rows where the value of `sii` is `NaN` since it's the features that we use for our supervised learning. <br>
We will then remove the column that represent the id feature since it's used as "primary key" to distinguish the rows and it's not relevant for a classification task.

In [None]:
# remove rows where the value of sii is NaN
df_train.dropna(subset=['sii'], inplace=True)

# remove the column id
del df_train['id'] 

We can now start dealing with the missing values.
I've notice that in the dataset we have a group of Physical Measures like "weight" and "height", that for child are related mostly to the age and the sex of the specific children.<br>
Luckily we can see that age and sex are feature that are never null in our dataset, so I thought that we could use them to insert in the rows where the phisical measure is missing the average value for the specific age and sex of the child. <br>
To do this I used an external source, since I thought that it would be more reliable than trying to predict the values with the mean or other ways. <br>
The external source is a csv file that I've created using the NHANES data that assess the health and nutritional status of children in the United States.

In [62]:
physical_measures_df = pd.read_csv('data/physical_measures.csv')

# add the columns of the average physical measures (given an age and a sex) to each row in the dataframe
df_train = df_train.merge( physical_measures_df, on=['Basic_Demos-Age', 'Basic_Demos-Sex'], suffixes=('', '_avg'))

cols = ['Physical-BMI','Physical-Height','Physical-Weight','Physical-Waist_Circumference','Physical-Diastolic_BP','Physical-HeartRate','Physical-Systolic_BP']
for col in cols:
    # first fill the nan values of the physical measure columns with the one having the average
    df_train[col] = df_train[col].fillna(df_train[f"{col}_avg"]) 
    # then remove the average columns
    del df_train[f"{col}_avg"]
    print(np.unique(df_train[col]))

[ 0.          8.52243608  9.69376616 ... 44.83554809 45.30602589
 46.10291358]
[36.   37.5  39.   39.5  40.   40.5  41.   41.5  42.   42.25 42.5  42.75
 42.9  43.   43.15 43.2  43.25 43.4  43.5  43.75 44.   44.2  44.25 44.3
 44.4  44.5  44.75 45.   45.2  45.25 45.3  45.5  45.75 46.   46.1  46.25
 46.5  46.6  46.75 47.   47.02 47.1  47.2  47.25 47.3  47.4  47.5  47.7
 47.75 47.8  48.   48.03 48.05 48.13 48.2  48.25 48.44 48.5  48.63 48.75
 48.8  48.88 49.   49.1  49.13 49.2  49.25 49.3  49.5  49.6  49.75 49.9
 50.   50.2  50.25 50.3  50.48 50.5  50.63 50.75 51.   51.2  51.25 51.38
 51.4  51.5  51.6  51.63 51.7  51.75 51.8  51.9  52.   52.1  52.13 52.2
 52.25 52.3  52.38 52.5  52.63 52.7  52.75 52.8  52.88 53.   53.1  53.2
 53.25 53.3  53.4  53.5  53.57 53.6  53.7  53.74 53.75 53.8  53.88 53.9
 54.   54.1  54.13 54.25 54.3  54.4  54.5  54.6  54.63 54.65 54.75 54.8
 54.88 55.   55.13 55.2  55.25 55.3  55.4  55.5  55.68 55.75 55.8  55.88
 56.   56.13 56.2  56.25 56.3  56.5  56.63 56.75 56.

Now that we have the complete values for physical measures, we can start working on the dataset to handle the other missing values. <br>
What we want to do is :
- replace the missing values of the columns with the mean for the numerical values
- replace the missing values of the columns with the mode for the categorical values

so we have to separate this two type of features.

In [63]:
# But first it's better to separate the features of the classification task from the one to be classified (sii)
X = df_train.iloc[:, :-1]
y = df_train.iloc[:, -1]

# array of boolean saying if a specific column is numeric or not
is_numerical = np.array([np.issubdtype(dtype, np.number) for dtype in X.dtypes])  
numerical_idx = np.flatnonzero(is_numerical) 
# takes only the column that are numerical
new_X = X.iloc[:, numerical_idx]
new_X.head(10)

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,...,PCIAT-PCIAT_15,PCIAT-PCIAT_16,PCIAT-PCIAT_17,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-computerinternet_hoursday
0,5,0,51.0,16.877316,46.0,50.8,23.0,63.0,92.5,96.0,...,4.0,4.0,4.0,4.0,2.0,4.0,55.0,,,3.0
1,5,0,60.0,15.968632,43.0,42.0,22.0,66.0,83.0,107.0,...,0.0,4.0,1.0,3.0,0.0,0.0,18.0,,,2.0
2,5,0,50.0,12.926988,43.0,34.0,23.0,93.0,86.0,157.0,...,0.0,2.0,1.0,1.0,0.0,0.0,8.0,,,2.0
3,5,0,51.0,16.113052,42.5,41.4,21.0,66.0,82.0,92.0,...,1.0,1.0,1.0,1.0,1.0,1.0,21.0,,,2.0
4,5,0,,24.023506,45.0,69.2,29.0,78.0,116.0,131.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0
5,5,0,60.0,17.274885,49.0,59.0,23.0,129.0,94.0,162.0,...,3.0,4.0,2.0,4.0,3.0,1.0,45.0,,,1.0
6,5,0,79.0,26.512004,46.0,79.8,28.0,72.0,83.0,120.0,...,0.0,0.0,0.0,0.0,0.0,0.0,12.0,,,0.0
7,5,0,55.0,14.485255,46.0,43.6,23.0,68.0,69.0,126.0,...,2.0,2.0,1.0,1.0,0.0,0.0,28.0,,,0.0
8,5,0,,15.336245,44.5,43.2,21.0,67.0,82.0,105.0,...,0.0,0.0,3.0,1.0,0.0,0.0,11.0,,,0.0
9,5,0,70.0,14.073912,44.25,39.2,23.0,53.0,93.0,99.0,...,0.0,3.0,0.0,2.0,0.0,0.0,8.0,47.0,66.0,0.0


Now that we have all the numerical column we can replace the NaN values of specific column with the mean value of the column. <br>
To do this I used `SimpleImputer`.

In [64]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_array = imputer.fit_transform(new_X)
new_X = pd.DataFrame(X_array, columns=new_X.columns, index=new_X.index)
new_X.head(10)

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,...,PCIAT-PCIAT_15,PCIAT-PCIAT_16,PCIAT-PCIAT_17,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-computerinternet_hoursday
0,5.0,0.0,51.0,16.877316,46.0,50.8,23.0,63.0,92.5,96.0,...,4.0,4.0,4.0,4.0,2.0,4.0,55.0,40.977839,57.647525,3.0
1,5.0,0.0,60.0,15.968632,43.0,42.0,22.0,66.0,83.0,107.0,...,0.0,4.0,1.0,3.0,0.0,0.0,18.0,40.977839,57.647525,2.0
2,5.0,0.0,50.0,12.926988,43.0,34.0,23.0,93.0,86.0,157.0,...,0.0,2.0,1.0,1.0,0.0,0.0,8.0,40.977839,57.647525,2.0
3,5.0,0.0,51.0,16.113052,42.5,41.4,21.0,66.0,82.0,92.0,...,1.0,1.0,1.0,1.0,1.0,1.0,21.0,40.977839,57.647525,2.0
4,5.0,0.0,65.159266,24.023506,45.0,69.2,29.0,78.0,116.0,131.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.977839,57.647525,0.0
5,5.0,0.0,60.0,17.274885,49.0,59.0,23.0,129.0,94.0,162.0,...,3.0,4.0,2.0,4.0,3.0,1.0,45.0,40.977839,57.647525,1.0
6,5.0,0.0,79.0,26.512004,46.0,79.8,28.0,72.0,83.0,120.0,...,0.0,0.0,0.0,0.0,0.0,0.0,12.0,40.977839,57.647525,0.0
7,5.0,0.0,55.0,14.485255,46.0,43.6,23.0,68.0,69.0,126.0,...,2.0,2.0,1.0,1.0,0.0,0.0,28.0,40.977839,57.647525,0.0
8,5.0,0.0,65.159266,15.336245,44.5,43.2,21.0,67.0,82.0,105.0,...,0.0,0.0,3.0,1.0,0.0,0.0,11.0,40.977839,57.647525,0.0
9,5.0,0.0,70.0,14.073912,44.25,39.2,23.0,53.0,93.0,99.0,...,0.0,3.0,0.0,2.0,0.0,0.0,8.0,47.0,66.0,0.0


Let's now check if there are still `NaN` values in the numerical features:

In [65]:
num_inst, num_features = new_X.shape
for f in range(num_features):
    col = new_X.iloc[:, f].astype(str)
    print(f, np.unique(col))
# we can feel satisfied by this first part of the data processing

0 ['10.0' '11.0' '12.0' '13.0' '14.0' '15.0' '16.0' '17.0' '18.0' '19.0'
 '20.0' '21.0' '22.0' '5.0' '6.0' '7.0' '8.0' '9.0']
1 ['0.0' '1.0']
2 ['25.0' '30.0' '31.0' '33.0' '35.0' '38.0' '39.0' '40.0' '41.0' '42.0'
 '44.0' '45.0' '46.0' '47.0' '48.0' '49.0' '50.0' '51.0' '52.0' '53.0'
 '54.0' '55.0' '56.0' '57.0' '58.0' '59.0' '60.0' '61.0' '62.0' '63.0'
 '64.0' '65.0' '65.15926558497011' '66.0' '67.0' '68.0' '69.0' '70.0'
 '71.0' '72.0' '73.0' '74.0' '75.0' '76.0' '77.0' '78.0' '79.0' '80.0'
 '81.0' '82.0' '83.0' '85.0' '87.0' '88.0' '90.0' '91.0' '92.0' '93.0'
 '95.0']
3 ['0.0' '10.28168847' '10.67543945' ... '8.522436082' '9.693766159'
 '9.959166667']
4 ['36.0' '37.5' '39.0' '39.5' '40.0' '40.5' '41.0' '41.5' '42.0' '42.25'
 '42.5' '42.75' '42.9' '43.0' '43.15' '43.2' '43.25' '43.4' '43.5' '43.75'
 '44.0' '44.2' '44.25' '44.3' '44.4' '44.5' '44.75' '45.0' '45.2' '45.25'
 '45.3' '45.5' '45.75' '46.0' '46.1' '46.25' '46.5' '46.6' '46.75' '47.0'
 '47.02' '47.1' '47.2' '47.25' '47.3' '4

Now we have to handle the categorical values, we have 2 things to do:
- replace missing values with the mode
- transform them with One-Hot Encoding

In [66]:
categorical_idx = np.flatnonzero(is_numerical==False)
categorical_X = X.iloc[:, categorical_idx]

imputer = SimpleImputer(strategy='most_frequent')
X_array = imputer.fit_transform(categorical_X)
categorical_X = pd.DataFrame(X_array, columns=categorical_X.columns, index=categorical_X.index)
categorical_X.head(10)

Unnamed: 0,Basic_Demos-Enroll_Season,CGAS-Season,Physical-Season,Fitness_Endurance-Season,FGC-Season,BIA-Season,PAQ_A-Season,PAQ_C-Season,PCIAT-Season,SDS-Season,PreInt_EduHx-Season
0,Fall,Winter,Fall,Spring,Fall,Fall,Winter,Spring,Fall,Spring,Fall
1,Spring,Fall,Spring,Spring,Spring,Spring,Winter,Spring,Summer,Spring,Spring
2,Summer,Fall,Fall,Fall,Fall,Summer,Winter,Spring,Fall,Spring,Summer
3,Spring,Summer,Spring,Spring,Spring,Spring,Winter,Spring,Spring,Spring,Spring
4,Summer,Spring,Summer,Spring,Summer,Summer,Winter,Spring,Summer,Spring,Summer
5,Summer,Winter,Fall,Fall,Fall,Fall,Winter,Spring,Fall,Spring,Fall
6,Spring,Summer,Spring,Spring,Spring,Spring,Winter,Spring,Summer,Spring,Spring
7,Spring,Fall,Spring,Spring,Spring,Spring,Winter,Spring,Summer,Spring,Spring
8,Fall,Spring,Fall,Spring,Fall,Fall,Winter,Spring,Fall,Spring,Fall
9,Spring,Fall,Spring,Spring,Spring,Summer,Winter,Spring,Summer,Spring,Spring


In [67]:
# now that we have no more missing values, we can handle categorical labels using one-hot encoding
from sklearn.preprocessing import OneHotEncoder

oh = OneHotEncoder(sparse_output=False)
oh.fit(categorical_X)

encoded = oh.transform(categorical_X)
print(oh.get_feature_names_out())
# we now add the encoded string features to the new data frame
for i, col in enumerate(oh.get_feature_names_out()):
    new_X = new_X.copy()
    new_X[col] = encoded[:, i]

['Basic_Demos-Enroll_Season_Fall' 'Basic_Demos-Enroll_Season_Spring'
 'Basic_Demos-Enroll_Season_Summer' 'Basic_Demos-Enroll_Season_Winter'
 'CGAS-Season_Fall' 'CGAS-Season_Spring' 'CGAS-Season_Summer'
 'CGAS-Season_Winter' 'Physical-Season_Fall' 'Physical-Season_Spring'
 'Physical-Season_Summer' 'Physical-Season_Winter'
 'Fitness_Endurance-Season_Fall' 'Fitness_Endurance-Season_Spring'
 'Fitness_Endurance-Season_Summer' 'Fitness_Endurance-Season_Winter'
 'FGC-Season_Fall' 'FGC-Season_Spring' 'FGC-Season_Summer'
 'FGC-Season_Winter' 'BIA-Season_Fall' 'BIA-Season_Spring'
 'BIA-Season_Summer' 'BIA-Season_Winter' 'PAQ_A-Season_Fall'
 'PAQ_A-Season_Spring' 'PAQ_A-Season_Summer' 'PAQ_A-Season_Winter'
 'PAQ_C-Season_Fall' 'PAQ_C-Season_Spring' 'PAQ_C-Season_Summer'
 'PAQ_C-Season_Winter' 'PCIAT-Season_Fall' 'PCIAT-Season_Spring'
 'PCIAT-Season_Summer' 'PCIAT-Season_Winter' 'SDS-Season_Fall'
 'SDS-Season_Spring' 'SDS-Season_Summer' 'SDS-Season_Winter'
 'PreInt_EduHx-Season_Fall' 'PreInt_EduHx

In [68]:
# we now a good dataset to train our model with
new_X.head(10)

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,...,PCIAT-Season_Summer,PCIAT-Season_Winter,SDS-Season_Fall,SDS-Season_Spring,SDS-Season_Summer,SDS-Season_Winter,PreInt_EduHx-Season_Fall,PreInt_EduHx-Season_Spring,PreInt_EduHx-Season_Summer,PreInt_EduHx-Season_Winter
0,5.0,0.0,51.0,16.877316,46.0,50.8,23.0,63.0,92.5,96.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,5.0,0.0,60.0,15.968632,43.0,42.0,22.0,66.0,83.0,107.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,5.0,0.0,50.0,12.926988,43.0,34.0,23.0,93.0,86.0,157.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,5.0,0.0,51.0,16.113052,42.5,41.4,21.0,66.0,82.0,92.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
4,5.0,0.0,65.159266,24.023506,45.0,69.2,29.0,78.0,116.0,131.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
5,5.0,0.0,60.0,17.274885,49.0,59.0,23.0,129.0,94.0,162.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
6,5.0,0.0,79.0,26.512004,46.0,79.8,28.0,72.0,83.0,120.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
7,5.0,0.0,55.0,14.485255,46.0,43.6,23.0,68.0,69.0,126.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
8,5.0,0.0,65.159266,15.336245,44.5,43.2,21.0,67.0,82.0,105.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
9,5.0,0.0,70.0,14.073912,44.25,39.2,23.0,53.0,93.0,99.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


Let's try a first approach to the classification task with a simple Decision Tree

In [69]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( new_X, y, test_size=0.20, random_state=42)

# the baseline accuracy represent the accuracy of a naive classifier saying the correct class is always the one with more instances
baseline_accuracy = y_train.value_counts().max() / y_train.value_counts().sum()
print (f"Majority class accuracy: {baseline_accuracy:.3f}")
# our goal is to have a model that can predict better than the naive classifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
model = DecisionTreeClassifier(max_leaf_nodes=20)
model.fit(X_train, y_train)

test_acc = accuracy_score(y_true = y_test, y_pred = model.predict(X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

print(model.feature_importances_)

Majority class accuracy: 0.589
Test Accuracy: 1.000
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


We can see that even with a simple decision tree classifier, we get a perfect classifier. <br>
But there is a problem:

In [70]:
# we can see that only one feature is used to classify the instances, let's see what is it
importances = model.feature_importances_
feature_names = new_X.columns
important_features = [name for name, importance in zip(feature_names, importances) if importance == 1.0]
print(important_features)

['PCIAT-PCIAT_Total']


In [93]:
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")
print(df_train.shape[1])
print(df_test.shape[1])

82
59


Unfortunately in the test set we have less features than in the training set, so to have a coherent predictor I've decided to remove those features since in practice, they don't provide any help in the classification of the test set. <br>

In [94]:
train_features = df_train.columns.tolist()
test_features = df_test.columns.tolist()
features_toremove =  list(set(train_features) - set(test_features) - {'sii'})
print(features_toremove)

['PCIAT-PCIAT_17', 'PCIAT-PCIAT_09', 'PCIAT-PCIAT_07', 'PCIAT-PCIAT_13', 'PCIAT-PCIAT_03', 'PCIAT-PCIAT_11', 'PCIAT-PCIAT_02', 'PCIAT-PCIAT_05', 'PCIAT-PCIAT_08', 'PCIAT-PCIAT_15', 'PCIAT-PCIAT_14', 'PCIAT-PCIAT_01', 'PCIAT-PCIAT_04', 'PCIAT-PCIAT_20', 'PCIAT-PCIAT_16', 'PCIAT-PCIAT_12', 'PCIAT-PCIAT_Total', 'PCIAT-PCIAT_18', 'PCIAT-Season', 'PCIAT-PCIAT_06', 'PCIAT-PCIAT_19', 'PCIAT-PCIAT_10']


As we can see, the features that are not included in the test set are the PCIAT features, the one that are used from our decision tree to classify the instances.<br>
Let's now try to process the cleaning without including the PCIAT features:

In [95]:
del df_train['id']
for col in features_toremove: # this time we remove the PCIAT features from the train set
    del df_train[col]
df_train.dropna(subset=['sii'], inplace=True)
df_train.columns.to_list()

['Basic_Demos-Enroll_Season',
 'Basic_Demos-Age',
 'Basic_Demos-Sex',
 'CGAS-Season',
 'CGAS-CGAS_Score',
 'Physical-Season',
 'Physical-BMI',
 'Physical-Height',
 'Physical-Weight',
 'Physical-Waist_Circumference',
 'Physical-Diastolic_BP',
 'Physical-HeartRate',
 'Physical-Systolic_BP',
 'Fitness_Endurance-Season',
 'Fitness_Endurance-Max_Stage',
 'Fitness_Endurance-Time_Mins',
 'Fitness_Endurance-Time_Sec',
 'FGC-Season',
 'FGC-FGC_CU',
 'FGC-FGC_CU_Zone',
 'FGC-FGC_GSND',
 'FGC-FGC_GSND_Zone',
 'FGC-FGC_GSD',
 'FGC-FGC_GSD_Zone',
 'FGC-FGC_PU',
 'FGC-FGC_PU_Zone',
 'FGC-FGC_SRL',
 'FGC-FGC_SRL_Zone',
 'FGC-FGC_SRR',
 'FGC-FGC_SRR_Zone',
 'FGC-FGC_TL',
 'FGC-FGC_TL_Zone',
 'BIA-Season',
 'BIA-BIA_Activity_Level_num',
 'BIA-BIA_BMC',
 'BIA-BIA_BMI',
 'BIA-BIA_BMR',
 'BIA-BIA_DEE',
 'BIA-BIA_ECW',
 'BIA-BIA_FFM',
 'BIA-BIA_FFMI',
 'BIA-BIA_FMI',
 'BIA-BIA_Fat',
 'BIA-BIA_Frame_num',
 'BIA-BIA_ICW',
 'BIA-BIA_LDM',
 'BIA-BIA_LST',
 'BIA-BIA_SMM',
 'BIA-BIA_TBW',
 'PAQ_A-Season',
 'PAQ_

In [96]:
# this part is the same as before
physical_measures_df = pd.read_csv('data/physical_measures.csv')
df_train = df_train.merge( physical_measures_df, on=['Basic_Demos-Age', 'Basic_Demos-Sex'], suffixes=('', '_avg')) # add the column of the average physical measures to each row in the dataframe

cols = ['Physical-BMI','Physical-Height','Physical-Weight','Physical-Waist_Circumference','Physical-Diastolic_BP','Physical-HeartRate','Physical-Systolic_BP']
for col in cols:
    df_train[col] = df_train[col].fillna(df_train[f"{col}_avg"])
    del df_train[f"{col}_avg"]


X = df_train.iloc[:, :-1]
y = df_train.iloc[:, -1]

is_numerical = np.array([np.issubdtype(dtype, np.number) for dtype in X.dtypes])
numerical_idx = np.flatnonzero(is_numerical)
new_X = X.iloc[:, numerical_idx]

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_array = imputer.fit_transform(new_X)
new_X = pd.DataFrame(X_array, columns=new_X.columns, index=new_X.index)

categorical_idx = np.flatnonzero(is_numerical==False)
categorical_X = X.iloc[:, categorical_idx]
imputer = SimpleImputer(strategy='most_frequent')
X_array = imputer.fit_transform(categorical_X)
categorical_X = pd.DataFrame(X_array, columns=categorical_X.columns, index=categorical_X.index)


from sklearn.preprocessing import OneHotEncoder

oh = OneHotEncoder(sparse_output=False)
oh.fit(categorical_X)

encoded = oh.transform(categorical_X)
for i, col in enumerate(oh.get_feature_names_out()):
    new_X = new_X.copy()
    new_X[col] = encoded[:, i]
feature_names = new_X.columns.tolist()
print(feature_names)

['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI', 'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference', 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP', 'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU', 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR', 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI', 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num', 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM', 'BIA-BIA_TBW', 'PAQ_A-PAQ_A_Total', 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw', 'SDS-SDS_Total_T', 'PreInt_EduHx-computerinternet_hoursday', 'Basic_Demos-Enroll_Season_Fall', 'Basic_Demos-Enroll_Season_Sprin

Let's now retry our first approach with the prediction using the Decision Tree:

In [97]:
X_train, X_test, y_train, y_test = train_test_split( new_X, y, test_size=0.20, random_state=42)

from sklearn.tree import DecisionTreeClassifier
# this time we also tune the hyperparameter
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

model = DecisionTreeClassifier()
parameters = {'max_leaf_nodes': [2, 5, 10, 30],
    'max_depth': [3, 5, 10, None],
    'criterion': ['gini', 'entropy']
    }
# tune the hyperparameter using the validation set -> automatic parameter tuning using Grid Search:
tuned_model = GridSearchCV(model, parameters, cv=5, verbose=0)
tuned_model.fit(X_train, y_train)

print ("Best Score: {:.3f}".format(tuned_model.best_score_) )
print ("Best Params: ", tuned_model.best_params_)

test_acc = accuracy_score(y_true = y_test, y_pred = tuned_model.predict(X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

Best Score: 0.596
Best Params:  {'criterion': 'gini', 'max_depth': 5, 'max_leaf_nodes': 5}
Test Accuracy: 0.597


As we can see this time that we didn't take into consideration the PCIAT features, we get a classifier that is very similiar to the naive classifier, so not a good one. <br>
In the next notebook we will try to get a better classifier using the Random Forest method.