# **Data Cleaning**

First I'm going to look at the data we are working with:

In [25]:
import numpy as np
import pandas as pd

url_train = "data/train.csv"
df_train = pd.read_csv(url_train)
#df_train.info()

In [26]:
# dataset2 = df_train.astype('object')
# dataset2.describe().T

We can see that we have 3 type of data: int, float and string.
`string` data have only 4 possible values: the season, so they are good to be one-hot encoded.

In [27]:
num_inst, num_features = df_train.shape

for f in range(num_features):
    col = df_train.iloc[:, f].astype(str)
    #print(f, np.unique(col))

Another thing we can see is that most of the features have `nan` value, so we also have to deal with missing values.

We can get additional clues by looking at `data_dictionary.csv`

In [28]:
url = "data/data_dictionary.csv"
description = pd.read_csv(url)
#description

Now that we have additional information about the dataset, we can start the data cleaning process, we will start by removing the rows where the value of `sii` is `NaN` since it's the features that we use for our supervised learning. <br>
We will then remove the column that represent the id feature since it's used as "primary key" to distinguish the rows and it's not relevant for a classification task.

In [29]:
# remove rows where the value of sii is NaN
df_train.dropna(subset=['sii'], inplace=True)

# remove the column id
del df_train['id'] 

We can now start dealing with the missing values.
I've notice that in the dataset we have a group of Physical Measures like "weight" and "height", that for child are related mostly to the age and the sex of the specific children.<br>
Luckily we can see that age and sex are feature that are never null in our dataset, so I thought that we could use them to insert in the rows where the phisical measures are missing the average value for the specific age and sex of the child. <br>
I did this only in the rows where all the physical measures are missing, since in the case some of them are missing and some are present, I thought that using a KNN-Imputer was a better idea. <br>
To do this I used an external source, since I thought that it would be more reliable than trying to predict the values with the mean or other ways. <br>
The external source is a csv file that I've created using the NHANES data that assess the health and nutritional status of children in the United States. <br>
I decided to use this approach since it limits the distortion of the measures compared to the global means, indeed a girl of 7 years isn't going to be as tall as a boy of 15 years. <br>

In [30]:
# load external file
physical_measures_df = pd.read_csv('data/physical_measures.csv')

# add the columns of the average physical measures (given an age and a sex) to each row in the dataframe
df_train = df_train.merge( physical_measures_df, on=['Basic_Demos-Age', 'Basic_Demos-Sex'], suffixes=('', '_avg'))
cols = ['Physical-BMI','Physical-Height','Physical-Weight','Physical-Waist_Circumference','Physical-Diastolic_BP','Physical-HeartRate','Physical-Systolic_BP']
# a list of boolean corresponding to each row -> true if the physical measures are all nan
tot_nan_phys = df_train[cols].isna().all(axis=1)

for col in cols:
    # first fill the rows with only nan values for the physical measure with the average
    df_train.loc[tot_nan_phys, col] = df_train.loc[tot_nan_phys, f"{col}_avg"]
    # then remove the average columns
    del df_train[f"{col}_avg"]
    #print(np.unique(df_train[col]))

Now that we have enriched our dataset where rows had a very low level of information, we can start working on the rest of the dataset to handle other missing values. <br>
What we want to do is :
- replace the missing values of the columns with the KNN imputer for the numerical values
- replace the missing values of the columns with the mode for the categorical values

But first it's better to separate the features of the classification task from the one to be classified: `sii`. <br>
It's also important to separate the test set from the training set since otherwise the value that are going to substitute the missing values in the test set would be affected by one of the training set.

In [31]:
from sklearn.model_selection import train_test_split
X = df_train.iloc[:, :-1]
y = df_train.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

We can now start replacing the missing values by dividing the numerical and categorical features and operate on them separately:

In [32]:
# array of boolean saying if a specific column is numeric or not
is_numerical = np.array([np.issubdtype(dtype, np.number) for dtype in X.dtypes])  
numerical_idx = np.flatnonzero(is_numerical) 
# takes only the column that are numerical
new_X_train = X_train.iloc[:, numerical_idx]
new_X_test = X_test.iloc[:, numerical_idx]
#new_X_train.head(10)

Now that we have all the numerical column we can replace the NaN values of specific column with the value of the closer neighbors. <br>
To do this I used `KNNImputer`, which fills missing values based on nearest neighbors, in this way we take correlation into account. <br>
It's important to scale our values before using the k-nearest neighbors method since otherwise it will consider a lot more field where an high value is a default and not consider the one where a low value is normal. Naturally i will rescale the values to normal once computed the transformation. <br>
I decided to use it because in this case it's way better than the mean since, as already mentioned, having child of 6 and 19 years old in the same dataset give us a mean that doesn't represent coherently the specific kid.

In [33]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
imputer = KNNImputer(n_neighbors=3)

# train set
scaled_train = scaler.fit_transform(new_X_train)
X_array = imputer.fit_transform(scaled_train)
X_array = scaler.inverse_transform(X_array)
new_X_train = pd.DataFrame(X_array, columns=new_X_train.columns, index=new_X_train.index) # convert into a dataframe since X_array is of type ndarray

# test set
scaled_test = scaler.fit_transform(new_X_test)
X_array = imputer.fit_transform(scaled_test)
X_array = scaler.inverse_transform(X_array)
new_X_test = pd.DataFrame(X_array, columns=new_X_test.columns, index=new_X_test.index)

#new_X_train.head(10)

As we can see in the field where there was a missing value, we substitute it with the total mean of the non-missing values. <br>
Let's now check if there are still `NaN` values in the numerical features:

In [34]:
num_inst, num_features = new_X_train.shape
for f in range(num_features):
    col = new_X_train.iloc[:, f].astype(str)
    #print(f, np.unique(col))

We can feel satisfied by this first part of the data processing.

Now we have to handle the categorical values, we have 2 things to do:
- replace missing values with the mode
- transform them with One-Hot Encoding <br>

I've decided to use the mode to replace missing values because since the number of categories is small (the seasons), we don't need a complex modelling so we can use a simple model.<br>
What the mode imputer does is fill the missing values with the most common value of the selected feature.

In [35]:
from sklearn.impute import SimpleImputer

categorical_idx = np.flatnonzero(is_numerical==False)
categorical_X_train = X_train.iloc[:, categorical_idx]
categorical_X_test = X_test.iloc[:, categorical_idx]

imputer = SimpleImputer(strategy='most_frequent')
X_array = imputer.fit_transform(categorical_X_train)
categorical_X_train = pd.DataFrame(X_array, columns=categorical_X_train.columns, index=categorical_X_train.index)

X_array = imputer.fit_transform(categorical_X_test)
categorical_X_test = pd.DataFrame(X_array, columns=categorical_X_test.columns, index=categorical_X_test.index)

#categorical_X_train.head(10)

In [36]:
# now that we have no more missing values, we can handle categorical labels using one-hot encoding
from sklearn.preprocessing import OneHotEncoder

oh = OneHotEncoder(sparse_output=False)

oh.fit(categorical_X_train)
encoded = oh.transform(categorical_X_train)
#print(oh.get_feature_names_out())
# we now add the encoded string features to the new data frame
for i, col in enumerate(oh.get_feature_names_out()):
    new_X_train = new_X_train.copy()
    new_X_train[col] = encoded[:, i]

oh.fit(categorical_X_test)
encoded = oh.transform(categorical_X_test)

# we now add the encoded string features to the new data frame
for i, col in enumerate(oh.get_feature_names_out()):
    new_X_test = new_X_test.copy()
    new_X_test[col] = encoded[:, i]

In [37]:
# we now a good dataset to train our model with
#new_X_train.head(10)

Let's now try with a first approach: <br>
At first I will calculate the baseline accuracy, that represents the accuracy of a naive classifier that basically classify every instance as if it was of the most frequent in the train set. <br>
Then I will use a basic Random Forest model and check its accuracy. <br>
Our goal is to at least predict better than the naive classifier.

In [38]:
baseline_accuracy = y_train.value_counts().max() / y_train.value_counts().sum()
print (f"Majority class accuracy: {baseline_accuracy:.3f}")
# our goal is to have a model that can predict better than the naive classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
model = RandomForestClassifier(max_leaf_nodes=20)
model.fit(new_X_train, y_train)

test_acc = accuracy_score(y_true = y_test, y_pred = model.predict(new_X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

#print(model.feature_importances_)

Majority class accuracy: 0.589
Test Accuracy: 0.989


We can see that even with a basic Random Forest, we get a perfect classifier. <br>
But there is a problem, a few features are quite important for the prediction, they are:

In [39]:
importances = model.feature_importances_
feature_names = new_X_train.columns
important_features = [name for name, importance in zip(feature_names, importances) if importance > 0.05]
print(important_features)

['PCIAT-PCIAT_02', 'PCIAT-PCIAT_03', 'PCIAT-PCIAT_05', 'PCIAT-PCIAT_15', 'PCIAT-PCIAT_17', 'PCIAT-PCIAT_Total']


We can see that the most significative features are the one relative to PCIAT that means Parent-Child Internet Addiction Test. <br>
These features measures characteristics and behaviors associated with compulsive use of the Internet. <br>
From the description we can understand that they can easily be very useful for the prediction, but unfortunately they are not present in the `test.csv` file, so to have a coherent predictor I've removed those features since in practice, they don't provide any help in the classification.

In [40]:
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")
print(df_train.shape[1])
print(df_test.shape[1])

82
59


In [41]:
train_features = df_train.columns.tolist()
test_features = df_test.columns.tolist()
features_toremove =  list(set(train_features) - set(test_features) - {'sii'})
print(features_toremove)

['PCIAT-PCIAT_03', 'PCIAT-PCIAT_08', 'PCIAT-PCIAT_19', 'PCIAT-PCIAT_10', 'PCIAT-PCIAT_13', 'PCIAT-PCIAT_15', 'PCIAT-PCIAT_18', 'PCIAT-Season', 'PCIAT-PCIAT_11', 'PCIAT-PCIAT_14', 'PCIAT-PCIAT_01', 'PCIAT-PCIAT_17', 'PCIAT-PCIAT_02', 'PCIAT-PCIAT_07', 'PCIAT-PCIAT_05', 'PCIAT-PCIAT_16', 'PCIAT-PCIAT_09', 'PCIAT-PCIAT_06', 'PCIAT-PCIAT_04', 'PCIAT-PCIAT_Total', 'PCIAT-PCIAT_20', 'PCIAT-PCIAT_12']


As we can see, the features that are not included in the test set are the PCIAT features.<br>
Let's now try to process the cleaning without including the PCIAT features:

In [42]:
del df_train['id']
for col in features_toremove: # this time we remove the PCIAT features from the train set
    del df_train[col]
df_train.dropna(subset=['sii'], inplace=True)
#df_train.columns.to_list()

In [43]:
# this part is the same as before

physical_measures_df = pd.read_csv('data/physical_measures.csv')

df_train = df_train.merge( physical_measures_df, on=['Basic_Demos-Age', 'Basic_Demos-Sex'], suffixes=('', '_avg'))
cols = ['Physical-BMI','Physical-Height','Physical-Weight','Physical-Waist_Circumference','Physical-Diastolic_BP','Physical-HeartRate','Physical-Systolic_BP']
tot_nan_phys = df_train[cols].isna().all(axis=1)

for col in cols:
    df_train.loc[tot_nan_phys, col] = df_train.loc[tot_nan_phys, f"{col}_avg"]
    del df_train[f"{col}_avg"]


X = df_train.iloc[:, :-1]
y = df_train.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

is_numerical = np.array([np.issubdtype(dtype, np.number) for dtype in X.dtypes])  
numerical_idx = np.flatnonzero(is_numerical) 
new_X_train = X_train.iloc[:, numerical_idx]
new_X_test = X_test.iloc[:, numerical_idx]


scaler = StandardScaler()
imputer = KNNImputer(n_neighbors=3)

scaled_train = scaler.fit_transform(new_X_train)
X_array = imputer.fit_transform(scaled_train)
X_array = scaler.inverse_transform(X_array)
new_X_train = pd.DataFrame(X_array, columns=new_X_train.columns, index=new_X_train.index) # convert into a dataframe since X_array is of type ndarray

scaled_test = scaler.fit_transform(new_X_test)
X_array = imputer.fit_transform(scaled_test)
X_array = scaler.inverse_transform(X_array)
new_X_test = pd.DataFrame(X_array, columns=new_X_test.columns, index=new_X_test.index)

categorical_idx = np.flatnonzero(is_numerical==False)
categorical_X_train = X_train.iloc[:, categorical_idx]
categorical_X_test = X_test.iloc[:, categorical_idx]

imputer = SimpleImputer(strategy='most_frequent')
X_array = imputer.fit_transform(categorical_X_train)
categorical_X_train = pd.DataFrame(X_array, columns=categorical_X_train.columns, index=categorical_X_train.index)

X_array = imputer.fit_transform(categorical_X_test)
categorical_X_test = pd.DataFrame(X_array, columns=categorical_X_test.columns, index=categorical_X_test.index)


oh = OneHotEncoder(sparse_output=False)

oh.fit(categorical_X_train)
encoded = oh.transform(categorical_X_train)

for i, col in enumerate(oh.get_feature_names_out()):
    new_X_train = new_X_train.copy()
    new_X_train[col] = encoded[:, i]

oh.fit(categorical_X_test)
encoded = oh.transform(categorical_X_test)

for i, col in enumerate(oh.get_feature_names_out()):
    new_X_test = new_X_test.copy()
    new_X_test[col] = encoded[:, i]


Let's see now how our Random Forest Classifier behave:

In [44]:
baseline_accuracy = y_test.value_counts().max() / y_test.value_counts().sum()
print (f"Majority class accuracy: {baseline_accuracy:.3f}")
model = RandomForestClassifier()
model.fit(new_X_train, y_train)

#print ("Best Score: {:.3f}".format(model.score(new_X_train, y_train)) )

test_acc = accuracy_score(y_true = y_test, y_pred = model.predict(new_X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

Majority class accuracy: 0.558
Test Accuracy: 0.584


This time that we didn't take into consideration the PCIAT features, we get a classifier that is very similiar to the naive classifier, so we are not really satisfied by it.<br>
In the next notebook we will try to get a better Random Forest classifier.