# *Data Cleaning*

First I'm going to look at the data we are working with:

In [25]:
import numpy as np
import pandas as pd

url = "data/train.csv"
dataset = pd.read_csv(url)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3960 entries, 0 to 3959
Data columns (total 82 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   id                                      3960 non-null   object 
 1   Basic_Demos-Enroll_Season               3960 non-null   object 
 2   Basic_Demos-Age                         3960 non-null   int64  
 3   Basic_Demos-Sex                         3960 non-null   int64  
 4   CGAS-Season                             2555 non-null   object 
 5   CGAS-CGAS_Score                         2421 non-null   float64
 6   Physical-Season                         3310 non-null   object 
 7   Physical-BMI                            3022 non-null   float64
 8   Physical-Height                         3027 non-null   float64
 9   Physical-Weight                         3076 non-null   float64
 10  Physical-Waist_Circumference            898 non-null    floa

In [29]:
dataset2 = dataset.astype('object')
dataset2.describe().T

Unnamed: 0,count,unique,top,freq
id,3960,3960,00008ff9,1
Basic_Demos-Enroll_Season,3960,4,Spring,1127
Basic_Demos-Age,3960,18,8,490
Basic_Demos-Sex,3960,2,0,2484
CGAS-Season,2555,4,Spring,697
...,...,...,...,...
SDS-SDS_Total_Raw,2609.0,62.0,35.0,132.0
SDS-SDS_Total_T,2606.0,49.0,50.0,132.0
PreInt_EduHx-Season,3540,4,Spring,985
PreInt_EduHx-computerinternet_hoursday,3301.0,4.0,0.0,1524.0


We can see that we have mix type of data: string, float and int -> later we will use ONE-HOT encoding to have only float types and make the dataset consistent

In [30]:
num_inst, num_features = dataset.shape

for f in range(num_features):
    col = dataset.iloc[:, f].astype(str)
    print(f, np.unique(col))

0 ['00008ff9' '000fd460' '00105258' ... 'ffcd4dbd' 'ffed1dd5' 'ffef538e']
1 ['Fall' 'Spring' 'Summer' 'Winter']
2 ['10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '20' '21' '22' '5' '6'
 '7' '8' '9']
3 ['0' '1']
4 ['Fall' 'Spring' 'Summer' 'Winter' 'nan']
5 ['25.0' '30.0' '31.0' '33.0' '35.0' '38.0' '39.0' '40.0' '41.0' '42.0'
 '44.0' '45.0' '46.0' '47.0' '48.0' '49.0' '50.0' '51.0' '52.0' '53.0'
 '54.0' '55.0' '56.0' '57.0' '58.0' '59.0' '60.0' '61.0' '62.0' '63.0'
 '64.0' '65.0' '66.0' '67.0' '68.0' '69.0' '70.0' '71.0' '72.0' '73.0'
 '74.0' '75.0' '76.0' '77.0' '78.0' '79.0' '80.0' '81.0' '82.0' '83.0'
 '85.0' '87.0' '88.0' '90.0' '91.0' '92.0' '93.0' '95.0' '999.0' 'nan']
6 ['Fall' 'Spring' 'Summer' 'Winter' 'nan']
7 ['0.0' '10.28168847' '10.67543945' ... '9.693766159' '9.959166667' 'nan']
8 ['33.0' '36.0' '37.5' '39.0' '39.5' '40.0' '40.5' '41.0' '41.5' '42.0'
 '42.25' '42.5' '42.75' '42.9' '43.0' '43.15' '43.2' '43.25' '43.4' '43.5'
 '43.75' '44.0' '44.2' '44.25' '44.3' '44.4' 

Another thing we can see is that we have some features that may have `nan` value -> later we will add real value with a similiarity estimator.

We can get additional clues by looking at `data_dictionary.csv`

In [31]:
url = "data/data_dictionary.csv"
description = pd.read_csv(url)
description

Unnamed: 0,Instrument,Field,Description,Type,Values,Value Labels
0,Identifier,id,Participant's ID,str,,
1,Demographics,Basic_Demos-Enroll_Season,Season of enrollment,str,"Spring, Summer, Fall, Winter",
2,Demographics,Basic_Demos-Age,Age of participant,float,,
3,Demographics,Basic_Demos-Sex,Sex of participant,categorical int,01,"0=Male, 1=Female"
4,Children's Global Assessment Scale,CGAS-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
...,...,...,...,...,...,...
76,Sleep Disturbance Scale,SDS-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
77,Sleep Disturbance Scale,SDS-SDS_Total_Raw,Total Raw Score,int,,
78,Sleep Disturbance Scale,SDS-SDS_Total_T,Total T-Score,int,,
79,Internet Use,PreInt_EduHx-Season,Season of participation,str,"Spring, Summer, Fall, Winter",


From this additional file we can see that the type of the features are float, int, categorical int or string. <br>
In case of the string possible values we always have the 4 seasons so we can easily convert it in a float type with ONE-HOT encoding, while the int type can be transformed in float.

Now that we have information about the dataset, we can start cleaning the data cleaning process.

We start by eliminating the rows that doesn't have the ssi value, since they are not useful for our supervised learning.