# Child Mind Institute — Problematic Internet Use

## Team notes & log

🔔 <b>REMINDER:</b> Make sure to use git pull starting work, and git push after finishing. These options are built into Kaggle once you've linked your GitHub account.

<u><b>To Do:</b></u>

- CP: Explore each category of data to see how it should be handled.
- CP: Handle missing values. (Idea: Use KNN)
- CP: Drop unnecessary columns.
- CP: One-hot encoding, where possible.

<u>Notes from Célie</u>

- 11/2: Hi Anusha! I will label my work with a comment and my initials (CP).
- 11/2: You can link the Kaggle notebook to your GitHub from within Kaggle.
- 11/3: I have set up this notebook so it works either locally or via Kaggle.
- 11/3: Cleaned up some repetitve sections.


## Preprocessing


In [19]:
# Import libraries

import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

####AG###

# CP
import numpy as np
import pandas as pd

In [3]:
import os

# CP: Load data

# CP: Check if you are running in Kaggle or locally

#CP: Running locally
if os.path.exists('kaggle_data'):
    train = pd.read_csv('kaggle_data/train.csv')
    test = pd.read_csv('kaggle_data/test.csv')
    data_dict = pd.read_csv('kaggle_data/data_dictionary.csv')

# CP: Running in Kaggle
else:
    train = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/train.csv')
    test = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/train.csv')
    data_dict = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/data_dictionary.csv')

# CP: Show all rows when displaying data
pd.set_option('display.max_rows', None)

# CP: Display data dictionary
data_dict

Unnamed: 0,Instrument,Field,Description,Type,Values,Value Labels
0,Identifier,id,Participant's ID,str,,
1,Demographics,Basic_Demos-Enroll_Season,Season of enrollment,str,"Spring, Summer, Fall, Winter",
2,Demographics,Basic_Demos-Age,Age of participant,float,,
3,Demographics,Basic_Demos-Sex,Sex of participant,categorical int,01,"0=Male, 1=Female"
4,Children's Global Assessment Scale,CGAS-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
5,Children's Global Assessment Scale,CGAS-CGAS_Score,Children's Global Assessment Scale Score,int,,
6,Physical Measures,Physical-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
7,Physical Measures,Physical-BMI,Body Mass Index (kg/m^2),float,,
8,Physical Measures,Physical-Height,Height (in),float,,
9,Physical Measures,Physical-Weight,Weight (lbs),float,,


In [4]:
train.head(20)

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,4.0,2.0,4.0,55.0,,,,Fall,3.0,2.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,0.0,0.0,0.0,0.0,Fall,46.0,64.0,Summer,0.0,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,2.0,1.0,1.0,28.0,Fall,38.0,54.0,Summer,2.0,0.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,3.0,4.0,1.0,44.0,Summer,31.0,45.0,Winter,0.0,1.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,,,,,,,,,
5,001f3379,Spring,13,1,Winter,50.0,Summer,22.279952,59.5,112.2,...,1.0,2.0,1.0,34.0,Summer,40.0,56.0,Spring,0.0,1.0
6,0038ba98,Fall,10,0,,,Fall,19.66076,55.0,84.6,...,4.0,1.0,0.0,20.0,Winter,27.0,40.0,Fall,3.0,0.0
7,0068a485,Fall,10,1,,,Fall,16.861286,59.25,84.2,...,,,,,,,,Fall,2.0,
8,0069fbed,Summer,15,0,,,Spring,,,,...,,,,,,,,Summer,2.0,
9,0083e397,Summer,19,1,Summer,,,,,,...,,,,,,,,,,


In [5]:
####AG###

# Children Global Assessment Scale(CGAS) score is used to check the general 
#functioning level of children typically in the range of 1-100 

#there are null values in every column except for age and basic season up until row 20

# CP: Check for missing/null values
train.isnull().sum()

id                                           0
Basic_Demos-Enroll_Season                    0
Basic_Demos-Age                              0
Basic_Demos-Sex                              0
CGAS-Season                               1405
CGAS-CGAS_Score                           1539
Physical-Season                            650
Physical-BMI                               938
Physical-Height                            933
Physical-Weight                            884
Physical-Waist_Circumference              3062
Physical-Diastolic_BP                     1006
Physical-HeartRate                         993
Physical-Systolic_BP                      1006
Fitness_Endurance-Season                  2652
Fitness_Endurance-Max_Stage               3217
Fitness_Endurance-Time_Mins               3220
Fitness_Endurance-Time_Sec                3220
FGC-Season                                 614
FGC-FGC_CU                                1638
FGC-FGC_CU_Zone                           1678
FGC-FGC_GSND 

In [6]:
# CP: Explore data
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3960 entries, 0 to 3959
Data columns (total 82 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   id                                      3960 non-null   object 
 1   Basic_Demos-Enroll_Season               3960 non-null   object 
 2   Basic_Demos-Age                         3960 non-null   int64  
 3   Basic_Demos-Sex                         3960 non-null   int64  
 4   CGAS-Season                             2555 non-null   object 
 5   CGAS-CGAS_Score                         2421 non-null   float64
 6   Physical-Season                         3310 non-null   object 
 7   Physical-BMI                            3022 non-null   float64
 8   Physical-Height                         3027 non-null   float64
 9   Physical-Weight                         3076 non-null   float64
 10  Physical-Waist_Circumference            898 non-null    floa

In [7]:
# CP: Explore data
train.shape

(3960, 82)

In [8]:
# CP: Check target values
train['sii'].value_counts()

sii
0.0    1594
1.0     730
2.0     378
3.0      34
Name: count, dtype: int64

In [9]:
# CP: Check missing target values
train['sii'].isnull().sum()

np.int64(1224)

In [10]:
# CP: Drop any rows where target value is missing
# since they cannot be used for training.
train.dropna(subset=['sii'], inplace=True)

In [11]:
# CP: Recheck missing target values
train['sii'].isnull().sum()

np.int64(0)

In [13]:
# CP: Check for duplicates
train.duplicated().sum()

np.int64(0)

In [15]:
# CP: Drop ID, should not be used for training model
train.drop(columns=['id'], inplace=True)

<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 81 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Enroll_Season               2736 non-null   object 
 1   Basic_Demos-Age                         2736 non-null   int64  
 2   Basic_Demos-Sex                         2736 non-null   int64  
 3   CGAS-Season                             2342 non-null   object 
 4   CGAS-CGAS_Score                         2342 non-null   float64
 5   Physical-Season                         2595 non-null   object 
 6   Physical-BMI                            2527 non-null   float64
 7   Physical-Height                         2530 non-null   float64
 8   Physical-Weight                         2572 non-null   float64
 9   Physical-Waist_Circumference            483 non-null    float64
 10  Physical-Diastolic_BP                   2478 non-null   float64
 