# **STUDENT AI** - DATA CLEANING

## Objectives

Inspect the dataset and solve any issues that might arise from wrong data types, or missing / wrong values

## Inputs

Continues to assess dataset loaded in previous notebook.

## Outputs

Saves the cleaned dataset back to inputs/dataset folder for futher use


---

# Import required libraries

In [1]:
import os
import pandas as pd
from pandas_profiling import ProfileReport

print('All Libraries Loaded')

All Libraries Loaded


  from pandas_profiling import ProfileReport


# Change working directory

### Set the working directory to notebook parent folder
If the output does not match, click **'clear all outputs'** and then **'restart'** the notebook. 
Then run cells from top to bottom.

In [2]:
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print('If correct, Active Directory should read: /workspace/student-AI')
print(f"Active Directory: {current_dir}")

If correct, Active Directory should read: /workspace/student-AI
Active Directory: /workspace/student-AI


### Load saved dataset

In [3]:
df = pd.read_csv(f"inputs/dataset/Expanded_data_with_more_features.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


### Summarize Dataset Again

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30641 entries, 0 to 30640
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           30641 non-null  int64  
 1   Gender               30641 non-null  object 
 2   EthnicGroup          28801 non-null  object 
 3   ParentEduc           28796 non-null  object 
 4   LunchType            30641 non-null  object 
 5   TestPrep             28811 non-null  object 
 6   ParentMaritalStatus  29451 non-null  object 
 7   PracticeSport        30010 non-null  object 
 8   IsFirstChild         29737 non-null  object 
 9   NrSiblings           29069 non-null  float64
 10  TransportMeans       27507 non-null  object 
 11  WklyStudyHours       29686 non-null  object 
 12  MathScore            30641 non-null  int64  
 13  ReadingScore         30641 non-null  int64  
 14  WritingScore         30641 non-null  int64  
dtypes: float64(1), int64(4), object(10)


At first glance, 30000+ data rows is a very robust dataset to be able to train an ML. <br>
There seems to be no reason why the NrSiblings should be a float data type as you cannot have 0.25 of a brother/sister
we can convert the column to an integer type - however this throws an error as the dataset seems to have NaN or Inf data, so first we deal with missing data

### Check for missing values

In [6]:
df.isnull().sum()

Unnamed: 0                0
Gender                    0
EthnicGroup            1840
ParentEduc             1845
LunchType                 0
TestPrep               1830
ParentMaritalStatus    1190
PracticeSport           631
IsFirstChild            904
NrSiblings             1572
TransportMeans         3134
WklyStudyHours          955
MathScore                 0
ReadingScore              0
WritingScore              0
dtype: int64

Checking for '.isnull()' shows quite a few missing variables... a detailed pandas report below will shed some more light...

In [5]:
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

### Drop any rows that contain missing data and repeat the info and report to see how many remain...

In [8]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19243 entries, 2 to 30640
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           19243 non-null  int64  
 1   Gender               19243 non-null  object 
 2   EthnicGroup          19243 non-null  object 
 3   ParentEduc           19243 non-null  object 
 4   LunchType            19243 non-null  object 
 5   TestPrep             19243 non-null  object 
 6   ParentMaritalStatus  19243 non-null  object 
 7   PracticeSport        19243 non-null  object 
 8   IsFirstChild         19243 non-null  object 
 9   NrSiblings           19243 non-null  float64
 10  TransportMeans       19243 non-null  object 
 11  WklyStudyHours       19243 non-null  object 
 12  MathScore            19243 non-null  int64  
 13  ReadingScore         19243 non-null  int64  
 14  WritingScore         19243 non-null  int64  
dtypes: float64(1), int64(4), object(10)


In [9]:
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]