# **STUDENT AI** - DATA CLEANING

## Objectives

Inspect the dataset and solve any issues that might arise from wrong data types, or missing / wrong values

## Inputs

Continues to assess dataset loaded in previous notebook.

## Outputs

Saves the cleaned dataset back to inputs/dataset folder for futher use


---

# Import required libraries

In [1]:
import os
import pandas as pd

print('All Libraries Loaded')

All Libraries Loaded


# Change working directory

### Set the working directory to notebook parent folder
If the output does not match, click **'clear all outputs'** and then **'restart'** the notebook. 
Then run cells from top to bottom.

In [2]:
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print('If correct, Active Directory should read: /workspace/student-AI')
print(f"Active Directory: {current_dir}")

If correct, Active Directory should read: /workspace/student-AI
Active Directory: /workspace/student-AI


### Load saved dataset

In [6]:
df = pd.read_csv(f"inputs/dataset/Expanded_data_with_more_features.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


### Drop Unnamed column that pandas created on import

In [7]:
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head()

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


### Summarize Dataset Again
This function will analyse each column in teh dataset and find the unique values for each. If there are more than 10, it assumes a numerical variable and does not list them all...

In [8]:
print("Pandas dataset summary:\n")
df.info()

Pandas dataset summary:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30641 entries, 0 to 30640
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Gender               30641 non-null  object 
 1   EthnicGroup          28801 non-null  object 
 2   ParentEduc           28796 non-null  object 
 3   LunchType            30641 non-null  object 
 4   TestPrep             28811 non-null  object 
 5   ParentMaritalStatus  29451 non-null  object 
 6   PracticeSport        30010 non-null  object 
 7   IsFirstChild         29737 non-null  object 
 8   NrSiblings           29069 non-null  float64
 9   TransportMeans       27507 non-null  object 
 10  WklyStudyHours       29686 non-null  object 
 11  MathScore            30641 non-null  int64  
 12  ReadingScore         30641 non-null  int64  
 13  WritingScore         30641 non-null  int64  
dtypes: float64(1), int64(3), object(10)
memory usage: 3.3+ MB


### Create random test list for DASHBOARD report testing
Running this cell will produce a student list without grades or missing values for use as an 'upload' file in the dashboard.<br>
To create an 'imperfect list' ie WITH missing values, comment out the second code line.<br>
change n to the number of records you want to create.<br>
csv file will be saved to the inputs/dataset folder.

In [9]:
df.drop(columns=['MathScore','ReadingScore','WritingScore'])
df_clean = df.dropna()
df_sample = df_clean.sample(n=300)

file_path = 'inputs/dataset/student_random_list.csv'

# Remove previous file if it exists
if os.path.exists(file_path):
    os.remove(file_path)

# Create the directory if it doesn't exist
os.makedirs(name='inputs/dataset', exist_ok=True)

# Save cleaned DataFrame to the file path
df_sample.to_csv(file_path, index=True)

df_sample

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
2568,male,group A,some college,standard,none,married,regularly,yes,3.0,private,5 - 10,62,55,55
3594,female,group B,high school,free/reduced,none,married,regularly,no,3.0,school_bus,5 - 10,54,58,55
4749,female,group D,associate's degree,standard,none,married,sometimes,yes,0.0,school_bus,< 5,64,73,70
26170,male,group A,bachelor's degree,standard,none,married,regularly,yes,2.0,school_bus,5 - 10,85,82,71
30186,female,group B,high school,standard,none,married,regularly,no,2.0,private,< 5,62,60,59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4977,female,group D,some high school,standard,none,married,sometimes,no,1.0,school_bus,5 - 10,86,95,85
24946,male,group D,high school,standard,completed,single,never,no,1.0,school_bus,< 5,77,79,75
29462,female,group B,associate's degree,standard,none,single,regularly,no,1.0,school_bus,< 5,45,48,53
4893,male,group D,bachelor's degree,standard,completed,married,regularly,no,5.0,private,> 10,72,70,70


### ANALYSIS
At first glance, 30000+ data rows is a very robust dataset to be able to train an ML. <br>
There seems to be no reason why the NrSiblings should be a float data type as you cannot have 0.25 of a brother/sister. <br>
I will treat nrSiblings as a **categorical variable** and confert it to an object type before using it in any modelling phases.

In [None]:
print("\n** Detailed list of unique values **\n")
for column in df.columns:
    unique_values = df[column].unique()
    num_distinct_values = len(unique_values)
    if num_distinct_values > 10:
        print(
            f"{column}: many values - {unique_values[:10]}...")
    else:
       print(
        f"{column}: {num_distinct_values} distinct values - {unique_values[:10]}")

### ANALYSIS
The individual values make sense but will need interpreting for future categorical encoders (convert strings to numbers).<br>
Another issue visible at this stage are the 'nan' values in most Variables indicating "not a number" which indicates missing values.<br>
A more detailed check for missing values is needed ...

### Check for missing values

In [None]:
missing_values = df.isnull().sum()
percentage_missing = (missing_values / len(df)) * 100
percentage_missing = percentage_missing.round(1)
missing_data = pd.DataFrame({'Missing_Values': missing_values, 'Percentage_Missing': percentage_missing})
missing_data['Percentage_Missing'] = missing_data['Percentage_Missing'].astype(str) + '%'
missing_data

### Analysis
Considering the size of the dataset (30000+ rows) soem missing values can be extrapolated (impupted) without significantly affecting the data relationships in the dataset.
Options are either drop (=delete) rows with missing values - which will lose data that is still containined in that row, or 'fill in the blanks' with a logical value.
I will aseess the best option next.

### Assess how many rows would need to be dropped...

In [None]:
total = len(df)
dropped_data = df.dropna()
dropped_data.info()
deleted_rows = total - len(dropped_data)
percent = (1 - len(dropped_data) / total) * 100
percent_rounded = round(percent)
print(f"\n** Dropping missing data will delete {deleted_rows} rows. ({percent_rounded}%) **")

## Initial Assumptions

* Dropping null values leaves us with 19243 data rows ... I hypothesize that while this seems like enough data to still achieve the business requirement, dropping this many values (37%) will likely induce an imbalance to the data and bias the dataset. Imputing logical values to fill the gaps is more advisable.

## Imputing Missing Values
let's review the missing_data form the full dataset

In [None]:
missing_data

### Analysis
The highest missing data rate is TransportMeans at 10.2%. The variable is binary (as the unique values are school_bus and private) so imputing the most common value is an option without inducing too much ambiguity.<br>
If the imputed values induce more errors than desired, another option is to drop the column entirely. LunchType has no missing values and I hypothesize that LunchType and TrasnportMeans indicate similar socio-economic status of a given students family -- eg if they need to rely on a school bus, they most likely will also be relying on free school lunches .. this again might have an impact on other support the student might receive extracurricularly - which could have an influence on the stutends performance eventually.

### Imputing Categorical Variables

Categorical variables with missing values are :
* 'EthnicGroup'
* 'TestPrep'
* 'ParentEduc'
* 'ParentMaritalStatus'
* 'IsFirstChild'
* 'PracticeSport'
* 'TransportMeans'
* 'WklyStudyHours'
* ('LunchType' and 'Gender' have no missing values and do not need to be adjusted)

For these categorical variables I will insert the most common value from the dataset (mode) as that will be closest to the actual value probabalistically.

In [None]:
for column in ['EthnicGroup', 'TestPrep', 'ParentEduc', 'TransportMeans', 'ParentMaritalStatus', 'IsFirstChild', 'PracticeSport', 'WklyStudyHours']:
    mode_value = df[column].mode()[0]
    df[column].fillna(mode_value, inplace=True)

### Imputing Numerical Variables

The only numerical variable from the feature set (not counting the scores which have no missing values) is NrSiblings.
Once the missing values have been imputed, I can also change the data type to a more sensical integer rather than float.
The imputed values will be based on the **median** instead of the **mean** as this is less sensitive to outliers since the variable does contain some 'extreme' values of 6 or more siblings.

In [None]:
median_value = df['NrSiblings'].median()
df['NrSiblings'] = df['NrSiblings'].fillna(median_value).astype(int)

### Quick check for remaining missing values and check datatype change and possible duplicate values:

In [None]:
df.isnull().sum()

### Analysis
Good, there are no more missign values in the categorical feature varuiables, only NrSiblings remains, as the original dataset list it as a numerical float variable. To check again:

In [None]:
df['NrSiblings'].dtype

### Convert nrSiblings Variable to categorical by changing dtype to  string

In [None]:
df['NrSiblings'] = df['NrSiblings'].astype('category')
df['NrSiblings'].dtype

In [None]:
# Remap values to unproblematic strings or logical values
study_mapping = {
    '< 5': 'Less than 5 hours',
    '5 - 10': 'Between 5-10 hours',
    '> 10': 'More than 10 hours'
}
test_mapping = {
    'none': "not completed",
    'completed': "completed"
}
bus_mapping = {
    'private': 'private',
    'school_bus': 'schoolbus'
}
parentEduc_mapping = {
    "bachelor's degree": 'bachelors',
    "some college": 'college',
    "master's degree": 'masters',
    "associate's degree": 'associates',
    "high school": 'highschool',
    "some high school": 'highschool',
    "bachelor's degree": 'bachelor',
}
lunch_mapping = {
    "free/reduced": 'free',
    "standard": 'standard',
}
# Remove 'group ' from EthnicGroup Column
df['EthnicGroup'] = df['EthnicGroup'].str.replace('group ', '', case=False)

# Adjust values in the column
df['WklyStudyHours'] = df['WklyStudyHours'].map(study_mapping)
df['TestPrep'] = df['TestPrep'].map(test_mapping)
df['LunchType'] = df['LunchType'].map(lunch_mapping)
df['TransportMeans'] = df['TransportMeans'].map(bus_mapping)
df['ParentEduc'] = df['ParentEduc'].map(parentEduc_mapping)

df

## Manually Adjust Categorical Variables
Adjust values to more sensible categories. For instance binary categories can already be set to 0 or 1 instead of male/female or yes/no.

### The data is now clean and logical values have been added, in the next book I will conduct an EDA to go into detail about the feature set and data distribution / balance. 

## Save file to repository for follow on notebooks

In [None]:
file_path = 'outputs/dataset/Expanded_data_with_more_features_clean.csv'

# Remove previous file if it exists
if os.path.exists(file_path):
    os.remove(file_path)

# Create the directory if it doesn't exist
os.makedirs(name='outputs/dataset', exist_ok=True)

# Save cleaned DataFrame to the file path
df.to_csv(file_path, index=False)