# **STUDENT AI** - DATA CLEANING

## Objectives

Inspect the dataset and solve any issues that might arise from wrong data types, or missing / wrong values

## Inputs

Continues to assess dataset loaded in previous notebook.

## Outputs

Saves the cleaned dataset back to inputs/dataset folder for futher use


---

# Import required libraries

In [1]:
import os
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

print('All Libraries Loaded')

All Libraries Loaded


  from pandas_profiling import ProfileReport


# Change working directory

### Set the working directory to notebook parent folder
If the output does not match, click **'clear all outputs'** and then **'restart'** the notebook. 
Then run cells from top to bottom.

In [2]:
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print('If correct, Active Directory should read: /workspace/student-AI')
print(f"Active Directory: {current_dir}")

If correct, Active Directory should read: /workspace/student-AI
Active Directory: /workspace/student-AI


### Load saved dataset

In [31]:
df = pd.read_csv(f"inputs/dataset/Expanded_data_with_more_features.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


### Drop Unnamed column that pandas created on import

In [32]:
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head()

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


In [33]:
df.drop(columns=['MathScore','ReadingScore','WritingScore'], inplace=True)
df.head()

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours
0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5
1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10
2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5
3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10
4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10


### Create random test list for report testing

In [36]:
df_clean = df.dropna()
df_sample = df_clean.sample(n=300)

file_path = 'inputs/dataset/student_random_list.csv'

# Remove previous file if it exists
if os.path.exists(file_path):
    os.remove(file_path)

# Create the directory if it doesn't exist
os.makedirs(name='inputs/dataset', exist_ok=True)

# Save cleaned DataFrame to the file path
df_sample.to_csv(file_path, index=True)

In [37]:
df_sample

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours
20030,male,group B,some college,standard,completed,married,sometimes,yes,,private,< 5
17149,female,group B,bachelor's degree,standard,none,single,sometimes,yes,1.0,private,5 - 10
8737,male,,,free/reduced,none,married,never,yes,4.0,private,< 5
27922,male,group C,some college,standard,none,single,sometimes,no,5.0,private,5 - 10
22336,female,group C,some high school,standard,none,married,regularly,yes,5.0,,5 - 10
...,...,...,...,...,...,...,...,...,...,...,...
26493,female,group E,associate's degree,free/reduced,completed,divorced,sometimes,yes,3.0,private,5 - 10
18985,male,group C,some high school,standard,none,married,sometimes,yes,1.0,private,5 - 10
15177,male,group C,some high school,free/reduced,none,married,sometimes,no,2.0,private,5 - 10
27155,female,group C,master's degree,standard,none,single,sometimes,yes,2.0,school_bus,5 - 10


### Summarize Dataset Again

In [None]:

df.info()

At first glance, 30000+ data rows is a very robust dataset to be able to train an ML. <br>
There seems to be no reason why the NrSiblings should be a float data type as you cannot have 0.25 of a brother/sister
I will convert nrSiblings to a string type as it needs to be treated as a categorcial valriable and not a numerical one. - however this throws an error as the dataset seems to have NaN or Inf data, so first we deal with missing data

### Check for missing values

In [None]:
missing_values = df.isnull().sum()
percentage_missing = (missing_values / len(df)) * 100
percentage_missing = percentage_missing.round(1)
missing_data = pd.DataFrame({'Missing_Values': missing_values, 'Percentage_Missing': percentage_missing})
missing_data['Percentage_Missing'] = missing_data['Percentage_Missing'].astype(str) + '%'
missing_data

### Assess how many rows would need to be dropped...

In [None]:
dropped_data = df.dropna()
dropped_data.info()

## Initial Assumptions

* Dropping null values leaves us with 19243 data rows ... I hypothesize that while this seems like enough data to still achieve the business requirement, dropping this many values will likely induce an imbalance to the data and bias the dataset. If confirmed, then imputing logical values to fill the gaps is more advisable.
* The Pandas report shows **Alerts** under the nrSiblings variable as it assumes that a value of zero (0) could be a problem. In this case, it can be ignored, as a value of zero (0) is valid and indicates an only child.


## Assessing imbalance on 'dropped' dataset

In [None]:
pandas_report = ProfileReport(df=dropped_data, minimal=True)
pandas_report.to_notebook_iframe()

## Pandas Report on 'dropped' dataset

Clicking on each variable to view the specific report fo each variable:

| Variable |  Obvious Insights  | 
|---|---|
| Gender  |  shows equal disribution of 50.8% to 49.2% and no missing values |
|EthnicGroup|report is not much value as it gives the word 'group' most emphasis - **string 'group ' needs to be removed from data**|
|ParentEduc|again here the word 'some' is not accurate. this categorical variable should be changed to a linear numerical one indicating 0 none to n highest level of education, with 'some as intermediary values|
|LunchType|this feature is imbalanced with 64.8% standard and 35.2% free/reduced. Will require some feature engineering|
|TestPrep|this feature is imbalanced with 65.4% completed and 34.6% none. Will require some feature engineering|
|ParentMaritalStatus|this feature is imbalanced with the majority rows (57.2%) labelled married|
|practiceSport|this feature is imbalanced with the majority rows (50.5%) labelled sometimes will need to be engineered or dropped|
|IsFirstChild|this feature is imbalanced with 64.5% yes and 35.6% no. Will require some feature engineering|
|NrSiblings|this numerical feature looks skewed and has a few outliers that might be able to be removed - further analysis required|
|TransportMeans|this feature is slightly imbalanced 58.6% private vs 41.4% school_bus|
|WklyStudyHours|this is a categorical feature with options <5, 5-10 ,and >10 hours ... distribution is reasonable balanced 39.3%/32.4%/28.3%|
|MathScore|at first glance looks like normal distribution, but obvious missing values that were dropped|
|ReadingScore|at first glance looks like normal distribution, but more obvious missing values that were dropped|
|WritingScore|at first glance looks like normal distribution, but some missing values that were dropped|


### dropping the missing value data rows has indeed induced an imbalance to the dataset. I will therefore assess the viability of imputing data to fill the gaps...

## Imputing Missing Values
let's review the missing_data form the full dataset

In [None]:
missing_data

The highest missing data rate is TransportMeans at 10.2%. Thinking logically about how this feature might affect overall school performance of a student, it seems that this is an indicator of economic status of the family.
Another indicator of economic status of the family is LunchType, indicating whether the family has the means available to pay for own school lunches. This feature has no missing values.

I will therefore **drop the TransportMeans** feature as it has many missing values and a single indicator of Family economic status should be sufficient in assessing its effect on school performance.

In [None]:
df.drop(columns=['TransportMeans'], inplace=True)

### Imputing Categorical Variables

Categorical variables with missing values are :
* 'EthnicGroup'
* 'TestPrep'
* 'ParentEduc'
* 'ParentMaritalStatus'
* 'IsFirstChild'
* 'PracticeSport'
* 'WklyStudyHours'
* ('LunchType' and 'Gender' have no missing values and do not need to be adjusted)

For these categorical variables I will insert the most common value from the dataset (mode) as that will be closest to the actual value probabalistically.

In [None]:
for column in ['EthnicGroup', 'TestPrep', 'ParentEduc', 'ParentMaritalStatus', 'IsFirstChild', 'PracticeSport', 'WklyStudyHours']:
    mode_value = df[column].mode()[0]
    df[column].fillna(mode_value, inplace=True)

In [None]:
# Function to perform weighted random imputation for a single column
def weighted_random_imputation(series):
    # Drop missing values and get the distribution of the remaining values
    counts = series.value_counts(normalize=True)
    
    # Generate random values for missing entries, based on the distribution of existing values
    random_values = np.random.choice(counts.index, size=series.isnull().sum(), p=counts.values)
    
    # Create a Series with the random values and the same index as the missing entries
    random_series = pd.Series(random_values, index=series[series.isnull()].index)
    
    # Fill the missing values with the random values
    return series.fillna(random_series)

# List of categorical features you want to impute
categorical_features = ['EthnicGroup', 'TestPrep', 'ParentEduc', 'ParentMaritalStatus', 'IsFirstChild', 'PracticeSport', 'WklyStudyHours','TransportMeans']

# Apply the weighted random imputation to each categorical feature
for feature in categorical_features:
    df[feature] = weighted_random_imputation(df[feature])

# Now 'df' has the missing values filled in with weighted random imputation for the specified features

### Imputing Numerical Variables

The only numerical variable from the feature set (not counting the scores which have no missing values) is NrSiblings.
Once the missing values have been imputed, I can also change the data type to a more sensical integer rather than float.
The imputed values will be based on the **median** instead of the **mean** as this is less sensitive to outliers since the variable does contain some 'extreme' values of 6 or more siblings.

In [None]:
median_value = df['NrSiblings'].median()
df['NrSiblings'] = df['NrSiblings'].fillna(median_value).astype(int)

### Quick check for remaining missing values and check datatype change and possible duplicate values:

In [None]:
df.isnull().sum()

In [None]:
df['NrSiblings'].dtype

### Convert nrSiblings Variable to categorical by changing dtype to  string

In [None]:
df['NrSiblings'] = df['NrSiblings'].astype(str)
df['NrSiblings'].dtype

### The data is now clean and logical values have been added, in the next book I will conduct an EDA to go into detail about the feature set and data distribution / balance. 

## Save file to repository for follow on notebooks

In [None]:
file_path = 'outputs/dataset/Expanded_data_with_more_features_clean2.csv'

# Remove previous file if it exists
if os.path.exists(file_path):
    os.remove(file_path)

# Create the directory if it doesn't exist
os.makedirs(name='outputs/dataset', exist_ok=True)

# Save cleaned DataFrame to the file path
df.to_csv(file_path, index=False)