# **STUDENT AI** - EDA - SmartCorrelation Assessment

## Objectives

Use Smart Correlation Function to asses if any features need to be dropped to prevent overfitting the model due to high correlation levels. Allows decision if smart correlation step needs to be performed in pipeline.

## Inputs

Continues to assess dataset loaded in previous notebook.

## Outputs

None


---

# Import required libraries

In [1]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from sklearn.model_selection import train_test_split

print('All Libraries Loaded')

All Libraries Loaded


# Change working directory

### Set the working directory to notebook parent folder
If the output does not match, click **'clear all outputs'** and then **'restart'** the notebook. 
Then run cells from top to bottom.

In [2]:
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print('If correct, Active Directory should read: /workspace/student-AI')
print(f"Active Directory: {current_dir}")

If correct, Active Directory should read: /workspace/student-AI
Active Directory: /workspace/student-AI


### Load cleaned dataset

In [3]:
df = pd.read_csv(f"outputs/dataset/Expanded_data_with_more_features_clean.csv")
df.head()

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,female,C,bachelor,standard,not completed,married,regularly,yes,3,schoolbus,Less than 5 hours,71,71,74
1,female,C,college,standard,not completed,married,sometimes,yes,0,schoolbus,Between 5-10 hours,69,90,88
2,female,B,masters,standard,not completed,single,sometimes,yes,4,schoolbus,Less than 5 hours,87,93,91
3,male,A,associates,free,not completed,married,never,no,1,schoolbus,Between 5-10 hours,45,56,42
4,male,C,college,standard,not completed,married,sometimes,yes,0,schoolbus,Between 5-10 hours,76,78,75


#### Ensure NrSiblings is categorical

In [4]:
df['NrSiblings'] = df['NrSiblings'].astype('object')

## Create split dataset for Math smart correlation test

In [5]:
df_math = df.drop(['ReadingScore', 'WritingScore'], axis=1)
df_math.head()

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore
0,female,C,bachelor,standard,not completed,married,regularly,yes,3,schoolbus,Less than 5 hours,71
1,female,C,college,standard,not completed,married,sometimes,yes,0,schoolbus,Between 5-10 hours,69
2,female,B,masters,standard,not completed,single,sometimes,yes,4,schoolbus,Less than 5 hours,87
3,male,A,associates,free,not completed,married,never,no,1,schoolbus,Between 5-10 hours,45
4,male,C,college,standard,not completed,married,sometimes,yes,0,schoolbus,Between 5-10 hours,76


### Split Data set into train and test sets
Split the dataset into 80% training and 20% test values. 

In [8]:
math_train_features, math_test_features, math_train_scores, math_test_scores = train_test_split(
    df_math.drop(['MathScore'], axis=1),
    df_math['MathScore'],
    test_size = 0.2,
    random_state = 101
)

print("New Data Set Shapes")
print(f"math_train_features: {math_train_features.shape}, with {math_train_scores.shape} math scores")
print(f"math_test_features: {math_test_features.shape} with {math_test_scores.shape} math scores")


New Data Set Shapes
math_train_features: (24512, 11), with (24512,) math scores
math_test_features: (6129, 11) with (6129,) math scores


### Encode categorical variables

In [9]:
variables_to_encode = ['Gender', 'EthnicGroup', 'ParentEduc', 'LunchType', 'TestPrep','ParentMaritalStatus','PracticeSport','IsFirstChild','NrSiblings','TransportMeans','WklyStudyHours']
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = variables_to_encode)
df_math_score_encoded = encoder.fit_transform(df_math)
df_math_score_encoded.head()


Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore
0,0,0,0,0,0,0,0,0,0,0,0,71
1,0,0,1,0,0,0,1,0,1,0,1,69
2,0,1,2,0,0,1,1,0,2,0,0,87
3,1,2,3,1,0,0,2,1,3,0,1,45
4,1,0,1,0,0,0,1,0,1,0,1,76


### Analyse SmartCorrelationFunction
the threshold of 0.6 ensures we are checking for highly correlated variables. The expectation is that no feature will correlate that highly and should return an empty list.

In [24]:
correlated_selection_maths = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

correlated_selection_maths.fit_transform(df_math_score_encoded)
correlated_selection_maths.correlated_feature_sets_

[]

#### Double check that no features are recommended to drop

In [25]:
correlated_selection_maths.features_to_drop_

[]

## Repeat check for Reading and Writing to make sure 
Since all 3 scores for students are typically very close together, I am not expeciting any differences.

In [26]:
df_reading = df.drop(['MathScore', 'WritingScore'], axis=1)
df_writing = df.drop(['ReadingScore', 'MathScore'], axis=1)

#reading dataset
reading_train_features, reading_test_features, reading_train_scores, reading_test_scores = train_test_split(
    df_reading.drop(['ReadingScore'], axis=1),
    df_reading['ReadingScore'],
    test_size = 0.2,
    random_state = 101
)

#writing dataset
writing_train_features, writing_test_features, writing_train_scores, writing_test_scores = train_test_split(
    df_writing.drop(['WritingScore'], axis=1),
    df_writing['WritingScore'],
    test_size = 0.2,
    random_state = 101
)

print("Reading Data Set Shapes")
print(f"reading_train_features: {reading_train_features.shape}, with {reading_train_scores.shape} reading scores")
print(f"reading_test_features: {reading_test_features.shape} with {reading_test_scores.shape} reading scores")
print('')
print("Writing Data Set Shapes")
print(f"writing_train_features: {writing_train_features.shape}, with {writing_train_scores.shape} writing scores")
print(f"writing_test_features: {writing_test_features.shape} with {writing_test_scores.shape} writing scores")

df_reading_score_encoded = encoder.fit_transform(df_reading)
df_writing_score_encoded = encoder.fit_transform(df_writing)

print('')
print("Datasets successfully encoded!")


Reading Data Set Shapes
reading_train_features: (24512, 11), with (24512,) reading scores
reading_test_features: (6129, 11) with (6129,) reading scores

Writing Data Set Shapes
writing_train_features: (24512, 11), with (24512,) writing scores
writing_test_features: (6129, 11) with (6129,) writing scores

Datasets successfully encoded!


### Run Smart Correlation Test on Reading and Writing datasets

In [31]:
threshold = 0.6

correlated_selection_reading = SmartCorrelatedSelection(variables=None, method="spearman", threshold=threshold, selection_method="variance")
correlated_selection_reading.fit_transform(df_reading_score_encoded)
print('Reading Dataset')
print(f"features correlating above {threshold} threshold: {correlated_selection_reading.correlated_feature_sets_}")
print(f"features that should be dropped:  {correlated_selection_reading.features_to_drop_}")
print('')
correlated_selection_writing = SmartCorrelatedSelection(variables=None, method="spearman", threshold=threshold, selection_method="variance")
correlated_selection_writing.fit_transform(df_writing_score_encoded)
print('Writing Dataset')
print(f"features correlating above {threshold} threshold: {correlated_selection_writing.correlated_feature_sets_}")
print(f"features that should be dropped:  {correlated_selection_writing.features_to_drop_}")

Reading Dataset
features correlating above 0.6 threshold: []
features that should be dropped:  []

Writing Dataset
features correlating above 0.6 threshold: []
features that should be dropped:  []


### Conclusion:
No features are high enough correlated to risk overfitting the model. Thus this step can be ommitted from the pipeine. As a check I lowered the threshold to 0.2 (which were confirmed correlation values in previous step) and the function correctly identified features correlating at that level.