# **STUDENT AI** - CORRELATION ASSESSMENT

## Objectives

Complete a correlation study to assess if any features of the dataset strongly influence the numerical target (Math, Reading, Writing) scores.

## Inputs

Cleaned data set from previous notebook with Mean scores added

## Outputs

Determine which, if any, features to use for model training and prediction


---

# Import required libraries

In [1]:
import os
import pandas as pd
from feature_engine.encoding import OneHotEncoder
import ppscore as pps

print('All Libraries Loaded')

All Libraries Loaded


# Change working directory

### Set the working directory to notebook parent folder
If the output does not match, click **'clear all outputs'** and then **'restart'** the notebook. 
Then run cells from top to bottom.

In [2]:
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print('If correct, Active Directory should read: /workspace/student-AI')
print(f"Active Directory: {current_dir}")

If correct, Active Directory should read: /workspace/student-AI
Active Directory: /workspace/student-AI


### Load cleaned dataset and add mean score

In [3]:
df = pd.read_csv(f"outputs/dataset/Expanded_data_with_more_features_clean.csv")
df['MeanScore'] = df[['MathScore', 'ReadingScore', 'WritingScore']].mean(axis=1).round().astype(int)
df.head()

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,WklyStudyHours,MathScore,ReadingScore,WritingScore,MeanScore
0,female,group C,bachelor's degree,standard,none,married,regularly,yes,3,< 5,71,71,74,72
1,female,group C,some college,standard,none,married,sometimes,yes,0,5 - 10,69,90,88,82
2,female,group B,master's degree,standard,none,single,sometimes,yes,4,< 5,87,93,91,90
3,male,group A,associate's degree,free/reduced,none,married,never,no,1,5 - 10,45,56,42,48
4,male,group C,some college,standard,none,married,sometimes,yes,0,5 - 10,76,78,75,76


# Assess Feature Correlation

I will use Pearson and Spearman tests as well as predictive Power Score Library to assess the dataset and see where correlations can be found, and if they match my hypothesis from the previous notebook. 
Pearson and Spearman require numerical data only, so all categorical variable data has to be encoded as a number.

male = 0, female = 1 etc ....

this can be automatically achieved using the one_hot_encoder function.

In [6]:
encoded_data = OneHotEncoder(variables = df.columns[df.dtypes=='object'].to_list(), drop_last=False)
encoded_data = one_hot_encoder.fit_transform(df)
encoded_data.head()

Unnamed: 0,NrSiblings,MathScore,ReadingScore,WritingScore,MeanScore,Gender_female,Gender_male,EthnicGroup_group C,EthnicGroup_group B,EthnicGroup_group A,...,ParentMaritalStatus_widowed,ParentMaritalStatus_divorced,PracticeSport_regularly,PracticeSport_sometimes,PracticeSport_never,IsFirstChild_yes,IsFirstChild_no,WklyStudyHours_< 5,WklyStudyHours_5 - 10,WklyStudyHours_> 10
0,3,71,71,74,72,1,0,1,0,0,...,0,0,1,0,0,1,0,1,0,0
1,0,69,90,88,82,1,0,1,0,0,...,0,0,0,1,0,1,0,0,1,0
2,4,87,93,91,90,1,0,0,1,0,...,0,0,0,1,0,1,0,1,0,0
3,1,45,56,42,48,0,1,0,0,1,...,0,0,0,0,1,0,1,0,1,0
4,0,76,78,75,76,0,1,1,0,0,...,0,0,0,1,0,1,0,0,1,0


All variable and categorical values have been converted to its own column using the fit.transform method. Creating a much simpler binary option for each value. There are now 34 columns. Now the Pearson algorithm can be applied to calculate the level of correlation with the selected target variable. I will use the mean as it best represents the overall performance of a student.

## Calculate Pearson Correlation Table

In [18]:
correlation_pearson = encoded_data.corr(method='pearson')['MeanScore'].sort_values(key=abs, ascending=False).head(10)
correlation_pearson

MeanScore                      1.000000
ReadingScore                   0.969200
WritingScore                   0.966114
MathScore                      0.919876
LunchType_free/reduced        -0.315839
LunchType_standard             0.315839
TestPrep_none                 -0.218126
TestPrep_completed             0.218126
EthnicGroup_group E            0.160630
ParentEduc_some high school   -0.136453
Name: MeanScore, dtype: float64

The top 10 correlation Features are listed. The pearson test also lists the correlation between other numerical variables and of course each score directly correlates to the mean score, as does the mean score perfectly correlate with itself. 
**Therefore I only consider the 5th value onwards for investigating correlation...**

In [19]:
correlation_pearson = encoded_data.corr(method='pearson')['MeanScore'].sort_values(key=abs, ascending=False)[4:].head(10)
correlation_pearson

LunchType_free/reduced         -0.315839
LunchType_standard              0.315839
TestPrep_none                  -0.218126
TestPrep_completed              0.218126
EthnicGroup_group E             0.160630
ParentEduc_some high school    -0.136453
Gender_female                   0.126188
Gender_male                    -0.126188
ParentEduc_master's degree      0.123639
ParentEduc_bachelor's degree    0.101549
Name: MeanScore, dtype: float64

## Calculate Spearman correlation table
I also investigate the spearman correlation calculation and list the top features without the numercial variables

In [20]:
correlation_spearman = encoded_data.corr(method='spearman')['MeanScore'].sort_values(key=abs, ascending=False)[5:].head(10)
correlation_spearman

LunchType_standard              0.308989
TestPrep_completed              0.213853
TestPrep_none                  -0.213853
EthnicGroup_group E             0.158113
ParentEduc_some high school    -0.132208
Gender_male                    -0.124251
Gender_female                   0.124251
ParentEduc_master's degree      0.121891
ParentEduc_bachelor's degree    0.098894
EthnicGroup_group B            -0.082000
Name: MeanScore, dtype: float64

### Individual Score analysis.

It shows that Pearson is able to determine slightly higher correlations on the mean target variable. Using this same method on idnvidual scores might show minor details in the overall correlation. For instance it is possible that gender will have a higher correlation specifically for the maths scores...

In [23]:
correlation_pearson_maths = encoded_data.corr(method='pearson')['MathScore'].sort_values(key=abs, ascending=False)[4:].head(10)
correlation_pearson_maths

LunchType_standard             0.367942
LunchType_free/reduced        -0.367942
EthnicGroup_group E            0.221775
Gender_male                    0.162391
Gender_female                 -0.162391
TestPrep_none                 -0.134949
TestPrep_completed             0.134949
ParentEduc_some high school   -0.121240
ParentEduc_master's degree     0.100001
EthnicGroup_group B           -0.096779
Name: MathScore, dtype: float64

In [24]:
correlation_pearson_reading = encoded_data.corr(method='pearson')['ReadingScore'].sort_values(key=abs, ascending=False)[4:].head(10)
correlation_pearson_reading

LunchType_standard              0.258152
LunchType_free/reduced         -0.258152
Gender_male                    -0.237097
Gender_female                   0.237097
TestPrep_none                  -0.204724
TestPrep_completed              0.204724
EthnicGroup_group E             0.128716
ParentEduc_some high school    -0.122773
ParentEduc_master's degree      0.116293
ParentEduc_bachelor's degree    0.087993
Name: ReadingScore, dtype: float64

In [25]:
correlation_pearson_writing = encoded_data.corr(method='pearson')['MathScore'].sort_values(key=abs, ascending=False)[4:].head(10)
correlation_pearson_writing

LunchType_standard             0.367942
LunchType_free/reduced        -0.367942
EthnicGroup_group E            0.221775
Gender_male                    0.162391
Gender_female                 -0.162391
TestPrep_none                 -0.134949
TestPrep_completed             0.134949
ParentEduc_some high school   -0.121240
ParentEduc_master's degree     0.100001
EthnicGroup_group B           -0.096779
Name: MathScore, dtype: float64

## Correlation Conclusions

Since a perfect correlation for both tests is represented by a value of 1.0, we can see that the categorical features only have a relatively weak correlation with a best value of 0.315 for the LunchType Feature of the pearson test.

However, the test does align with my previous hypothesis that LunchType, EthnicGroup, Parental Education, TestPrep and Gender are most likey to influence the numerical target variables.

the individual score evaluations all agree on lunchtype being teh most significant indicator. Interestingly, the Specific Maths score showed a lower correlation of gender, against my initial hypothesis.

## PPS Correlation

to prevent the numerical variables from being considered against each other, I will drop them so that only one target variable remains before the prediction.

In [26]:
df_mean = df.drop(['MathScore', 'ReadingScore', 'WritingScore'], axis=1)

pps_correlation = pps.matrix(df=df_mean)
pps_correlation

Unnamed: 0,x,y,ppscore,case,is_valid_score,metric,baseline_score,model_score,model
0,Gender,Gender,1.0,predict_itself,True,,0.0000,1.000000,
1,Gender,EthnicGroup,0.0,classification,True,weighted F1,0.2468,0.195202,DecisionTreeClassifier()
2,Gender,ParentEduc,0.0,classification,True,weighted F1,0.2008,0.128795,DecisionTreeClassifier()
3,Gender,LunchType,0.0,classification,True,weighted F1,0.5392,0.513640,DecisionTreeClassifier()
4,Gender,TestPrep,0.0,classification,True,weighted F1,0.5564,0.543773,DecisionTreeClassifier()
...,...,...,...,...,...,...,...,...,...
116,MeanScore,PracticeSport,0.0,classification,True,weighted F1,0.3992,0.393912,DecisionTreeClassifier()
117,MeanScore,IsFirstChild,0.0,classification,True,weighted F1,0.5412,0.521804,DecisionTreeClassifier()
118,MeanScore,NrSiblings,0.0,regression,True,mean absolute error,1.0820,1.136556,DecisionTreeRegressor()
119,MeanScore,WklyStudyHours,0.0,classification,True,weighted F1,0.4142,0.403903,DecisionTreeClassifier()


### Filter pps results for mean score correlation...

In [28]:
df_mean_pps_filtered = pps_correlation.filter(['x', 'y', 'ppscore']).query('x == "MeanScore"')
df_mean_pps_filtered

Unnamed: 0,x,y,ppscore
110,MeanScore,Gender,0.047392
111,MeanScore,EthnicGroup,0.011235
112,MeanScore,ParentEduc,0.0
113,MeanScore,LunchType,0.220754
114,MeanScore,TestPrep,0.13331
115,MeanScore,ParentMaritalStatus,0.002409
116,MeanScore,PracticeSport,0.0
117,MeanScore,IsFirstChild,0.0
118,MeanScore,NrSiblings,0.0
119,MeanScore,WklyStudyHours,0.0
