# **Correlation Analysis Notebook**

## Objectives
- Determine how the variables of the dataset correlate

## Inputs
- student-exam-results.csv file

## Outputs
- Revealed correlations within the dataset

## Additional comments

# Change working directory

Since this notebook exists in the jupyter_notebooks directory, we need to change the current working directory from the jupyter_notebooks directory to the workspace, so that any directories created in further codes cells are added in the correct place. 

We access the current directory with the OS packages' `getcwd()` method

In [1]:
import os
current_directory = os.getcwd()
current_directory


'/workspace/Exam-Scores-Analysis/jupyter_notebooks'

We now want to set the working directory as the parent of the current working directory, jupyter_notebooks

- The `os.path.dirname()` method gets the parent directory
- The `os.chir()` method defines the new current directory
- We do this to access all of the project's files and directories, rather than those in the jupyter_notebooks directory

In [2]:
os.chdir(os.path.dirname(current_directory))
print("You set a new current directory")

You set a new current directory


To make certain of things, we now use a code cell to confirm that we have set the current working directory properly

In [3]:
current_directory = os.getcwd()
current_directory

'/workspace/Exam-Scores-Analysis'

## Correlation Study

Now that we have become more familiar with the dataset, we can perform a correlation study. As is noted in the Churnometer walkthrough project, we could use the Predictive Power Score library, however the feature variables are categorical. This means that we use One Hot Encoding to encode the categorical variables so that we can determine correlation scores.

However, our task is larger in scope, since we have 5 target variables. We need to see how each feature variable correlates with each target variable. First, as we have a new notebook, we need to re-import the dataset.

In [4]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/student-exam-results.csv')
df.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score,reading_score,writing_score,average_score
0,male,group A,high school,standard,completed,67,67,63,65
1,female,group D,some high school,free/reduced,none,40,59,55,51
2,male,group E,some college,free/reduced,none,59,60,50,56
3,male,group B,high school,standard,none,77,78,68,74
4,male,group E,associate's degree,standard,completed,78,73,68,73


Now we can apply the One Hot Encoder

In [5]:
from feature_engine.encoding import OneHotEncoder
one_hot_encoder = OneHotEncoder(variables = df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_one_hot_encoder = one_hot_encoder.fit_transform(df)
print(df_one_hot_encoder.shape)
df_one_hot_encoder.head()

(1000, 21)


Unnamed: 0,math_score,reading_score,writing_score,average_score,gender_male,gender_female,ethnicity_group A,ethnicity_group D,ethnicity_group E,ethnicity_group B,...,parental_education_high school,parental_education_some high school,parental_education_some college,parental_education_associate's degree,parental_education_bachelor's degree,parental_education_master's degree,lunch_program_standard,lunch_program_free/reduced,test_preparation_course_completed,test_preparation_course_none
0,67,67,63,65,1,0,1,0,0,0,...,1,0,0,0,0,0,1,0,1,0
1,40,59,55,51,0,1,0,1,0,0,...,0,1,0,0,0,0,0,1,0,1
2,59,60,50,56,1,0,0,0,1,0,...,0,0,1,0,0,0,0,1,0,1
3,77,78,68,74,1,0,0,0,0,1,...,1,0,0,0,0,0,1,0,0,1
4,78,73,68,73,1,0,0,0,1,0,...,0,0,0,1,0,0,1,0,1,0


We note that the encoded dataset now has 21 columns, and that the 5 target variables have been preserved.

### Maths score correlation

#### Pearson

In [6]:
math_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['math_score'].sort_values(key=abs, ascending=False)[1:].head(10)
math_correlation_pearson

average_score                          0.919267
reading_score                          0.819398
writing_score                          0.805944
lunch_program_free/reduced            -0.374431
lunch_program_standard                 0.374431
ethnicity_group E                      0.203515
gender_male                            0.200863
gender_female                         -0.200863
parental_education_some high school   -0.179725
test_preparation_course_none          -0.151704
Name: math_score, dtype: float64

We can see that the other 4 target variables correlate highly with math_score. By default, we exclude the first entry, which is math_score's correlation with itself. Since, in any prediction situation, we will not know the other test scores, we cannot use them in the correlation study, so we can adjust the correlation code cell to exclude the first 5 rows, so as to better reveal how the feature variables correlate with the math_score variable.

In [7]:
math_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['math_score'].sort_values(key=abs, ascending=False)[5:].head(10)
math_correlation_pearson

lunch_program_standard                  0.374431
ethnicity_group E                       0.203515
gender_male                             0.200863
gender_female                          -0.200863
parental_education_some high school    -0.179725
test_preparation_course_none           -0.151704
test_preparation_course_completed       0.151704
ethnicity_group C                      -0.146533
parental_education_bachelor's degree    0.117535
ethnicity_group D                       0.111121
Name: math_score, dtype: float64

This reveals that only the lunch_program feature variable correlates with the math_score variable, and then only weakly. Therefore, prediction of the math_score target variable could be problematic.

#### Spearman

We anticipate a similar outcome to the Pearson correlation, so we will discard the first 5 rows of the Spearman correlation test output.

In [8]:
math_correlation_spearman = df_one_hot_encoder.corr(method='spearman')['math_score'].sort_values(key=abs, ascending=False)[5:].head(10)
math_correlation_spearman

lunch_program_standard                 0.363140
gender_male                            0.193047
gender_female                         -0.193047
ethnicity_group E                      0.192825
parental_education_some high school   -0.176459
test_preparation_course_completed      0.145819
test_preparation_course_none          -0.145819
ethnicity_group C                     -0.141661
ethnicity_group D                      0.118324
ethnicity_group B                     -0.110060
Name: math_score, dtype: float64

We get similar correlation results to the Pearson correlation - the lunch_program variable correlates weakly to the math_score variable.

### Reading score correlation

#### Pearson

In [9]:
reading_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['reading_score'].sort_values(key=abs, ascending=False)[5:].head(10)
reading_correlation_pearson

lunch_program_free/reduced             -0.288282
test_preparation_course_none           -0.245144
test_preparation_course_completed       0.245144
gender_female                           0.189389
gender_male                            -0.189389
parental_education_some high school    -0.151530
ethnicity_group D                       0.124821
ethnicity_group C                      -0.122770
parental_education_bachelor's degree    0.120719
parental_education_master's degree      0.119698
Name: reading_score, dtype: float64

We see that both the lunch_program and test_prepartion_course feature variables correlate weakly with reading_score

#### Spearman

In [10]:
reading_correlation_spearman = df_one_hot_encoder.corr(method='spearman')['reading_score'].sort_values(key=abs, ascending=False)[5:].head(10)
reading_correlation_spearman

lunch_program_standard                  0.273246
test_preparation_course_completed       0.244122
test_preparation_course_none           -0.244122
gender_female                           0.181827
gender_male                            -0.181827
parental_education_some high school    -0.149800
ethnicity_group C                      -0.123938
ethnicity_group D                       0.121751
parental_education_bachelor's degree    0.118116
parental_education_master's degree      0.109558
Name: reading_score, dtype: float64

As with Pearson correlation, we see that both the lunch_program and test_prepartion_course feature variables correlate weakly with reading_score

### Writing score correlation

#### Pearson

In [11]:
writing_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['writing_score'].sort_values(key=abs, ascending=False)[5:].head(10)
writing_correlation_pearson

lunch_program_standard                  0.319191
test_preparation_course_completed       0.315601
test_preparation_course_none           -0.315601
gender_female                           0.246089
gender_male                            -0.246089
ethnicity_group D                       0.172772
parental_education_some high school    -0.161996
parental_education_bachelor's degree    0.151974
parental_education_master's degree      0.143354
ethnicity_group C                      -0.142404
Name: writing_score, dtype: float64

We see that lunch_program and test_preparation_course correlate weakly with writing_score

#### Spearman

In [12]:
reading_correlation_spearman = df_one_hot_encoder.corr(method='spearman')['writing_score'].sort_values(key=abs, ascending=False)[5:].head(10)
reading_correlation_spearman

test_preparation_course_none           -0.312719
lunch_program_standard                  0.308331
lunch_program_free/reduced             -0.308331
gender_female                           0.240425
gender_male                            -0.240425
ethnicity_group D                       0.165223
parental_education_some high school    -0.162533
parental_education_bachelor's degree    0.146431
ethnicity_group C                      -0.139161
parental_education_master's degree      0.132708
Name: writing_score, dtype: float64

We see that lunch_program and test_preparation_course correlate weakly with writing_score

## Discussion

We see that both the Spearman and Pearson correlation methods produce similar correlation scores. The Pearson method slightly outperforms the Spearman method. Both methods produce the same rankings of variables. The lunch_program variable correlates most highly with the test score variables, followed by the test_preparation_course variable.

## Predictive Power Score

As noted above, we chose to use One Hot Encoding and the correlation method to determine correlation scores. However, at best, we only saw weak correlations. So, we will also analyse the dataset with the Predictive Power Score module, to see if any further insights can be gathered.

In [15]:
import ppscore as pps

Since we have seen in the correlation studies that the score variables tend to correlate well with each other, it is likely that the Predictive Power Score module will also assign strong predictive power to the score variables for the other score variables, which may confuse matters. Hence, we will run 3 predictive power score code cells - one for each of the score variables. To do this, we will generate dataframes that only contain one score variable.

First, lets recap the dataframe:

In [16]:
df.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score,reading_score,writing_score,average_score
0,male,group A,high school,standard,completed,67,67,63,65
1,female,group D,some high school,free/reduced,none,40,59,55,51
2,male,group E,some college,free/reduced,none,59,60,50,56
3,male,group B,high school,standard,none,77,78,68,74
4,male,group E,associate's degree,standard,completed,78,73,68,73


Looks to be intact

### Maths

Let's drop the reading_score, writing_score and average_score columns

In [17]:
df_maths = df.drop(['reading_score', 'writing_score', 'average_score'], axis=1)
df_maths.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score
0,male,group A,high school,standard,completed,67
1,female,group D,some high school,free/reduced,none,40
2,male,group E,some college,free/reduced,none,59
3,male,group B,high school,standard,none,77
4,male,group E,associate's degree,standard,completed,78


Now let's generate the Predictive Power Score dataframe

In [18]:
df_maths_pps = pps.matrix(df=df_maths)
df_maths_pps

Unnamed: 0,x,y,ppscore,case,is_valid_score,metric,baseline_score,model_score,model
0,gender,gender,1.0,predict_itself,True,,0.0,1.0,
1,gender,ethnicity,0.0,classification,True,weighted F1,0.254,0.157718,DecisionTreeClassifier()
2,gender,parental_education,0.0,classification,True,weighted F1,0.188,0.122846,DecisionTreeClassifier()
3,gender,lunch_program,0.0,classification,True,weighted F1,0.534,0.514654,DecisionTreeClassifier()
4,gender,test_preparation_course,0.0,classification,True,weighted F1,0.566,0.531202,DecisionTreeClassifier()
5,gender,math_score,0.021409,regression,True,mean absolute error,12.49,12.222604,DecisionTreeRegressor()
6,ethnicity,gender,0.0,classification,True,weighted F1,0.51,0.491056,DecisionTreeClassifier()
7,ethnicity,ethnicity,1.0,predict_itself,True,,0.0,1.0,
8,ethnicity,parental_education,0.0,classification,True,weighted F1,0.188,0.163448,DecisionTreeClassifier()
9,ethnicity,lunch_program,0.0,classification,True,weighted F1,0.534,0.514654,DecisionTreeClassifier()


This dataframe is unwieldy, and we are only interested in a few items, so let's filter and query it. We are not interested in the rows that do not deal with the math_score variable, nor are we interested in the columns after ppscore. 

The only thing we will retain from the full dataframe is the choice of the DecisionTreeClassifier algorithm. This may prove relevant when we come to fit a prediction model.

In [19]:
df_maths_pps_filtered = df_maths_pps.filter(['x', 'y', 'ppscore']).query('x == "math_score"')
df_maths_pps_filtered

Unnamed: 0,x,y,ppscore
30,math_score,gender,0.024337
31,math_score,ethnicity,0.039444
32,math_score,parental_education,0.00999
33,math_score,lunch_program,0.258262
34,math_score,test_preparation_course,0.024844
35,math_score,math_score,1.0


As noted in [this forum discussion](https://github.com/8080labs/ppscore/issues/39) on PPS' Github page, a score of >0.2 indicates strong predictive power, and a score of 0 to 2 indicates some weak yet relevant predictive power.

As we can see, the PPS module has determined that lunch_program has strong predictive power for a student's math score. This is in line with the outcome of the correlation study. The other variables have very low predictive power scores, and we can safely discard them. The test_preparation_course variable also has practically no predictive power, which is in line with our previous correlation study.

### Reading

As before, we'll filter the initial dataframe so that the only numerical variable is reading_score.

In [20]:
df_reading = df.drop(['math_score', 'writing_score', 'average_score'], axis=1)
df_reading.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,reading_score
0,male,group A,high school,standard,completed,67
1,female,group D,some high school,free/reduced,none,59
2,male,group E,some college,free/reduced,none,60
3,male,group B,high school,standard,none,78
4,male,group E,associate's degree,standard,completed,73


Now the predictive power score dataframe:

In [21]:
df_reading_pps = pps.matrix(df=df_reading)
df_reading_pps

Unnamed: 0,x,y,ppscore,case,is_valid_score,metric,baseline_score,model_score,model
0,gender,gender,1.0,predict_itself,True,,0.0,1.0,
1,gender,ethnicity,0.0,classification,True,weighted F1,0.254,0.157718,DecisionTreeClassifier()
2,gender,parental_education,0.0,classification,True,weighted F1,0.188,0.122846,DecisionTreeClassifier()
3,gender,lunch_program,0.0,classification,True,weighted F1,0.534,0.514654,DecisionTreeClassifier()
4,gender,test_preparation_course,0.0,classification,True,weighted F1,0.566,0.531202,DecisionTreeClassifier()
5,gender,reading_score,0.010793,regression,True,mean absolute error,11.88,11.751776,DecisionTreeRegressor()
6,ethnicity,gender,0.0,classification,True,weighted F1,0.51,0.491056,DecisionTreeClassifier()
7,ethnicity,ethnicity,1.0,predict_itself,True,,0.0,1.0,
8,ethnicity,parental_education,0.0,classification,True,weighted F1,0.188,0.163448,DecisionTreeClassifier()
9,ethnicity,lunch_program,0.0,classification,True,weighted F1,0.534,0.514654,DecisionTreeClassifier()


As before, this is cluttered, so we'll filter and query. We will also mark the use of use the DecisionTreeClassifier algorithm 

In [22]:
df_reading_pps_filtered = df_reading_pps.filter(['x', 'y', 'ppscore']).query('x == "reading_score"')
df_reading_pps_filtered

Unnamed: 0,x,y,ppscore
30,reading_score,gender,0.032292
31,reading_score,ethnicity,0.049479
32,reading_score,parental_education,0.008174
33,reading_score,lunch_program,0.189989
34,reading_score,test_preparation_course,0.065538
35,reading_score,reading_score,1.0


Interestingly, lunch_program has slightly less predictive power for reading_score. The test_preparation_course variable has also assumed slightly higher predictive power than for math_score. Given that interpretation of predictive power score is subjective and contextual, we can probably take lunch_program as having strong predictive power, given that it is the highest score by far. It is interesting that test_preparation_course has such a low predictive power score when it correlates only a little less strongly than lunch_program (0.24 vs 0.28).

### Writing

As above, we will first filter the initial dataframe:

In [23]:
df_writing = df.drop(['math_score', 'reading_score', 'average_score'], axis=1)
df_writing.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,writing_score
0,male,group A,high school,standard,completed,63
1,female,group D,some high school,free/reduced,none,55
2,male,group E,some college,free/reduced,none,50
3,male,group B,high school,standard,none,68
4,male,group E,associate's degree,standard,completed,68


Now the predictive power score dataframe:

In [24]:
df_writing_pps = pps.matrix(df=df_writing)
df_writing_pps

Unnamed: 0,x,y,ppscore,case,is_valid_score,metric,baseline_score,model_score,model
0,gender,gender,1.0,predict_itself,True,,0.0,1.0,
1,gender,ethnicity,0.0,classification,True,weighted F1,0.254,0.157718,DecisionTreeClassifier()
2,gender,parental_education,0.0,classification,True,weighted F1,0.188,0.122846,DecisionTreeClassifier()
3,gender,lunch_program,0.0,classification,True,weighted F1,0.534,0.514654,DecisionTreeClassifier()
4,gender,test_preparation_course,0.0,classification,True,weighted F1,0.566,0.531202,DecisionTreeClassifier()
5,gender,writing_score,0.026406,regression,True,mean absolute error,12.572,12.240025,DecisionTreeRegressor()
6,ethnicity,gender,0.0,classification,True,weighted F1,0.51,0.491056,DecisionTreeClassifier()
7,ethnicity,ethnicity,1.0,predict_itself,True,,0.0,1.0,
8,ethnicity,parental_education,0.0,classification,True,weighted F1,0.188,0.163448,DecisionTreeClassifier()
9,ethnicity,lunch_program,0.0,classification,True,weighted F1,0.534,0.514654,DecisionTreeClassifier()


As above, we need to filter and query. Again, we note the use of the DecisionTreeClassifier algorithm

In [25]:
df_writing_pps_filtered = df_writing_pps.filter(['x', 'y', 'ppscore']).query('x == "writing_score"')
df_writing_pps_filtered

Unnamed: 0,x,y,ppscore
30,writing_score,gender,0.042184
31,writing_score,ethnicity,0.064535
32,writing_score,parental_education,0.004927
33,writing_score,lunch_program,0.199992
34,writing_score,test_preparation_course,0.149159
35,writing_score,writing_score,1.0


We note that lunch_program continues to enjoy strong predictive power, and that test_preparation_course also has a high predictive power score. As with reading_score, we will use the contextual nature of PPS interpretation to our advantage and classify it as strong predictive power. This may be justified by the very close correlation scores.

### Discussion

We will gather our insights from the predictive power score analyses here:

- For math_score, lunch_program has the strongest predictive power by far

- For reading_score, lunch_program has the strongest predictive power

- For writing_score, lunch_program has the strongest predictive power, closely followed by test_preparation_course

Therefore, going forward, we will focus on the lunch_program variable and, to a lesser extent, the test_preparation_course variable.