# **Correlation Analysis Notebook**

## Objectives
- Determine how the variables of the dataset correlate

## Inputs
- student-exam-results.csv file

## Outputs
- Revealed correlations within the dataset

## Additional comments

# Change working directory

Since this notebook exists in the jupyter_notebooks directory, we need to change the current working directory from the jupyter_notebooks directory to the workspace, so that any directories created in further codes cells are added in the correct place. 

We access the current directory with the OS packages' `getcwd()` method

In [1]:
import os
current_directory = os.getcwd()
current_directory


'/workspace/Exam-Scores-Analysis/jupyter_notebooks'

We now want to set the working directory as the parent of the current working directory, jupyter_notebooks

- The `os.path.dirname()` method gets the parent directory
- The `os.chir()` method defines the new current directory
- We do this to access all of the project's files and directories, rather than those in the jupyter_notebooks directory

In [2]:
os.chdir(os.path.dirname(current_directory))
print("You set a new current directory")

You set a new current directory


To make certain of things, we now use a code cell to confirm that we have set the current working directory properly

In [3]:
current_directory = os.getcwd()
current_directory

'/workspace/Exam-Scores-Analysis'

## Correlation Study

Now that we have become more familiar with the dataset, we can perform a correlation study. As is noted in the Churnometer walkthrough project, we could use the Predictive Power Score library, however the feature variables are categorical. This means that we use One Hot Encoding to encode the categorical variables so that we can determine correlation scores.

However, our task is larger in scope, since we have 5 target variables. We need to see how each feature variable correlates with each target variable. First, as we have a new notebook, we need to re-import the dataset.

In [4]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/student-exam-results.csv')
df.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score,reading_score,writing_score,average_score,literacy_score
0,male,group A,high school,standard,completed,67,67,63,65,65
1,female,group D,some high school,free/reduced,none,40,59,55,51,57
2,male,group E,some college,free/reduced,none,59,60,50,56,55
3,male,group B,high school,standard,none,77,78,68,74,73
4,male,group E,associate's degree,standard,completed,78,73,68,73,70


Now we can apply the One Hot Encoder

In [5]:
from feature_engine.encoding import OneHotEncoder
one_hot_encoder = OneHotEncoder(variables = df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_one_hot_encoder = one_hot_encoder.fit_transform(df)
print(df_one_hot_encoder.shape)
df_one_hot_encoder.head()

(1000, 22)


Unnamed: 0,math_score,reading_score,writing_score,average_score,literacy_score,gender_male,gender_female,ethnicity_group A,ethnicity_group D,ethnicity_group E,...,parental_education_high school,parental_education_some high school,parental_education_some college,parental_education_associate's degree,parental_education_bachelor's degree,parental_education_master's degree,lunch_program_standard,lunch_program_free/reduced,test_preparation_course_completed,test_preparation_course_none
0,67,67,63,65,65,1,0,1,0,0,...,1,0,0,0,0,0,1,0,1,0
1,40,59,55,51,57,0,1,0,1,0,...,0,1,0,0,0,0,0,1,0,1
2,59,60,50,56,55,1,0,0,0,1,...,0,0,1,0,0,0,0,1,0,1
3,77,78,68,74,73,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,1
4,78,73,68,73,70,1,0,0,0,1,...,0,0,0,1,0,0,1,0,1,0


We note that the encoded dataset now has 22 columns, and that the 5 target variables have been preserved.

### Maths score correlation

#### Pearson

In [9]:
math_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['math_score'].sort_values(key=abs, ascending=False)[1:].head(10)
math_correlation_pearson

average_score                          0.919267
literacy_score                         0.821858
reading_score                          0.819398
writing_score                          0.805944
lunch_program_free/reduced            -0.374431
lunch_program_standard                 0.374431
ethnicity_group E                      0.203515
gender_male                            0.200863
gender_female                         -0.200863
parental_education_some high school   -0.179725
Name: math_score, dtype: float64

We can see that the other 4 target variables correlate highly with math_score. By default, we exclude the first entry, which is math_score's correlation with itself. Since, in any prediction situation, we will not know the other test scores, we cannot use them in the correlation study, so we can adjust the correlation code cell to exclude the first 5 rows, so as to better reveal how the feature variables correlate with the math_score variable.

In [11]:
math_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['math_score'].sort_values(key=abs, ascending=False)[5:].head(10)
math_correlation_pearson

lunch_program_free/reduced             -0.374431
lunch_program_standard                  0.374431
ethnicity_group E                       0.203515
gender_male                             0.200863
gender_female                          -0.200863
parental_education_some high school    -0.179725
test_preparation_course_none           -0.151704
test_preparation_course_completed       0.151704
ethnicity_group C                      -0.146533
parental_education_bachelor's degree    0.117535
Name: math_score, dtype: float64

This reveals that only the lunch_program feature variable correlates with the math_score variable, and then only weakly. Therefore, prediction of the math_score target variable could be problematic.

#### Spearman

We anticipate a similar outcome to the Pearson correlation, so we will discard the first 5 rows of the Spearman correlation test output.

In [12]:
math_correlation_spearman = df_one_hot_encoder.corr(method='spearman')['math_score'].sort_values(key=abs, ascending=False)[5:].head(10)
math_correlation_spearman

lunch_program_free/reduced            -0.363140
lunch_program_standard                 0.363140
gender_male                            0.193047
gender_female                         -0.193047
ethnicity_group E                      0.192825
parental_education_some high school   -0.176459
test_preparation_course_completed      0.145819
test_preparation_course_none          -0.145819
ethnicity_group C                     -0.141661
ethnicity_group D                      0.118324
Name: math_score, dtype: float64

We get similar correlation results to the Pearson correlation - the lunch_program variable correlates weakly to the math_score variable.

### Reading score correlation

#### Pearson

In [14]:
reading_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['reading_score'].sort_values(key=abs, ascending=False)[5:].head(10)
reading_correlation_pearson

lunch_program_standard                  0.288282
lunch_program_free/reduced             -0.288282
test_preparation_course_none           -0.245144
test_preparation_course_completed       0.245144
gender_female                           0.189389
gender_male                            -0.189389
parental_education_some high school    -0.151530
ethnicity_group D                       0.124821
ethnicity_group C                      -0.122770
parental_education_bachelor's degree    0.120719
Name: reading_score, dtype: float64

We see that both the lunch_program and test_prepartion_course feature variables correlate weakly with reading_score

#### Spearman

In [15]:
reading_correlation_spearman = df_one_hot_encoder.corr(method='spearman')['reading_score'].sort_values(key=abs, ascending=False)[5:].head(10)
reading_correlation_spearman

lunch_program_free/reduced             -0.273246
lunch_program_standard                  0.273246
test_preparation_course_completed       0.244122
test_preparation_course_none           -0.244122
gender_female                           0.181827
gender_male                            -0.181827
parental_education_some high school    -0.149800
ethnicity_group C                      -0.123938
ethnicity_group D                       0.121751
parental_education_bachelor's degree    0.118116
Name: reading_score, dtype: float64

As with Pearson correlation, we see that both the lunch_program and test_prepartion_course feature variables correlate weakly with reading_score

### Writing score correlation

#### Pearson

In [16]:
writing_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['writing_score'].sort_values(key=abs, ascending=False)[5:].head(10)
writing_correlation_pearson

lunch_program_free/reduced             -0.319191
lunch_program_standard                  0.319191
test_preparation_course_completed       0.315601
test_preparation_course_none           -0.315601
gender_female                           0.246089
gender_male                            -0.246089
ethnicity_group D                       0.172772
parental_education_some high school    -0.161996
parental_education_bachelor's degree    0.151974
parental_education_master's degree      0.143354
Name: writing_score, dtype: float64

We see that lunch_program and test_preparation_course correlate weakly with writing_score

#### Spearman

In [17]:
reading_correlation_spearman = df_one_hot_encoder.corr(method='spearman')['writing_score'].sort_values(key=abs, ascending=False)[5:].head(10)
reading_correlation_spearman

test_preparation_course_completed       0.312719
test_preparation_course_none           -0.312719
lunch_program_standard                  0.308331
lunch_program_free/reduced             -0.308331
gender_female                           0.240425
gender_male                            -0.240425
ethnicity_group D                       0.165223
parental_education_some high school    -0.162533
parental_education_bachelor's degree    0.146431
ethnicity_group C                      -0.139161
Name: writing_score, dtype: float64

We see that lunch_program and test_preparation_course correlate weakly with writing_score

### Average score correlation

#### Pearson

In [18]:
average_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['average_score'].sort_values(key=abs, ascending=False)[5:].head(10)
average_correlation_pearson

lunch_program_free/reduced             -0.343801
lunch_program_standard                  0.343801
test_preparation_course_none           -0.250024
test_preparation_course_completed       0.250024
parental_education_some high school    -0.171873
ethnicity_group C                      -0.144197
ethnicity_group D                       0.143886
parental_education_bachelor's degree    0.135769
ethnicity_group E                       0.124586
parental_education_master's degree      0.124421
Name: average_score, dtype: float64

We see that lunch_program and test_preparation_course correlate weakly with average_score, though lunch_program correlates much better

#### Spearman

In [19]:
average_correlation_spearman = df_one_hot_encoder.corr(method='spearman')['average_score'].sort_values(key=abs, ascending=False)[5:].head(10)
average_correlation_spearman

lunch_program_free/reduced             -0.328517
lunch_program_standard                  0.328517
test_preparation_course_completed       0.244359
test_preparation_course_none           -0.244359
parental_education_some high school    -0.170254
ethnicity_group D                       0.144023
ethnicity_group C                      -0.142902
parental_education_bachelor's degree    0.127152
ethnicity_group E                       0.121390
parental_education_master's degree      0.116028
Name: average_score, dtype: float64

We see that lunch_program and test_preparation_course correlate weakly with average_score, though lunch_program correlates much better

### Literacy score correlation

#### Pearson

In [20]:
literacy_correlation_pearson = df_one_hot_encoder.corr(method='pearson')['literacy_score'].sort_values(key=abs, ascending=False)[5:].head(10)
literacy_correlation_pearson

lunch_program_standard                  0.308416
lunch_program_free/reduced             -0.308416
test_preparation_course_completed       0.284684
test_preparation_course_none           -0.284684
gender_female                           0.220989
gender_male                            -0.220989
parental_education_some high school    -0.159035
ethnicity_group D                       0.150611
parental_education_bachelor's degree    0.139335
ethnicity_group C                      -0.134234
Name: literacy_score, dtype: float64

We see that lunch_program and test_preparation_course correlate weakly with literacy_score

#### Spearman

In [21]:
literacy_correlation_spearman = df_one_hot_encoder.corr(method='spearman')['literacy_score'].sort_values(key=abs, ascending=False)[5:].head(10)
literacy_correlation_spearman

lunch_program_free/reduced             -0.296744
lunch_program_standard                  0.296744
test_preparation_course_completed       0.281946
test_preparation_course_none           -0.281946
gender_female                           0.213681
gender_male                            -0.213681
parental_education_some high school    -0.158121
ethnicity_group D                       0.145532
parental_education_bachelor's degree    0.134937
ethnicity_group C                      -0.131162
Name: literacy_score, dtype: float64

We see that lunch_program and test_preparation_course correlate weakly with literacy_score