# **Feature Engineering**

## Objectives
- Split the dataset into train and test sets for each of the maths, writing and reading exam scores, so that models can be trained to predict each exam score
- Conduct a SmartCorrelatedSelection step, to see if any of the feature variables are correlated such that they must needs be removed to prevent overfitting

## Inputs
- outputs/datasets/collection/student-exam-results.csv

## Outputs
- train and test sets for each of the math_score, reading_score and writing_score variables

## Additional comments

# Change working directory

Since this notebook exists in the jupyter_notebooks directory, we need to change the current working directory from the jupyter_notebooks directory to the workspace, so that any directories created in further codes cells are added in the correct place. 

We access the current directory with the OS packages' `getcwd()` method

In [1]:
import os
current_directory = os.getcwd()
current_directory


'/workspace/Exam-Scores-Analysis/jupyter_notebooks'

We now want to set the working directory as the parent of the current working directory, jupyter_notebooks

- The `os.path.dirname()` method gets the parent directory
- The `os.chir()` method defines the new current directory
- We do this to access all of the project's files and directories, rather than those in the jupyter_notebooks directory

In [2]:
os.chdir(os.path.dirname(current_directory))
print("You set a new current directory")

You set a new current directory


To make certain of things, we now use a code cell to confirm that we have set the current working directory properly

In [3]:
current_directory = os.getcwd()
current_directory

'/workspace/Exam-Scores-Analysis'

## Import packages

In [4]:
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from sklearn.model_selection import train_test_split

## Load data

In [5]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/student-exam-results.csv')
df.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score,reading_score,writing_score,average_score
0,male,group A,high school,standard,completed,67,67,63,65
1,female,group D,some high school,free/reduced,none,40,59,55,51
2,male,group E,some college,free/reduced,none,59,60,50,56
3,male,group B,high school,standard,none,77,78,68,74
4,male,group E,associate's degree,standard,completed,78,73,68,73


Now we create three different sets

In [6]:
df_math_score = df.drop(['reading_score', 'writing_score', 'average_score'], axis=1)
df_math_score.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score
0,male,group A,high school,standard,completed,67
1,female,group D,some high school,free/reduced,none,40
2,male,group E,some college,free/reduced,none,59
3,male,group B,high school,standard,none,77
4,male,group E,associate's degree,standard,completed,78


In [7]:
df_reading_score = df.drop(['math_score', 'writing_score', 'average_score'], axis=1)
df_reading_score.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,reading_score
0,male,group A,high school,standard,completed,67
1,female,group D,some high school,free/reduced,none,59
2,male,group E,some college,free/reduced,none,60
3,male,group B,high school,standard,none,78
4,male,group E,associate's degree,standard,completed,73


In [8]:
df_writing_score = df.drop(['math_score', 'reading_score', 'average_score'], axis=1)
df_writing_score.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,writing_score
0,male,group A,high school,standard,completed,63
1,female,group D,some high school,free/reduced,none,55
2,male,group E,some college,free/reduced,none,50
3,male,group B,high school,standard,none,68
4,male,group E,associate's degree,standard,completed,68


### Math Score

Now we can create split these datasets to form train and test sets. First, we import the train_test_split package.

In [9]:
math_train_vars, math_test_vars, math_train_score, math_test_score = train_test_split(
    df_math_score.drop(['math_score'], axis=1),
    df_math_score['math_score'],
    test_size = 0.2,
    random_state = 7
)

Now we check to see if this has worked:

In [10]:
print(f"math_train_vars: {math_train_vars.shape}")
print(f"math_test_vars: {math_test_vars.shape}")
print(f"math_train_score: {math_train_score.shape}")
print(f"math_test_score: {math_test_score.shape}")


math_train_vars: (800, 5)
math_test_vars: (200, 5)
math_train_score: (800,)
math_test_score: (200,)


Excellent - we see that the math_train_vars and math_train_score datasets both have 800 entries, equivalent to 80% of the original 1000 record dataset. We also see that the math_test_vars and math_test_score datasets both have 200 entries, equivalent to 20% of the original 1000 record dataset. Since we have used the same random state, we can be assured that the data has been randomised

### Reading Score

We will now do the same with the reading_score variable

In [11]:
reading_train_vars, reading_test_vars, reading_train_score, reading_test_score = train_test_split(
    df_reading_score.drop(['reading_score'], axis=1),
    df_reading_score['reading_score'],
    test_size = 0.2,
    random_state = 7
)

Now we check to see if these have been created properly

In [12]:
print(f"reading_train_vars: {reading_train_vars.shape}")
print(f"reading_test_vars: {reading_test_vars.shape}")
print(f"reading_train_score: {reading_train_score.shape}")
print(f"reading_test_score: {reading_test_score.shape}")

reading_train_vars: (800, 5)
reading_test_vars: (200, 5)
reading_train_score: (800,)
reading_test_score: (200,)


Excellent - 800 records for the train set and 200 records for the test set. Since we have used the same random state, we can be assured that the data has been randomised

### Writing Score

Finally, we split the writing_score dataset

In [13]:
writing_train_vars, writing_test_vars, writing_train_score, writing_test_score = train_test_split(
    df_writing_score.drop(['writing_score'], axis=1),
    df_writing_score['writing_score'],
    test_size = 0.2,
    random_state = 7
)

In [14]:
print(f"writing_train_vars: {writing_train_vars.shape}")
print(f"writing_test_vars: {writing_test_vars.shape}")
print(f"writing_train_score: {writing_train_score.shape}")
print(f"writing_test_score: {writing_test_score.shape}")

writing_train_vars: (800, 5)
writing_test_vars: (200, 5)
writing_train_score: (800,)
writing_test_score: (200,)


As before, we have 800 records for the train set and 200 for the test set

## Smart Correlated Selection

Though we have already identified some correlations between the feature variables, we should conduct a smart correlated selection analysis to see if we need to include a smart correlated selection step in the pipeline. We will need to encode our categorical variables.

In [15]:
variables_to_encode = ['gender', 'ethnicity', 'parental_education', 'lunch_program', 'test_preparation_course']

encoder = OrdinalEncoder(encoding_method='arbitrary', variables = variables_to_encode)

df_math_score_encoded = encoder.fit_transform(df_math_score)
df_reading_score_encoded = encoder.fit_transform(df_reading_score)
df_writing_score_encoded = encoder.fit_transform(df_writing_score)

Now we check to see if the datasets have been encoded properly

In [16]:
df_math_score_encoded.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score
0,0,0,0,0,0,67
1,1,1,1,1,1,40
2,0,2,2,1,1,59
3,0,3,0,0,1,77
4,0,2,3,0,0,78


In [17]:
df_reading_score_encoded.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,reading_score
0,0,0,0,0,0,67
1,1,1,1,1,1,59
2,0,2,2,1,1,60
3,0,3,0,0,1,78
4,0,2,3,0,0,73


In [18]:
df_writing_score_encoded.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,writing_score
0,0,0,0,0,0,63
1,1,1,1,1,1,55
2,0,2,2,1,1,50
3,0,3,0,0,1,68
4,0,2,3,0,0,68


Excellent - looks like the datasets have been encoded identically

### Smart correlated selection - maths

In [19]:
correlated_selection_maths = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

correlated_selection_maths.fit_transform(df_math_score_encoded)
correlated_selection_maths.correlated_feature_sets_

[]

In [20]:
correlated_selection_maths.features_to_drop_

[]

Interesting - looks like the smart correlated selection method has identified that all of the categorical variables are sufficiently uncorrelated that none should be dropped. This should hold true for the reading and writing datasets, but we should check just to be safe.

### Smart correlated selection - reading


In [21]:
correlated_selection_reading = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

correlated_selection_reading.fit_transform(df_reading_score_encoded)
correlated_selection_reading.correlated_feature_sets_

[]

In [22]:
correlated_selection_reading.features_to_drop_

[]

### Smart correlated selection - writing


In [23]:
correlated_selection_writing = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

correlated_selection_writing.fit_transform(df_writing_score_encoded)
correlated_selection_writing.correlated_feature_sets_

[]

In [24]:
correlated_selection_writing.features_to_drop_

[]

As expected - in all three cases, none of the categorical variables are sufficiently correlated that using them will risk overfitting the model. Therefore, we can safely forego a SmartCorrelatedSelection step in our pipeline.