# Feature Engineering Code-along

In this notebook, we use the following feature engineering strategies:
1. scaling
2. binning
3. log transformation
4. PCA

The data regards students, features about them, and their scores on an exam.  The orginal data can be found [here on Kaggle](https://www.kaggle.com/datasets/uciml/student-alcohol-consumption)

We will use some feature engineering on our data and then try to predict whether students will pass the exam.

The minimum passing score is 12

# Data Dictionary

1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)


In [2]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import timeit

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, ConfusionMatrixDisplay, \
classification_report

from sklearn import set_config
set_config(transform_output='pandas')

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

# Useful Functions

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
def classification_metrics(y_true, y_pred, label="",
                           output_dict=False, figsize=(8,4),
                           normalize='true', cmap='Blues',
                           colorbar=False):
  # Get the classification report
  report = classification_report(y_true, y_pred)
  ## Print header and report
  header = "-"*70
  print(header, f" Classification Metrics: {label}", header, sep='\n')
  print(report)
  ## CONFUSION MATRICES SUBPLOTS
  fig, axes = plt.subplots(ncols=2, figsize=figsize)
  # create a confusion matrix  of raw counts
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=None, cmap='gist_gray', colorbar=colorbar,
                ax = axes[0],);
  axes[0].set_title("Raw Counts")
  # create a confusion matrix with the test data
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=normalize, cmap=cmap, colorbar=colorbar,
                ax = axes[1]);
  axes[1].set_title("Normalized Confusion Matrix")
  # Adjust layout and show figure
  fig.tight_layout()
  plt.show()
  # Return dictionary of classification_report
  if output_dict==True:
    report_dict = classification_report(y_true, y_pred, output_dict=True)
    return report_dict
    
    
    
def evaluate_classification(model, X_train, y_train, X_test, y_test,
                         figsize=(6,4), normalize='true', output_dict = False,
                            cmap_train='Blues', cmap_test="Reds",colorbar=False):
  # Get predictions for training data
  y_train_pred = model.predict(X_train)
  # Call the helper function to obtain regression metrics for training data
  results_train = classification_metrics(y_train, y_train_pred, #verbose = verbose,
                                     output_dict=True, figsize=figsize,
                                         colorbar=colorbar, cmap=cmap_train,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = model.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = classification_metrics(y_test, y_test_pred, #verbose = verbose,
                                  output_dict=True,figsize=figsize,
                                         colorbar=colorbar, cmap=cmap_test,
                                    label='Test Data' )
  if output_dict == True:
    # Store results in a dataframe if ouput_frame is True
    results_dict = {'train':results_train,
                    'test': results_test}
    return results_dict



# Load Data


In [None]:
# load data
import glob
math_files = sorted(glob.glob('Data/Students/*.csv'))
math_files

In [None]:
port_files = sorted(glob.glob('Data/Students/Portugese/*.csv'))
port_files

In [None]:
math = pd.concat([pd.read_csv(file) for file in math_files])
port = pd.concat([pd.read_csv(file) for file in port_files])

df = pd.concat([math, port])

df_backup = df.copy()

df.head()

In [None]:
df.info()

## Explore and clean the data

In [None]:
# Check for duplicates

df.duplicated().sum()

In [None]:
# Check for missing values

df.isna().sum()

In [None]:
# Check summary statistics

df.describe()

In [None]:
df.describe(exclude='number')

# Converting Regression to Classification: Binning the Target

What our stakeholders really want to know is which students will pass and which students will fail the exam.  We also know that a passing score is 12 or higher.  Using this knowledge we can bin the target into passing and failing scores.

## Applying a function

In [None]:
## Define a function


## Apply the Function

## Check Value counts



# Feature Engineering

## Combining Features

Walc is weekend alcohol consumption and Dalc is weekday alcohol consumption.  We can combine these into one column, overall alcohol consumption.

In [3]:
# Add together the different alcohol consumption columns



## Binary Encoding

In [None]:
## Replace 'yes' and 'no' with 1 and 0


## Combining Columns

In [None]:
## Combine the school and subject


## drop original columns


## Reducing Outliers: Log transformation

In [None]:
## Check distribution of absences


We could just drop the outliers, or we can do a log transform to squick them closer to the other values.  This makes the data more normal without losing any information.

We can't get the natural log of 0, so we will just add one to each value to make sure there are no 0s.

In [None]:
## Log Transform Absences



## PCA

PCA will cause data leakage if we apply it to all rows, since it needs to look at all rows to determine how to transform the data.

<font color='red'> We must do the PCA transformation AFTER the split </font>

In [None]:
## Split the data
X = df.drop('passed_exam', axis=1)
y = df['passed_exam'].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

### Preprocessor

We will one-hot encode the data, scale it, and then PCA transform it.

In [None]:
## define transformers
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

## Select Columns

num_cols = X_train.select_dtypes(include='number').columns
cat_cols = X_train.select_dtypes(include='object').columns

## define tuples

num_tuple = ('Numeric', scaler, num_cols)
cat_tuple = ('Nominal', ohe, cat_cols)

#### preprocessing pipeline

In [None]:
## Define the column transformer


## Combine the column transformer with a PCA model to transform the data.


## Transform the data

Let's compare the number of columns with and without PCA

In [None]:
col_trans.fit_transform(X_train).shape

In [None]:
X_train_proc = preprocessor.fit_transform(X_train)
X_test_proc = preprocessor.transform(X_test)
X_train_proc.shape

## Examine the explained variance of each principal component

In [None]:
## Plot the explained variance ratio of the pca components



# Model the engineered data

In [None]:
## Define the model
knn_eng = KNeighborsClassifier()

## Fit the model
knn_eng.fit(X_train_proc, y_train)

## Evaluate the model
evaluate_classification(knn_eng, X_train_proc, y_train, 
                 X_test_proc, y_test)