# Feature Engineering Code-along Solution

In this notebook, we use the following feature engineering strategies:
1. scaling
2. binning
3. log transformation
4. PCA

The data regards students, features about them, and their scores on an exam.  The orginal data can be found [here on Kaggle](https://www.kaggle.com/code/ramontanoeiro/student-performance)

We will use some feature engineering on our data and then try to predict whether students will pass the exam.

The minimum passing score is 12

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import timeit

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, ConfusionMatrixDisplay, \
classification_report

from sklearn import set_config
set_config(transform_output='pandas')

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

# Useful Functions

In [None]:
def eval_classification(true, pred, name='Model'):
    
    """ Shows classification_report and confusion matrix
    for classification model predictions.  Outputs a dataframe of metrics"""
  
    print(name, '\n')
    print(classification_report(true, pred))
    ConfusionMatrixDisplay.from_predictions(true, pred)
    plt.show()

    scores = pd.DataFrame()
    scores['Model Name'] = [name]
    scores['Precision'] = [precision_score(true, pred)]
    scores['Recall'] = [recall_score(true, pred)]
    scores['F1 Score'] = [f1_score(true, pred)]
    scores['Accuracy'] = [accuracy_score(true, pred)]
    scores.set_index('Model Name', inplace=True)

    return scores

def evaluate(model, X_train, y_train, X_test, y_test, name='model'):
    """ Evaluate and time a fit model
    returns a dataframe of training and testing metrics and predict times"""

    ## Time the predictions
    start = timeit.timeit()
    train_pred = model.predict(X_train)
    end = timeit.timeit()

    train_time = end - start

    start = timeit.timeit()
    test_pred = model.predict(X_test)
    end = timeit.timeit()

    test_time = end - start

    ## Evaluate the predictions
    train_scores = eval_classification(train_pred, y_train, name=name + ' train')
    test_scores = eval_classification(test_pred, y_test, name=name + ' test')

    train_scores['predict_time'] = train_time
    test_scores['predict_time'] = test_time

    return pd.concat([train_scores, test_scores])
    

## Data
<br>
Today we will use data about used car sales in India from Kaggle.  

[Here is the source](https://www.kaggle.com/datasets/saisaathvik/used-cars-dataset-from-cardekhocom)

In [None]:
# load data
import glob
math_files = sorted(glob.glob('Data/Students/*.csv'))
math_files

In [None]:
port_files = sorted(glob.glob('Data/Students/Portugese/*.csv'))
port_files

In [None]:
math = pd.concat([pd.read_csv(file) for file in math_files])
port = pd.concat([pd.read_csv(file) for file in port_files])

df = pd.concat([math, port])

df_backup = df.copy()

df.head()

In [None]:
df.info()

## Explore and clean the data

In [None]:
# Check for duplicates

df.duplicated().sum()

In [None]:
# Check for missing values

df.isna().sum()

In [None]:
# Check summary statistics

df.describe()

In [None]:
df.describe(exclude='number')

# Converting Regression to Classification: Binning the Target

What our stakeholders really want to know is which students will pass and which students will fail the exam.  We also know that a passing score is 12 or higher.  Using this knowledge we can bin the target into passing and failing scores.

## Applying a function

In [None]:
## Define a function


## Apply the Function

## Check Value counts



# Feature Engineering


## Binary Encoding

In [None]:
## Replace 'yes' and 'no' with 1 and 0


## Combining Columns

In [None]:
## Combine the school and subject


## drop original columns


## Reducing Outliers: Log transformation

In [None]:
## Check distribution of absences


We could just drop the outliers, or we can do a log transform to squick them closer to the other values.  This makes the data more normal without losing any information.

We can't get the natural log of 0, so we will just add one to each value to make sure there are no 0s.

In [None]:
## Log Transform Absences



## PCA

PCA will cause data leakage if we apply it to all rows, since it needs to look at all rows to determine how to transform the data.

<font color='red'> We must do the PCA transformation AFTER the split </font>

In [None]:
## Split the data
X = df.drop('passed_exam', axis=1)
y = df['passed_exam'].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

### Preprocessor

We will one-hot encode the data, scale it, and then PCA transform it.

In [None]:
## define transformers
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

## Select Columns

num_cols = X_train.select_dtypes(include='number').columns
cat_cols = X_train.select_dtypes(include='object').columns

## define tuples

num_tuple = ('Numeric', scaler, num_cols)
cat_tuple = ('Nominal', ohe, cat_cols)

#### preprocessing pipeline

In [None]:
## Define the column transformer


## Combine the column transformer with a PCA model to transform the data.


## Transform the data

Let's compare the number of columns with and without PCA

In [None]:
col_trans.fit_transform(X_train).shape

In [None]:
X_train_proc = preprocessor.fit_transform(X_train)
X_test_proc = preprocessor.transform(X_test)
X_train_proc.shape

## Examine the explained variance of each principal component

In [None]:
## Plot the explained variance ratio of the pca components



# Model the engineered data

In [None]:
## Define the model
knn_eng = KNeighborsClassifier()

## Fit the model
knn_eng.fit(X_train_proc, y_train)

## Evaluate the model
scores = evaluate(knn_eng, X_train_proc, y_train, 
                 X_test_proc, y_test,
                 name='KNN Engineered Data')
scores