# **Feature Engineering**

## Objectives

* Find engineer features for Classification and Regression.

## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* generate a list with variables to engineer

## Conclusions

* Feature Engineering Transformers
    * Ordinal categorical encoding: `['Parental_Involvement', 'Access_to_Resources', 'Extracurricular_Activities', 'Motivation_Level', 'Internet_Access', 'Family_Income', 'Teacher_Quality', 'School_Type', 'Peer_Influence', 'Learning_Disabilities', 'Parental_Education_Level', 'Distance_from_Home', 'Gender']`
    * Yeo Johnson: `['Attendance', Tutoring-Session', 'Exam_Score']`

* A transformer has been developed to create a new column, `Improve_Score`, which is calculated by subtracting the `Previous_Scores` from the `Exam_Score`. This new column contains values of 0 and 1, indicating whether students did not improve or improve their test scores, respectively. Once this new column is added, the `Previous_Scores` and `Exam_Score` columns are removed from the DataFrame.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Cleaned Data

## Train Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(3)

## Test Set

In [None]:
test_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TestSet = pd.read_csv(test_set_path)
TestSet.head(3)

---

# Data Exploration

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet)
pandas_report.to_notebook_iframe()

## Summary of Data Exploration

Since the PPS heatmap and values with storng values doesn't differ from the correlation map in the previous notbook other Corraletion and PPS Analaysis. An Analyses will not be made made in this notbooke

* There is a strong correlation between Exam_Score and Attendance 
* Internet_Access data is unbalaced and according to the heatmap there is a low to none corroletion with Exam_Score
* Learning_Disabilities data is unbalaced and according to the heatmap there is a low to none corroletion with Exam_Score

### Potanial Feature Engineering Spreadsheet 

| Feature                    | Type   | Categorical <br> Encoding | Numerical <br> Transformation | Smart Correlation <br> Selection |
|----------------------------|--------|:---------------------:|:-------------------------:|:----------------------------:|
| Parental_Involvement       | object |           X           |                           |               X              |
| Access_to_Resources        | object |           X           |                           |               X              |
| Extracurricular_Activities | object |           X           |                           |               X              |
| Motivation_Level           | object |           X           |                           |               X              |
| Internet_Access            | object |           X           |                           |               X              |
| Family_Income              | object |           X           |                           |               X              |
| Teacher_Quality            | object |           X           |                           |               X              |
| School_Type                | object |           X           |                           |               X              |
| Peer_Influence             | object |           X           |                           |               X              |
| Learning_Disabilities      | object |           X           |                           |               X              |
| Parental_Education_Level   | object |           X           |                           |               X              |
| Distance_from_Home         | object |           X           |                           |               X              |
| Gender                     | object |           X           |                           |               X              |
|                            |        |                       |                           |                              |
| Hours_Studied              | int64  |                       |             X             |               X              |
| Attendance                 | int64  |                       |             X             |               X              |
| Sleep_Hours                | int64  |                       |             X             |               X              |
| Previous_Scores            | int64  |                       |             X             |               X              |
| Tutoring_Sessions          | int64  |                       |             X             |               X              |
| Physical_Activity          | int64  |                       |             X             |               X              |
| Exam_Score                 | int64  |                       |             X             |               X              |



---

# Feature Engineering

In [7]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", rvalue=True, plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


In [8]:
variables_engineering_object = ['Parental_Involvement', 'Access_to_Resources',
                                'Extracurricular_Activities', 'Motivation_Level',
                                'Internet_Access', 'Family_Income', 'Teacher_Quality',
                                'School_Type', 'Peer_Influence', 'Learning_Disabilities',
                                'Parental_Education_Level', 'Distance_from_Home', 'Gender']

variables_engineering_num = ['Hours_Studied', 'Attendance', 'Sleep_Hours',
                             'Previous_Scores', 'Tutoring_Sessions', 
                             'Physical_Activity', 'Exam_Score']


In [None]:
df_engineering_object = TrainSet[variables_engineering_object].copy()
df_engineering_object.head(3)

In [None]:
df_engineering_num = TrainSet[variables_engineering_num].copy()
df_engineering_num.head(3)

In [None]:
df_engineering_object = FeatureEngineeringAnalysis(df=df_engineering_object, analysis_type='ordinal_encoder')

In [None]:
df_engineering_num = FeatureEngineeringAnalysis(df=df_engineering_num, analysis_type='numerical')

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = variables_engineering_object)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

Make a copy of the training dataframe

In [None]:
df_engineering = TrainSet.copy()
df_engineering.head(3)

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel6 = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")
corr_sel6.fit_transform(df_engineering)

corr_sel7 = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.7, selection_method="variance")
corr_sel7.fit_transform(df_engineering)

print(f'Correlated feature sets with threshold 0.6: {corr_sel6.correlated_feature_sets_}')
print(f'Features to drop: {corr_sel6.features_to_drop_}')
print('--------------------------')
print(f'Correlated feature sets with threshold 0.7: {corr_sel7.correlated_feature_sets_}')
print(f'Features to drop: {corr_sel7.features_to_drop_}')


The transformer finds a correlation between the feature Attenced and Exam_Score when running the SmartCorrelatedSelection with the `threshold = 0.6`. Since Exam_Score has the highest variance of the values, the transformer will try to drop Exam_Score, which is the goal of the business case. This can be solved by increasing the threshold to `0.7`, but this will result in none of the features being dropped, leading to the conclusion that the SmartCorrelatedSelection will not be applied.  

## Feature Engineering Spreadsheet Summary

| Feature                    | Type   | Categorical <br> Encoding | Numerical <br> Transformation | Smart Correlation <br> Selection |
|----------------------------|--------|:---------------------:|:-------------------------:|:----------------------------:|
| Parental_Involvement       | object |           X           |                           |                             |
| Access_to_Resources        | object |           X           |                           |                             |
| Extracurricular_Activities | object |           X           |                           |                             |
| Motivation_Level           | object |           X           |                           |                             |
| Internet_Access            | object |           X           |                           |                             |
| Family_Income              | object |           X           |                           |                             |
| Teacher_Quality            | object |           X           |                           |                             |
| School_Type                | object |           X           |                           |                             |
| Peer_Influence             | object |           X           |                           |                             |
| Learning_Disabilities      | object |           X           |                           |                             |
| Parental_Education_Level   | object |           X           |                           |                             |
| Distance_from_Home         | object |           X           |                           |                             |
| Gender                     | object |           X           |                           |                             |
|                            |        |                       |                           |                             |
| Hours_Studied              | int64  |                       |                           |                             |
| Attendance                 | int64  |                       |             X             |                             |
| Sleep_Hours                | int64  |                       |                           |                             |
| Previous_Scores            | int64  |                       |                           |                             |
| Tutoring_Sessions          | int64  |                       |             X             |                             |
| Physical_Activity          | int64  |                       |                           |                             |
| Exam_Score                 | int64  |                       |             X             |                             |



---

# Create Transformer

## Create improved column

In [16]:
from sklearn.base import BaseEstimator, TransformerMixin

class CreateImprovedScoreColumn(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()  # To avoid modifying the original data
        new_value = X['Exam_Score'] - X['Previous_Scores']
        X['Improved_Score'] = new_value

        # Replace negative and zero values with 0, positive values with 1
        X['Improved_Score'] = X['Improved_Score'].apply(lambda x: 1 if x > 0 else 0)
        
        # Drop the original columns Previous_Scores and Exam_Score
        X.drop(['Previous_Scores', 'Exam_Score'], axis=1, inplace=True)

        return X

In [None]:
Transformer_improve = CreateImprovedScoreColumn()
df_ex = Transformer_improve.fit_transform(TrainSet)
df_ex.head(3)


In [None]:
pandas_report = ProfileReport(df=df_ex)
pandas_report.to_notebook_iframe()