# **Feature Engineering**

## Objectives

**Perform Business requirement 2 user story task: Feature engineering ML tasks**
* Perform categorical encoding on categorical features.
* Perform feature selection to distill the most significant features, and also remove redundant features.
* Carry out feature scaling/transformations to normalise the distributions of remaining features.
* Create the data cleaning and feature engineering pipeline.

## Inputs
* cleaned train set: outputs/datasets/ml/cleaned/train_set.csv
* cleaned test set: outputs/datasets/ml/cleaned/test_set.csv

## Outputs


---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load cleaned train and test datasets

In [None]:
import pandas as pd

train_set_df = pd.read_csv(filepath_or_buffer='outputs/datasets/ml/cleaned/train_set.csv')
test_set_df = pd.read_csv(filepath_or_buffer='outputs/datasets/ml/cleaned/test_set.csv')

---

## Categorical feature encoding

In the sale price correlation study notebook, the categorical features were encoded using an ordinal encoder; this was deemed most suitable since all the cateogrical features are ordinal, with an obvious ordering based around a rating.

The exact same encoding will be used for the cleaned train and test sets. 

In [None]:
train_set_categorical_df = train_set_df.select_dtypes(include='object')
print(train_set_categorical_df.columns)
test_set_categorical_df = test_set_df.select_dtypes(include='object')
test_set_categorical_df.columns

In [None]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

# Designating the ordered categories
bsmt_fin_type1_cat = np.array(list(reversed(['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'None'])))
bsmt_exposure_cat = np.array(['None', 'No', 'Mn', 'Av', 'Gd'])
garage_finish_cat = np.array(['None', 'Unf', 'RFn', 'Fin'])
kitchen_quality_cat = np.array(['Po', 'Fa', 'TA', 'Gd', 'Ex'])

categories = [bsmt_exposure_cat, bsmt_fin_type1_cat, garage_finish_cat, kitchen_quality_cat]
encoder = OrdinalEncoder(categories=categories, dtype='int64')
encoder.set_output(transform='pandas')

# fitting and transforming each set
train_set_df[train_set_categorical_df.columns] = encoder.fit_transform(X=train_set_categorical_df)
print(train_set_df[train_set_categorical_df.columns].head())
test_set_df[test_set_categorical_df.columns] = encoder.transform(X=test_set_categorical_df)
test_set_df[test_set_categorical_df.columns].head()


---

## Initial feature selection

It was decided to perform an initial feature selection before feature scaling, for primarily two reasons. Firstly there is quite a large number of features (23), for which performing feature scaling would be time consuming, as well as pointless for several features which have much less significance with regard to the target; this was already discovered in the sales price correlation study notebook, where a group of significant features was generated using correlation tests and PPS's applied to the whole dataset. The second reason is that as this initial selection will not use a ML model to establish the significant features, scaling is not important. 

Instead a similar approach to that used in the sales price correlation study involving correlation tests, but with the 'selectKBest method', will be used --- again this is not affected by scaling. 

After this is performed, feature-feature correlations will be assessed, and one of the features of any strongly correlated group will be dropped in order to reduce redundancy; which was seen to exist in the significant feature EDA notebook. A decision tree algorithm will be used, as this is unaffected by scaling.

Feature scaling will then be performed on the most significant features, before a further feature selection will occur that relies on scale affected ML models. Since scaling has been performed, this should pose no issues.

**SelectKBest feature selection**

The SelectKBest feature selection will be performed twice, once using the Spearman correlation test, then using the phi_k correlation test.

Creating the score functions

In [None]:
import pingouin as pg
from sklearn.feature_selection import SelectKBest

def spearman_score_function(feature_array, target_array):
    """
    Calculates spearman correlation coefficients for the train or test set.

    To be used as the score function in the sklearn.feature_selection.SelectKBest function.

    Args:
        feature_array = array-like, containing train or test set feature data.
        target_array = array-like, containing train or test set target data

    Returns tuple of the spearman r values series, and the spearman p values series for the dataset.
    """
    column_names = train_set_df.columns.to_list()
    df = pd.DataFrame(data=feature_array, columns=column_names[0:-1])
    df['SalePrice'] = target_array
    spearman_df = df.pairwise_corr(columns=['SalePrice'], method='spearman')
    print('Spearman coefficients')
    print(spearman_df.sort_values(by='r')[['X', 'Y', 'r', 'p-unc']])
    return (spearman_df['r'], spearman_df['p-unc'])

In [None]:
import phik
from phik.phik import phik_matrix
from phik.report import plot_correlation_matrix
from phik.significance import significance_matrix

def phik_score_function(feature_array, target_array,):
    """
    Calculates phi_k correlation coefficients for the train or test set.

    To be used as the score function in the sklearn.feature_selection.SelectKBest function.

    Args:
        feature_array = array-like, containing train or test set feature data.
        target_array = array-like, containing train or test set target data

    Returns tuple of the phi_k correlation values series, and the significance values series for the dataset.
    """
    column_names = train_set_df.columns.to_list()
    df = pd.DataFrame(data=feature_array, columns=column_names[0:-1])
    df['SalePrice'] = target_array
   
    phik_matrix_df = phik_matrix(df)[['SalePrice']].drop('SalePrice', axis=0)
    significance_matrix_df = significance_matrix(df)[['SalePrice']].drop('SalePrice', axis=0)
    print('phi_k coefficients')
    print(phik_matrix_df)

    return (phik_matrix_df['SalePrice'], significance_matrix_df['SalePrice'])

Will take the union of filtered out features using each score function.

In [None]:
def filter_out_best_features(df):
    """
    Applies a SelectKBest transformer, once with a spearman score function, and then with a phi_k score function.

    Args:
        df: train or test set dataframe containing features and a target.

    Returns a list of the union of the k-best-selected features generated using each score function.
    """
    select_k_best = SelectKBest(score_func=spearman_score_function)
    select_k_best.set_output(transform='pandas')
    train_set_spearman_transformed_df = select_k_best.fit_transform(df.drop('SalePrice', axis=1), df['SalePrice'])

    select_k_best = SelectKBest(score_func=phik_score_function)
    select_k_best.set_output(transform='pandas')
    train_set_phik_transformed_df = select_k_best.fit_transform(df.drop('SalePrice', axis=1), df['SalePrice'])

    return list(set(train_set_spearman_transformed_df.columns.to_list() + train_set_phik_transformed_df.columns.to_list()))
        

In [None]:
selected_features = filter_out_best_features(train_set_df)

**Best features list**

In [None]:
print('Number of features:', len(selected_features))
selected_features

Can see that most of the selected features match those selected during the sale price correlation study notebook using the whole dataset.

Now will manually transform the train set by filtering out only the selected features. Will also filter out the same features from the test set, in order to avoid data leakage by
selecting the best features separately using the test set instead.

In [None]:
train_set_df = train_set_df[selected_features + ['SalePrice']]
test_set_df = test_set_df[selected_features + ['SalePrice']]

### Handling redundant features

Will select a single feature from any highly correlated feature groups, using the Spearman correlation test.

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
from sklearn.tree import DecisionTreeRegressor

estimator = DecisionTreeRegressor(min_samples_split=10, min_samples_leaf=5, random_state=30)
smart_correlated_transformer = SmartCorrelatedSelection(train_set_df.drop('SalePrice', axis=1).columns.to_list(),
                                                        method='spearman', threshold=0.8, selection_method='model_performance',
                                                        estimator=estimator, scoring='r2', cv=5)

fitting to train set

In [None]:
smart_correlated_transformer.fit(X=train_set_df, y=train_set_df['SalePrice'])

In [None]:
for statement in [smart_correlated_transformer.variables,
                  smart_correlated_transformer.correlated_feature_sets_,
                  smart_correlated_transformer.features_to_drop_]:
                  print(statement)

transforming train set

In [None]:
train_set_df = smart_correlated_transformer.transform(train_set_df)


transforming test set

In [None]:
test_set_df = smart_correlated_transformer.transform(test_set_df)

---