# Feature Enginering Notebook

## Objectives  
In this notebook, we will explore the collected data to identify and understand the relationships between various house attributes and the sale price. This process will help to identify key features that will be used in the model for predicting house prices. The exploration phase will guide us in selecting the most relevant features and identifying any potential issues with the data (such as missing values or outliers) that may need to be addressed during data preprocessing.

## Inputs  
- `inputs/datasets/cleaned/TrainSet.csv`  
- `inputs/datasets/cleaned/TestSet.csv`  

## Outputs  
- Identified features correlated with house sale prices  
- Visualized relationships between features and the target variable (sale price)  
- Insights on the most relevant features to predict sale prices  

## Conclusions  
During this exploration phase, we will:  
- Perform correlation analysis to find the most influential features related to house sale price.  
- Visualize these relationships to better understand how each feature contributes to the price.  
- Identify and handle potential data issues (e.g., missing values or outliers) that could affect the model's performance.  
- Prepare the data for the next steps in feature engineering and model training.  


In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/housingprices/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/housingprices'

In [4]:
import pandas as pd
TrainSetCleaned = pd.read_csv('outputs/datasets/cleaned/TrainSetCleaned.csv')  


In [5]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
    
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

In [6]:
variables_engineering= [
    '1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtFinSF1',
    'BsmtFinType1', 'BsmtUnfSF', 'GarageArea', 'GarageFinish', 'GarageYrBlt',
    'GrLivArea', 'KitchenQual', 'LotArea', 'LotFrontage', 'MasVnrArea',
    'OpenPorchSF', 'OverallCond', 'OverallQual', 'TotalBsmtSF',
    'YearBuilt', 'YearRemodAdd'
]

variables_engineering

['1stFlrSF',
 '2ndFlrSF',
 'BedroomAbvGr',
 'BsmtExposure',
 'BsmtFinSF1',
 'BsmtFinType1',
 'BsmtUnfSF',
 'GarageArea',
 'GarageFinish',
 'GarageYrBlt',
 'GrLivArea',
 'KitchenQual',
 'LotArea',
 'LotFrontage',
 'MasVnrArea',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'TotalBsmtSF',
 'YearBuilt',
 'YearRemodAdd']

In [7]:
testset = pd.read_csv('outputs/datasets/cleaned/TestSetCleaned.csv')

df_engineering = testset[variables_engineering].copy()

df_engineering.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd
0,2515,0.0,4.0,No,1219,Rec,816,484,,1975.0,...,TA,32668,,,0,3,6,2035,1957,1975
1,958,620.0,3.0,No,403,BLQ,238,240,Unf,1941.0,...,Fa,9490,79.0,0.0,0,7,6,806,1941,1950
2,979,224.0,3.0,No,185,LwQ,524,352,Unf,1950.0,...,Gd,7015,,161.0,0,4,5,709,1950,1950


In [8]:
import pandas as pd

TrainSet = pd.read_csv('outputs/datasets/cleaned/TrainSetCleaned.csv') 
TestSet = pd.read_csv('outputs/datasets/cleaned/TestSetCleaned.csv') 


In [9]:
from sklearn.preprocessing import StandardScaler

variables_engineering = [
    '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'LotArea', 'MasVnrArea',
    'OpenPorchSF', 'OverallCond', 'OverallQual', 'TotalBsmtSF',
    'YearBuilt', 'YearRemodAdd'
]

df_engineering = TrainSetCleaned[variables_engineering].copy()
df_engineering.head(3)

scaler = StandardScaler()
df_engineering_scaled = scaler.fit_transform(df_engineering)

print(df_engineering_scaled[:3])

TrainSet[variables_engineering] = scaler.fit_transform(TrainSet[variables_engineering])
TestSet[variables_engineering] = scaler.transform(TestSet[variables_engineering])

print("* Numerical transformation - scaling done!")





[[ 1.78757018 -0.79992561  0.60188649  0.10321202  1.908672    0.87411633
  -0.51304058  2.13150648  1.86572881  1.18803167  1.07891405]
 [-0.71540986 -0.79992561 -1.21671763 -0.37288066 -0.56671646 -0.70046141
  -0.51304058 -0.79485211 -0.38726187 -0.29250097 -1.09754814]
 [-0.52782035 -0.79992561 -1.08041967  0.25891881 -0.56671646 -0.70046141
   1.27838363 -0.79485211 -0.43096212 -1.64143071  1.03054822]]
* Numerical transformation - scaling done!


In [10]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from feature_engine.selection import SmartCorrelatedSelection

TrainSetCleaned = pd.read_csv('outputs/datasets/cleaned/TrainSetCleaned.csv')
TestSet = pd.read_csv('outputs/datasets/cleaned/TestSetCleaned.csv')

variables_engineering = [
    '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'LotArea', 'MasVnrArea',
    'OpenPorchSF', 'OverallCond', 'OverallQual', 'TotalBsmtSF',
    'YearBuilt', 'YearRemodAdd'
]


df_engineering = TrainSetCleaned[variables_engineering].copy()
print("Original data (train) - första raderna:")
print(df_engineering.head(3))

scaler = StandardScaler()
df_engineering_scaled = scaler.fit_transform(df_engineering)

print("\nSkalade värden (train) - första raderna:")
print(df_engineering_scaled[:3])

TrainSetCleaned[variables_engineering] = df_engineering_scaled
TestSet[variables_engineering] = scaler.transform(TestSet[variables_engineering])

print("\n* Numerical transformation - scaling done!")



Original data (train) - första raderna:
   1stFlrSF  2ndFlrSF  GrLivArea  LotArea  MasVnrArea  OpenPorchSF  \
0      1828       0.0       1828    11694       452.0          108   
1       894       0.0        894     6600         0.0            0   
2       964       0.0        964    13360         0.0            0   

   OverallCond  OverallQual  TotalBsmtSF  YearBuilt  YearRemodAdd  
0            5            9         1822       2007          2007  
1            5            5          894       1962          1962  
2            7            5          876       1921          2006  

Skalade värden (train) - första raderna:
[[ 1.78757018 -0.79992561  0.60188649  0.10321202  1.908672    0.87411633
  -0.51304058  2.13150648  1.86572881  1.18803167  1.07891405]
 [-0.71540986 -0.79992561 -1.21671763 -0.37288066 -0.56671646 -0.70046141
  -0.51304058 -0.79485211 -0.38726187 -0.29250097 -1.09754814]
 [-0.52782035 -0.79992561 -1.08041967  0.25891881 -0.56671646 -0.70046141
   1.27838363 -0.

In [11]:
df_engineering = TrainSet.copy()

from feature_engine.selection import SmartCorrelatedSelection

corr_sel = SmartCorrelatedSelection(
    variables=None, 
    method="spearman", 
    threshold=0.6, 
    selection_method="variance" 
)

corr_sel.fit_transform(df_engineering)

print("Correlated feature sets:", corr_sel.correlated_feature_sets_)

print("Features to drop:", corr_sel.features_to_drop_)


Correlated feature sets: [{'GrLivArea', 'YearBuilt', 'OverallQual', 'SalePrice', 'GarageArea'}, {'GarageYrBlt', 'YearRemodAdd'}, {'LotArea', 'LotFrontage'}, {'TotalBsmtSF', '1stFlrSF'}]
Features to drop: ['GarageArea', 'GrLivArea', 'YearBuilt', 'OverallQual', 'YearRemodAdd', 'LotArea', 'TotalBsmtSF']
