# Feature Engineering

### Objectives

1. Load in the cleaned dataset for preporation of feature engineering.
2. Review pandas profile report on the cleaned train set to evaluate potential engineering steps.
3. Define the FeatureEngineeringAnalysis function demonstrated in prior lessons based on Feature engineering.
4. Utilise different feature engineering methods that suit the data type.
5. Use the FeatureEngineeringAnalysis function to help define optimum feature engineering steps for a pipeline.

### Inputs

* outputs/datasets/cleaned/TrainSets
* outputs/datasets/cleaned/TestSets

### Outputs

* A pipeline with defined feature engineering steps. 

### Additional Comments

* This notebook was designed and follows the principles set out by Code Institute in the Predictive Analytics lessons and Walkthrough projects. The code written in this work book has taken influence from these lessons and projects but has been modiefied, in some cases such as the graphical design, heavily modified by myself in order to suit the needs for this project.

___

## Change working Directory

* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir


We want to make the parent of the current directory the new current directory

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Load Cleaned Data and Import Necessary Packages

* Firstly load both datasets and import all neccesery packages. 

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
from pandas_profiling import ProfileReport
from sklearn.pipeline import Pipeline
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [None]:
TrainSet = (pd.read_csv("outputs/datasets/cleaned/TrainSet.csv"))
TrainSet.head()

In [None]:
TestSet = (pd.read_csv("outputs/datasets/cleaned/TestSet.csv"))
TestSet.head()

___

### Data Exploration

We will load the pandas profile report against the train set.

In [None]:
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

___

### Feature Engineering

* We will use the function defined in the Code Institute Feature Engineering Lessons to help in the Feature Engineering process.
* You can see the Function below.

In [None]:
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')
%matplotlib inline


def FeatureEngineeringAnalysis(df,analysis_type=None):


  """
  - used for quick feature engineering on numerical and categorical variables
  to decide which transformation can better transform the distribution shape 
  - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions

  """
  check_missing_values(df)
  allowed_types= ['numerical', 'ordinal_encoder',  'outlier_winsorizer']
  check_user_entry_on_analysis_type(analysis_type, allowed_types)
  list_column_transformers = define_list_column_transformers(analysis_type)
  
  
  # Loop over each variable and engineer the data according to the analysis type
  df_feat_eng = pd.DataFrame([])
  for column in df.columns:
    # create additional columns (column_method) to apply the methods
    df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
    for method in list_column_transformers:
      df_feat_eng[f"{column}_{method}"] = df[column]
      
    # Apply transformers in respectives column_transformers
    df_feat_eng, list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

    # For each variable, assess how the transformations perform
    transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

  return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
  ### Check analyis type
  if analysis_type == None:
    raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
  if analysis_type not in allowed_types:
      raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")

def check_missing_values(df):
  if df.isna().sum().sum() != 0:
    raise SystemExit(
        f"There is missing values in your dataset. Please handle that before getting into feature engineering.")



def define_list_column_transformers(analysis_type):
  ### Set suffix colummns acording to analysis_type
  if analysis_type=='numerical':
    list_column_transformers = ["log_e","log_10","reciprocal", "power","box_cox","yeo_johnson"]
  
  elif analysis_type=='ordinal_encoder':
    list_column_transformers = ["ordinal_encoder"]

  elif analysis_type=='outlier_winsorizer':
    list_column_transformers = ['iqr']

  return list_column_transformers



def apply_transformers(analysis_type, df_feat_eng, column):


  for col in df_feat_eng.select_dtypes(include='category').columns:
    df_feat_eng[col] = df_feat_eng[col].astype('object')


  if analysis_type=='numerical':
    df_feat_eng,list_applied_transformers = FeatEngineering_Numerical(df_feat_eng,column)
  
  elif analysis_type=='outlier_winsorizer':
    df_feat_eng,list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

  elif analysis_type=='ordinal_encoder':
    df_feat_eng,list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng,column)

  return df_feat_eng,list_applied_transformers



def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
  # For each variable, assess how the transformations perform
  print(f"* Variable Analyzed: {column}")
  print(f"* Applied transformation: {list_applied_transformers} \n")
  for col in [column] + list_applied_transformers:
    
    if analysis_type!='ordinal_encoder':
      DiagnosticPlots_Numerical(df_feat_eng, col)
    
    else:
      if col == column: 
        DiagnosticPlots_Categories(df_feat_eng, col)
      else:
        DiagnosticPlots_Numerical(df_feat_eng, col)

    print("\n")



def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(20, 5))
  sns.countplot(data=df_feat_eng, x=col,palette=['#432371'],order = df_feat_eng[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30,y=1.05)        
  plt.show()
  print("\n")



def DiagnosticPlots_Numerical(df, variable):
  fig, axes = plt.subplots(1, 3, figsize=(20, 6))
  sns.histplot(data=df, x=variable, kde=True,element="step",ax=axes[0]) 
  stats.probplot(df[variable], dist="norm", plot=axes[1])
  sns.boxplot(x=df[variable],ax=axes[2])
  
  axes[0].set_title('Histogram')
  axes[1].set_title('QQ Plot')
  axes[2].set_title('Boxplot')
  fig.suptitle(f"{variable}", fontsize=30,y=1.05)
  plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng,column):
  list_methods_worked = []
  try:  
    encoder= OrdinalEncoder(encoding_method='arbitrary', variables = [f"{column}_ordinal_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_ordinal_encoder")
  
  except: 
    df_feat_eng.drop([f"{column}_ordinal_encoder"],axis=1,inplace=True)
    
  return df_feat_eng,list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng,column):
  list_methods_worked = []

  ### Winsorizer iqr
  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked




def FeatEngineering_Numerical(df_feat_eng,column):

  list_methods_worked = []

  ### LogTransformer base e
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_e"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_e")
  except: 
    df_feat_eng.drop([f"{column}_log_e"],axis=1,inplace=True)

    ### LogTransformer base 10
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_10"],base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_10")
  except: 
    df_feat_eng.drop([f"{column}_log_10"],axis=1,inplace=True)

  ### ReciprocalTransformer
  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_reciprocal"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_reciprocal")
  except:
    df_feat_eng.drop([f"{column}_reciprocal"],axis=1,inplace=True)

  ### PowerTransformer
  try:
    pt = vt.PowerTransformer(variables = [f"{column}_power"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_power")
  except:
    df_feat_eng.drop([f"{column}_power"],axis=1,inplace=True)

  ### BoxCoxTransformer
  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_box_cox"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_box_cox")
  except:
    df_feat_eng.drop([f"{column}_box_cox"],axis=1,inplace=True)


  ### YeoJohnsonTransformer
  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yeo_johnson"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yeo_johnson")
  except:
        df_feat_eng.drop([f"{column}_yeo_johnson"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked

___

### Transformers to test

* The transformers I will use are as follows:
    * Categorical Encoding
    * Numerical Transformation
    * Smart Correlation Selection

#### Categorical Encoding

* The aim of categorical encoding is to replace the categorical data with numerical data
    * The first thing we shall do is split out our categoricl data into a seperate dataframe.
    * We shall then assess the effect of using the ordinal_encoder on these variables by utilising the function above.

In [None]:
categorical_variables = list(TrainSet.select_dtypes(['object','category']).columns)
df_categorical = TrainSet[categorical_variables].copy()
df_categorical.head()

In [None]:

df_categorical_engineered = FeatureEngineeringAnalysis(df=df_categorical, analysis_type='ordinal_encoder')

* Having applied the ordinal encoder to our categorical varibales we can evaluate the effect it has had. The FeatureEngineeringAnalysis function defined above plots 3 graphs that help us make sense of the data and its distribution.
* We can then apply the encoder to the train and test sets.

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables=categorical_variables)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

___

### Numerical Transformation

* The aim of numerical transformation is to improve the distribution with the goal of making the data normally distributed.
    * The first thing we shall do is split out our numerical data into a seperate dataframe.
    * We shall then assess the effect of using the various different transformers on these variables by utilising the function above.

In [None]:
numerical_variables = list(TrainSet.select_dtypes(['int64','float64']).columns)
df_numerical = TrainSet[numerical_variables].copy()
df_numerical.head()

In [None]:
df_numerical_engineered = FeatureEngineeringAnalysis(df=df_numerical, analysis_type='numerical')

#### Analysis of results

* As we can see the function has provided us with several plots for each variable, showing the effects of different numerical transformations applied to the data. Again our goal here is to identify if any of the transformations produced an improvement in the distribution of the data, with a normal distribution being the optimal result.

* After an initial look at the results we can see that a different amount of transformers have been applied to different variables.
    * 1stFlrSF, GrLivArea, LotArea, LotFrontage, OverallCond, OverallQual, YearBuilt, YearRemodAdd and SalesPrice all had log_e, log_10, reciprocal, power, box_cox and yeo_johnson applied to them, and we can see the effect each had in the plots above.
    * 2ndFlrSF, BedroomAbvGr, BsmtExposure, BsmtFinSF1, BsmtFinType1, BsmtUnfSF1, GarageArea, GarageFinish, GarageYrBlt, KitchenQual, MasVnrArea, OpenPorchSF and TotalBsmtSF, only had the power and yeo_johnson transofrmers applied to them.
    * We shall have a look at each of the 9 variables that had all transformations applied first, we shall go through each one individually and review the effect each transformer had if any, and conclude what is the best transformer for that variable.
    * After that we shall go through the remaining variables that only had power and yeo_johnson applied, drawing up conclusions again based on the best transformer.   

___

#### 1stFlrSF, GrLivArea, LotArea, LotFrontage, OverallCond, OverallQual, YearBuilt, YearRemodAdd and SalesPrice
#### Transformations Applied: log_e, log_10, reciprocal, power, box_cox and yeo_johnson

##### 1stFlrSF
* Analysis of transformation:
    * log_e - Positive shift towards normal distribution.
    * log_10 - Positive shift towards normal distribution.
    * reciprocal - A small change in the distibution but not enough to warrent any kind of action, no impactful shift towards normal distribution.
    * power - Positive shift towards normal distribution but slight skew to the left.
    * box_cox - Positive shift towards normal distribution.
    * yeo_johnson - Positive shift towards normal distribution.
    * Conclusion - Upon inspection Log_e, Log_10, box_cox and yeo_johnson show the most positive results and therefore will be considered.

##### GrLivArea
* Analysis of transformation:
    * log_e - Positive shift left towards normal distribution.
    * log_10 - Positive shift left towards normal distribution.
    * reciprocal - A large shift to the left away from a normal distribution, not what we are looking for. 
    * power - Showed no significant change in the distribbution. 
    * box_cox - A small positive change towards normal distribution.
    * yeo_johnson - A small positive change towards normal distribution.
    * Conclusion - Log_e and Log_10 showed the most postitive shift towards a normal distribution and will be considered later.

##### LotArea
* Analysis of transformation:
    * log_e - Very positive shift towards a normal distribution.
    * log_10 - Very positive shift towards a normal distribution.
    * reciprocal - A broader distribution but still heavily skewed to the left. 
    * power - A broader distribution but still heavily skewed to the left.
    * box_cox - Very positive shift towards a normal distribution.
    * yeo_johnson - Very positive shift towards a normal distribution.
    * Conclusion - Reciprocal and power transformations dont show enough of a positive chnage to be viable, however both log transformations as well as box_cox and yeo_johnson show enough of a positive transformation to be considered.

##### LotFrontage
* Analysis of transformation:
    * log_e - A positive shift towards normal distribution but no noticible improvement across the qq plot.
    * log_10 - A positive shift towards normal distribution but no noticible improvement across the qq plot. 
    * reciprocal - A positive shift towards normal distribution but no noticible improvement across the qq plot. 
    * power - A positive shift towards normal distribution but no noticible improvement across the qq plot. 
    * box_cox - A positive shift towards normal distribution but no noticible improvement across the qq plot. 
    * yeo_johnson - A positive shift towards normal distribution but no noticible improvement across the qq plot.
    * Conclusion - All transformations showed a postive shift towards a normal distribution however the qq plot showed a degredation across the board. For this reason none will be carried forward.  

##### OverallCond
* Analysis of transformation:
    * log_e - Distribution skewed over to the right with no trend towards a normal distribution and no positive improvement to the qq plot
    * log_10 - Distribution skewed over to the right with no trend towards a normal distribution and no positive improvement to the qq plot
    * reciprocal - Distribution skewed over to the left with no trend towards a normal distribution and no positive improvement to the qq plot
    * power - Distribution skewed over to the right with no trend towards a normal distribution and no positive improvement to the qq plot
    * box_cox - Distribution skewed over to the right with no trend towards a normal distribution and no positive improvement to the qq plot
    * yeo_johnson - Distribution skewed over to the right with no trend towards a normal distribution and no positive improvement to the qq plot
    * Conclusion - None of the above transformations effected the distribution in any particularly postivae manner and therefore none of them will be taken forward. 

##### OverallQual
* Analysis of transformation:
    * log_e - Distribution shifted over to the right but no signficant improvemnt in the distribution
    * log_10 - Distribution shifted over to the right but no signficant improvemnt in the distribution
    * reciprocal - Distribution heavily shifted over to the left but no signficant improvemnt in the distribution
    * power - Very little chnage detected between the distribution produced by this transformation and the original, not enough to be considered a success. 
    * box_cox - Very little chnage detected between the distribution produced by this transformation and the original, not enough to be considered a success.
    * yeo_johnson - Very little chnage detected between the distribution produced by this transformation and the original, not enough to be considered a success.
    * Conclusion - Both log transformations skewed to the right whilsts reciprocal skewed to the left, the rest failed to produce any significant change and therefore none will be considered moving forward.

##### YearBuilt
* Analysis of transformation:
    * log_e - This transformation shows no significant change to the distribution prioir to transformation.
    * log_10 - This transformation shows no significant change to the distribution prioir to transformation.
    * reciprocal - This Transformation has flipped the distribution but shows nothing that can be considered an improvement.
    * power - This transformation shows no significant change to the distribution prioir to transformation.
    * box_cox - This transformation has exasperated the distribution making it broader but shows nothing that can be considered an improvement. 
    * yeo_johnson - This transformation has exasperated the distribution making it broader but shows nothing that can be considered an improvement.
    * Conclusion - None of the transformations produced anything that could be considered a postive shift, as such none will be considered further.

##### YearRemodAdd
* Analysis of transformation:
    * log_e - This transformation shows no significant change to the distribution prioir to transformation.
    * log_10 - This transformation shows no significant change to the distribution prioir to transformation.
    * reciprocal -  This Transformation has flipped the distribution but shows nothing that can be considered an improvement.
    * power - This transformation shows no significant change to the distribution prioir to transformation. 
    * box_cox - This transformation shows no significant change to the distribution prioir to transformation.
    * yeo_johnson - This transformation shows no significant change to the distribution prioir to transformation.
    * Conclusion - None of the above transformations made any kind of impact to the distribution and therefore none will be considered further.

##### SalesPrice
* Analysis of transformation:
    * log_e - Postive trend in all regards, the distribution is heading towards a normal distribution and qq plot shows postive improvement.
    * log_10 - Postive trend in all regards, the distribution is heading towards a normal distribution and qq plot shows postive improvement.
    * reciprocal - Small improvement across the distribution but not enough to be considered when we compare it to the other transformations 
    * power - A greater improvement then the reciprocal transformation but again when compared to the rest it is clearly not as much of an improvement.
    * box_cox - Postive trend in all regards, the distribution is heading towards a normal distribution and qq plot shows postive improvement.
    * yeo_johnson - Postive trend in all regards, the distribution is heading towards a normal distribution and qq plot shows postive improvement.
    * Conclusion - Both log transformations, box_cox and yeo_johnson show very positive transformations, reciprocal and power showed positive change but in comparison to the rest it was not a grand improvement. As such the Log, box_cox and yeo_johnson will be considered. 


___

#### 2ndFlrSF, BedroomAbvGr, BsmtExposure, BsmtFinSF1, BsmtFinType1, BsmtUnfSF1, GarageArea, GarageFinish, GarageYrBlt, KitchenQual, MasVnrArea, OpenPorchSF and TotalBsmtSF
#### Transformations Applied: power and yeo_johnson

##### 2ndFlrSF
* Analysis of transformation:
    * power - No shift towards a normal distribution.  
    * yeo_johnson - No shift towards a normal distribution. 
    * Conclusion - Both transformers had no noticible postitive effect on the distribution and therefore niether will be considered.

##### BedroomAbvGr
* Analysis of transformation:
    * power - Movement away from a normal distribution.
    * yeo_johnson - Movement away from a normal distribution.
    * Conclusion - Both transformations seemingly negatively impacted the distribution, or atleast provided no improvements therefore niether will be considered.

##### BsmtExposure
* Analysis of transformation:
    * power - shift of the mean towards the center, but overall not following a normal distribution.
    * yeo_johnson - shift of the mean towards the center, but overall not following a normal distribution.
    * Conclusion - Both transformers show a very similar transformation, the mean shifted but outliers did not move towards a normal distribution. I conclude that this does not show a positive shift and therfore both transformations will not be considered.

##### BsmtFinSF
* Analysis of transformation:
    * power - Shift of data to the right and towards the center but nothink akin to a normal distribution.
    * yeo_johnson - Majoririty of the data shifted too far right away from the center, a large change but not a positive one.
    * Conclusion - Niether transformation shows any positive change in the distribution therefore niether will be considered further.

##### BsmtFinType1
* Analysis of transformation:
    * power - No significant change at all detecting in the distribution.
    * yeo_johnson - No significant change at all detecting in the distribution.
    * Conclusion - Both transformations showed little to no change in the distribution, niether will be considered.

##### BsmtUnfSF1
* Analysis of transformation:
    * power - A large draw to the center, 0 values remain but the majority of the data show a shift towards normal distribution. not perfect but a postive change.
    * yeo_johnson - A large draw to the center, 0 values remain but the majority of the data show a shift towards normal distribution. not perfect but a postive change. 
    * Conclusion - Neither transformations are perfect but they certainly show a positive change and therefore both will be considered.  

##### GarageArea
* Analysis of transformation:
    * power - A change is observed however it does not warrent any further action, the change does not make any substantial impact be that positive or negative
    * yeo_johnson - A change is observed however it does not warrent any further action, the change does not make any substantial impact be that positive or negative
    * Conclusion - Due to the changes having very little impact niether will be carried forward.

##### GarageFinish
* Analysis of transformation:
    * power - No positive change at all. 
    * yeo_johnson - No positive change at all.
    * Conclusion - Both transformations that were applied failed to move the distribution in any constructive way and therefore will not be considered. 

##### GarageYrBlt
* Analysis of transformation:
    * power - No significant change to the distribution. 
    * yeo_johnson - The only noticible change is a lift towards the back of the plot where we find the majority of the data but this is not what we are looking for and shows no postive change towards a normal distribution.
    * Conclusion - Both transformations made no significant impact and therfore will not be taken forward.

##### KitchenQual
* Analysis of transformation:
    * power - No significant change to the distribution. 
    * yeo_johnson - No significant change to the distribution. 
    * Conclusion - Both transformations made no significant impact and therfore will not be taken forward.

##### MasVnrArea
* Analysis of transformation:
    * power - Failed to impact the distribution in any positivly or significant way.  
    * yeo_johnson - Failed to impact the distribution in any positivly or significant way.
    * Conclusion - Due to both transformations failing to shift the distribution in any positive manner neither will be considered when moving forward.

##### OpenPorchSF
* Analysis of transformation:
    * power - No positive and impactful shift can be seen after this transformation. 
    * yeo_johnson - No positive and impactful shift can be seen after this transformation.
    * Conclusion - Both transformations failed to produce a postive shift with the data, for this reason none will be considered further.

##### TotalBsmtSF
* Analysis of transformation:
    * power - The Distribution can be seen shifted slightly to the right but no real diffrence in qq plot indicates that the transformation didnt effect the distribution in a significantly positive way. 
    * yeo_johnson - The Distribution can be seen shifted slightly to the right but no real diffrence in qq plot indicates that the transformation didnt effect the distribution in a significantly positive way.
    * Conclusion - We can see small changes in a potentailly positive direction with both transformations but it is not enough to consider taking these transformations further.

___

#### Conclusions

##### Variables that will not have a transformer applied
* ['LotFrontage', 'OverallCond', 'OverallQual', 'YearBuilt', 'YearRemodAdd', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtFinSF1', 'BsmtUnfSF', 'GarageArea', 'GarageFinish', 'GarageYrBlt', 'KitchenQual', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF']

##### Variables that will have a transformer applied
* ['1stFlrSF', 'GrLivArea', 'LotArea', 'BsmtUnfSF1']
   * ##### Transformers that will be implemented.
       * 1stFlrSF - log_e
       * GrLivArea - log_e
       * LotArea - log_e
       * BsmtUnfSF - yeo_johnson


___

#### Apply transformations to the train and test sets

In [None]:
pipeline = Pipeline([
    ("NumericLogTransform",vt.LogTransformer(variables=['1stFlrSF', 'GrLivArea', 'LotArea',])),
    ("NumericYeoJohnsonTransform",vt.YeoJohnsonTransformer(variables=['BsmtUnfSF']))
    ])
TrainSet = pipeline.fit_transform(TrainSet)
TestSet = pipeline.transform(TestSet)

___

### Smart Correlated Selection

* First we want to drop SalesPrice as this is our end prediction goal.
* We will do so by creating a new dataframe to perform our Smart Correlated Selection on.

In [None]:
df_smart_correlated_selection = TrainSet.drop(['SalePrice'], axis=1)
df_smart_correlated_selection.head()

* Second we will create our engineered variables that apply the transformations

In [None]:
selection_method = "cardinality"
corr_method = 'spearman'
smart_correlation_selection = SmartCorrelatedSelection(variables=None, method=corr_method, threshold=0.60, selection_method=selection_method)

smart_correlation_selection.fit_transform(df_smart_correlated_selection)
smart_correlation_selection.correlated_feature_sets_

In [None]:
smart_correlation_selection.features_to_drop_

## Final Conclusion

* By working down this notebook we have come to the conclusion on several key steps.
    * After an intial analysis of the various numerical transformations we methodically went through each variable and analysed the results of each transformation.
    * We then established which transformations showed the most promising results and these were the results: 
       * 1stFlrSF - log_e
       * GrLivArea - log_e
       * LotArea - log_e
       * BsmtUnfSF - yeo_johnson
    * As such we shall be implementing these feature engineering steps into our pipleine when training the model.
    * We then did a smart correlated selection in order to identify any features of unimportance that may be dropped, these were the results:
        * 2ndFlrSF
        * GarageYrBlt
        * OverallQual
        * TotalBsmtSF
      * All these features will be dropped.
