# **(ADD HERE THE NOTEBOOK NAME)**

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artifacts you generate by the end of the notebook 

## CRISP-DM

* Modelling


---

# Change working directory

* We are assuming you will store the notebooks in a sub folder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-heritage-housing-issues'

# Load Data

In [4]:
import pandas as pd
df = (pd.read_csv(f"inputs/datasets/unzipped/house_prices_records.csv")
        .drop(labels=['EnclosedPorch', 'WoodDeckSF'],axis=1))
print(df.shape)
df.head()

(1460, 22)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,8450,65.0,196.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,9600,80.0,0.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,11250,68.0,162.0,42,5,7,920,2001,2002,223500
3,961,,,No,216,ALQ,540,642,Unf,1998.0,...,9550,60.0,0.0,35,5,7,756,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,836,RFn,2000.0,...,14260,84.0,350.0,84,5,8,1145,2000,2000,250000


### Load variables

In [5]:
arbitrary_imputation_vars = ['2ndFlrSF', 'EnclosedPorch', 'WoodDeckSF']
median_imputation_vars = ['BedroomAbvGr', 'LotFrontage','GarageYrBlt','MasVnrArea']
most_frequent_vars = ['BsmtFinType1']

In [7]:
categorical_encoding_vars =df.select_dtypes(include=['object']).columns.to_list()
categorical_encoding_vars

['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

In [8]:
log_transformation_vars = ['1stFlrSF', 'LotArea']
yeojohnson_vars = ['GrLivArea']
boxcox_vars =['SalePrice']

In [9]:
smart_correlation_features = df.columns.to_list() 
smart_correlation_features.pop()
smart_correlation_features

['1stFlrSF',
 '2ndFlrSF',
 'BedroomAbvGr',
 'BsmtExposure',
 'BsmtFinSF1',
 'BsmtFinType1',
 'BsmtUnfSF',
 'GarageArea',
 'GarageFinish',
 'GarageYrBlt',
 'GrLivArea',
 'KitchenQual',
 'LotArea',
 'LotFrontage',
 'MasVnrArea',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'TotalBsmtSF',
 'YearBuilt',
 'YearRemodAdd']

* We Will handle Data cleaning for 'GarageFinish' outside of the pipeline

In [10]:
nan_GarageFinish_vals=df[df['GarageFinish'].isna()]
nan_GarageFinish_index = list(nan_GarageFinish_vals.index.values)

df['GarageFinish'] = df['GarageFinish'].fillna(0)

for x in nan_GarageFinish_index:
    garage_area_value = df.iloc[x,8]
    if garage_area_value == 0:
        df.at[x,'GarageFinish'] = 'None'
    else:
        df.at[x,'GarageFinish'] = 'Unf'

df.filter(['GarageFinish']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   GarageFinish  1460 non-null   object
dtypes: object(1)
memory usage: 11.5+ KB


# Create ML Pipelines

1. Data Cleaning and Feature Engineering

In [12]:
from sklearn.pipeline import Pipeline

### Data Cleaning
from feature_engine.imputation import MeanMedianImputer
from feature_engine.imputation import ArbitraryNumberImputer
from feature_engine.imputation import CategoricalImputer

### Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt

def PipelineDataCleaningAndFeatureEngineering():
  pipeline_base = Pipeline([
    ("ArbitraryImputer", ArbitraryNumberImputer(arbitrary_number=0,
                                            variables=arbitrary_imputation_vars)),
                                
    ("MedianImputation", MeanMedianImputer(imputation_method='median',
                                            variables=median_imputation_vars)),
    
    ("CategoricalImputer", CategoricalImputer(imputation_method='frequent', 
                                        variables=most_frequent_vars)),

    ("OrdinalCategoricalEncoder",OrdinalEncoder(encoding_method='arbitrary', 
                                                variables = categorical_encoding_vars) ),
    
    ("LogTransformer", vt.YeoJohnsonTransformer(variables = yeojohnson_vars)),

    ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables = yeojohnson_vars)),

    ("BoxCoxTransformer", vt.BoxCoxTransformer(variables = boxcox_vars)),
      
    ("SmartCorrelatedSelection",SmartCorrelatedSelection(variables=smart_correlation_features, 
                                                          method="pearson", threshold=0.6, 
                                                          selection_method="variance") ),
       
  ])

  return pipeline_base

PipelineDataCleaningAndFeatureEngineering()

Pipeline(steps=[('ArbitraryImputer',
                 ArbitraryNumberImputer(arbitrary_number=0,
                                        variables=['2ndFlrSF', 'EnclosedPorch',
                                                   'WoodDeckSF'])),
                ('MedianImputation',
                 MeanMedianImputer(variables=['BedroomAbvGr', 'LotFrontage',
                                              'GarageYrBlt', 'MasVnrArea'])),
                ('CategoricalImputer',
                 CategoricalImputer(imputation_method='frequent',
                                    variables=['BsmtFinType1'])),
                ('OrdinalCa...
                 SmartCorrelatedSelection(selection_method='variance',
                                          threshold=0.6,
                                          variables=['1stFlrSF', '2ndFlrSF',
                                                     'BedroomAbvGr',
                                                     'BsmtExposure',
                  

# ML Pipeline for Modelling and Hyperparameter Optimization

* Our next step is choosing the optimal algorithm for our ML model and the most effective hyperparameter for our selected algorithm
* We will do so with the below function and custom class.
    * Below code taken from CI lesson: *Scikit-Learn Unit 6: Cross Validation Search Part 2*

In [13]:
import warnings
warnings.filterwarnings('ignore')
### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

def PipelineClf(model):
  pipeline_base = Pipeline([
       ("scaler",StandardScaler() ),
       ("feat_selection",SelectFromModel(model) ),
       ("model",model ),
  ])

  return pipeline_base

In [14]:
from sklearn.model_selection import GridSearchCV

class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model =  PipelineClf(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, )
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

# Split Train & Test Set

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['SalePrice'],axis=1),
                                    df['SalePrice'],
                                    test_size = 0.2,
                                    random_state = 0,
                                    )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1168, 21) (1168,) (292, 21) (292,)


---

NOTE

* You may add how many sections you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section for "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
