# Introduction
I had previously written a short kernel to practice Python. My "native language" is R, but the benefits of using Python are too great to ignore. In this notebook, I revist the same dataset but attempt several additional techniques, including: 
1. Pre-process with object-oriented programming
2. Create Pipeline
    * using custom transformers
    * using decorators and a "general pipleline class" 
3. Use partial dependency plots
4. running multiple aglorithms through a pipeline

The individual elements could be the subject af a kernel all to themselves, but I am choosing to make each occupy a section within this Kernel. I hope this helps you on your python journey! 

# Import Libraries

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from subprocess import check_output
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer 
from sklearn.impute import MissingIndicator
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectPercentile, mutual_info_regression
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import reciprocal, uniform
from pdpbox import pdp, info_plots
import eli5
from eli5.sklearn import PermutationImportance
#import functools

# Dictionaries

Below is a (nested?) dictionary which contains specific values for each column. This was initially performed by another user on Kaggle -- if someone happens to know who made these dicionaries initially then I will edit this notebook and provide proper attribution.  My contribution was coalesce all of the individual dictionaries into one. 

In [2]:
replacement_dict = {"MoSold" : {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun", 7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"},
            "SaleCondition" : {"Abnorml" : 0, "Alloca" : 0, "AdjLand" : 0, "Family" : 0, "Normal" : 0, "Partial" : 1},
            "ExterCond" : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5},
            "ExterQual" : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5},
            "Functional" : {"Sal" : 1, "Sev" : 2, "Maj2" : 3, "Maj1" : 4, "Mod": 5, "Min2" : 6, "Min1" : 7, "Typ" : 8},
            "HeatingQC" : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
            "KitchenQual" : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
            "LandSlope" : {"Sev" : 1, "Mod" : 2, "Gtl" : 3},
            "LotShape" : {"IR3" : 1, "IR2" : 2, "IR1" : 3, "Reg" : 4},
            "PavedDrive" : {"N" : 0, "P" : 1, "Y" : 2},
            "Street" : {"Grvl" : 1, "Pave" : 2},
            "Utilities" : {"ELO" : 1, "NoSeWa" : 2, "NoSewr" : 3, "AllPub" : 4},
            "OverallQual" : {1 : 1, 2 : 1, 3 : 1, # bad
                           4 : 2, 5 : 2, 6 : 2, # average
                           7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
                          },
            "OverallCond" : {1 : 1, 2 : 1, 3 : 1, # bad
                                      4 : 2, 5 : 2, 6 : 2, # average
                                      7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
                                     },
    "ModExterCond" : {1 : 1, # bad
                                                   2 : 1, 3 : 1, # average
                                                   4 : 2, 5 : 2 # good
                                                  },
    "ModExterQual" : {1 : 1, # bad
                                                   2 : 1, 3 : 1, # average
                                                   4 : 2, 5 : 2 # good
                                                  },
    "ModFunctional" : {1 : 1, 2 : 1, # bad
                                                     3 : 2, 4 : 2, # major
                                                     5 : 3, 6 : 3, 7 : 3, # minor
                                                     8 : 4 # typical
                                                    },
    "ModKitchenQual" : {1 : 1, # bad
                                                       2 : 1, 3 : 1, # average
                                                       4 : 2, 5 : 2 # good
                                                      },
    "ModHeatingQC" : {1 : 1, # bad
                                                   2 : 1, 3 : 1, # average
                                                   4 : 2, 5 : 2 # good
                                                  }   
}

# 1. Pre-process with Object Oriented Programming

I have learned that there are many benefits to using OOP with Python. It makes code more readable, reproducible, testable, etc. For this section, I create a python class specifically for this competition to handle the "pre-processing" of the data. What I mean by "pre-processing" is making the training and testing sets equivalent in form. This class contains functions to: 
1. Load the training and testing data
2. Generate a "submission" dataframe
3. Use the above dictionary to replace vlalues
4. Create new columns by:
    * multiplication
    * addition
    * hardcoding
5. Log transforming the response
6. Removing categories that do not show up in Test (to allow for one-hot-encoding)

In [3]:
# Define Class
class HousingPricesRegression():
    # class initialization
    def __init__(self, train_path, test_path):
        self.train = pd.read_csv(train_path)
        self.test = pd.read_csv(test_path)
        self.submission = pd.DataFrame()
    def gen_submission_file(self, id_col):
        self.submission[id_col] = self.test[id_col]
        self.test.drop([id_col], axis = 1, inplace=True)
    # replace information within columns
    def replace_values(self, replace_dict):
        # Find name (top level key)
        key_name = str(*replace_dict)
        true_index = np.where(self.train.columns.isin([key_name]))
        col_name = list(self.train.columns[true_index])
        if len(col_name) == 0:
            # Option 1: Dictionary refers to a key that doesn't exist. Create one!
            orig_name = key_name[3:]
            self.train[key_name] = self.train[orig_name].map(replace_dict[key_name])
            self.test[key_name] = self.test[orig_name].map(replace_dict[key_name])
        else:
            # Option 2: Dictionary refers to a key that does exist. Modify! 
            # Take 'ModExterCond' and find the original ("ExterCond")
            self.train[key_name] = self.train[key_name].map(replace_dict[key_name])
            self.test[key_name] = self.test[key_name].map(replace_dict[key_name])
    # Take Logarithm of Response
    def create_response(self, response):
        self.price_labels = np.log(self.train[response].copy() + 1)
        # drop original to avoid confusion
        self.train.drop([response], axis=1, inplace = True)  
    # Remove categories 
    def remove_categories(self, cat_cols):
        for i in cat_cols:
            trn = self.train[i].astype('category')
            trn = trn.cat.categories
            tst = self.test[i].astype('category')
            tst = tst.cat.categories
            # Find that which is in test which is not in train
            unique_values_train = list(set(trn) - set(tst))
            # Check 
            # If no difference, print
            if len(unique_values_train)==0:
                print("No Differnence for {}".format(i))
            else:
                # If there is a differnce, change to NA in test for later imputation
                for j in unique_values_train:
                    to_replace_index = self.train.loc[self.train[i]==j,i].index
                    self.train.loc[to_replace_index,i] = float('NaN')
                    self.train[i] = self.train[i].cat.remove_categories(j)
                    print("{} removed from {} in train".format(j,i))
    # Set the category type 
    def make_category(self, cat_cols):
        self.train[cat_cols] = self.train[cat_cols].astype('category')
        self.test[cat_cols] = self.test[cat_cols].astype('category')

#### Load Data

In [5]:
# Define Paths
file_path = '/home/gopherguy14/PYTHON_PROJECTS/Housing_Prices/'
train_path = str(file_path) + str('train.csv')
test_path = str(file_path) + str('test.csv')
submission_path = str(file_path) + str()

# Instantiate model object
hprObject = HousingPricesRegression(train_path, test_path)

# Generate submission file
hprObject.gen_submission_file(id_col='Id')

#### Call Replacement Method

It's possible to get the key-names of a dictionary using `*dict`. This block of code loops through `replacement_dict`, essentially creating a subset dictionary. This subset is then passed to the `replace_values()` method, which in turn grabs the top level key (the column name) and attempts to find all of the levels to replace, which is accomplished by `.map()`. Note that the train and test data are pandas dataframes at this point. 

In [6]:
# Replace features - multiple columns
for key, values in replacement_dict.items():
    temp_dict = {}
    temp_dict[key] = replacement_dict[key] 
    hprObject.replace_values(temp_dict)

#### Create Response
This creates a single column dataframe called `price_labels` that is a copy of that in train, except log-transformed. The corresponding column in the train dataframe is also dropped. 

In [7]:
# This takes the log of the sales price
hprObject.create_response(response="SalePrice")

#### Equivocate Training and Test Sets
I had trouble with this for a while. Apparently, there are 'levels' in a given column in train that do not exist in test. For some reason, I have a problem with one-hot encoding if I do *not* reconcile this. What I do now is simply make that particular 'cell' (a specific row in a specific column) blank. *I do not delete the row*. Later in the kernel I impute for blanks. 

In [8]:
## Create list of numerical and categorical attributes
num_cols = hprObject.test._get_numeric_data().columns
cat_cols = list(set(hprObject.test.columns) - set(num_cols))
num_cols = list(num_cols)

## Need to move this to a class method or something
hprObject.make_category(cat_cols)

# Replace unique levels in test with NaN to be imputed later
hprObject.remove_categories(cat_cols)

## Make sure that the training data only reflects the columns in the testing data 
names_in_test = list(hprObject.test)
hprObject.train = hprObject.train[names_in_test]

No Differnence for LandContour
No Differnence for BsmtFinType1
Other removed from Exterior2nd in train
No Differnence for MoSold
No Differnence for Neighborhood
No Differnence for SaleType
No Differnence for GarageType
No Differnence for GarageCond
No Differnence for Fence
No Differnence for BsmtQual
No Differnence for BsmtCond
Membran removed from RoofMatl in train
ClyTile removed from RoofMatl in train
Roll removed from RoofMatl in train
Metal removed from RoofMatl in train
No Differnence for LotConfig
Fa removed from PoolQC in train
No Differnence for BsmtExposure
No Differnence for Foundation
TenC removed from MiscFeature in train
Mix removed from Electrical in train
2.5Fin removed from HouseStyle in train
No Differnence for Alley
No Differnence for GarageFinish
No Differnence for RoofStyle
OthW removed from Heating in train
Floor removed from Heating in train
No Differnence for MasVnrType
No Differnence for CentralAir
No Differnence for Condition1
Ex removed from GarageQual in tra

# Pipeline

In [9]:
## Column Names ##
kitchen_col = ['KitchenAbvGr', 'KitchenQual']
grade_col = ['OverallQual', 'OverallCond']
exter_cond = ['ExterQual','ExterCond']
add_area = ['GrLivArea','TotalBsmtSF']
add_floor = ['1stFlrSF', '2ndFlrSF']
add_porch = ['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch']
count_bath = ['BsmtFullBath','FullBath','BsmtHalfBath','HalfBath']

feature_num = kitchen_col + grade_col + exter_cond + add_area + add_floor + add_porch + count_bath 
other_num_cols = list(set(num_cols) - set(feature_num))

## Column Transformers ##
feature_engineering = ColumnTransformer(
    transformers = [
        ('multi_kitchen', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), kitchen_col),
        ('multi_grade', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), grade_col),
        ('multi_exter', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), exter_cond),
        ('add_total_SF', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), add_area),
        ('add_floor_SF', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), add_floor),
        ('add_porch_SF', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), add_porch),
        ('count_baths', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), count_bath)
    ],
    remainder='drop'
)

other_numerical = ColumnTransformer(
    transformers = [
        ('impute_other', SimpleImputer(strategy="median"), other_num_cols),
        ('scale_other', StandardScaler(), other_num_cols)
    ],
    remainder='drop'
)

categorical = ColumnTransformer(
    transformers = [
        ('impute_cats', SimpleImputer(strategy="most_frequent"), cat_cols),
        
    ],
    remainder='drop'
)

other_missing = ColumnTransformer(
    transformers = [
        ('other_missing', MissingIndicator(features='all'), other_num_cols)
    ],
    remainder='drop'
)

category_missing = ColumnTransformer(
    transformers = [
        ('cat_missing', MissingIndicator(features='all'), cat_cols)
    ],
    remainder='drop'
)

## Pipelines ## 
step1 = Pipeline([
    ('feat_eng', feature_engineering),
    ('impute_feats', SimpleImputer(strategy="median")),
    ('scale_feats', StandardScaler())
])

step2 = Pipeline([
    ('other_nums', other_numerical),
    ('impute_nums', SimpleImputer(strategy="median")),
    ('scale_nums', StandardScaler())
])

step3 = Pipeline([
    ('categ', categorical),
    ('encode', OneHotEncoder(handle_unknown='ignore'))
])
        
unify = FeatureUnion([
    ('new_features', step1 ),
    ('numeric_scale', step2), 
    ('categoric_encode', step3),
    ('num_missing_indicator', other_missing),
    ('cat_missing_indicator', category_missing)
])

final_pipeline = Pipeline([
    ('finally', unify)
])

## Just checking
#f = final_pipeline.fit_transform(hprObject.train)
#f.shape

In [10]:
# split train into train and validation set
X_train, X_val, y_train, y_val = train_test_split(hprObject.train, hprObject.price_labels, test_size = 0.2)

# list names of regressors
names = [
         "Support-Vector Regressor",
         "Random Forest Regressor"
        ]

regressors = [
    SVR(),
    RandomForestRegressor()
]

parameters = [
                  {
                      'reg__C': (reciprocal(0.001, 0.1)), 
                      'reg__gamma': (uniform(1, 10))
                  },
                  {
                      'reg__max_depth': (5,10,15)
                  }
             ]

In [11]:
model_list = {}
for name, regressor, params in zip(names, regressors, parameters):
    reg_pipe = Pipeline([
        ('pipeline', final_pipeline),
        ('reg', regressor),
    ])
    rs_reg = RandomizedSearchCV(reg_pipe, params, n_iter=10, verbose=2, random_state=42, scoring='neg_mean_squared_error')
    regres = rs_reg.fit(X_train, y_train)
    score = regres.score(X_val, y_val)
    print("{} score: {}".format(name, score))
    model_list[name] = regres

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] reg__C=0.005611516415334503, reg__gamma=10.50714306409916 .......


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  reg__C=0.005611516415334503, reg__gamma=10.50714306409916, total=   0.5s
[CV] reg__C=0.005611516415334503, reg__gamma=10.50714306409916 .......


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s


[CV]  reg__C=0.005611516415334503, reg__gamma=10.50714306409916, total=   0.4s
[CV] reg__C=0.005611516415334503, reg__gamma=10.50714306409916 .......
[CV]  reg__C=0.005611516415334503, reg__gamma=10.50714306409916, total=   0.5s
[CV] reg__C=0.029106359131330688, reg__gamma=6.986584841970366 .......
[CV]  reg__C=0.029106359131330688, reg__gamma=6.986584841970366, total=   0.4s
[CV] reg__C=0.029106359131330688, reg__gamma=6.986584841970366 .......
[CV]  reg__C=0.029106359131330688, reg__gamma=6.986584841970366, total=   0.5s
[CV] reg__C=0.029106359131330688, reg__gamma=6.986584841970366 .......
[CV]  reg__C=0.029106359131330688, reg__gamma=6.986584841970366, total=   0.5s
[CV] reg__C=0.0020513382630874496, reg__gamma=2.5599452033620267 .....
[CV]  reg__C=0.0020513382630874496, reg__gamma=2.5599452033620267, total=   0.4s
[CV] reg__C=0.0020513382630874496, reg__gamma=2.5599452033620267 .....
[CV]  reg__C=0.0020513382630874496, reg__gamma=2.5599452033620267, total=   0.5s
[CV] reg__C=0.002

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   21.4s finished


Support-Vector Regressor score: -0.1555474469376055
Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] reg__max_depth=5 ................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ................................. reg__max_depth=5, total=   0.3s
[CV] reg__max_depth=5 ................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV] ................................. reg__max_depth=5, total=   0.3s
[CV] reg__max_depth=5 ................................................
[CV] ................................. reg__max_depth=5, total=   0.3s
[CV] reg__max_depth=10 ...............................................
[CV] ................................ reg__max_depth=10, total=   0.7s
[CV] reg__max_depth=10 ...............................................
[CV] ................................ reg__max_depth=10, total=   0.7s
[CV] reg__max_depth=10 ...............................................
[CV] ................................ reg__max_depth=10, total=   0.7s
[CV] reg__max_depth=15 ...............................................
[CV] ................................ reg__max_depth=15, total=   0.9s
[CV] reg__max_depth=15 ...............................................
[CV] ................................ reg__max_depth=15, total=   0.9s
[CV] reg__max_depth=15 ...............................................
[CV] .

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    6.5s finished


Random Forest Regressor score: -0.02450339621788411


In [12]:
model_list['Support-Vector Regressor'].best_params_

{'reg__C': 0.046225890010208284, 'reg__gamma': 3.1233911067827616}

In [13]:
model_list['Random Forest Regressor'].best_params_

{'reg__max_depth': 10}

In [14]:
# Fit with optimized parameters
svr_pipe = Pipeline([
        ('pipeline', final_pipeline),
        ('svr', SVR(
            gamma = model_list['Support-Vector Regressor'].best_params_["reg__gamma"], 
            C = model_list['Support-Vector Regressor'].best_params_["reg__C"]
        )),
    ])

rf_pipe = Pipeline([
    ('pipeline', final_pipeline),
    ('rf', RandomForestRegressor(max_depth = model_list['Random Forest Regressor'].best_params_["reg__max_depth"]))
])

svr_model = svr_pipe.fit(X_train, y_train)
rf_model = rf_pipe.fit(X_train, y_train)

In [15]:
# SVR Names
# Find the one-hot-encodings
svr_ohe = svr_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[2][1].named_steps['encode'].get_feature_names().tolist()
# Engineered features
svr_eng_feat = svr_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[0][1].named_steps['feat_eng'].get_feature_names()
# Other numeric? 
svr_other_num = svr_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[3][1].get_params()['transformers'][0][2]
# but how to get the missing ... stuff? 

svr_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[1][1].get_params()

#a = svr_ohe + svr_eng_feat + svr_other_num
#len(a)

{'memory': None,
 'steps': [('other_nums',
   ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
            transformer_weights=None,
            transformers=[('impute_other', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
          strategy='median', verbose=0), ['LotShape', 'PoolArea', 'MiscVal', 'YrSold', 'WoodDeckSF', 'PavedDrive', 'Functional', 'TotRmsAbvGrd', 'MSSubClass', 'GarageArea', 'LandSlope', 'GarageYrBlt', 'LotArea'...replaces', 'HeatingQC', 'BedroomAbvGr', 'MasVnrArea', 'LotFrontage', 'ModFunctional', 'Utilities'])])),
  ('impute_nums', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
          strategy='median', verbose=0)),
  ('scale_nums', StandardScaler(copy=True, with_mean=True, with_std=True))],
 'other_nums': ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
          transformer_weights=None,
          transformers=[('impute_other', SimpleImputer(copy=True, fill_value=None, missing_values=nan,

In [None]:
# RF Names
# One hot encodings
rf_ohe = svr_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[2][1].named_steps['encode'].get_feature_names().tolist()
# Engineered features
rf_eng_feat = rf_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[0][1].named_steps['feat_eng'].get_feature_names()
# Other numeric?
rf_other_num = rf_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[3][1].get_params()['transformers'][0][2]

In [21]:
# missing numeric
#svr_missing_num = 
svr_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[1][1].named_steps['other_nums'].get_params()
#rf_missing_num = rf_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[3][1].get_params()['transformers'][0][2]

{'n_jobs': None,
 'remainder': 'drop',
 'sparse_threshold': 0.3,
 'transformer_weights': None,
 'transformers': [('impute_other',
   SimpleImputer(copy=True, fill_value=None, missing_values=nan,
          strategy='median', verbose=0),
   ['LotShape',
    'PoolArea',
    'MiscVal',
    'YrSold',
    'WoodDeckSF',
    'PavedDrive',
    'Functional',
    'TotRmsAbvGrd',
    'MSSubClass',
    'GarageArea',
    'LandSlope',
    'GarageYrBlt',
    'LotArea',
    'BsmtFinSF1',
    'YearBuilt',
    'SaleCondition',
    'BsmtUnfSF',
    'BsmtFinSF2',
    'ModKitchenQual',
    'Street',
    'YearRemodAdd',
    'LowQualFinSF',
    'ModExterCond',
    'ModExterQual',
    'GarageCars',
    'ModHeatingQC',
    'Fireplaces',
    'HeatingQC',
    'BedroomAbvGr',
    'MasVnrArea',
    'LotFrontage',
    'ModFunctional',
    'Utilities']),
  ('scale_other',
   StandardScaler(copy=True, with_mean=True, with_std=True),
   ['LotShape',
    'PoolArea',
    'MiscVal',
    'YrSold',
    'WoodDeckSF',
    'Pa

In [108]:
# missing categorical
svr_missing_cat = svr_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[4][1].get_params()['transformers'][0][2]
rf_missing_cat = rf_pipe.named_steps['pipeline'].named_steps['finally'].transformer_list[4][1].get_params()['transformers'][0][2]

['BsmtFinType1',
 'Condition1',
 'GarageQual',
 'MiscFeature',
 'SaleType',
 'Fence',
 'BsmtCond',
 'Heating',
 'MoSold',
 'Condition2',
 'Neighborhood',
 'BsmtExposure',
 'CentralAir',
 'Foundation',
 'BsmtFinType2',
 'BldgType',
 'Exterior2nd',
 'MSZoning',
 'MasVnrType',
 'RoofMatl',
 'RoofStyle',
 'Exterior1st',
 'HouseStyle',
 'PoolQC',
 'BsmtQual',
 'Electrical',
 'GarageCond',
 'GarageType',
 'LotConfig',
 'LandContour',
 'Alley',
 'FireplaceQu',
 'GarageFinish']

In [None]:
# Final column names
svr_feature_names = feature_num + other_num_cols + svr_ohe
rf_feature_names = feature_num + other_num_cols + rf_ohe

In [None]:
## Transform the Validation Set
X_train_prepared = final_pipeline.fit_transform(X_train)
X_val_transformed = final_pipeline.transform(X_val)
#X_train_prepared.shape
X_val_transformed.shape

## Convert this into a pandas dataframe
# X_val_df = pd.DataFrame(X_val_transformed)
# X_val_df.columns = all_feature_names

## Feature Importance

In [None]:
# Feature Importance through Permutation
perm = PermutationImportance(svr_model, random_state=42).fit(X_val_transformed, y_val)
#eli5.show_weights(perm, feature_names = all_feature_names, top = 58)

## Partial Dependency Plots

In [None]:
name = 'ModHeatingQC'
pdp_housing = pdp.pdp_isolate(
    model=svr_model, 
    dataset=X_val_df, 
    model_features=all_feature_names, 
    feature=name
)

# plot it
pdp.pdp_plot(pdp_housing, name)
plt.show()

In [None]:
# Names of Numeric features
num_cols = hprObject.train._get_numeric_data().columns
feature_name_list = list(num_cols)

# Select numerical features
num_train_df = hprObject.train[feature_name_list]

# Numeric Pipeline
numerical_pipeline = Pipeline([
   # ('drop_columns', DropMissing(threshold_percent = 0.07)),
    ('select_numeric', TypeSelector(np.number)),
    ('impute_missing', SimpleImputer(strategy = "median")),
    ('std_scaler', StandardScaler()),
    ('select_features', SelectPercentile(mutual_info_regression, percentile=95))
])

# Run through pipeline
num_train_np = numerical_pipeline.fit_transform(X = num_train_df, y = hprObject.price_labels)

# Split 
X_train, X_val, y_train, y_val = train_test_split(num_train_np, hprObject.price_labels, test_size = 0.2)

In [None]:
# Make dataframe copy of X_Val
X_val_df = pd.DataFrame(X_val)
X_val_df.columns = feature_name_list

In [None]:
# Perform cross-validation
param_distributions = {"gamma": reciprocal(0.001, 0.1), "C": uniform(1, 10)}
rnd_search_cv = RandomizedSearchCV(SVR(), param_distributions, n_iter=10, verbose=2, random_state=42)
rnd_search_cv.fit(X_train, y_train)

# Fit with optimal Parameters
svr_model = SVR(gamma = rnd_search_cv.best_params_["gamma"], C = rnd_search_cv.best_params_["C"])
svr_model.fit(X_train, y_train)

## Feature Importance

In [None]:
# Feature importance
perm = PermutationImportance(svr_model, random_state=1).fit(X_val, y_val)
eli5.show_weights(perm, feature_names = feature_name_list, top = 58)

## Partial Dependency Plots

In [None]:
# Create a series of partial depedency plots
name = 'ModHeatingQC'
pdp_housing = pdp.pdp_isolate(model=svr_model, dataset=X_val_df, model_features=feature_name_list, feature=name)

# plot it
pdp.pdp_plot(pdp_housing, name)
plt.show()

In [None]:
# Create a series of partial depedency plots
name='HeatingQC'
pdp_housing = pdp.pdp_isolate(model=svr_model, dataset=X_val_df, model_features=feature_name_list, feature=name)

# plot it
pdp.pdp_plot(pdp_housing, name)
plt.show()

## Fit Data with Model(s) & Parameters

In [None]:
# split train into train and validation sets
TrainX, ValX, TrainY, ValY = train_test_split(hprObject.train, hprObject.price_labels, test_size = 0.2)

# run full pipeline
train_data = unify_pipeline.fit_transform(TrainX, TrainY)

# Run SVR()
param_distributions = {"gamma": reciprocal(0.001, 0.1), "C": uniform(1, 10)}
svr_cv = RandomizedSearchCV(SVR(), param_distributions, n_iter=10, verbose=2, random_state=42)
svr_cv.fit(train_data, TrainY)

In [None]:
svr_cv.best_params_

In [None]:
# Run pipeline for validation data
test_data = unify_pipeline.transform(ValX, ValY)

# Generate predictions to compare between models

## Try Running Multiple Models

In [None]:
final_pipeline.get_params()

## Trying something different

In [None]:
reg.best_params_

## Generate Submission File

In [None]:
# create submission
y_pred = rnd_search_cv.best_estimator_.predict(train_prepared)
mse = mean_squared_error(price_labels, y_pred)
np.exp(np.sqrt(mse))

fitted_model = SVR().fit(X=prepared_train, y=hprObject.price_labels)
predictions = fitted_model.predict(prepared_test)