## Foreword

Hi Kagglers!
The score of the House Price predictions project https://www.kaggle.com/godzill22/house-price-predictions/comments I did a year ago it's nothing to be proud of. Over that period I've learnt more interesting techniques that I would like to try and see whether I can impove my model. This notebook is an attempt of systematic approach of how to deal with high dimension dataset and with different features type. I think, the data from House Price competition is ideal to practice those skills. Overall, preparation range anywhere from 60–80% of the total time spent on a Data Science project.

## Load the dataset

In [None]:
import numpy as np
import pandas as pd

import missingno
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

import scipy as sp
from scipy.stats import skew

import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Load data desctiption
# with open("/kaggle/input/home-data-for-ml-course/data_description.txt", 'r') as f:
    # print(f.read())

In [None]:
# Load the training dataset
train_df = pd.read_csv("/kaggle/input/home-data-for-ml-course/train.csv")
train_df.head()

In [None]:
# train_df.info()

In [None]:
len(train_df)

In [None]:
train_df[train_df.duplicated()]

In [None]:
train_missing = train_df.isna().sum()

train_missing = 100 * (train_missing[train_missing > 0] / len(train_df))
train_missing

In [None]:
# Load the test dataset
test_df = pd.read_csv("/kaggle/input/home-data-for-ml-course/test.csv")
test_df.head()

In [None]:
#test_df.info()

In [None]:
test_df[test_df.duplicated()]

In [None]:
test_missing = test_df.isna().sum()

test_missing = 100 * (test_missing[test_missing > 0] / len(test_df))
test_missing

### Check target column first

In [None]:
fig, ax = plt.subplots(figsize=(10,4))
sns.distplot(train_df['SalePrice'], bins=30, kde=True, ax=ax)

In [None]:
# One way of doing it
fig, ax = plt.subplots(figsize=(10,4))
sns.distplot(np.log1p(train_df['SalePrice']), bins=30, kde=True, ax=ax);
# Perform log transformation 
train_df['SalePrice'] = np.log1p(train_df['SalePrice'])

In [None]:
train_df['SalePrice'].isna().sum()

First of all I will concatenate these 2 dataframe for feature engineering. It will help me to avoid a problem where train and test dataset discrete features are different from each other.

In [None]:
# Concatenate train/test datasets
df = pd.concat([train_df, test_df], axis=0)

## Numerical Features

In [None]:
# Change these features into object type
change_type = ['MSSubClass','OverallQual','OverallCond','YearBuilt','YearRemodAdd','GarageYrBlt','MoSold','YrSold']

for col in change_type:
    df[col] = df[col].astype("object")

In [None]:
# Describe numeric columns
df.drop("Id", axis=1).describe(include=['number']).T

In [None]:
num_feat = [x for x in df.columns if df[x].dtype !="object"]

num_feat.remove("Id")

In [None]:
# Correlation between numerical variables
corr_matrix = df[num_feat].corr()
plt.figure(figsize=(16,12))
sns.heatmap(corr_matrix.T, annot=True, cbar=False, cmap='coolwarm');

In [None]:
# Correlated variables greater than 0.8
corr_matrix = df[num_feat].corr()
plt.figure(figsize=(12,12))
sns.heatmap(corr_matrix.T, annot=True, mask= corr_matrix < 0.8 ,cbar=False, cmap='coolwarm');

Let's check how these correlated variables to each other are correlated to the target column, so I can decide which of them remove from further analysis.

In [None]:
price_corr_ser = df[num_feat].corr()['SalePrice']
price_corr_ser = price_corr_ser.sort_values(ascending=False)
price_corr_ser = price_corr_ser.drop("SalePrice")

fig, ax = plt.subplots(figsize=(10,12))
sns.barplot(x=price_corr_ser.values, y=price_corr_ser.index, palette="rocket_r")
plt.title("Numeric Feature Correlation with Traget Column");

In [None]:
# Remove one of the highly correlated variables
high_correlated_var = ["GarageArea",'1stFlrSF','TotRmsAbvGrd']
df = df.drop(high_correlated_var, axis=1)

# Remove it from list of numeric columns
for c in high_correlated_var:
    num_feat.remove(c)

### Distribution of numeric features

In [None]:
# Plot distribution of numeric variables
fig = plt.figure(figsize=(20,20))

for i in range(len(num_feat)):
    plt.subplot(14,5, i+1)
    sns.distplot(df[num_feat[i]], rug=True, hist=False, kde_kws={'bw':0.1})
    plt.title(num_feat[i])
    plt.xlabel("Value")
    plt.ylabel("Count")
    plt.tight_layout()
fig.show()

In [None]:
# Visualize relation between numeric features and target column
fig = plt.figure(figsize=(20,20))
# numeric_df = num_df.drop('SalePrice', axis=1)

for i, col in enumerate(df[num_feat].columns):
    plt.subplot(12,5, i+1)
    sns.scatterplot(x=df[col], y=df['SalePrice'])
    plt.tight_layout()
    
fig.show()

### Numerical outliers

In [None]:
fig = plt.figure(figsize=(24,15))

plt.subplot(4,3,1)
sns.distplot(df["LotArea"])

plt.subplot(4,3,2)
sns.scatterplot(x="LotArea", y="SalePrice", data=df)

In [None]:
df["LotArea"].describe()

I will remove outliers from this continues numeric column later on as it would effect my test dataset for submission if I do it now.

**PoolArea**

In [None]:
# Create binary column 1 if the house has a pool, 0 if not
df['isPool'] = df['PoolArea'].apply(lambda x: 0 if x == 0 else 1)
df['isPool'] = df['isPool'].astype("object")
df = df.drop('PoolArea',axis=1)
num_feat.remove("PoolArea")

**totalPorch**

In [None]:
# create a new column where I concatenate all Porch columns
porch_col = ['OpenPorchSF','EnclosedPorch', '3SsnPorch', 'ScreenPorch']

df['totalPorch'] = np.zeros(len(df)).reshape(len(df),1)

for col in porch_col:
    df['totalPorch'] += df[col]
    
# Remove porch col from dataset
for c in porch_col:
    df.drop(c, axis=1, inplace=True)

# Remove it from the list of numerical columns
to_remove = ["OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch"]
for c in to_remove:
    num_feat.remove(c)

# Add column to the list of numeric
num_feat.append("totalPorch")

**Bathroom columns**

In [None]:
# Create new columns and drop relevant ones
df["TotBathAbvGrade"] = df["FullBath"] + (0.5 * df["HalfBath"])
df["TotBsmtBath"] = df["BsmtFullBath"] + (0.5 * df["BsmtHalfBath"])

# Remove columns
to_remove = ["FullBath","HalfBath","BsmtFullBath", "BsmtHalfBath"]

for col in to_remove:
    df.drop(col, axis=1, inplace=True)
    num_feat.remove(col)

# Append new ones to the numeric columns
num_feat.append("TotBathAbvGrade")
num_feat.append("TotBsmtBath")

Columns LotFrontage(Linear feet of street connected to property) and LotArea(Lot size in square feet) are highly correlated, so I will drop feature with missing values.

In [None]:
# Remove useless numerical column
df.drop("LotFrontage", axis=1, inplace=True)
num_feat.remove("LotFrontage")

In [None]:
# Create a plot again
fig = plt.figure(figsize=(15,15))

for i, col in enumerate(num_feat):
    plt.subplot(12,5, i+1)
    sns.scatterplot(x=df[col], y='SalePrice', data=df)
    plt.tight_layout()

fig.show()

### Missing values in numeric features

The only numerical column left with some missing values (less than 1%) so I will fill them with a mean of the column. I don't need to fill missing values in SalePrice Columns.

In [None]:
# Show missing values
missingno.matrix(df[num_feat], figsize=(20,4))

In [None]:
# Remove SalePrice temporary
num_feat.remove("SalePrice")

In [None]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(df[num_feat])
df[num_feat] = imp.transform(df[num_feat])

In [None]:
for col in df[num_feat]:
    df[col] = df[col].apply(lambda x: np.log1p(x))
    
# Append SalePrice back to numeric columns
num_feat.append("SalePrice")

## Categorical features

#### Missing Values in Categorical

In [None]:
# List of categorical columns
cat_feat = [x for x in df.columns if df[x].dtype == "object"]

# Create a multi plot with categorical features
fig = plt.figure(figsize=(18, 30))

for i , col in enumerate(cat_feat):
    plt.subplot(12,5, i+1)
    sns.boxplot(x=col, y='SalePrice', data=df)
    plt.ylabel("Log() SalePrice")
    plt.tight_layout()
    
fig.show()

There are some features that are useless(to many variables or the same information). First, I will try to create new features from them.

In [None]:
cat_missing = df[cat_feat].isna().sum()

cat_missing = 100 * (cat_missing[cat_missing > 0] / len(df[cat_feat]))

plt.figure(figsize=(10,5))
sns.barplot(x= cat_missing.sort_values(ascending=False).values, y= cat_missing.sort_values(ascending=False).index)
plt.title("Missing Categorical Values in %");

#### Missing data in categorical columns

There are many methods to impute data, some of them are very sophisticated, but there is one flaw, we impute artificially created values. In case of categorical variables imputing mode of a column could be one of them, but I will fill missing values with a string "None" so I could retain the orginal information.

In [None]:
# Fill missing values in categorical columns with a string
for col in cat_feat:
    if df[col].isna().sum() > 0:
        df[col] = df[col].fillna(value="NA")
    else:
        continue

In [None]:
missingno.matrix(df[cat_feat], figsize=(20,4))

In [None]:
# for col in cat_missing.columns:
    # print(f" Column '{col}' has unique values {df[col].unique()}")

In [None]:
cat_missing = df[cat_feat].isna().sum()

cat_missing = 100 * (cat_missing[cat_missing > 0] / len(df[cat_feat]))

cat_missing

### Feature engineering for categorical variables

**GarageYrBlt column**

In [None]:
# Creating new series 
is_garage = df['GarageYrBlt'].apply(lambda x: 1 if x != "NA" else 0)

# Plot new series
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))

sns.countplot(is_garage, ax=axes[0])
sns.boxplot(x=is_garage.values, y='SalePrice', data=df, ax=axes[1])

axes[0].set_xlabel("Is Garage")
axes[0].set_ylabel("SalePrice ")

axes[1].set_xlabel("Is Garage")
axes[1].set_ylabel("SalePrice ");

In [None]:
to_remove = []
# Add to the list of columns to remove
to_remove.append("GarageYrBlt")

# Create new column from GaragYrBlt
df['isGarage'] = is_garage.astype('object')
cat_feat.append("isGarage")

**YearRemodAdd column**

In [None]:
df['YearRemodAdd'].unique()
to_remove.append("YearRemodAdd")

I don't see any value from this column, therefore I will drop it later on.

**YearBuilt & YrSold columns**

In [None]:
# Create a series of how old a house was when sold
how_old = (df['YrSold'].astype(int) - df['YearBuilt'].astype(int))

# New column from 
df['Old_in_Years'] = pd.Series(how_old)
# Update to numertic list
num_feat.append("Old_in_Years")

# Add columns for remove
to_remove.append('YrSold')
to_remove.append('YearBuilt')

In [None]:
fig = plt.figure(figsize=(18,10))

# Distribution of new column
plt.subplot(4,2, 1)
sns.distplot(df['Old_in_Years'])
plt.ylabel("count")

# Scatterplot of new column
plt.subplot(4,2, 2)
sns.scatterplot(x=df['Old_in_Years'].values, y='SalePrice', data=df)

# Labels
plt.xlabel("Old in Years")
plt.ylabel("SalePrice");

I'm not sure if creating another column from two old ones will improve my model or it will carry the same information as newly created numerical one? If someone can clear that for me that would be great. For now I won't create it.

In [None]:
to_remove

**Condition1 & Condition2 columns**

In [None]:
df['Condition1'].value_counts()

In [None]:
df['Condition2'].value_counts()

In [None]:
fig = plt.figure(figsize=(18,10))

condition1 = df['Condition1'].apply(lambda x: x if x == "Norm" else "Other")

plt.subplot(3,2, 1)
sns.countplot(condition1)

plt.subplot(3,2, 2)
sns.boxplot(x=condition1.values, y='SalePrice', data=df);

In [None]:
fig = plt.figure(figsize=(18,10))

condition2 = df['Condition2'].apply(lambda x: x if x == "Norm" else "Other")

plt.subplot(4,2, 1)
sns.countplot(condition2)

plt.subplot(4,2,2)
sns.boxplot(x=condition2.values, y='SalePrice', data=df);

I think, that the only reason doing it is to reduce dimensionality of our dataframe. I am going to to keep these columns in unchange form now.

### Nominal and Ordinal Columns

Some of the categorical columns have nominal or ordinal values and that needs to be addressed. As a remminder, nominal data is when we can only classify the data, while ordinal data can be classified and ordered.

**Ordinal values**

In [None]:
ordinal_feat = ['OverallQual','OverallCond','ExterQual','ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure','BsmtFinType1',
               'BsmtFinType2','HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual','GarageCond','PoolQC']

df['BsmtExposure'] = df['BsmtExposure'].apply(lambda x: x if x !='No' else "NA")

for col in ordinal_feat:
    # Remove ordinal from list  
    cat_feat.remove(col)
        
    print(f" Column '{col}' has unique values {df[col].unique()}")

In [None]:
# Map ordinal columns and change their type
ord_map = {"NA":0, "Po":1, "Fa":2, "TA":3, "Gd":4,"Ex":5}
ord_map1 = {"NA":0, "Unf":1, "LwQ":2, "Rec":3, "BLQ":4, "ALQ":5, "GLQ":6}
ord_map2 = {"NA":0, "No":1, "Mn":2, "Av":3, "Gd":4}

for col in ordinal_feat:
        
    if len(df[col].unique()) <= 6 and col !="BsmtExposure":
        df[col] = df[col].map(ord_map)
        df[col] = df[col].astype(int)
        
    elif col in ['OverallQual', 'OverallCond']:
        df[col] = df[col].astype(int)
        
    elif df[col].name in ['BsmtFinType1', 'BsmtFinType2']:
        df[col] = df[col].map(ord_map1)
        df[col] = df[col].astype(int)
        
    else:
        df[col] = df[col].map(ord_map2) 
        df[col] = df[col].astype(int)
        
for col in ordinal_feat:
    # Append nominal features to the numerical features
    num_feat.append(col)
    print(f" Column '{col}' has unique values :{df[col].unique()}, dtype: {df[col].dtypes}")
    

In [None]:
# Let's remove some categorical columns we do not need anymore
for col in to_remove:
    df.drop(col, axis=1, inplace=True)
    cat_feat.remove(col)

####  Check numerical features correlation again

In [None]:
plt.figure(figsize=(14,12))

sns.heatmap(df[num_feat].corr(), mask= df[num_feat].corr()  < 0.8 , cbar=False, cmap='coolwarm', annot=True);

I introduce correlation between features when I converted some ordinal features and I need to remove one of the correlated feature. There is also high correlation between "SalePrice" and "OverallQual" (0.817185) but this is all right.

In [None]:
corr_to_remove = ['GarageCond', 'FireplaceQu', 'BsmtFinType1','BsmtFinType2','BsmtCond',]

for col in corr_to_remove:
    df = df.drop(col, axis=1)
    num_feat.remove(col)
    ordinal_feat.remove(col)

**Nominal values**

Different methods can be apply to convert nominal variables into numbers so I future algorithm can work with them. All of them have prons and cons but I am not going to write about it here. One of the simpliest and easy to understand is pandas "get_dummies" method, however you need to remember not to indroduce nulticollinearity what is also called (dummy trap). 

In [None]:
# Create dummy variables 
dummy_df = pd.get_dummies(df[cat_feat], drop_first=True)

for col in cat_feat:
    df.drop(col, axis=1, inplace=True)
    
df_with_dummies = pd.concat([df, dummy_df], axis=1)

Second method for converting nominal categorical variables is OneHotEncode, but I am not going to use it in this notebook. Another common method used by practitioners is Label Encoding which suits more with variables who have some sort of order. 

**Now, this is very important that we split dataset back into test and train dataset before we scale the data.**

In [None]:
# Split dataframe into test/train dataset
clean_train_df = df_with_dummies[df_with_dummies["SalePrice"] > 0].copy()
clean_test_df = df_with_dummies[df_with_dummies["SalePrice"].isna()].copy()

# Drop SalePrice column from test set
clean_test_df.drop("SalePrice", axis=1, inplace=True)

In [None]:
clean_test_df.shape, clean_train_df.shape

**Skewness**

In [None]:
skewed_features = clean_train_df[num_feat].skew().sort_values(ascending=False)
skewed_features = skewed_features[skewed_features > 0.5]
skewed_index = skewed_features.index

In [None]:
fig = plt.figure(figsize=(12,4), dpi=120)
skewed_features.sort_values(ascending=False).plot(kind='bar')
plt.xticks(rotation=45)
plt.xticks(horizontalalignment="right")
plt.title("Skewed Feature above 0.5 upper limit")
plt.tight_layout();

First of all I will remove highly skewed features from dataset and then trim off the rest of them by 0.5 limit. Acceptable values for skewness are between -0.5 and 0.5 while -2 and 2 for Kurtosis.

In [None]:
right_skewness_col = ['PoolQC', 'LowQualFinSF']
for col in right_skewness_col:
    clean_train_df.drop(col, axis=1, inplace=True)
    clean_test_df.drop(col, axis=1, inplace= True)
    num_feat.remove(col)

In [None]:
skewed_index = skewed_index.drop(['PoolQC','LowQualFinSF'])

In [None]:
for col in skewed_index:
    q3 = np.quantile(clean_train_df[col], 0.75)
    q1 = np.quantile(clean_train_df[col], 0.25)
    iqr = q3 - q1
    # Upper limit for outliers
    upper_limit = q3 + (1.5*iqr)
    col_limit =  clean_train_df[col].apply(lambda x: x <= upper_limit)
    clean_train_df = clean_train_df[col_limit]

In [None]:
clean_train_df.shape

In [None]:
# Save id column for submission
row_id = pd.Series(clean_test_df["Id"])
clean_train_df = clean_train_df.drop("Id", axis=1).astype("float64")
clean_test_df = clean_test_df.drop("Id", axis=1).astype("float64")

## Splitting and standarization of the data

In [None]:
from sklearn.model_selection import train_test_split

# Split training dataset into X/y first
X = clean_train_df.drop(["SalePrice"], axis=1)
y = clean_train_df['SalePrice']

# Then, split it into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

We all know how important is to standarize/normalize dataset for our algorithms but there are exceptions depend on what algorithm we are going to use. In order to do it right we need to remember about few things. Some suggest that it only descrete variables should be standarized and definitely we have to fit and transform train set and then only transforming test set. The reason for that we are not creating what is known as data leakage.

In [None]:
num_feat.remove('SalePrice')

In [None]:
from sklearn.preprocessing import StandardScaler
scaled_Xtrain = X_train.copy()
scaled_Xtest = X_test.copy()

scaler = StandardScaler()

scaled_Xtrain[num_feat] = scaler.fit_transform(scaled_Xtrain[num_feat])
scaled_Xtest[num_feat] = scaler.transform(scaled_Xtest[num_feat])

In [None]:
scaled_Xtrain.head()

## Creating and testing our models


### Models score baseline

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score

# Create function for model evaluation
def model_evaluation(algo,algoname):
    """
    This function  fit and  evaluate 
    given algorithm. It takes 3 arguments:
    
    First: algorithm of a choice without parentheses.
    Second: the name of a algorithm as a string.
    """

    # Fit given model
    algo.fit(scaled_Xtrain, y_train)
    y_pred = algo.predict(scaled_Xtest)
    
    # Calculate metrics
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    # R-squared 
    r2score = r2_score(y_test, y_pred)
    
    print(f"**{algoname} Metrics**")
    print(f"**MAE: {mae:}")
    print(f"**RMSE: {rmse:}")
    print(f"**R-squared: {r2score:.2f}%")
    
    return mae, rmse, r2score, y_pred, algo

Great place to start for someone who ask what algorithm I shoud use is sklearn algorithm cheat-sheet https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html. Follow their recommendation I am going to choose first Stochastic Gradient Descent. Let's create Stochastic Gradient Descent first.

**SGDRegressor**

In [None]:
# Create a base model
from sklearn.linear_model import SGDRegressor

sgd_base_model = SGDRegressor(random_state=101)

sgd_base_mae, sgd_base_rmse, sgd_base_r2score, sgd_y_pred, _ = model_evaluation(sgd_base_model, 
                                                                                "SGDRegressor")

In [None]:
def plot_residuals(y_pred, algoname):
    """
    Function plots probability and residuals plot
    """
    residuals = pd.Series(y_test - y_pred, 
                          name="residuals")
    
    fig, axes = plt.subplots(ncols=2, 
                             nrows=2, 
                             figsize=(14,4), 
                             dpi=120)
    # Plot probability
    sp.stats.probplot(residuals, plot=axes[0,0])
    # Plot kde
    sns.distplot(residuals, ax=axes[0,1], hist=False)
    # Plot residuals
    sns.scatterplot(x=y_test, y=residuals, ax=axes[1,0])
    axes[1,0].axhline(y=0, c='red',ls='--')
    # Plot distribution
    sns.boxplot(residuals, ax=axes[1,1])
    plt.tight_layout()

In [None]:
plot_residuals(sgd_y_pred, "SGDRegressor")

**Gradient Boosting Regressor**

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbr_model = GradientBoostingRegressor()
gbr_base_mae, gbr_base_rmse, gbr_base_r2score, gbr_y_pred, gbr_model = model_evaluation(gbr_model, 
                                                                                        "GradientBostingRegressor")

In [None]:
plot_residuals(gbr_y_pred, "Gradient Boosting Regressor")

**Random Forest Regressor**

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr_model = RandomForestRegressor()
rfr_base_mae, rfr_base_rmse, rfr_base_r2score, rfr_y_pred, rfr_model = model_evaluation(rfr_model, 
                                                                                        "RandomForestRegressor")

In [None]:
plot_residuals(rfr_y_pred, "Random Forest Regressor")

**Extreme Gradient Boosting**

In [None]:
from xgboost import XGBRegressor

xgboost_model = XGBRegressor()
xgboost_base_mae, xgboost_base_rmse, xgboost_base_r2score, xgboost_y_pred, xgboost_model = model_evaluation(xgboost_model, 
                                                                                                            "Extreme Gradient Boosting")

In [None]:
plot_residuals(xgboost_y_pred, "Extreme Gradient Boosting")

**KNeighbors**

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn_model = KNeighborsRegressor()
knn_base_mae, knn_base_rmse, knn_base_r2score, knn_y_pred, knn_model = model_evaluation(knn_model, 
                                                                                        "KNeighborsRegressor")

In [None]:
plot_residuals(knn_y_pred, "KNeighborsRegressor")

**Submmit to Kaggle with the best score to Kaggle competition**

In [None]:
# Instantiate StandardScaler and copy dataset
sc = StandardScaler()
scaled_X = X.copy()
scaled_test = clean_test_df.copy()

# Scale the data
scaled_X[num_feat] = sc.fit_transform(X[num_feat])
scaled_test[num_feat] = sc.transform(clean_test_df[num_feat])

# Instantiate the final model
# final_base_model = GradientBoostingRegressor()

# Fit the model
# final_base_model.fit(scaled_X, y)

# final_predictions = final_base_model.predict(scaled_test)


# Make predictions and save it to the dataframe
# final_base_model_df = pd.DataFrame({"id":row_id,
                                    # "SalePrice": np.expm1(final_predictions)})

In [None]:
# final_base_model_df.to_csv("house_price_final_base_sub.csv", index=False)

**ElasticNetCV**

I am going to use ElasticNetCV in the base line models predictions as it will allow me to choose between Ridge(L2 regularization) or Lasso (L1 regularization). The benefit is that elastic net allows a balance of both penalties, which can result in better performance than a model with either one or the other penalty on some problems.

In [None]:
from sklearn.linear_model import ElasticNetCV

elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1])

el_base_mae, el_base_rmse, el_base_r2score, el_base_y_pred, elastic_model = model_evaluation(elastic_model,
                                                                                             "ElasticNetCV")

In [None]:
plot_residuals(el_base_y_pred, "ElasticNetCV")

In [None]:
elastic_model.l1_ratio_

In [None]:
from sklearn.linear_model import LassoCV

lasso_cv_model = LassoCV(eps=0.01, n_alphas=200, cv=10, max_iter=1000000)


lassoCV_mae, lassoCV_rmse, lassoCV_r2score, lassoCV_y_pred, lasso_cv_model = model_evaluation(lasso_cv_model, "LassoCV")

In [None]:
plot_residuals(lassoCV_y_pred, "LassoCV")

**RidgeCV**

In [None]:
from sklearn.linear_model import RidgeCV

ridge_model = RidgeCV(alphas=[0.1, 1.0, 10.0])
ridge_cv_mae, ridge_cv_rmse, ridge_cv_r2, ridge_cv_y_pred, ridge_model = model_evaluation(ridge_model,
                                                                                          "RidgeCV")

In [None]:
ridge_model.alpha_

In [None]:
plot_residuals(ridge_cv_y_pred, "RidgeCV")

In [None]:
from sklearn.svm import SVR

svr_base_model = SVR()

svr_base_mae, svr_base_rmse, svr_base_r2score, svr_base_y_red, svr_base_model = model_evaluation(svr_base_model, 
                                                                                                 "Support Vector Regressor")

In [None]:
plot_residuals(svr_base_y_red, "SVR")

**CatBoostRegressor**

In [None]:
from catboost import CatBoostRegressor

cat_base = CatBoostRegressor(verbose=0, random_state=101)

cat_base_mae, cat_base_rmse, cat_base_r2, cat_base_y_pred, cat_base_model = model_evaluation(cat_base,
                                                                                             "CatBoostRegressor")

In [None]:
feat_imp = cat_base.get_feature_importance(prettified=True)

# Plotting top 20 features' importance

plt.figure(figsize = (12,8))
sns.barplot(feat_imp['Importances'][:20],feat_imp['Feature Id'][:20], orient = 'h', palette="coolwarm_r")
plt.title("Feature Importance")
plt.show()

In [None]:
import shap
from catboost import Pool

# Feature importance Interactive Plot 

train_pool = Pool(scaled_Xtrain)
val_pool = Pool(scaled_Xtest)

explainer = shap.TreeExplainer(cat_base_model) # insert your model
shap_values = explainer.shap_values(train_pool) # insert your train Pool object

shap.summary_plot(shap_values, scaled_Xtrain)

### Base model scores metrics

In [None]:
base_score_df = pd.DataFrame({"Model":["SGDRegressor", "GradientBoostingRegressor",
                                       "RandomForestRegressor", "Extreme Gradient Boosting",
                                       "KNeighborsRegressor" , "LassoCV", "SVR", "RidgeCV",
                                       "CatBoost"],
                              
                              "R-square":[sgd_base_r2score, gbr_base_r2score, rfr_base_r2score,
                                         xgboost_base_r2score, knn_base_r2score, lassoCV_r2score,
                                         svr_base_r2score, ridge_cv_r2, cat_base_r2],
                              
                              "RMSE":[sgd_base_rmse, gbr_base_rmse, rfr_base_rmse, xgboost_base_rmse,
                                      knn_base_rmse, lassoCV_rmse, svr_base_rmse, ridge_cv_rmse,
                                      cat_base_rmse],
                              
                              "MAE": [sgd_base_mae, gbr_base_mae, rfr_base_mae, xgboost_base_mae,
                                      knn_base_mae, lassoCV_mae, svr_base_mae, ridge_cv_mae,
                                      cat_base_mae]})

base_score_df = base_score_df.sort_values(by=["R-square"], 
                                          ascending=False).reset_index(drop=True)

In [None]:
print("**Base Models Metrics**")
base_score_df

In [None]:
# Visualize the table above
fig, ax = plt.subplots(figsize=(8,5))

sns.barplot(x="Model", y="R-square", data=base_score_df, ax=ax, palette="magma")
sns.lineplot(x="Model", y="RMSE", data=base_score_df, color="red", ax=ax,legend='brief', label="rmse")
sns.lineplot(x="Model", y="MAE", data=base_score_df, color='green', ax=ax, legend='brief', label="mae")

plt.xticks(rotation=45, horizontalalignment="right")
plt.title("Regression Model Performance Metrics")
plt.ylabel("R_squared")
plt.legend();

#### Submmit Voting Ensamble Model with base models

In [None]:
from sklearn.ensemble import VotingRegressor

ensemble1_model = VotingRegressor(estimators=[("ridgecv", ridge_model),
                                             ("catboost", cat_base_model),
                                             ("gbr", gbr_model),
                                             ("lassocv", lasso_cv_model),
                                             ("svr", svr_base_model),
                                             ("forest", rfr_model)])

In [None]:
ensemble1_mae, ensemble1_rmse, ensemble1_r2, _ , ensemble1_model = model_evaluation(ensemble1_model,
                                                                                    "Voting Regressor")

**Submmit ensemble model to the competition**

In [None]:
# Fit the model
# ensemble_model1.fit(scaled_X, y)

# final_ensemble = ensemble_model1.predict(scaled_test)


# Make predictions and save it to the dataframe
# final_base_ensemble_df = pd.DataFrame({"id":row_id,
                                       # "SalePrice": np.expm1(final_ensemble)})

In [None]:
# final_base_ensemble_df.to_csv("house_price_final_ensemble_base_sub.csv", index=False)

### GridSearchCV for the best hyperparameters

GridSearchCV is a exhaustive search over specified hyperparameters values for an estimator. It allow us to find the best combination of best parameters for a chosen model. I will split data again with test size of 0.1.

In [None]:
from sklearn.model_selection import GridSearchCV

def model_gridsearchCV(algo,param,name):
    """
    Function will perform gridsearchCV for given algorithm
    and parameter grid. Returns grid model, y_pred. Prints out 
    mean absolute error, root mean squared error, R-square score
    """
    # Instatiate base model
    model = algo()
    
    # Instantiate grid for a model
    model_grid = GridSearchCV(model, 
                             param,
                             scoring="r2",
                             # verbose=2,
                             n_jobs=-1,
                             cv=3)
    # Fit the grid model
    model_grid.fit(scaled_Xtrain, y_train)
    
    # Make prediction
    y_pred = model_grid.predict(scaled_Xtest)
    
    # Evaluate model
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2score = r2_score(y_test, y_pred)
    
    # Print 
    print(f"**{name} with GridSearchCV**")
    print(f"MAE: {mae:}")
    print(f"RMSE: {rmse:}")
    print(f"R-squared: {r2score:.2f}%")
    
    return mae, rmse, r2score, y_pred, model_grid

**GradientBoostingRegressor**

In [None]:
param_grid = {#"loss":["ls","lad","huber","quantile"],
              "learning_rate": [ 0.01, 0.1, 0.3, 1],
              "subsample": [0.5, 0.2, 0.1],
              "n_estimators": [500, 1000],
              "max_depth": [3,6,8]}

gbr_grid_mae, gbr_grid_rmse, gbr_grid_r2, _ , gbr_grid = model_gridsearchCV(GradientBoostingRegressor, 
                                                                            param_grid,
                                                                            "GradientBoostingRegressor")

In [None]:
gbr_grid.best_params_

**Random Forest Regressor**

In [None]:
param_grid = {"n_estimators": [500,1000, 1500],
              "max_features": ['auto','sqrt'],
              "max_depth": [None,5,10,],
              "min_samples_split": [2,5,10],
              "min_samples_leaf": [1,2,5,10]}

rfr_grid_mae, rfr_grid_rmse, rfr_grid_r2, _ , rfr_grid_model = model_gridsearchCV(RandomForestRegressor,
                                                                                  param_grid,
                                                                                  "RandomForestRegressor")

In [None]:
rfr_grid_model.best_params_

**SVR**

In [None]:
param_grid = {"kernel":["linear","rbf",],
              "gamma": ["scale","auto"],
              "C": [0.1, 0.5, 1, 10],
              "epsilon": [0.1, 0.01, 0.001]}

svr_grid_mae, svr_grid_rmse, svr_grid_r2, svr_grid_y_pred, svr_grid_model = model_gridsearchCV(SVR,
                                                                                param_grid,
                                                                               "SVR")

In [None]:
svr_grid_model.best_params_

In [None]:
svr_grid_model.best_score_

**Ridge**

RidgeCV had the best metrics and I want to see if GridSearchCV can improve the model.

In [None]:
from sklearn.linear_model import Ridge

param_grid = {"solver": ["auto","svd","lsqr","saga"],
              "max_iter": [1000, 10000],
              "tol": [1e-3,1e-2],
              "alpha": [0.1, 1.0, 10.0, 30.0]}

ridge_gr_mae, ridge_gr_rmse, ridge_gr_r2,_ , ridge_gr_model = model_gridsearchCV(Ridge,
                                                                                 param_grid,
                                                                                 "Ridge")

In [None]:
ridge_gr_model.best_params_

**Extreme Gradient Boosting**

There is a warning when running this algoritm, but It should not prevent your code from running, nor should it lead to different results.

In [None]:
param_grid = {"learning_rate":[0.05, 0.10, 0.15, 0.20, 0.30],
              "max_depth":[3,4,5,6,8,15],
              "min_child_weight":[1,3,5,7],
              "gamma":[0.0, 0.1, 0.2, 0.3, 0.4],
              "colsample_bytree":[0.3, 0.4, 0.5, 0.7]}

xboost_gr_mae, xboost_gr_rmse, xboost_gr_r2, _ , xboost_gr_model = model_gridsearchCV(XGBRegressor,
                                                                                      param_grid,
                                                                                      "XGBoost")

In [None]:
xboost_gr_model.best_params_

**CatBoostRegressor**

In [None]:
#param_grid = {'iterations': [250,100,500,1000],
              #'learning_rate': [0.01,0.1,0.2,0.3],
              #'depth': [4, 6],
              #'l2_leaf_reg': [3,1,5,10,100]}


# cat_grid_mae, cat_grid_rmse, cat_grid_r2, _ , cat_grid_model = model_gridsearchCV(CatBoostRegressor,
                                                                                  # param_grid,
                                                                                  # "CatBoost")

**CatBoost with GridSearchCV**
1. MAE: 0.08686650731538992
2. RMSE: 0.12366782337837257
3. R-squared: 0.92%

In [None]:
cat_grid_model.best_params_

In [None]:
grCV_metrics_df = pd.DataFrame({"Model":["GradientBoostingRegressor", "RandomForestRegressor", 
                                         "SVR", "Ridge", "XGBRegressor", "CatBoost"],
                                        
                                "R-square":[gbr_grid_r2, rfr_grid_r2, svr_grid_r2, 
                                            ridge_gr_r2, xboost_gr_r2, cat_grid_r2],
                                        
                                "RMSE":[gbr_grid_rmse, rfr_grid_rmse, svr_grid_rmse, 
                                        ridge_gr_rmse, xboost_gr_rmse, cat_grid_rmse],
                                        
                                "MAE":[gbr_grid_mae, rfr_grid_mae, svr_grid_mae, 
                                      ridge_gr_mae, xboost_gr_mae, cat_grid_mae]})

grCV_mertics_df = grCV_metrics_df.sort_values(by=["R-square"],
                                              ascending=False).reset_index(drop=True)

print("**GridSearchCV Models Metrics**")
grCV_mertics_df

In [None]:
# Visualize the table above
fig, ax = plt.subplots(figsize=(8,5))

list_order = list(grCV_mertics_df['Model'].values)
# R-squared
sns.barplot(x="Model", y="R-square", 
            data=grCV_metrics_df, ax=ax, 
            palette="magma", order= list_order)
# Root Mean Squared Error
sns.lineplot(x="Model", y="RMSE", data=grCV_metrics_df, 
             color="red", ax=ax,legend='brief', label="rmse")
# Mean Absolute Error
sns.lineplot(x="Model", y="MAE", data=grCV_metrics_df, 
             color='green', ax=ax, legend='brief', label="mae")

plt.xticks(rotation=45, horizontalalignment="right")
plt.title("Regression Models with GridSearchCV Metrics")
plt.ylabel("R_squared")
plt.legend();

###  Ensemble model with best parameters

In [None]:
ensemble2_model = VotingRegressor(estimators=[("ridgecv", ridge_gr_model.estimator),
                                             ("catboost", cat_grid_model.estimator),
                                             ("gbr", gbr_grid.estimator),
                                             ("lassocv", lasso_cv_model),
                                             ("svr", svr_base_model),
                                             ("forest", rfr_model.base_estimator)])

In [None]:
# Fit the model
# ensemble2_model.fit(scaled_Xtrain, y_train)

In [None]:
# Evaluate ensemble model
ensemble2_y_pred = ensemble2_model.predict(scaled_Xtest)

ensemble2_mae = mean_absolute_error(y_test, ensemble2_y_pred)
ensemble2_rmse = np.sqrt(mean_squared_error(y_test, ensemble2_y_pred))
    
# R-squared 
ensemble2_r2 = r2_score(y_test, ensemble2_y_pred)
    
print(f"**VotingRegressor Metrics**")
print(f"**MAE: {ensemble2_mae}")
print(f"**RMSE: {ensemble2_rmse}")
print(f"**R-squared: {ensemble2_r2:.2f}%")

### Make predictions submmision to Kaggle

In [None]:
best_ensemble = VotingRegressor(estimators=[("gbr", gbr_grid.estimator),
                                            ("forest", rfr_grid_model.estimator),
                                            ("svr", svr_grid_model.estimator),
                                            ("ridge", ridge_gr_model.estimator),
                                            ("xgboost", xboost_gr_model.estimator),
                                            ("catboost", cat_grid_model.estimator)])

In [None]:
#fit the model
#best_ensemble.fit(scaled_X, y)

#final_ensemble2 = best_ensemble.predict(scaled_test)


#Make predictions and save it to the dataframe
#final_ensemble_df = pd.DataFrame({"id":row_id,"SalePrice": np.expm1(final_ensemble2)})

In [None]:
final_ensemble_df.to_csv("house_price_grid_ensemble_sub.csv", index=False)

I have to admit that spent all that time on testing didn't help to improve the model in this case .After submmision to Kaggle competition I end up in top 8%. Now, I need to figure it out what to do next or which techniques I could implement to make this model even more robust. I've got some ideas already in my mind, anyway, back to reading and searching. So, if you Kagglers have some ideas let me know. Don't forget to leave feadback.