<a href="https://colab.research.google.com/github/DLPY/Regression-Session-2/blob/master/Regression_Session_2_ElasticNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Open In Colab

# House Price Prediction based on Postal Code, Number of Bathrooms, Car Parking and Property Type

Detail on Data: https://www.kaggle.com/mihirhalai/sydney-house-prices

# **1.Import necessary packages for performing EDA and Multiple Regression**

In [None]:
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 
import seaborn as sns
from sklearn.linear_model import (Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV)
from sklearn.metrics import (r2_score, mean_squared_error)
from sklearn.model_selection import (RepeatedKFold, train_test_split)
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import (LabelEncoder, OneHotEncoder, StandardScaler)

%matplotlib inline

pd.set_option('display.max_colwidth', None)

## i) Read data from csv file into Pandas dataframe

In [None]:
!wget https://raw.githubusercontent.com/DLPY/Regression-Session-2/master/Data/Processed_SydneyHousePrices.csv
df = pd.read_csv('Processed_SydneyHousePrices.csv')

## ii) Isolate Target and Predictor Variables to Different Dataframes

In [None]:
X = df[['postalCode', 'bed', 'bath', 'car', 'propType', 'diffDate', 'Year', 'Month', 'Day', 'Quarter', 'medSellPrice']]
y = df[['sellPrice']]

# Save this list of column values for later
columns_list = list(X.columns.values)

In [None]:
X.head(5)

In [None]:
y.head(5)

# **2. Standardise Features**

In [None]:
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [None]:
# The scaled values are now stored as an array.
X_std[: 5]

In [None]:
# X is already an array data type, so y also needs converting (the model expects these as inputs).
y = y.values

# **3.Split dataset into the training and test using train_set_split**

90% - train

10% - test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.1, random_state=23)

In [None]:
print('Training Data:', X_train.shape, y_train.shape)
print('Testing Data:', X_test.shape, y_test.shape)

# **4.Train, Test and Predict using Elastic Net regression**

## ElasticNet Model

Set the cross-validation parameters separately, ratios values range from 0 to 1, alphas are a series of log10 values.

In [None]:
%%capture

# define model evaluation method (the value for n_splits is typically 3, 5, or 10)
cross_val = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)

# define model parameters
ratios = np.arange(0, 1, 0.01)
alpha_list = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
elasticnet_cv_regr = ElasticNetCV(l1_ratio=ratios, alphas=alpha_list, cv=cross_val, n_jobs=-1)

# fit model
elasticnet_cv_regr_model = elasticnet_cv_regr.fit(X_train, y_train.ravel())

Investigate the importance of each feature based on the absolute value of their coefficients.

In [None]:
feature_importance = pd.Series(index=columns_list, data=np.abs(elasticnet_cv_regr_model.coef_))

n_selected_features = (feature_importance > 0).sum()
print('{0:d} features, reduction of {1:2.2f}%'.format(
    n_selected_features,(1 - n_selected_features / len(feature_importance)) * 100))

feature_importance.sort_values().tail(30).plot(kind='barh', figsize=(18, 6))

In [None]:
alpha_ = elasticnet_cv_regr_model.alpha_
l1_ratio_ = elasticnet_cv_regr_model.l1_ratio_
n_iter_ = elasticnet_cv_regr_model.n_iter_

# summarize chosen configuration
print('alpha: {}'.format(alpha_))
print('L1_ratio_: {}'.format(l1_ratio_))
print('Number of iterations {}'.format(n_iter_))

In [None]:
elasticnet_coef = pd.DataFrame(abs(elasticnet_cv_regr_model.coef_)).T
elasticnet_coef.columns = columns_list
elasticnet_coef

From the above; notice that there are no Zero values at Alpha=0.001. 

This means that we can keep all of the features in the data set.

Use the above alpha, L1 ratio, and number of iterations values for the ElasticNet model.

In [None]:
# If a 'model did not converge' warning appears, increase the value of max_iter.
elasticnet_regr = ElasticNet(alpha=alpha_, l1_ratio=l1_ratio_, max_iter=7000)

In [None]:
# fit model
elasticnet_regr_model = elasticnet_regr.fit(X_train, y_train.ravel())

In [None]:
elasticnet_train_score = elasticnet_regr_model.score(X_train, y_train)
elasticnet_train_score

# **5.Evaluation metrics - How to Calculate R-Square and RMSE**

In [None]:
print('LassoCV Test data results:')
y_pred = elasticnet_regr_model.predict(X_test)

coef_of_determination_elasticnet = r2_score(y_test, y_pred)
print('R-squared: {}'.format(coef_of_determination_elasticnet))

rmse_elasticnet = np.sqrt(mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error: {}'.format(rmse_elasticnet))

In [None]:
# Displaying Results and Difference in Table 
res = pd.DataFrame(y_pred, y_test.ravel())
res = res.reset_index()
res.columns = ['Price', 'Prediction']
res['Prediction'] = round(res['Prediction'], 0)
res['Difference'] = res['Prediction'] - res['Price']
res.head(5)

In [None]:
# Get the median difference of actual prices and predicted prices
elasticnet_med_diff = res['Difference'].median()

print('The median difference of actual prices and predicted prices using Elastic Net: {}'.format(elasticnet_med_diff))

# **6.Comparison of Model Outputs**

In [None]:
!wget https://raw.githubusercontent.com/DLPY/Regression-Session-2/master/Data/df_coef.csv
df_coef = pd.read_csv('df_coef.csv')

In [None]:
# Review coefficients of each model 0=Ridge, 1=Lasso, 2=ElasticNet
df_coef

From the above; notice the Zero values for the Ridge Model (Line 0) and Lasso Model (Line 1). This is due to not using these features in these models.

In [None]:
!wget https://raw.githubusercontent.com/DLPY/Regression-Session-2/master/Data/model_result.csv
model_results = pd.read_csv('model_result.csv', header=None, index_col=0, names= [ 'Model', 'R-Squared', 'RMSE'] )

In [None]:
model_results

In [None]:
model_results.sort_values(ascending=True, by='RMSE')

In [None]:
!wget https://raw.githubusercontent.com/DLPY/Regression-Session-2/master/Data/Median_Error.csv
median_error = pd.read_csv('Median_Error.csv' )

In [None]:
median_error

From the above, notice that the overall results are similar. Lasso is the champion model due to the lowest RMSE and highest R-Squared values, even though the predicted median sell price difference of the Ridge model is the lowest of the three models.

Plotting the coefficients of each model displays that the values of each model's coefficients are mostly quite similar.

In [None]:
df_coef = df_coef.iloc[:, :-1]
plt.plot(df_coef.loc[0], alpha=0.7, linestyle='none', marker='*', markersize=4,
         color='red', label=r'Ridge', zorder=7) 
plt.plot(df_coef.loc[1], alpha=0.5, linestyle='none', marker='d', markersize=6,
         color='blue', label=r'Lasso') 
plt.plot(df_coef.loc[2], alpha=0.4, linestyle='none', marker='o', markersize=8,
         color='green', label='ElasticNet')
plt.xlabel('Coefficient Index', fontsize=10)
plt.ylabel('Coefficient Magnitude', fontsize=10)
plt.legend(title='Model', title_fontsize=15,
           fontsize=13, loc='center left', bbox_to_anchor=(1, 0.5))
plt.xticks(range(0, len(columns_list)), columns_list, rotation=45)
plt.show()

# **7.To read more on lasso, ridge, and elastic net regression.**
https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b

https://medium.com/mlearning-ai/elasticnet-regression-fundamentals-and-modeling-in-python-8668f3c2e39e