<a href="https://colab.research.google.com/github/DLPY/Regression-Session-2/blob/master/Regression_Session_2_Ridge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Open In Colab

# House Price Prediction based on Postal Code, Number of Bathrooms, Car Parking and Property Type

Detail on Data: https://www.kaggle.com/mihirhalai/sydney-house-prices

# **1.Import necessary packages for performing EDA and Regression**

In [None]:
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 
import seaborn as sns
from sklearn.linear_model import (Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV)
from sklearn.metrics import (r2_score, mean_squared_error)
from sklearn.model_selection import (RepeatedKFold, train_test_split)
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import (LabelEncoder, OneHotEncoder, StandardScaler)

%matplotlib inline

pd.set_option('display.max_colwidth', None)

## i) Read data from csv file into Pandas dataframe

In [None]:
!wget https://raw.githubusercontent.com/DLPY/Regression-Session-2/master/Data/Processed_SydneyHousePrices.csv
df = pd.read_csv('Processed_SydneyHousePrices.csv')

## ii) Isolate Target and Predictor Variables to Different Dataframes

In [None]:
X = df[['postalCode', 'bed', 'bath', 'car', 'propType', 'diffDate', 'Year', 'Month', 'Day', 'Quarter', 'medSellPrice']]
y = df[['sellPrice']]

# Save this list of column values for later
columns_list = list(X.columns.values)

In [None]:
X.head(5)

In [None]:
y.head(5)

# **2.Standardise Features**

Because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, the features must be standardised prior to training.

The approach to standardising features is removing the mean and scale to unit variance.

The standard score of a sample x is calculated as:

    z = (x - u) / s

where _u_ is the mean of the training samples and _s_ is the standard deviation of the training samples.

In [None]:
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [None]:
# The scaled values are now stored as an array.
X_std[: 5]

In [None]:
# X is already an array data type, so y also needs converting (the model expects these as inputs).
y = y.values

# **3.Split dataset into the training and test using train_set_split**

90% - train

10% - test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.1, random_state=23)

In [None]:
print('Training Data:', X_train.shape, y_train.shape)
print('Testing Data:', X_test.shape, y_test.shape)

# **4.Train, Test and Predict using Ridge regression**

## i) Ridge Regression Model
Create ridge regression with seven possible alpha values (a series of log10), set to 5-fold cross-validation.

In [None]:
alpha_list = [1e-3, 1e-2, 1e-1, 0.1, 1.0, 10.0, 100.0]

In [None]:
# Run RidgeCV to get the optimal alpha value.
ridge_cv_regr = RidgeCV(alphas=alpha_list, cv=5)

Fit the linear regression model to the training set. We use the fit method the arguments of the fit method will be training sets

In [None]:
ridge_cv_regr_model = ridge_cv_regr.fit(X_train, y_train)

# View best alpha value
alpha_ = ridge_cv_regr_model.alpha_
print('The optimal alpha value is: {}'.format(alpha_))

In [None]:
feature_importance = pd.Series(index=columns_list, data=np.abs(ridge_cv_regr_model.coef_.ravel()))

n_selected_features = (feature_importance > 0).sum()
print('{0:d} features, reduction of {1:2.2f}%'.format(
    n_selected_features,(1 - n_selected_features / len(feature_importance)) * 100))

feature_importance.sort_values().tail(30).plot(kind='barh', figsize=(18, 6))

In [None]:
# Review the absolute values of coefficients
ridge_coef = pd.DataFrame(abs(ridge_cv_regr_model.coef_))
ridge_coef.columns = columns_list
ridge_coef

From the above; notice that there are no Zero values at Alpha=10.

This means that we can keep all of the features in the data set.

---

In [None]:
# Create the Ridge model using the optimal alpha value.
ridge_regr_model = Ridge(alpha=alpha_)

In [None]:
ridge_regr_model.fit(X_train, y_train)

Regression Coefficients

## ii) Predicting the results

Training set prediction score

In [None]:
y_pred = ridge_regr_model.predict(X_train)

In [None]:
ridge_train_score = ridge_regr_model.score(X_train, y_train)
ridge_train_score

Test set prediction score

In [None]:
y_pred = ridge_regr_model.predict(X_test)

In [None]:
ridge_test_score = ridge_regr_model.score(X_test, y_test)
ridge_test_score

From the above, notice that the results of the test data are slightly better than the results of the training data (higher score is better).
This suggests that the model is generalised enough to work well with previously unseen data.

# **5.Evaluation metrics - How to Calculate R-Square and RMSE**

In [None]:
print('Ridge Test Data Results:')
y_pred = ridge_regr_model.predict(X_test)

coef_of_determination_ridge = r2_score(y_test, y_pred)
print('R-squared: {}'.format(coef_of_determination_ridge))

rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error: {}'.format(rmse_ridge))

From the above, notice that the results of the R-Square score retrieve the same value as the model's prediction score method.

In [None]:
# Review the absolute values of coefficients
ridge_coef = pd.DataFrame(abs(ridge_regr_model.coef_))
ridge_coef.columns = columns_list
ridge_coef

In [None]:
# Displaying Results and Difference in Table 
res = pd.DataFrame(y_pred, y_test.ravel())
res = res.reset_index()
res.columns = ['Price', 'Prediction']
res['Prediction'] = round(res['Prediction'], 0)
res['Difference'] = res['Prediction'] - res['Price']
res.head(5)

In [None]:
# Get the median difference of actual prices and predicted prices
ridge_med_diff = res['Difference'].median()

print('The median difference of actual prices and predicted prices using Ridge: {}'.format(ridge_med_diff))