# Regression Regularization: Real estate price

In this notebook i'm going to apply Regression Regularization on [Real estate price prediction
](https://www.kaggle.com/quantbruce/real-estate-price-prediction).

This dataset has 8 columns. The values of X1 to X6 columns affect the price per unit area of the house in the "Y house price of unit area" column.

## Introduction
One of the major aspects of training your machine learning model is avoiding overfitting. The model will have a low accuracy if it is overfitting. This happens because your model is trying too hard to capture the noise in your training dataset. By noise we mean the data points that don’t really represent the true properties of your data, but random chance. Learning such data points, makes your model more flexible, at the risk of overfitting.

The concept of balancing bias and variance, is helpful in understanding the phenomenon of overfitting.

One of the ways of avoiding overfitting is using cross-validation, that helps in estimating the error over test set, and in deciding what parameters work best for your model.

## Regularization
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Read full article about Regularization method [here](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a).

#### Import all Necessary Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Import the Data

In [None]:
df = pd.read_csv('../input/real-estate-price-prediction/Real estate.csv')

#### Take a look at dataset

In [None]:
df.head()

In [None]:
df.info()

#### EDA:

In [None]:
sns.pairplot(data = df,
             x_vars = ["X1 transaction date" ,
                      "X2 house age" ,
                      "X3 distance to the nearest MRT station",
                      "X4 number of convenience stores" ,
                      "X5 latitude" ,
                      "X6 longitude" ,
                      "Y house price of unit area"],
             y_vars = ["X1 transaction date" ,
                      "X2 house age" ,
                      "X3 distance to the nearest MRT station",
                      "X4 number of convenience stores" ,
                      "X5 latitude" ,
                      "X6 longitude" ,
                      "Y house price of unit area"]
            )

#### Determine the Features & Target Variable (Lable)

In [None]:
# Features:
X = df.drop(['Y house price of unit area'  , 'No'] , axis = 1)
# Label:
y = df['Y house price of unit area']

#### Preprocessing (Polynomial Conversion)

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
polynomial_converter= PolynomialFeatures(degree=3, include_bias=False)
poly_features= polynomial_converter.fit_transform(X)
poly_features.shape

#### Split the Data to Train & Test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

#### Scaling the Data

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler= StandardScaler()
scaler.fit(X_train)

In [None]:
X_train= scaler.transform(X_train)
X_test= scaler.transform(X_test)

### Regularization

#### 1- Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge_model= Ridge(alpha=10)
ridge_model.fit(X_train, y_train)

In [None]:
#predict Test Data
y_pred = ridge_model.predict(X_test)

In [None]:
#Evaluating the Model
from sklearn.metrics import mean_absolute_error, mean_squared_error

MAE= mean_absolute_error(y_test, y_pred)
MSE= mean_squared_error(y_test, y_pred)
RMSE= np.sqrt(MSE)

In [None]:
pd.DataFrame([MAE, MSE, RMSE], index=['MAE', 'MSE', 'RMSE'], columns=['metrics'])

#### Ridge Regression (Coosing an alpha value with Cross-Validation)

In [None]:
#Train the Model
from sklearn.linear_model import RidgeCV

In [None]:
ridge_cv_model=RidgeCV(alphas=(0.1, 1.0, 10.0), scoring='neg_mean_absolute_error')

In [None]:
ridge_cv_model.fit(X_train, y_train)

In [None]:
ridge_cv_model.alpha_

In [None]:
#Predicting Test Data
y_pred_ridge= ridge_cv_model.predict(X_test)

In [None]:
MAE_ridge= mean_absolute_error(y_test, y_pred_ridge)
MSE_ridge= mean_squared_error(y_test, y_pred_ridge)
RMSE_ridge= np.sqrt(MSE_ridge)

In [None]:
pd.DataFrame([MAE_ridge, MSE_ridge, RMSE_ridge], index=['MAE', 'MSE', 'RMSE'], columns=['Ridge Metrics'])

In [None]:
ridge_cv_model.coef_

#### 2- Lasso Regression

In [None]:
from sklearn.linear_model import LassoCV

In [None]:
lasso_cv_model= LassoCV(eps=0.01, n_alphas=100, cv=5)
lasso_cv_model.fit(X_train, y_train)

In [None]:
lasso_cv_model.alpha_

In [None]:
y_pred_lasso= lasso_cv_model.predict(X_test)

In [None]:
MAE_Lasso= mean_absolute_error(y_test, y_pred_lasso)
MSE_Lasso= mean_squared_error(y_test, y_pred_lasso)
RMSE_Lasso= np.sqrt(MSE_Lasso)
pd.DataFrame([MAE_Lasso, MSE_Lasso, RMSE_Lasso], index=['MAE', 'MSE', 'RMSE'], columns=['Lasso Metrics'])

In [None]:
lasso_cv_model.coef_

#### 3- Elastic Net

In [None]:
from sklearn.linear_model import ElasticNetCV

In [None]:
elastic_model= ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1],cv=5, max_iter=100000)
elastic_model.fit(X_train, y_train)

In [None]:
elastic_model.l1_ratio_

In [None]:
y_pred_elastic=elastic_model.predict(X_test)

In [None]:
MAE_Elastic= mean_absolute_error(y_test, y_pred_elastic)
MSE_Elastic= mean_squared_error(y_test, y_pred_elastic)
RMSE_Elastic= np.sqrt(MSE_Elastic)
pd.DataFrame([MAE_Elastic, MSE_Elastic, RMSE_Elastic], index=['MAE', 'MSE', 'RMSE'], columns=['Elastic Metrics'])

In [None]:
elastic_model.coef_