# Modelling.

**Objectives**:

1. **Selecting Suitable models**: Selecting 2-3 Models to cross validate.

2. **Evaluating**: Cross validating models to check which one works best.

3. **Fine tuning**: Fine tuning best model using GridSearchCV.

4. **Performing Test on test data**: Check the finalized model on Test data, to see if it did well enough.

5. **Preparing final product**: training the final model on whole dataset and saving it to be used in website.

## Dependencies.

In [93]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, ElasticNet, ElasticNetCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
import pickle

## Selecting Suitable Model and Evaluating.

Due to to nature of prediction and datasets, 2 configurations of regression models may work well.

1. Multiple linear regression.
2. ElasticNet (by configuring parameters using ElasticNetCV).
3. Random Forest Regressor.

In [94]:
# Preparing training data.
training_data = pd.read_csv('train_dataset', index_col=False)
training_data

# Removing outliers. (Values of area above 10500)
training_data = training_data[training_data['area'] <= 10500]

# Splitting Features and Target Variables.
X_train = training_data.drop(columns=['price'])
y_train = training_data['price']

# Converting DataFrames and series to NumPy array.
X_train = np.array(X_train)
y_train = np.array(y_train)

In [95]:
# Cross Validation on Linear Regression Model.

linear_regressor = LinearRegression()

linear_regressor_score = cross_val_score(
    estimator=linear_regressor,
    X=X_train,
    y=y_train,
    cv=7,
    scoring='r2'
)

linear_regressor_score.mean()

0.5954465173529986

Trying ElasticNet because it automatically decrease the influence of less correlated features which can cause noise given good set of hyperparameters.

In [96]:
# Cross Validation on Ridge Regression.

elasticnet = ElasticNet(alpha=1, l1_ratio=0.5)

elasticnet_score = cross_val_score(
    estimator=elasticnet,
    X=X_train,
    y=y_train,
    cv=7,
    scoring='r2'
)

elasticnet_score.mean()

0.5164486046232961

Cross validating Random Forest Regressor.

In [97]:
random_forest = RandomForestRegressor() # We are using default settings of Random Forest.

random_forest_score = cross_val_score(
    estimator=random_forest,
    X=X_train,
    y=y_train,
    cv=7,
    scoring='r2'
)

random_forest_score.mean()

0.5504484130425923

Multiple Linear regression performed very well.

Random Forest came in between.

ElasticNet performed worse. But we know, Tweaking with hyperparameters can increase the predictive power of the model. So fine tuning the model and checking if it works better than Multiple Linear Regression.

## Fine tuning models. 

Using ElasticNetCV to get the best combination of hyperparameters.

In [98]:
elastic_net_cv = ElasticNetCV(
    alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10],
    l1_ratio=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
    cv=5,
    random_state=42,
    max_iter=100000
)

elastic_net_cv.fit(X=X_train, y=y_train)

elastic_net_cv.alpha_, elastic_net_cv.l1_ratio_


(0.1, 0.9)

So the best combo of hyperparameters are:

alpha = 0.1

l1_ratio = 0.9

In [99]:
# Using best fit hyperparameters and cross validating the model.

elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.9)

elasticnet_score = cross_val_score(
    estimator=elasticnet,
    X=X_train,
    y=y_train,
    cv=7,
    scoring='r2'
)

elasticnet_score.mean()

0.5981126171318107

Difference is very little but it is here. Using ElasticNet maybe better choice.

## Evaluating model on test data.

In [100]:
# Perparing test data.

test_data = pd.read_csv('test_dataset')

# Performing necessary transformations on dataset.

# mainroad,	guestroom, basement, hotwaterheating, airconditioning and prefarea are Binary Categoricl Variables and can be fixed using Labeled Encoding.
categorical_variables = ['mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea']

for i in range (len(categorical_variables)):
    test_data[categorical_variables[i]] = test_data[categorical_variables[i]].apply(lambda x: 1 if x == 'yes' else 0)

# furnishingstatus need OneHotEncoding.
encoder = OneHotEncoder()

test_data = pd.get_dummies(test_data, columns=['furnishingstatus'], drop_first=False, dtype=int)

# Removing outliers. (Values of area above 10500)
test_data = test_data[test_data['area'] <= 10500]

# Splitting Features and Target Variables.
X_testing = test_data.drop(columns=['price'])
y_testing = test_data['price']

# Converting DataFrames and series to NumPy array.
X_testing = np.array(X_testing)
y_testing = np.array(y_testing)

In [101]:
# Training Model.

model = ElasticNet(alpha=0.1, l1_ratio=0.9, max_iter=100000, random_state=42)

model.fit(X=X_train, y=y_train)

0,1,2
,alpha,0.1
,l1_ratio,0.9
,fit_intercept,True
,precompute,False
,max_iter,100000
,copy_X,True
,tol,0.0001
,warm_start,False
,positive,False
,random_state,42


In [102]:
# Checking final score.

model.score(X=X_testing, y=y_testing)

0.6433188700493281

Model have genralized very well on unseen data.

## Conclusion.

ElasticNet model with alpha=0.1 and l1_ratio is suited best to be integrated in model section of Housing Prize Prediction website, Due to it's maximum score over all test models 0.6433

## Final training on whole data set and saving model as .pickle file.

In [103]:
# Importing whole dataset.
dataset = pd.read_csv('Housing.csv')

# Transforming whole dataset for model fitting.

# mainroad,	guestroom, basement, hotwaterheating, airconditioning and prefarea are Binary Categoricl Variables and can be fixed using Labeled Encoding.
categorical_variables = ['mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea']

for i in range (len(categorical_variables)):
    dataset[categorical_variables[i]] = dataset[categorical_variables[i]].apply(lambda x: 1 if x == 'yes' else 0)

# furnishingstatus need OneHotEncoding.
encoder = OneHotEncoder()

dataset = pd.get_dummies(dataset, columns=['furnishingstatus'], drop_first=False, dtype=int)

# Removing outliers. (Values of area above 10500)
dataset = dataset[dataset['area'] <= 10500]

# Splitting Features and Target Variables.
X_final = dataset.drop(columns=['price'])
y_final = dataset['price']

# Converting DataFrames and series to NumPy array.
X_final = np.array(X_final)
y_final = np.array(y_final)

In [104]:
# Training model.
final_model = ElasticNet(alpha=0.1, l1_ratio=0.9, max_iter=10000, random_state=42)

final_model.fit(X=X_final, y=y_final)

0,1,2
,alpha,0.1
,l1_ratio,0.9
,fit_intercept,True
,precompute,False
,max_iter,10000
,copy_X,True
,tol,0.0001
,warm_start,False
,positive,False
,random_state,42


In [105]:
# Saving model as .pickle file.
with open('model.pkl', 'wb') as f:
    pickle.dump(final_model, f)