# Model Training

During this phase, we aim to develop and train machine learning model using our preprocessed data. This is a critical step in the data science workflow, and our objectives during this phase are as follows: <br>
1. Defining Individual & Dependent Variables
2. GridSearchCV (doesn't work efficiently for such large datasets)
3. Linear Regression
4. Random Forest Regression (with Hyper Parameter Tuning)
5. Model Testing
6. Saving model to pickle file

**Excel File Utilized : Processed-Delhi-Prices.xlsx**

**Imports**

In [1]:
import pandas as pd
import os

**Creating dataframe for Imported file**

In [2]:
cwd = os.getcwd()
df = pd.read_excel(cwd + "/Processed-Delhi-Prices.xlsx")
df

Unnamed: 0,Area,BHK,Bathroom,Price (in Lakhs),Aali Village,Ali Vihar Sarita Vihar,Amrita Shergill Marg,Anand Niketan,Anand Vihar,Ashok Nagar,...,Tuglak Road,Uday Park,Unknown,Uttam Nagar,Vasant Kunj,Vasant Vihar,Vasundhara Enclave,Vikas Puri,Vikaspuri,West End
0,1900,3,2,178.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,1500,2,2,175.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1900,2,2,175.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,1900,3,3,175.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1600,2,2,174.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21078,700,3,3,32.01,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21079,800,3,3,29.65,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21080,800,3,3,29.00,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21081,600,3,3,25.00,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False


**Defining Individual & Dependent Variables**

In [3]:
X = df.drop('Price (in Lakhs)', axis = 'columns')
y = df['Price (in Lakhs)']

**GridSearchCV**

We use the GridSearchCV function from scikit-learn to perform a grid search with cross-validation to find the best algorithm and its hyperparameters for your machine learning model. This approach helps you systematically evaluate different algorithms and their parameter settings.

**Train Test Split**

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Linear Regression**

In [5]:
from sklearn.linear_model import LinearRegression

model_lr = LinearRegression()
model_lr.fit(X_train, y_train)

score = model_lr.score(X_test, y_test)
print("Score:" , score)

Score: 0.830310878548488


**Random Forest Regression**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

model_rf = RandomForestRegressor(n_estimators=750)
model_rf.fit(X_train, y_train)

y_pred = model_rf.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

r2 = r2_score(y_test, y_pred)
print("R-squared (R^2) Score:", r2)

While our objective is to predict property prices, it's crucial to use the model that provides the most accurate predictions. In this context, the Random Forest Regression model with an R-squared (R^2) score of 0.8383 may be a better choice based on the provided information, but testing shows us that Linear Regression. odel is predicting prices that are near to real life prices.

**Test the model with a few properties**

In [None]:
import numpy as np

def predict_price(location,sqft,bath,bhk):    
    loc_index = np.where(X.columns==location)[0][0]

    x = np.zeros(len(X.columns))
    x[0] = sqft
    x[1] = bath
    x[2] = bhk
    if loc_index >= 0:
        x[loc_index] = 1

    return model_lr.predict([x])[0]

In [None]:
predict_price('Greater Kailash', 1800, 4, 4)

In [None]:
predict_price('Kalkaji', 1800, 4, 4)

In [None]:
predict_price('Kalkaji', 900, 2, 2)

In [None]:
predict_price('Kalkaji', 900, 2, 3)

**Export the tested model to a Pickle File**

In [None]:
import pickle
filename = 'Linear-Regression-Model-Delhi-Prices.pkl'

with open(filename, 'wb') as file:
    pickle.dump(model_lr, file)

print(f"Linear Regression model saved to {filename}")

**Export `location` and `column information` to a file that will be used later in our Prediction App**

In [None]:
import json

columns = {
    'data_columns': list(X.columns)
}

with open("columns.json", "w") as f:
    json.dump(columns, f)