# Model Construction and Evaluation

## Objectives

- Build different regression models
- Train and test regression models

In [110]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [4]:
# Import dataset
df = pd.read_csv('ModelCarData.csv')
df.drop(['Unnamed: 0'],axis=1,inplace=True)
df.head()

Unnamed: 0,selling_price,km_driven,max_power(bhp),fuel,owner
0,450000,145500,74.0,1,1
1,370000,120000,103.52,1,2
2,158000,140000,78.0,2,3
3,225000,127000,90.0,1,1
4,130000,120000,88.2,2,1


For fuel variable:
- 1: Diesel
- 2: Petrol
- 3: CNG
- 4: LPG

For owner variable:
- 1: First Owner
- 2: Second Owner
- 3: Third Owner
- 4: Fourth & Above Owner
- 5: Test Drive Car

## Model Construction

Defining train and test sets for models.

In [77]:
Y = df['selling_price']
X = df[['km_driven','max_power(bhp)','fuel','owner']]
X = np.asarray(X)

### 1. Linear Regression Model

Building training and testing sets, with 70% and 30% of data respectively.

In [84]:
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.3,random_state=43)

Building and training multivariate linear regression model.

In [85]:
lr = LinearRegression()
lrm = lr.fit(x_train,y_train)

Predict x_test selling price.

In [86]:
lr_preds = lrm.predict(x_test)

In [87]:
lr_r2 = r2_score(y_test,lr_preds)
lr_mse = mean_squared_error(y_test,lr_preds,squared=False)

print("Squared error: ",lr_r2)
print("Mean squared error: ",lr_mse)

Squared error:  0.4830084805447066
Mean squared error:  386231.54882058897


The r2 score is below 50, which means that the model does not have a high accuracy at predicting selling prices.

### 2. Polynomial regression

Defining training and testing sets with a polynomial degree of 2.

In [106]:
prm = PolynomialFeatures(degree=2)
x_poly = prm.fit_transform(X)

poly_x_train, poly_x_test, y_train, y_test = train_test_split(x_poly,Y,test_size=0.3,random_state=43)

Training model with polynomial transformed data and predicting values of test set.

In [107]:
lr = LinearRegression()
prm = lr.fit(poly_x_train,y_train)

pr_preds = prm.predict(poly_x_test)

Comparing predicted values with actual values.

In [108]:
pr_r2 = r2_score(y_test,pr_preds)
pr_mse = mean_squared_error(y_test,pr_preds,squared=False)

print("Squared error: ",pr_r2)
print("Mean squared error: ",pr_mse)

Squared error:  -4.816167989258717
Mean squared error:  1295461.4891352865


In this case, the polynomial regression shows a negative value for r2, which means that this model does not fit with the data.

### 3. Decision Tree regression

Building decision tree regressor model.

In [147]:
dtree = DecisionTreeRegressor(max_depth=10, min_samples_leaf=0.05, random_state=2)

x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.3,random_state=43)

Training model with training sets and predicting selling price with test set.

In [148]:
dtm = dtree.fit(x_train,y_train)

dt_preds = dtm.predict(x_test)

Showing results of decision tree regression model.

In [149]:
dt_r2 = r2_score(y_test,dt_preds)
dt_mse = mean_squared_error(y_test,dt_preds,squared=False)

print("R2 score: ",dt_r2)
print("Mean squared error: ",dt_mse)

R2 score:  0.5146255973648961
Mean squared error:  374235.0590739678


### 4. Random Forest Regressor

Building random forest regressor model.

In [164]:
rf = RandomForestRegressor(n_estimators=500, oob_score=True, random_state=100)

x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.3,random_state=43)

In [165]:
rfm = rf.fit(x_train,y_train)

rf_preds = rfm.predict(x_test)

In [166]:
rf_r2 = r2_score(y_test,rf_preds)
rf_mse = mean_squared_error(y_test,rf_preds,squared=False)

print("R2 score: ",rf_r2)
print("Mean squared error: ",rf_mse)

R2 score:  0.8053482971740353
Mean squared error:  236992.79328863331


## Results

The results for each model are:

| Model | R2 score (%) | MSE score |
| --- | --- | --- |
| Linear Model | 48.30 | 386,231 |
| Polynomial Model | -481.62 | 1,295,461 |
| Decision Tree Model | 51.46 | 374,235 |
| Random Forest Model | 80.53 | 236,992 |


Based on the results shown above, the best model is the Random forest model, with an accuracy of 80.53% over the test set and having the lowest MSE value. The worst model was the polynomial model, having a negative accuracy, and having a huge value for MSE.