<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/UsedCarPricePredictionSystem-Files/blob/master/Random_Forest_Using_Grid_Search_CV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd    
import joblib
import time                                          # Used to measure the time taken to load the data as a dataframe

                              # Loading the dataset into a dataframe and performing the desired operations

In [0]:
# Storing the dataset (CSV file) as a pandas dataframe

df = pd.read_csv("/content/drive/My Drive/Dataset/CleanedData.csv")   # Storing the CSV file into a dataframe
df.head(5)

Unnamed: 0,price,yearOfRegistration,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,postalCode,vehicleType_0,vehicleType_1,vehicleType_2,vehicleType_3,vehicleType_4,vehicleType_5,vehicleType_6,vehicleType_7,gearbox_0,gearbox_1
0,650,1995,102,11,150000,10,1,2,33775,0,0,0,0,0,0,1,0,0,1
1,2000,2004,105,10,150000,12,1,19,96224,0,0,0,0,0,0,1,0,0,1
2,2799,2005,140,160,150000,12,3,37,57290,0,0,0,0,0,1,0,0,0,1
3,999,1995,115,160,150000,11,1,37,37269,0,0,0,0,0,1,0,0,0,1
4,2500,2004,131,160,150000,2,1,37,90762,0,0,0,0,0,1,0,0,0,1


In [0]:
df.shape

(94157, 19)

# **Splitting the dataset**


The data set will be splitted into three distinct sets of labeled examples as shown below:
1. Training set
2. Validation set
3. Testing set

The holdout sets in this case will be the validation and testing set. Since this is not a big data, 80% will be the training set, 10% will be validation and 10% will be testing set. In this case, 
1. The training will be only used for building (training) the model. 
2. Validation set will be used to choose the learning algorithm, find the best values of hyperparameters. 
3. Testing set will be used when the model is deployed in production, or for testing purpose. 

In [0]:
selectedFeatures = ['yearOfRegistration','powerPS','model','kilometer','monthOfRegistration','fuelType','brand','postalCode','vehicleType_0','vehicleType_1','vehicleType_2','vehicleType_3','vehicleType_4','vehicleType_5','vehicleType_6','vehicleType_7','gearbox_0','gearbox_1']

In [0]:
X = df[selectedFeatures]
y = df['price']

We can  use `sklearn.model_selection.train_test_split` twice. First to split to train, test and then split train again into validation and train.

In [0]:
from sklearn.model_selection import train_test_split

# Training set = 90%, Testing set = 10%, Validation set = 10%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)         
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) 

The learning algorithm
cannot
use examples from these two subsets to build the model validation and test sets

# **Choosing the best learning algorithm by using validation set**

**Pipelining**


With Pipeline

Different regression algorithms will be implemented by using the validation set. The validation set will be specifically used here to choose the best alogorithm (score). 
1. Linear Regression
2. Random Forest Regressor
3. Decision Tree Regressor
4. Support Vector Regressor

In [0]:
# Importing necessary libraries
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score  

# Creating the pipelines
pipeline_rf=Pipeline([('scalar1', StandardScaler()),
                     ('rf_regressor',RandomForestRegressor())])

pipeline_dt=Pipeline([('scalar2', StandardScaler()),
                     ('dt_regressor',DecisionTreeRegressor())])

pipeline_lr=Pipeline([('scalar3', StandardScaler()),
                     ('lr_regressor',LinearRegression())])

pipeline_svm=Pipeline([('scalar4', StandardScaler()),
                     ('svr_regressor',svm.SVR())])

# List to store all the model results
pipelines = [pipeline_rf, pipeline_dt, pipeline_lr, pipeline_svm]

In [0]:
start_time = time.time()                                  

# Dictionary of pipelines and Regression types for ease of reference
pipe_dict = {0: 'Random Forest Regression', 1: 'Decision Tree Regressor', 2: 'Linear Regression', 3: 'Support Vector Regression'}
# Fit the pipelines
for pipe in pipelines:
	pipe.fit(X_train, y_train)
 
for i,model in enumerate(pipelines):
    pred = model.predict(X_val)
    print("{} Model Accuracy: {}".format(pipe_dict[i],r2_score(y_val, pred)* 100))

print("--- %s seconds ---" % (time.time() - start_time))   # Displaying the time in seconds

Random Forest Regression Model Accuracy: 84.81584796937892
Decision Tree Regressor Model Accuracy: 73.02462727685325
Linear Regression Model Accuracy: 56.095065680128066
Support Vector Regression Model Accuracy: 23.623095422425923
--- 379.20802187919617 seconds ---




---



Without Pipeline

In [0]:
start_time = time.time()                                                                                  # Calculating the time taken to convert the dataset from CSV to dataframe

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor().fit(X_train, y_train)
pred = rfr.predict(X_val)
print(r2_score(y_val, pred))

print("--- %s seconds ---" % (time.time() - start_time))                                                  # Displaying the time in seconds

In [0]:
start_time = time.time()                                                                                  # Calculating the time taken to convert the dataset from CSV to dataframe

rfr = DecisionTreeRegressor().fit(X_train, y_train)
pred = rfr.predict(X_val)
print(r2_score(y_val, pred))

print("--- %s seconds ---" % (time.time() - start_time))                                                  # Displaying the time in seconds

In [0]:
start_time = time.time()                                                                                  # Calculating the time taken to convert the dataset from CSV to dataframe

rfr = LinearRegression().fit(X_train, y_train)
pred = rfr.predict(X_val)
print(r2_score(y_val, pred))

print("--- %s seconds ---" % (time.time() - start_time))                                                  # Displaying the time in seconds

In [0]:
start_time = time.time()                                                                                  # Calculating the time taken to convert the dataset from CSV to dataframe

rfr = svm.SVR().fit(X_train, y_train)
pred = rfr.predict(X_val)
print(r2_score(y_val, pred))

print("--- %s seconds ---" % (time.time() - start_time))                                                  # Displaying the time in seconds



---



# **Underfitting**

A model has a low bias if it predicts well the labels of the training data. If the model makes many mistakes on the training data, we say that the model has a high bias or that the model underfits. So, underfitting is the inability of the model to predict well the labels of the data it was trained on.

**Lets check for underfitting**

**Random forest regression - Predicting the labels of training data**

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2

In [0]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [0]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor().fit(X_train, y_train)
pred = rfr.predict(X_train)
print(r2_score(y_train, pred)* 100)

In [0]:
df = pd.DataFrame({'Actual': y_train, 'Predicted': pred})
df

In [0]:
import matplotlib.pyplot as plt
df1 = df.head(35)
df1.plot(kind='bar',figsize=(10,5.5))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()



---



In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

start_time = time.time()                                                                                
rfr = RandomForestRegressor()
# Parameter of Random Forest Regression
parameters = {
                    "n_estimators":[5,50,150,250,350],
                    "max_depth":[2,4,8,16,None],
                    "max_features":[10, 11, 12, 13, 14],
                    "min_samples_leaf": [2, 3, 4, 5, 6]
             }
cv = GridSearchCV(rfr,parameters,cv=2)
cv.fit(X_train, y_train.values.ravel())
print("--- %s seconds ---" % (time.time() - start_time))  

def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
display(cv)                                               

--- 6988.722146034241 seconds ---
Best parameters are: {'max_depth': 16, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 350}


0.55 + or -0.009 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 5}
0.572 + or -0.001 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 50}
0.572 + or -0.0 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 150}
0.572 + or -0.001 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 250}
0.57 + or -0.0 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 350}
0.56 + or -0.003 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 3, 'n_estimators': 5}
0.569 + or -0.003 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 3, 'n_estimators': 50}
0.571 + or -0.001 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 3, 'n_estimators': 150}
0.571 + or -0.002 



---



**Trying decision tree**

In [0]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(max_depth=23)
dtr.fit(X_train, y_train)
pred = dtr.predict(X_train)
print(r2_score(y_train, pred)* 100)

In [0]:
df = pd.DataFrame({'Actual': y_train, 'Predicted': pred})
df

In [0]:
import matplotlib.pyplot as plt
df1 = df.head(35)
df1.plot(kind='bar',figsize=(10,5.5))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()



---



In [0]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor().fit(X_train, y_train)
pred = rfr.predict(X_test)
print(r2_score(y_test, pred)* 100)

In [0]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
pred = dtr.predict(X_test)
print(r2_score(y_test, pred)* 100)

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

start_time = time.time()                                                                                  # Calculating the time taken to convert the dataset from CSV to dataframe

rfr = RandomForestRegressor()
parameters = {
                    "n_estimators":[5,50,100,150,200,250,300,350],
                    "max_depth":[2,4,6,8,10,12,14,16],
                    "max_features":[10, 11, 12, 13, 14, 16,17,18],
                    "min_samples_leaf": [2, 3, 4, 5, 6, 7, 8, 9]
             }
cv = GridSearchCV(rfr,parameters,cv=2)
cv.fit(X_train, y_train.values.ravel())
print("--- %s seconds ---" % (time.time() - start_time))  





def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
display(cv)                                                # Displaying the time in seconds