<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/UsedCarPricePredictionSystem-Files/blob/master/Random_Forest_Using_Grid_Search_CV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Random Forest Using Grid Search CV**
The hyper parameter tuning is performed using [Grid Search cv](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) which is an exhaustive search over specified parameter values for an estimator. With the help of grid search cv we can tune and find the best estimator (parameter) for the model. As we saw earlier, random forest gave better results when predicting the labels from the holdout sets, we will perform hyper parameter tuning on random forest.

In [0]:
import pandas as pd    
import joblib
import time                                          

In [0]:
# Storing the dataset (CSV file) as a pandas dataframe

df = pd.read_csv("/content/drive/My Drive/Dataset/CleanedData.csv")   # Storing the CSV file into a dataframe
df.head(5)

Unnamed: 0,price,yearOfRegistration,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,postalCode,vehicleType_0,vehicleType_1,vehicleType_2,vehicleType_3,vehicleType_4,vehicleType_5,vehicleType_6,vehicleType_7,gearbox_0,gearbox_1
0,650,1995,102,11,150000,10,1,2,33775,0,0,0,0,0,0,1,0,0,1
1,2000,2004,105,10,150000,12,1,19,96224,0,0,0,0,0,0,1,0,0,1
2,2799,2005,140,160,150000,12,3,37,57290,0,0,0,0,0,1,0,0,0,1
3,999,1995,115,160,150000,11,1,37,37269,0,0,0,0,0,1,0,0,0,1
4,2500,2004,131,160,150000,2,1,37,90762,0,0,0,0,0,1,0,0,0,1


In [0]:
selectedFeatures = ['yearOfRegistration','powerPS','model','kilometer','monthOfRegistration','fuelType','brand','postalCode','vehicleType_0','vehicleType_1','vehicleType_2','vehicleType_3','vehicleType_4','vehicleType_5','vehicleType_6','vehicleType_7','gearbox_0','gearbox_1']

In [0]:
X = df[selectedFeatures]
y = df['price']

We can  use `sklearn.model_selection.train_test_split` twice. First to split to train, test and then split train again into validation and train.

In [0]:
from sklearn.model_selection import train_test_split

# Training set = 90%, Testing set = 10%, Validation set = 10%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)         
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) 

## **Performing grid search CV**

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

start_time = time.time()                                                                                
rfr = RandomForestRegressor()
# Parameter of Random Forest Regression
parameters = {
                    "n_estimators":[5,50,150,250,350],
                    "max_depth":[2,4,8,16,None],
                    "max_features":[10, 11, 12, 13, 14],
                    "min_samples_leaf": [2, 3, 4, 5, 6]
             }
cv = GridSearchCV(rfr,parameters,cv=2)
cv.fit(X_train, y_train.values.ravel())
print("--- %s seconds ---" % (time.time() - start_time))  

def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
display(cv)                                               

--- 6988.722146034241 seconds ---
Best parameters are: {'max_depth': 16, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 350}


0.55 + or -0.009 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 5}
0.572 + or -0.001 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 50}
0.572 + or -0.0 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 150}
0.572 + or -0.001 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 250}
0.57 + or -0.0 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 350}
0.56 + or -0.003 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 3, 'n_estimators': 5}
0.569 + or -0.003 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 3, 'n_estimators': 50}
0.571 + or -0.001 for the {'max_depth': 2, 'max_features': 10, 'min_samples_leaf': 3, 'n_estimators': 150}
0.571 + or -0.002 

So  by  tuning  of  hyper  parameters  using  grid  search  the  best  parameters  are `n_estimators = 350, max_dept = 16, max_features = 10, min_samples_leaf = 2`.  

## **Performing grid search cv by feeding more parameters**

Unfortunately this took forever to run. I almost waited 8 - 10 hours. But due to lack of time. I could not run it again. Also, most of the time when I ran this code, my notebook crashed.

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

start_time = time.time()                                                                                  # Calculating the time taken to convert the dataset from CSV to dataframe

rfr = RandomForestRegressor()
parameters = {
                    "n_estimators":[5,50,100,150,200,250,300,350],
                    "max_depth":[2,4,6,8,10,12,14,16],
                    "max_features":[10, 11, 12, 13, 14, 16,17,18],
                    "min_samples_leaf": [2, 3, 4, 5, 6, 7, 8, 9]
             }
cv = GridSearchCV(rfr,parameters,cv=2)
cv.fit(X_train, y_train.values.ravel())
print("--- %s seconds ---" % (time.time() - start_time))  

def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
display(cv)                                                # Displaying the time in seconds



---

