<a href="https://colab.research.google.com/github/Parth-Salian/14_days_challenge/blob/parth/Random_Forest(Day%204).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest Implementation

---


> Random Forest is a form of Ensemble Learning that uses Bagging as an approach. 
- It first selects a random subset from the data 
- It creates a decision tree for that subset and evaluates it's outcome. 
- It repeats the step for other random subsets. 
- The final outcome is decided by `majority voting` in the case of classification problems and `average answer` (average of all the outcomes are taken) in the case of regression. 
- The total number of decision trees formed by the model can be specified by us during execution. 






In [None]:
import numpy as np
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
train_data = pd.read_csv(r'/content/drive/MyDrive/Colab Notebooks/train-data.csv')
test_data = pd.read_csv(r'/content/drive/MyDrive/Colab Notebooks/test-data.csv')


In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5911 entries, 0 to 5910
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         5911 non-null   int64  
 1   Name               5911 non-null   object 
 2   Location           5911 non-null   object 
 3   Year               5911 non-null   int64  
 4   Kilometers_Driven  5911 non-null   int64  
 5   Fuel_Type          5911 non-null   object 
 6   Transmission       5911 non-null   object 
 7   Owner_Type         5911 non-null   object 
 8   Mileage            5909 non-null   object 
 9   Engine             5876 non-null   object 
 10  Power              5876 non-null   object 
 11  Seats              5874 non-null   float64
 12  Price              5911 non-null   float64
dtypes: float64(2), int64(3), object(8)
memory usage: 600.5+ KB


In [None]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,17.74


### There is some preprocessing to be done on the dataset (data cleaning, feature engineering, feature selection) 
- If you know this already, skip to the model implementation section.

# Data Preprocessing

In [None]:
train_data = train_data.iloc[:,1:]
train_data.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,4.5
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,6.0
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,17.74


In [None]:
train_data.shape

(5911, 12)



> - Identifying the columns that can be categorised into labels. 



In [None]:

print(train_data['Location'].unique())
print(train_data['Fuel_Type'].unique())
print(train_data['Transmission'].unique())
print(train_data['Owner_Type'].unique())

['Mumbai' 'Pune' 'Chennai' 'Coimbatore' 'Hyderabad' 'Jaipur' 'Kochi'
 'Kolkata' 'Delhi' 'Bangalore' 'Ahmedabad']
['CNG' 'Diesel' 'Petrol' 'LPG' 'Electric']
['Manual' 'Automatic']
['First' 'Second' 'Fourth & Above' 'Third']




> - Getting the number of missing values using isnull() function



In [None]:
train_data.isnull().sum()

Name                  0
Location              0
Year                  0
Kilometers_Driven     0
Fuel_Type             0
Transmission          0
Owner_Type            0
Mileage               2
Engine               35
Power                35
Seats                37
Price                 0
dtype: int64



> - This syntax deletes the rows that contain missing values. 
- There are other various methods you can apply to handle missing values. For ex: fillna() to fill the cells with values you specify. 




In [None]:
train_data = train_data[train_data['Mileage'].notna()]
train_data = train_data[train_data['Engine'].notna()]
train_data = train_data[train_data['Power'].notna()]
train_data = train_data[train_data['Seats'].notna()]



> - Since we deleted the rows, we have to reset the index to bring back the continuity. 



In [None]:
train_data = train_data.reset_index(drop=True)

In [None]:
train_data.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,4.5
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,6.0
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,17.74




> - Since columns like Mileage, Engine, Power contain units at each row, we will split the element and only extract the first value (hence remove the units from the element). 
- For prediction, it is advised to only use `float/int` values as features. Object type values don't work while training models. 
- **Name** column is split so that we can extract the company name (Honda, Maruti etc) only, since model names will get too specific for prediction. 



In [None]:
for i in range(train_data.shape[0]):
   ## split() function splits the values into a list wherever there are spaces. 
   ## Hence on specifying [0] index we can remove the units. 
    train_data.at[i, 'Company'] = train_data['Name'][i].split()[0] # Only company name extracted
    train_data.at[i, 'Mileage(km/kg)'] = train_data['Mileage'][i].split()[0]
    train_data.at[i, 'Engine(CC)'] = train_data['Engine'][i].split()[0]
    train_data.at[i, 'Power(bhp)'] = train_data['Power'][i].split()[0]



> - Since we have extracted the company names, we can check how many of such values exist in our dataset.



In [None]:
train_data['Company'].value_counts()

Maruti           1175
Hyundai          1058
Honda             600
Toyota            394
Mercedes-Benz     316
Volkswagen        314
Ford              294
Mahindra          268
BMW               262
Audi              235
Tata              183
Skoda             172
Renault           145
Chevrolet         120
Nissan             89
Land               57
Jaguar             40
Mitsubishi         27
Mini               26
Fiat               23
Volvo              21
Porsche            16
Jeep               15
Datsun             13
Force               3
ISUZU               2
Ambassador          1
Isuzu               1
Bentley             1
Lamborghini         1
Name: Company, dtype: int64

try

In [None]:
train_data.Company=train_data.Company.str.title() 
train_data['Company'].unique()

array(['Maruti', 'Hyundai', 'Honda', 'Audi', 'Nissan', 'Toyota',
       'Volkswagen', 'Tata', 'Land', 'Mitsubishi', 'Renault',
       'Mercedes-Benz', 'Bmw', 'Mahindra', 'Ford', 'Porsche', 'Datsun',
       'Jaguar', 'Volvo', 'Chevrolet', 'Skoda', 'Mini', 'Fiat', 'Jeep',
       'Ambassador', 'Isuzu', 'Force', 'Bentley', 'Lamborghini'],
      dtype=object)



> - Since the values we extracted were object type, we will convert mileage, engine and power to float values



In [None]:
train_data['Mileage(km/kg)'] = train_data['Mileage(km/kg)'].astype(float)
train_data['Engine(CC)'] = train_data['Engine(CC)'].astype(float)
train_data['Power(bhp)'] = train_data['Power(bhp)'].astype(float)

In [None]:
train_data.shape

(5872, 16)

In [None]:
train_data.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price,Company,Mileage(km/kg),Engine(CC),Power(bhp)
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,1.75,Maruti,26.6,998.0,58.16
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,12.5,Hyundai,19.67,1582.0,126.2
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,4.5,Honda,18.2,1199.0,88.7
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,6.0,Maruti,20.77,1248.0,88.76
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,17.74,Audi,15.2,1968.0,140.8




> - Since we created new columns `Mileage(km/kg)`, `Engine(CC)`,`Power(bhp)` and `Company` the original columns can be dropped (**Mileage, Engine, Power and Name**)



In [None]:
train_data.drop(["Name"],axis=1,inplace=True)
train_data.drop(["Mileage"],axis=1,inplace=True)
train_data.drop(["Engine"],axis=1,inplace=True)
train_data.drop(["Power"],axis=1,inplace=True)



> Now we start `feature engineering` by converting the labeled categorical data to numbers for better prediction results.
- For example, for **transmission** column we give 0 for automatic cars and 1 for manual. 
- We will apply this for all the columns that have categorical variables like - **Company, Location, Fuel Type, Tranmission and Owner type**



In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5872 entries, 0 to 5871
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Location           5872 non-null   object 
 1   Year               5872 non-null   int64  
 2   Kilometers_Driven  5872 non-null   int64  
 3   Fuel_Type          5872 non-null   object 
 4   Transmission       5872 non-null   object 
 5   Owner_Type         5872 non-null   object 
 6   Seats              5872 non-null   float64
 7   Price              5872 non-null   float64
 8   Company            5872 non-null   object 
 9   Mileage(km/kg)     5872 non-null   float64
 10  Engine(CC)         5872 non-null   float64
 11  Power(bhp)         5872 non-null   float64
dtypes: float64(5), int64(2), object(5)
memory usage: 550.6+ KB


In [None]:
var = 'Location'
var1= 'Company'
train_data[var].value_counts() ## Shows the number of items the values appear in the column

Mumbai        775
Hyderabad     718
Kochi         645
Coimbatore    629
Pune          594
Delhi         545
Kolkata       521
Chennai       476
Jaipur        402
Bangalore     347
Ahmedabad     220
Name: Location, dtype: int64

In [None]:
Location = train_data[[var]]
Location = pd.get_dummies(Location,drop_first=True)
Location.head()

In [None]:
Company= train_data[[var1]]
Company = pd.get_dummies(Company,drop_first=True)
Company.head()

Unnamed: 0,Company_Audi,Company_Bentley,Company_Bmw,Company_Chevrolet,Company_Datsun,Company_Fiat,Company_Force,Company_Ford,Company_Honda,Company_Hyundai,...,Company_Mini,Company_Mitsubishi,Company_Nissan,Company_Porsche,Company_Renault,Company_Skoda,Company_Tata,Company_Toyota,Company_Volkswagen,Company_Volvo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- Working for Fuel_Type

In [None]:
var = 'Fuel_Type'
train_data[var].value_counts()

Diesel    3152
Petrol    2655
CNG         55
LPG         10
Name: Fuel_Type, dtype: int64

In [None]:
Fuel_t = train_data[[var]]
Fuel_t = pd.get_dummies(Fuel_t,drop_first=True)
Fuel_t.head()

Unnamed: 0,Fuel_Type_Diesel,Fuel_Type_LPG,Fuel_Type_Petrol
0,0,0,0
1,1,0,0
2,0,0,1
3,1,0,0
4,1,0,0


In [None]:
var = 'Transmission'
train_data[var].value_counts()

Manual       4170
Automatic    1702
Name: Transmission, dtype: int64

In [None]:
Transmission = train_data[[var]]
Transmission = pd.get_dummies(Transmission,drop_first=True)
Transmission.head()

Unnamed: 0,Transmission_Manual
0,1
1,1
2,1
3,1
4,0


In [None]:
var = 'Owner_Type'
train_data[var].value_counts()

First             4839
Second             925
Third              101
Fourth & Above       7
Name: Owner_Type, dtype: int64

In [None]:
var = 'Owner_Type'
Owner = train_data[[var]]
Owner = pd.get_dummies(Owner)
Owner.head()

Unnamed: 0,Owner_Type_First,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third
0,1,0,0,0
1,1,0,0,0
2,1,0,0,0
3,1,0,0,0
4,0,0,1,0




> - Using `get_dummies()` function we converted the following columns to numerical values of 0 and 1. 



In [None]:
final_train= pd.concat([train_data,Location,Fuel_t,Transmission,Company,Owner],axis=1)
final_train.head()


Unnamed: 0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Seats,Price,Company,Mileage(km/kg),...,Company_Renault,Company_Skoda,Company_Tata,Company_Toyota,Company_Volkswagen,Company_Volvo,Owner_Type_First,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third
0,Mumbai,2010,72000,CNG,Manual,First,5.0,1.75,Maruti,26.6,...,0,0,0,0,0,0,1,0,0,0
1,Pune,2015,41000,Diesel,Manual,First,5.0,12.5,Hyundai,19.67,...,0,0,0,0,0,0,1,0,0,0
2,Chennai,2011,46000,Petrol,Manual,First,5.0,4.5,Honda,18.2,...,0,0,0,0,0,0,1,0,0,0
3,Chennai,2012,87000,Diesel,Manual,First,7.0,6.0,Maruti,20.77,...,0,0,0,0,0,0,1,0,0,0
4,Coimbatore,2013,40670,Diesel,Automatic,Second,5.0,17.74,Audi,15.2,...,0,0,0,0,0,0,0,0,1,0


In [None]:
final_train.drop(["Location","Fuel_Type","Transmission","Owner_Type","Company"],axis=1,inplace=True)
final_train.head()

Unnamed: 0,Year,Kilometers_Driven,Seats,Price,Mileage(km/kg),Engine(CC),Power(bhp),Location_Bangalore,Location_Chennai,Location_Coimbatore,...,Company_Renault,Company_Skoda,Company_Tata,Company_Toyota,Company_Volkswagen,Company_Volvo,Owner_Type_First,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third
0,2010,72000,5.0,1.75,26.6,998.0,58.16,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,2015,41000,5.0,12.5,19.67,1582.0,126.2,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,2011,46000,5.0,4.5,18.2,1199.0,88.7,0,1,0,...,0,0,0,0,0,0,1,0,0,0
3,2012,87000,7.0,6.0,20.77,1248.0,88.76,0,1,0,...,0,0,0,0,0,0,1,0,0,0
4,2013,40670,5.0,17.74,15.2,1968.0,140.8,0,0,1,...,0,0,0,0,0,0,0,0,1,0


In [None]:
final_train.shape

(5872, 53)



> For the column `year`we subtract it's year from the current year, 2022 to let the model learn how old the car is.



In [None]:
final_train.loc[:, "Year"] = final_train["Year"].apply(lambda x: 2022-x)
final_train.head()

Unnamed: 0,Year,Kilometers_Driven,Seats,Price,Mileage(km/kg),Engine(CC),Power(bhp),Location_Bangalore,Location_Chennai,Location_Coimbatore,...,Company_Renault,Company_Skoda,Company_Tata,Company_Toyota,Company_Volkswagen,Company_Volvo,Owner_Type_First,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third
0,12,72000,5.0,1.75,26.6,998.0,58.16,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,7,41000,5.0,12.5,19.67,1582.0,126.2,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,11,46000,5.0,4.5,18.2,1199.0,88.7,0,1,0,...,0,0,0,0,0,0,1,0,0,0
3,10,87000,7.0,6.0,20.77,1248.0,88.76,0,1,0,...,0,0,0,0,0,0,1,0,0,0
4,9,40670,5.0,17.74,15.2,1968.0,140.8,0,0,1,...,0,0,0,0,0,0,0,0,1,0


In [None]:
test_data.dtypes

Unnamed: 0             int64
Name                  object
Location              object
Year                   int64
Kilometers_Driven      int64
Fuel_Type             object
Transmission          object
Owner_Type            object
Mileage               object
Engine                object
Power                 object
Seats                float64
New_Price             object
dtype: object



> **Now we apply the same preprocessing code to our test data.**



In [None]:
test_data = test_data.iloc[:,1:]


test_data = test_data.reset_index(drop=True)

for i in range(test_data.shape[0]):
    test_data.at[i, 'Mileage(km/kg)'] = test_data['Mileage'][i].split()[0]
    test_data.at[i, 'Engine'] = str(test_data['Engine'][i])
    test_data.at[i, 'Engine(CC)'] = test_data['Engine'][i].split()[0]
    test_data.at[i, 'Power'] = str(test_data['Power'][i])
    test_data.at[i, 'Power(bhp)'] = test_data['Power'][i].split()[0]
    test_data.at[i, 'Company'] = test_data['Name'][i].split()[0]
test_data.Company=test_data.Company.str.title()
print('Split Done') 
test_data['Mileage(km/kg)'] = test_data['Mileage(km/kg)'].astype(float)
test_data['Engine(CC)'] = test_data['Engine(CC)'].astype(float)
print('casting 1 Done') 

position = []
for i in range(test_data.shape[0]):
    if test_data['Power(bhp)'][i]=='null':
        position.append(i)
        
test_data = test_data.drop(test_data.index[position])
test_data = test_data.reset_index(drop=True) 

test_data['Power(bhp)'] = test_data['Power(bhp)'].astype(float)
print('casting 2 Done') 


test_data.drop(["Name"],axis=1,inplace=True)
test_data.drop(["Mileage"],axis=1,inplace=True)
test_data.drop(["Engine"],axis=1,inplace=True)
test_data.drop(["Power"],axis=1,inplace=True)

var = 'Location'
Location = test_data[[var]]
Location = pd.get_dummies(Location,drop_first=True)
Location.head()

var1='Company'
Company= test_data[[var1]]
Company = pd.get_dummies(Company,drop_first=True)

var = 'Fuel_Type'
Fuel_t = test_data[[var]]
Fuel_t = pd.get_dummies(Fuel_t,drop_first=True)
Fuel_t.head()

var = 'Transmission'
Transmission = test_data[[var]]
Transmission = pd.get_dummies(Transmission,drop_first=True)
Transmission.head()

test_data.loc[:, "Year"] = test_data["Year"].apply(lambda x: 2022-x)

var = 'Owner_Type'
Owner = test_data[[var]]
oOwner = pd.get_dummies(Owner)


final_test= pd.concat([test_data,Location,Fuel_t,Transmission,Owner],axis=1)
final_test.head()

final_test.drop(["Location","Fuel_Type","Transmission","Company","Owner_Type"],axis=1,inplace=True)
final_test.head()

print("Final Test Size: ",final_test.shape)


Split Done
casting 1 Done
casting 2 Done
Final Test Size:  (1212, 21)


In [None]:
final_test.head()

Unnamed: 0,Year,Kilometers_Driven,Seats,New_Price,Mileage(km/kg),Engine(CC),Power(bhp),Location_Bangalore,Location_Chennai,Location_Coimbatore,...,Location_Hyderabad,Location_Jaipur,Location_Kochi,Location_Kolkata,Location_Mumbai,Location_Pune,Fuel_Type_Diesel,Fuel_Type_LPG,Fuel_Type_Petrol,Transmission_Manual
0,8,40929,4.0,,32.26,998.0,58.2,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,9,54493,5.0,,24.7,796.0,47.3,0,0,1,...,0,0,0,0,0,0,0,0,1,1
2,5,34000,7.0,25.27 Lakh,13.68,2393.0,147.8,0,0,0,...,0,0,0,0,1,0,1,0,0,1
3,8,29000,5.0,,18.5,1197.0,82.85,0,0,0,...,0,0,0,0,1,0,0,0,1,1
4,6,85609,7.0,,16.0,2179.0,140.0,0,0,1,...,0,0,0,0,0,0,1,0,0,1




> These are the final columns



In [None]:
final_train.columns

Index(['Year', 'Kilometers_Driven', 'Seats', 'Price', 'Mileage(km/kg)',
       'Engine(CC)', 'Power(bhp)', 'Location_Bangalore', 'Location_Chennai',
       'Location_Coimbatore', 'Location_Delhi', 'Location_Hyderabad',
       'Location_Jaipur', 'Location_Kochi', 'Location_Kolkata',
       'Location_Mumbai', 'Location_Pune', 'Fuel_Type_Diesel', 'Fuel_Type_LPG',
       'Fuel_Type_Petrol', 'Transmission_Manual', 'Company_Audi',
       'Company_Bentley', 'Company_Bmw', 'Company_Chevrolet', 'Company_Datsun',
       'Company_Fiat', 'Company_Force', 'Company_Ford', 'Company_Honda',
       'Company_Hyundai', 'Company_Isuzu', 'Company_Jaguar', 'Company_Jeep',
       'Company_Lamborghini', 'Company_Land', 'Company_Mahindra',
       'Company_Maruti', 'Company_Mercedes-Benz', 'Company_Mini',
       'Company_Mitsubishi', 'Company_Nissan', 'Company_Porsche',
       'Company_Renault', 'Company_Skoda', 'Company_Tata', 'Company_Toyota',
       'Company_Volkswagen', 'Company_Volvo', 'Owner_Type_First',


# Model training and prediction 


> - Here we will be implementing Random Forest to predict the `price` of the used car



In [None]:
X = final_train.loc[:,['Year', 'Kilometers_Driven', 'Seats',
       'Mileage(km/kg)', 'Engine(CC)', 'Power(bhp)', 
       'Location_Bangalore', 'Location_Chennai', 'Location_Coimbatore',
       'Location_Delhi', 'Location_Hyderabad', 'Location_Jaipur',
       'Location_Kochi', 'Location_Kolkata', 'Location_Mumbai',
       'Location_Pune', 'Fuel_Type_Diesel', 'Fuel_Type_LPG',
       'Fuel_Type_Petrol', 'Transmission_Manual','Company_Audi', 'Company_Bmw', 'Company_Bentley',
       'Company_Chevrolet', 'Company_Datsun', 'Company_Fiat', 'Company_Force',
       'Company_Ford', 'Company_Honda', 'Company_Hyundai',
       'Company_Isuzu', 'Company_Jaguar', 'Company_Jeep',
       'Company_Lamborghini', 'Company_Land', 'Company_Mahindra',
       'Company_Maruti', 'Company_Mercedes-Benz', 'Company_Mini',
       'Company_Mitsubishi', 'Company_Nissan', 'Company_Porsche',
       'Company_Renault', 'Company_Skoda', 'Company_Tata', 'Company_Toyota',
       'Company_Volkswagen', 'Company_Volvo','Owner_Type_First','Owner_Type_Fourth & Above',
       'Owner_Type_Second', 'Owner_Type_Third']]
X.shape

(5872, 52)



> We define our y variables as price column



In [None]:
y = final_train.loc[:,['Price']]
y.head()

Unnamed: 0,Price
0,1.75
1,12.5
2,4.5
3,6.0
4,17.74




> We will split the training data to 80% training and 20% validation data to check the accuracy of our model.
- `random_state` function is used to randomize the train/test split, if we don't specify a number, each time we run the code the train/test split will occur randomly thus affecting the accuracy. 
- You can try changing the random_state to reach the perfect accuracy. 



In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)



> First we will use `linear regression` model by importing it from sklearn.
- `score()` function shows the accuracy of the model for **regression** type.



In [None]:
from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred= linear_reg.predict(X_test)
linear_reg.score(X_train,y_train)
linear_reg.score(X_test,y_test)

0.7707912573023203

> As shown above, accuracy is 77% which isn't upto the mark. Hence we know try using Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor #Used for regression 
# You can use RandomForestClassifier for classification type problems.
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)
y_pred= rf_reg.predict(X_test)
print(rf_reg.score(X_test,y_test))


  after removing the cwd from sys.path.


0.9334928477034067


array([13.7373,  1.5188,  4.0458, ...,  3.4311,  6.5816, 41.1118])



> We are getting an accuracy of 93% which is way better than the linear regression model used. 



In [None]:
from sklearn import metrics
from sklearn.metrics import mean_squared_error
rf_reg.score(X_test,y_test)


0.9334928477034067