## Task

**Task:** Implement a boosted decision tree for the auto_mpg data employing [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier). Implement a random forest classifier for employing [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier). Compare the prediction results of both models for different training data subsets and parameters of the learning methods.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn import tree

import seaborn as sns

In [2]:
raw_df = pd.read_csv('auto-mpg.csv')                   #Importing CSV file

#### Cleaning Data 

Dropping last column as no numerical value and object data type

In [3]:
raw_df=raw_df.drop(['car name'],axis=1)

In [4]:
#Converting data with '?' values to none
subs = raw_df['horsepower'] == '?'
raw_df.loc[subs, 'horsepower'] = np.nan

In [5]:
df=raw_df.astype(float)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    float64
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    float64
 7   origin        398 non-null    float64
dtypes: float64(8)
memory usage: 25.0 KB


The Horsepower column has less values than others

#### Filling Missing Value

In [7]:
df['horsepower'].mean()

104.46938775510205

In [8]:
#In place of missing values, we have taken mean of the column
df['horsepower'].fillna(value = df['horsepower'].mean(), inplace = True)      

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    float64
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    float64
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    float64
 7   origin        398 non-null    float64
dtypes: float64(8)
memory usage: 25.0 KB


Now, all columns have same number of elements

#### Implementing AdaBoost and Random Forest Classifier to different subsets

In [10]:
ada_Boost= AdaBoostClassifier(learning_rate =0.02, n_estimators =1000)                     #[1]Citation

In [11]:
random_Forest = RandomForestClassifier(n_estimators=1000, n_jobs=-1 ,random_state=2)       #[2]Citation

In [12]:
#AdaBoost
x = df.drop(['origin'], axis = 1)
y = df['origin']

x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.2,random_state=2)

ada_Boost.fit(x_train, y_train)
accuracy_score(y_test, ada_Boost.predict(x_test))           #accuracy of AdaBoost          #[3]Citation

0.75

In [13]:
#RandomForestClassifier
random_Forest.fit(x_train, y_train)
accuracy_score(y_test, random_Forest.predict(x_test))       #accuracy of RandomForest      #[3]Citation

0.8

The Accuracy by using RandomForestClassifier is more than the accuracy by using AdaBoost used for this dataset.

###### For SUBSET 1

In [14]:
s_set1 = df.sample(n=200,replace=True,random_state=2)
x1 = s_set1.drop(['origin'], axis = 1)
y1 = s_set1['origin']
x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1,test_size=0.3,random_state=2)

In [15]:
ada_Boost.fit(x1_train, y1_train)

AdaBoostClassifier(learning_rate=0.02, n_estimators=1000)

In [16]:
random_Forest.fit(x1_train, y1_train)

RandomForestClassifier(n_estimators=1000, n_jobs=-1, random_state=2)

In [17]:
accuracy_score(y1_test, ada_Boost.predict(x1_test))           #accuracy of AdaBoost for 1         #[3]Citation

0.85

In [18]:
accuracy_score(y1_test, random_Forest.predict(x1_test))       #accuracy of RandomForest for 1     #[3]Citation

0.9166666666666666

The Accuracy by using RandomForestClassifier is more than the accuracy by using AdaBoost used for this dataset.

##### For SUBSET 2

In [19]:
s_set2 = df.sample(n=100,replace=True,random_state=2)
x2 = s_set2.drop(['origin'], axis = 1)
y2 = s_set2['origin']
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2,test_size=0.15,random_state=2)

In [20]:
ada_Boost.fit(x2_train, y2_train)

AdaBoostClassifier(learning_rate=0.02, n_estimators=1000)

In [21]:
random_Forest.fit(x2_train, y2_train)

RandomForestClassifier(n_estimators=1000, n_jobs=-1, random_state=2)

In [22]:
accuracy_score(y2_test, ada_Boost.predict(x2_test))              #accuracy of AdaBoost for 2        #[3]Citation

0.7333333333333333

In [23]:
accuracy_score(y2_test, random_Forest.predict(x2_test))          #accuracy of RandomForest for 2    #[3]Citation

0.8666666666666667

The Accuracy by using RandomForestClassifier is more than the accuracy by using AdaBoost used for this dataset.

##### For SUBSET 3

In [24]:
s_set3 = df.sample(n=150,replace=True,random_state=2)
x3 = s_set3.drop(['origin'], axis = 1)
y3 = s_set3['origin']
x3_train, x3_test, y3_train, y3_test = train_test_split(x3, y3,test_size=0.2,random_state=2)

In [25]:
ada_Boost.fit(x3_train, y3_train)

AdaBoostClassifier(learning_rate=0.02, n_estimators=1000)

In [26]:
random_Forest.fit(x3_train, y3_train)

RandomForestClassifier(n_estimators=1000, n_jobs=-1, random_state=2)

In [27]:
accuracy_score(y3_test, ada_Boost.predict(x3_test))              #accuracy of AdaBoost for 3          #[3]Citation

0.8

In [28]:
accuracy_score(y3_test, random_Forest.predict(x3_test))          #accuracy of RandomForest for 3      #[3]Citation

0.8333333333333334

The Accuracy by using RandomForestClassifier is almost same as the accuracy by using AdaBoost used for this dataset.

Comparing the Accuracies of all 3 subsets, it is observed that in most cases the accuracy of RandomForest is more than AdaBoost.

#### Grid Search on full data for Random Forest Classifier

In [29]:
x = df.drop(['origin'], axis = 1)
y = df['origin']

x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.2,random_state=2)

In [30]:
params_grid = {"max_features" : [4,5,6,],
              "min_samples_split": [2, 3, 10],
               "n_estimators" : [200, 1000, 2000]
              }

In [31]:
grid_searchR = GridSearchCV(random_Forest, params_grid,                                #[4]Citation
                           n_jobs=-1, cv=5, scoring='accuracy')

In [32]:
grid_searchR.fit(x_train, y_train)

GridSearchCV(cv=5,
             estimator=RandomForestClassifier(n_estimators=1000, n_jobs=-1,
                                              random_state=2),
             n_jobs=-1,
             param_grid={'max_features': [4, 5, 6],
                         'min_samples_split': [2, 3, 10],
                         'n_estimators': [200, 1000, 2000]},
             scoring='accuracy')

In [33]:
grid_searchR.best_params_ 

{'max_features': 5, 'min_samples_split': 2, 'n_estimators': 1000}

In [34]:
predict_bestR = grid_searchR.best_estimator_

In [35]:
accuracy_score(y_test, predict_bestR.predict(x_test))

0.825

#### Grid Search on full data for AdaBoost

In [36]:
params_grid = {"learning_rate" : [0.01, 0.02, 0.03],
               "n_estimators" : [200, 1000, 2000]
              }

In [37]:
grid_searchA = GridSearchCV(ada_Boost, params_grid,                                   #[4]Citation
                           n_jobs=-1, cv=5, scoring='accuracy')

In [38]:
grid_searchA.fit(x_train, y_train)

GridSearchCV(cv=5,
             estimator=AdaBoostClassifier(learning_rate=0.02,
                                          n_estimators=1000),
             n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.02, 0.03],
                         'n_estimators': [200, 1000, 2000]},
             scoring='accuracy')

In [39]:
grid_searchA.best_params_ 

{'learning_rate': 0.03, 'n_estimators': 1000}

In [40]:
predict_bestA = grid_searchA.best_estimator_

In [41]:
accuracy_score(y_test, predict_bestA.predict(x_test))

0.7625

##### Using Grid Search on both AdaBoost and RandomForestClassifier and choosing best parameter, THE ACCURACY using RandomForestClassifier is more than accuracy by using AdaBoost for the given dataset of auto-mpg.

## REFERENCES

[1]https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

[2]https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

[3]https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn-metrics-accuracy-score

[4]https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html