Tree Based Models - Q16- 25/July
===================================

We want to predict the price of mobile phone (range) based on the characteristics of the phone like memory, battery power, camera specification etc. The data for about 2000 phones is provided in 09_mobile_price.csv in the Google drive folder:
https://drive.google.com/drive/folders/1Jl8iDu7nGmrqCECbrLqmVafgwE5PYfiU

1) Train a decision tree  to predict the price category. 
     
    a) What's the best score we get ? Use 10 fold CV. 
    b) What are the best tree parameters
    c) Which variable come out to be important?

2) Now train a Random Forest classifier. How does the score compare with decision tree?

Note that this is multi-class classification.

# Answers
1)
    
    a) What's the best score we get ? Use 10 fold CV. 
        - Tree score on train set is 94.69
        - Tree score on test set is  84.75
    b) What are the best tree parameters?
        - 'ccp_alpha': 0.0,
        - 'max_depth': 7,
        - 'min_samples_leaf': 2,
        - 'min_samples_split': 5
        - rest all defult as per Decision Tree classsifier
    c) Which variable come out to be important?
       Following variables come important.
    
|feature_name     | importance |
|-----------------| -----------|
|ram	          | 0.636802   |
|battery_power	  | 0.164511   |
|px_height	      | 0.085881   |
|px_width	      | 0.082579   |
|mobile_wt	      | 0.008470   |
|frontcamera	  | 0.005183   |
|screen_height	  | 0.005104   |
|memory	          | 0.003965   |
|primarycamera_mp | 0.003514   |
|screen_width	  | 0.001658   |
|mobile_thickness | 0.001191   |
|talk_time	      | 0.000588   |
|n_cores	      | 0.000556   |
    
    
2) Now train a Random Forest classifier. How does the score compare with decision tree?

|Model        | Train Score | Test Score|
|-----------  | ----------- |---------- |
|Decision Tree|94.69        |84.75      |
|Random Forest|93.88        |87.0       |

Both Decision tree and random forest are over-fitting, but random forest is performing better in test set.

In [1]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("09_mobile_price.csv")
df.head(2)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,frontcamera,Has4G,memory,mobile_thickness,mobile_wt,n_cores,...,px_height,px_width,ram,screen_height,screen_width,talk_time,Has3G,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2


In [20]:
#df.columns

In [4]:
#df.describe().transpose()

In [5]:
x_vars = ['battery_power', 'blue', 'clock_speed', 'dual_sim', 'frontcamera',
          'Has4G', 'memory', 'mobile_thickness', 'mobile_wt', 'n_cores',
          'primarycamera_mp', 'px_height', 'px_width', 'ram', 'screen_height',
          'screen_width', 'talk_time', 'Has3G', 'touch_screen', 'wifi'
         ]
y_var = 'price_range'

In [6]:
x_train, x_test, y_train, y_test = train_test_split(df[x_vars], df[y_var], test_size=0.2, random_state=0, stratify=df[y_var])

# Decision Tree

In [7]:
tune_parm_space = {'min_samples_split':[1, 5, 10, 15],
                   'max_depth':range(1, 8),
                   'min_samples_leaf':[1, 2, 5, 10, 15],
                   'ccp_alpha':[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
                  }

tree_model = DecisionTreeClassifier(random_state=1)
tree_model = GridSearchCV(tree_model, tune_parm_space, cv=10)
tree_model.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=1),
             param_grid={'ccp_alpha': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
                         'max_depth': range(1, 8),
                         'min_samples_leaf': [1, 2, 5, 10, 15],
                         'min_samples_split': [1, 5, 10, 15]})

In [8]:
#DecisionTreeClassifier().get_params()

In [9]:
score_train = tree_model.score(x_train, y_train)
score_test = tree_model.score(x_test, y_test)

print(f"Tree score on train set is {np.round(score_train * 100, 2)}")
print(f"Tree score on test set is  {np.round(score_test * 100, 2)}")

Tree score on train set is 94.69
Tree score on test set is  84.75


In [10]:
tree_model.best_params_

{'ccp_alpha': 0.0,
 'max_depth': 7,
 'min_samples_leaf': 2,
 'min_samples_split': 5}

In [11]:
feature_importance = tree_model.best_estimator_.feature_importances_
df_feature_imprt = pd.DataFrame({'feature_name' : x_vars, 'importance':feature_importance})
df_feature_imprt.sort_values(by='importance', ascending=False)[df_feature_imprt['importance'] > 0]

Unnamed: 0,feature_name,importance
13,ram,0.636802
0,battery_power,0.164511
11,px_height,0.085881
12,px_width,0.082579
8,mobile_wt,0.00847
4,frontcamera,0.005183
14,screen_height,0.005104
6,memory,0.003965
10,primarycamera_mp,0.003514
15,screen_width,0.001658


# Random Forest

In [12]:
#RandomForestClassifier().get_params()

In [13]:
tune_parm_space = {'min_samples_split':[1, 5, 10, 15],
                   'max_depth':range(1, 7),
                   'min_samples_leaf':[1, 2, 5, 10, 15],
                   'ccp_alpha':[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
                  }

rf_model = RandomForestClassifier(random_state=1)
rf_model = GridSearchCV(rf_model, tune_parm_space, cv=10)
rf_model.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=1),
             param_grid={'ccp_alpha': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
                         'max_depth': range(1, 7),
                         'min_samples_leaf': [1, 2, 5, 10, 15],
                         'min_samples_split': [1, 5, 10, 15]})

In [14]:
rf_model.best_params_

{'ccp_alpha': 0.0,
 'max_depth': 6,
 'min_samples_leaf': 5,
 'min_samples_split': 15}

In [16]:
score_train = rf_model.score(x_train, y_train)
score_test = rf_model.score(x_test, y_test)

print(f"Random Forest score on train set is {np.round(score_train * 100, 2)}")
print(f"Random Forest score on test set is  {np.round(score_test * 100, 2)}")

Random Forest score on train set is 93.88
Random Forest score on test set is  87.0
