# Predicting mobile prices using decision trees
The purpose of this notebook is to predict mobile prices using a decision tree model. 
## Data:
Data obtained from kaggle.com:
"Mobile Price Classification" submitted by Abhishek Sharma
https://www.kaggle.com/iabhishekofficial/mobile-price-classification

## Gameplan:
* Overview of the data
* Creating a basic decision tree model
* Creating a random forest model
* Visualising and comparing results

In [1]:
# Importing
import pandas as pd
import numpy as np
import sklearn as skl
import seaborn as sns
import matplotlib.pyplot as plt

# Reading in the data
data = pd.read_csv('/users/kristiandampedersen/documents/mobile_ml_proj/data/train.csv')
test = pd.read_csv('/users/kristiandampedersen/documents/mobile_ml_proj/data/test.csv')

## Overview of the data
First things first, lets get an overview of the data.

In [2]:
data.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [3]:
data.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

In [4]:
data.describe()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1238.5185,0.495,1.52225,0.5095,4.3095,0.5215,32.0465,0.50175,140.249,4.5205,...,645.108,1251.5155,2124.213,12.3065,5.767,11.011,0.7615,0.503,0.507,1.5
std,439.418206,0.5001,0.816004,0.500035,4.341444,0.499662,18.145715,0.288416,35.399655,2.287837,...,443.780811,432.199447,1084.732044,4.213245,4.356398,5.463955,0.426273,0.500116,0.500076,1.118314
min,501.0,0.0,0.5,0.0,0.0,0.0,2.0,0.1,80.0,1.0,...,0.0,500.0,256.0,5.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,851.75,0.0,0.7,0.0,1.0,0.0,16.0,0.2,109.0,3.0,...,282.75,874.75,1207.5,9.0,2.0,6.0,1.0,0.0,0.0,0.75
50%,1226.0,0.0,1.5,1.0,3.0,1.0,32.0,0.5,141.0,4.0,...,564.0,1247.0,2146.5,12.0,5.0,11.0,1.0,1.0,1.0,1.5
75%,1615.25,1.0,2.2,1.0,7.0,1.0,48.0,0.8,170.0,7.0,...,947.25,1633.0,3064.5,16.0,9.0,16.0,1.0,1.0,1.0,2.25
max,1998.0,1.0,3.0,1.0,19.0,1.0,64.0,1.0,200.0,8.0,...,1960.0,1998.0,3998.0,19.0,18.0,20.0,1.0,1.0,1.0,3.0


In [5]:
data.isnull()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


From the above we gather that our data consists of the following variables (we gather descriptions from Kaggle):

* **id**: Index

* **battery_power**: Total energy a battery can store in one time measures in mAh

* **blue**: Has bluetooth or not

* **clock_speed**: Speed at which microprocessor executes instructions

* **dual_sim**: Has dual sim support or not

* **fc**: Front Camera mega pixels

* **four_g**: Has 4G or not

* **int_memory**: Internal memory in gigabytes

* **m_dep**: Mobile depth in cm

* **mobile_wt**: Weight of mobile phone

* **n_cores**: Number of cores of processor

* **pc**: Primary Camera mega pixels

* **px_height**: Pixel resolution height

* **px_width**: Pixel resolution width

* **ram**: Random Access Memory in megabytes

* **sc_h**: Screen height of mobile in cm

* **sc_w**: Screen width of mobile in cm

* **talk_time**: The longest a battery can last whilst talking

* **three_g**: Has 3G or not

* **touch_screen**: Has touch screen or not

* **wifi**: Has wifi or not
* **price_range**: Prices from 0 to 3
   
   0) low cost
   
   1) medium cost
   
   2) high cost
   
   3) very high cost
   
Furthermore we gather that there is no missing data in our dataset, meaning we can comfortably go ahead.

### Results from overview
For this analysis it will be our goal to create a model that predicts price_range from the other variables.
Usually an exploratory analysis would be desireable, to gain insight and overview of the data, but since this is a practice project it'll be ommitted here.

## Creating a basic decision tree model
In this section i'll be creating a basic decision tree model using sklearns DecisionTreeClassifier.
First, however, i need to organise and split the data.

#### Creating our training and validation datasets.

In [6]:
# Defining targets
from sklearn.model_selection import train_test_split
X = data.drop('price_range', axis=1)
y = data['price_range']
X_train, x_val, y_train, y_val = train_test_split(X, y, random_state=1)

####  Creating our model

In [7]:
# Defining the model
from sklearn.tree import DecisionTreeClassifier
simple_model = DecisionTreeClassifier(random_state=1)

#model fit
simple_model.fit(X_train, y_train)

DecisionTreeClassifier(random_state=1)

Lets predict some results shall we?

In [8]:
# Measuring fit
from sklearn.metrics import accuracy_score
y_true = y_val
y_pred = simple_model.predict(x_val)
a_score = accuracy_score(y_true, y_pred)
print(a_score)

0.834


Running this model, immideatly gets us back a result of 83.4% accuracy, which is definetly a good start. However lets try experimenting a bit with different tree depths.

In [9]:
# Examining tree depth
list_a_scores=[]
nr_leafs = [i for i in range(2,5000)]
def getascore (max_leaf_nodes, X_train, x_val, y_train, y_val):
    simple_model2 = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=1)
    simple_model2.fit(X_train, y_train)
    preds_val = simple_model2.predict(x_val)
    a_score = accuracy_score(y_val, preds_val)
    return(a_score)

for max_leaf_nodes in range(2,5000):
    my_a_score = getascore(max_leaf_nodes, X_train, x_val, y_train, y_val)
    list_a_scores.append(my_a_score)

a_score_table = {'Accuracy scores': list_a_scores, 'Nr of leafs': nr_leafs}

In [10]:
df_accuracy = pd.DataFrame(data=a_score_table)
df2 = df_accuracy.sort_values(by='Accuracy scores', ascending=False)
df2.head(10)

Unnamed: 0,Accuracy scores,Nr of leafs
93,0.844,95
35,0.844,37
34,0.844,36
94,0.844,96
36,0.842,38
101,0.84,103
116,0.84,118
37,0.84,39
124,0.84,126
123,0.84,125


In [11]:
# Finally putting together our final model
final_tree_model =     simple_model2 = DecisionTreeClassifier(max_leaf_nodes=38, random_state=1)
final_tree_model.fit(X_train, y_train)
pred_val = final_tree_model.predict(x_val)
a_score = accuracy_score(pred_val, y_val)
print(a_score)

0.842


### Results
After experimenting a bit with depth, it seems to be that 38 max leaf nodes seem to be ideal. This model provides us with the correct classification 84.2% of the time, providing a small improvement to our initial model..

## Creating a random forest model
Anoter model thats deeply related with the decision tree model above, is a random forest. The basic gist of random forest, is that it experiments with many different trees and compares results across all of them. To build this ill use the RandomForestClassifier from scikit-learn.

### Creating the model

In [14]:
# Same as above.
from sklearn.ensemble import RandomForestClassifier
simple_forest = RandomForestClassifier(random_state=1)
simple_forest.fit(X_train,y_train)

# Evaluating our model
y_pred = simple_forest.predict(x_val)
a_score = accuracy_score(y_pred, y_val)
print(a_score)

0.86


Even at this early stage, we're already seeing improved results with the random forest having 86% accuracy compared to the 84.2% we achieved with the decision tree.

### Experimenting with depth
However, as with above it might be useful to experiment with depth. For this we'll more or less build out a function like above.

In [21]:
# Creating our "buckets"
rf_list_scores = []
rf_nr_leaves = [i for i in range(2,5000)]

def getascorerf(max_leaf_nodes, X_train, x_val, y_train, y_val):
    rf_model = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes, random_state=1)
    rf_model.fit(X_train, y_train)
    pred_value = rf_model.predict(x_val)
    score = accuracy_score(pred_value, y_val)
    return score

for max_leaf_nodes in range(2,5000):
    myscore = getascorerf(max_leaf_nodes, X_train, x_val, y_train, y_val)
    rf_list_scores.append(myscore)

In [20]:
# Creating our dataframe
rf_data = {'Accuracy': rf_list_scores, 'Nr of leaves': rf_nr_leaves}
df_rfresults = pd.DataFrame(data=rf_data)
df_randomresults = df_rfresults.sort_values(by='Accuracy', ascending=False)
df_randomresults.head()

Unnamed: 0,Accuracy,Nr of leaves
22,0.848,24
38,0.848,40
40,0.846,42
46,0.846,48
34,0.846,36


In [22]:
#Final model
best_model = RandomForestClassifier(random_state=1)
best_model.fit(X_train,y_train)

# Evaluating our model
y_pred = best_model.predict(x_val)
a_score = accuracy_score(y_pred, y_val)
print(a_score)

0.86


In the end a bog-standard randomforestclassifier ended op being the best with **86% accuracy.**

## Visualising and comparing results