# Mobile price classification case study

Lexcorp is a new startup which is venturing into the mobile manufacturing industry. They want to compete with the leading mobile manufacturers, and they also plan on targeting every set of audience. They are launching basic mobiles to higher version of devices with various features.

As a marketing strategy they are planning to fix only 4 different price points for their mobiles. They have approached Wayne Enterprises where you work as a Data Scientist. You have been assigned with the task of creating a Machine Learning model which predicts this multi class.

In [1]:
# importing required libraries

import pandas as pd


In [4]:
# Loading the data
mobile = pd.read_csv('F:\Skillenable\Data frames\Mobile Classification.csv')
mobile

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,794,1,0.5,1,0,1,2,0.8,106,6,...,1222,1890,668,13,4,19,1,1,0,0
1996,1965,1,2.6,1,0,0,39,0.2,187,4,...,915,1965,2032,11,10,16,1,1,1,2
1997,1911,0,0.9,1,1,1,36,0.7,108,8,...,868,1632,3057,9,1,5,1,1,0,3
1998,1512,0,0.9,0,4,1,46,0.1,145,5,...,336,670,869,18,10,19,1,1,1,0


In [6]:
# information about dataset
mobile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

In [9]:
# find if there are anymissing values in the data
mobile.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

## Data Analysis 

Grouping the data to find the average RAM of every mobile price category

In [14]:
avg_RAM = mobile.groupby('price_range')[['ram']].mean().reset_index()
avg_RAM

Unnamed: 0,price_range,ram
0,0,785.314
1,1,1679.49
2,2,2582.816
3,3,3449.232


Since the RAM is less for 0 category and for 3 category, we can be sure that 0 is lower price category and 3 is the higher price category


In [15]:
avg_int_memory = mobile.groupby('price_range')[['int_memory']].mean().reset_index()
avg_int_memory

Unnamed: 0,price_range,int_memory
0,0,31.174
1,1,32.116
2,2,30.92
3,3,33.976


## Model Building - Random Forest


In [16]:
# Identify i/p and o/p variable

Y = mobile[['price_range']]
X = mobile.drop(columns=['price_range'])

In [17]:
Y

Unnamed: 0,price_range
0,1
1,2
2,2
3,2
4,1
...,...
1995,0
1996,2
1997,3
1998,0


In [18]:
X

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0
3,615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0
4,1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,794,1,0.5,1,0,1,2,0.8,106,6,14,1222,1890,668,13,4,19,1,1,0
1996,1965,1,2.6,1,0,0,39,0.2,187,4,3,915,1965,2032,11,10,16,1,1,1
1997,1911,0,0.9,1,1,1,36,0.7,108,8,3,868,1632,3057,9,1,5,1,1,0
1998,1512,0,0.9,0,4,1,46,0.1,145,5,5,336,670,869,18,10,19,1,1,1


In [19]:
# spliting the data into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, random_state=42)



In [20]:
len(X_train), len(X_test), len(Y_train), len(Y_test)

(1600, 400, 1600, 400)

### building the Random Forest model


In [21]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=1000, random_state=42)
rf = model.fit(X_train, Y_train)
print("THe model has been built sucessfully!!!")

  rf = model.fit(X_train, Y_train)


THe model has been built sucessfully!!!


In [25]:
# predicting the model on test data 

Y_test['Prediction'] = model.predict(X_test)
Y_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Y_test['Prediction'] = model.predict(X_test)


Unnamed: 0,price_range,Prediction
1860,0,0
353,2,2
1333,1,1
905,3,3
1289,1,1
...,...,...
965,3,3
1284,2,2
1739,1,1
261,1,1


In [26]:
# Cheking the accuracy

from sklearn.metrics import confusion_matrix, accuracy_score


In [27]:
print(confusion_matrix(Y_test['price_range'], Y_test['Prediction']))


[[101   4   0   0]
 [  7  77   7   0]
 [  0   8  77   7]
 [  0   0  13  99]]


In [28]:
print(accuracy_score(Y_test['price_range'], Y_test['Prediction']))


0.885


In [30]:
# Feature importance
model.feature_importances_

array([0.0748839 , 0.00668696, 0.02833914, 0.00668839, 0.02474776,
       0.00639886, 0.03779868, 0.02529849, 0.0399934 , 0.02328558,
       0.02970823, 0.05717705, 0.05587343, 0.47689485, 0.02780285,
       0.02811955, 0.0307812 , 0.00593997, 0.00682932, 0.00675237])

In [31]:
forest_importance = pd.Series(model.feature_importances_, index=X.columns)
forest_importance.sort_values(ascending=False).head(10)

ram              0.476895
battery_power    0.074884
px_height        0.057177
px_width         0.055873
mobile_wt        0.039993
int_memory       0.037799
talk_time        0.030781
pc               0.029708
clock_speed      0.028339
sc_w             0.028120
dtype: float64