# Mobile Price Classification Data Set

Bob has started his own mobile selling company. He wants to determin the prices for the mobile phones to sell. Therefore he collects data of different mobile phone features and their prices to do a analysis and to predict the prices. The data set we are going to analyse consits of mobile features and their prices. Our task is to use machine learing model building to determin the prices.
There is train.csv data set to train the model and test.csv file to test and determin the price. The data set from kaggle can be found [here.](https://www.kaggle.com/iabhishekofficial/mobile-price-classification)

In [1]:
#import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [3]:
#read the train data set
df = pd.read_csv('train.csv')

In [4]:
#display first 3 rows
df.head(3)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2


In [5]:
#find the rows and columns
df.shape

(2000, 21)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

## Data Cleaning

In [7]:
#check for null values
df.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

## Data Exploration

In [8]:
price = df.price_range.value_counts()
price

3    500
2    500
1    500
0    500
Name: price_range, dtype: int64

In [9]:
df.blue.value_counts()

0    1010
1     990
Name: blue, dtype: int64

In [10]:
df.int_memory.value_counts()

27    47
14    45
16    45
2     42
57    42
      ..
25    24
38    23
62    21
4     20
59    18
Name: int_memory, Length: 63, dtype: int64

## Feature Engineering

In [11]:
Y = df['price_range']
X = df.drop('price_range', axis = 1)


## Model Development

In [12]:
#read the test data set
test_df = pd.read_csv('test.csv')
test_df.head(3)

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,...,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,1,1043,1,1.8,1,14,0,5,0.1,193,...,16,226,1412,3476,12,7,2,0,1,0
1,2,841,1,0.5,1,4,1,61,0.8,191,...,12,746,857,3895,6,0,7,1,0,0
2,3,1807,1,2.8,0,1,0,27,0.9,186,...,4,1270,1366,2396,17,10,10,0,1,1


In [13]:
#display the rows and columns in test dataset
test_df.shape

(1000, 21)

In [14]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1000 non-null   int64  
 1   battery_power  1000 non-null   int64  
 2   blue           1000 non-null   int64  
 3   clock_speed    1000 non-null   float64
 4   dual_sim       1000 non-null   int64  
 5   fc             1000 non-null   int64  
 6   four_g         1000 non-null   int64  
 7   int_memory     1000 non-null   int64  
 8   m_dep          1000 non-null   float64
 9   mobile_wt      1000 non-null   int64  
 10  n_cores        1000 non-null   int64  
 11  pc             1000 non-null   int64  
 12  px_height      1000 non-null   int64  
 13  px_width       1000 non-null   int64  
 14  ram            1000 non-null   int64  
 15  sc_h           1000 non-null   int64  
 16  sc_w           1000 non-null   int64  
 17  talk_time      1000 non-null   int64  
 18  three_g  

In [15]:
test_df = test_df.drop('id', axis = 1)

In [16]:
#split the train data to test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, shuffle = True, random_state = 1)

In [17]:
print(len(X_train))
print(len(X_test))

1600
400


### Random Forest Classifier

In [18]:
rn = RandomForestClassifier()
rn.fit(X_train, Y_train)

RandomForestClassifier()

In [19]:
predictions = rn.predict(X_test)

In [20]:
rn_score = accuracy_score(Y_test, predictions)
rn_score

0.855

### Decision Tree Classifier

In [21]:
dt = DecisionTreeClassifier()
dt.fit(X_train,Y_train)

DecisionTreeClassifier()

In [22]:
predictions = dt.predict(X_test)

In [23]:
dt_score = accuracy_score(Y_test, predictions)
dt_score

0.8475

### Logistic Regression

In [24]:
lr = LogisticRegression()
lr.fit(X_train,Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [25]:
predictions = lr.predict(X_test)

In [26]:
lr_score = accuracy_score(Y_test, predictions)
lr_score

0.615

### KNeighbors Classifier

In [27]:
kn = KNeighborsClassifier(n_neighbors = 3)
kn.fit(X_train, Y_train)

KNeighborsClassifier(n_neighbors=3)

In [28]:
predictions = kn.predict(X_test)

In [29]:
kn_score = accuracy_score(Y_test, predictions)
kn_score

0.9075

### Support Vector Machine

In [30]:
svc = SVC()
svc.fit(X_train,Y_train)

SVC()

In [31]:
predictions = svc.predict(X_test)

In [32]:
svc_score = accuracy_score(Y_test, predictions)
svc_score

0.9425

### Linear SVC

In [33]:
linear_svc = LinearSVC()
linear_svc.fit(X_train,Y_train)



LinearSVC()

In [34]:
predictions = linear_svc.predict(X_test)

In [35]:
linearsvc_score = accuracy_score(Y_test, predictions)
linearsvc_score

0.53

### GaussianNB

In [36]:
gaussian = GaussianNB()
gaussian.fit(X_train,Y_train)

GaussianNB()

In [37]:
predictions = gaussian.predict(X_test)

In [38]:
gaussian_score = accuracy_score(Y_test, predictions)
gaussian_score

0.7575

### SGDClassifier

In [39]:
sgd = SGDClassifier()
sgd.fit(X_train,Y_train)

SGDClassifier()

In [40]:
predictions = sgd.predict(X_test)

In [41]:
sgd_score = accuracy_score(Y_test, predictions)
sgd_score

0.5775

### Perceptron

In [42]:
perceptron = Perceptron()
perceptron.fit(X_train,Y_train)

Perceptron()

In [43]:
predictions = perceptron.predict(X_test)

In [44]:
percep_score = accuracy_score(Y_test,predictions)
percep_score

0.4875

In [45]:
score = perceptron.score(X_test,Y_test)
score

0.4875

In [46]:
model = ['Random Forest Classifier', 'Decision Tree Classifier', 'Logistic Regression', 'kNeighbors Classifier', 'Support Vector Machine', 'Linear SVC', 
        'GaussianNB', 'SGDClassifier', 'Percentron']
score = [rn_score, dt_score, lr_score, kn_score, svc_score, linearsvc_score, gaussian_score, sgd_score, percep_score ]

summary = pd.DataFrame({'Model': model, 'Score': score})
summary.sort_values(by = 'Score',ascending = False)

Unnamed: 0,Model,Score
4,Support Vector Machine,0.9425
3,kNeighbors Classifier,0.9075
0,Random Forest Classifier,0.855
1,Decision Tree Classifier,0.8475
6,GaussianNB,0.7575
2,Logistic Regression,0.615
7,SGDClassifier,0.5775
5,Linear SVC,0.53
8,Percentron,0.4875


## Model Testing

Since the highest accuracy_score is from Support vector machine we will test the test data set using this model.

In [47]:
svc = SVC()
svc.fit(X_train, Y_train)

SVC()

In [48]:
predictions = svc.predict(test_df)
predictions

array([3, 3, 2, 3, 1, 3, 3, 1, 3, 0, 3, 3, 0, 0, 2, 0, 2, 1, 3, 2, 1, 3,
       1, 1, 3, 0, 2, 0, 3, 0, 2, 0, 3, 0, 0, 1, 3, 1, 2, 1, 1, 2, 0, 0,
       0, 1, 0, 3, 1, 2, 1, 0, 3, 0, 3, 1, 3, 1, 1, 3, 3, 3, 0, 1, 0, 1,
       1, 3, 1, 2, 1, 2, 2, 3, 3, 0, 2, 0, 2, 3, 0, 3, 3, 0, 3, 0, 3, 1,
       3, 0, 1, 2, 2, 1, 2, 2, 0, 2, 1, 2, 1, 0, 0, 3, 0, 2, 0, 1, 2, 3,
       3, 3, 1, 3, 3, 3, 3, 2, 3, 0, 0, 3, 2, 1, 2, 0, 3, 2, 3, 1, 0, 2,
       1, 1, 3, 1, 1, 0, 3, 2, 1, 2, 1, 2, 2, 3, 3, 3, 2, 3, 2, 3, 1, 0,
       3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 1, 0, 3, 0, 0, 0, 2, 1, 0, 1,
       0, 0, 1, 2, 1, 0, 0, 1, 1, 2, 2, 1, 0, 0, 0, 1, 0, 3, 1, 0, 2, 2,
       2, 3, 1, 2, 3, 2, 3, 2, 2, 1, 0, 0, 1, 2, 0, 2, 3, 3, 0, 2, 0, 3,
       2, 3, 3, 1, 0, 1, 0, 3, 0, 1, 0, 2, 2, 1, 2, 0, 3, 0, 3, 1, 2, 0,
       0, 2, 1, 3, 3, 3, 1, 1, 3, 0, 0, 2, 3, 3, 1, 3, 1, 1, 3, 2, 1, 2,
       3, 3, 3, 1, 0, 1, 2, 3, 1, 1, 3, 2, 0, 3, 0, 0, 2, 0, 0, 3, 2, 3,
       3, 2, 1, 3, 3, 2, 3, 1, 2, 1, 2, 0, 2, 3, 1,

## Conclusion

The accuracy score was calcualted for different machine leanrning models. SVC model had the best accuracy score of 0.9425. Then the output for the test.csv dataset was calculated using SVC model.