# Portfolio Part 3 - Analysis of Mobile Price Data (2024 S1)

In this Portfolio task, you will work on a new dataset named 'Mobile Price Data', it contains numerous details about mobile phone hardware, specifications, and prices. Your main task is to train classification models to predict **mobile phone prices** ('price range' in the dataset)and evaluate the strengths and weaknesses of these models.

Here's the explanation of each column:

|Column|Meaning|
|:-----:|:-----:|
|battery power|Total energy a battery can store in one time measured in mAh|
|blue|Has bluetooth or not|
|clock speed|speed at which microprocessor executes instructions|
|dual sim|Has dual sim support or not|
|fc|Front Camera mega pixels|
|four g|Has 4G or not|
|int memory|Internal Memory in Gigabytes|
|m dep|Mobile Depth in cm|
|mobile wt|Weight of mobile phone|
|n cores|Number of cores of processor|
|pc|Primary Camera mega pixels|
|px height|Pixel Resolution Height|
|px width|Pixel Resolution Width|
|ram|Random Access Memory in Mega Bytes|
|sc h|Screen Height of mobile in cm|
|sc w|Screen Width of mobile in cm|
|talk time|longest time that a single battery charge will last when you are|
|three g|Has 3G or not|
|touch screen|Has touch screen or not|
|wifi|Has wifi or not|
|price range|This is the target variable with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost)|

Blue, dual sim, four g, three g, touch screen, and wifi are all binary attributes, 0 for not and 1 for yes.

Your high level goal in this notebook is to build and evaluate predictive models for 'price range' from other available features. More specifically, you need to **complete the following major steps**:

1. ***Explore the data*** and ***clean the data if necessary***. For example, remove abnormal instanaces and replace missing values.

2. ***Study the correlation*** between 'price range' with other features. And ***select the variables*** that you think are helpful for predicting the price range. We do not limit the number of variables.

3. ***Split the dataset*** (Trainging set : Test set = 8 : 2)

4. ***Train a logistic regression model*** to predict 'price range' based on the selected features (from the second step). ***Calculate the accuracy*** of your model. (You are required to report the accuracy from both training set and test set.) ***Explain your model and evaluate its performance*** (Is the model performing well? If yes, what factors might be contributing to the good performance of your model? If not, how can improvements be made?).

5. ***Train a KNN model*** to predict 'price range' based on the selected features (you can use the features selected from the second step and set K with an ad-hoc manner in this step. ***Calculate the accuracy*** of your model. (You are required to report the accuracy from both training set and test set.)

6. ***Tune the hyper-parameter K*** in KNN (Hints: GridsearchCV), ***visualize the results***, and ***explain*** how K influences the prediction performance.

  Hints for visualization: You can use line chart to visualize K and mean accuracy scores on test set.

Note 1: In this assignment, we no longer provide specific guidance and templates for each sub task. You should learn how to properly comment your notebook by yourself to make your notebook file readable.

Note 2: You will not being evaluated on the accuracy of the model but on the process that you use to generate it and your explanation.

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score

import seaborn as sns
import matplotlib.pylab as plt
%matplotlib inline

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
from scipy.cluster.hierarchy import linkage, dendrogram, cut_tree
from scipy.spatial.distance import pdist 
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
%matplotlib inline

Showing all the coloumns for Mobile_Price_Data.csv file

In [2]:
df=pd.read_csv("C:/Users/48189111/Downloads/Mobile_Price_Data.csv")
df

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7.0,0.6,188,2,...,20,756.0,2549.0,9,7,19,0.0,0,1,1
1,1021,1,0.5,1,0,1,53.0,0.7,136,3,...,905,1988.0,2631.0,17,3,7,1.0,1,0,2
2,563,1,0.5,1,2,1,41.0,0.9,145,5,...,1263,1716.0,2603.0,11,2,9,1.0,1,0,2
3,615,1,2.5,0,0,0,10.0,0.8,131,6,...,1216,1786.0,2769.0,16,8,11,1.0,0,0,2
4,1821,1,1.2,0,13,1,44.0,0.6,141,2,...,1208,1212.0,1411.0,8,2,15,1.0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,794,1,0.5,1,0,1,2.0,0.8,106,6,...,1222,1890.0,668.0,13,4,19,1.0,1,0,0
1996,1965,1,2.6,1,0,0,39.0,0.2,187,4,...,915,1965.0,2032.0,11,10,16,1.0,1,1,2
1997,1911,0,0.9,1,1,1,36.0,0.7,108,8,...,868,1632.0,3057.0,9,1,5,1.0,1,0,3
1998,1512,0,0.9,0,4,1,46.0,0.1,145,5,...,336,670.0,869.0,18,10,19,1.0,1,1,0


This is the summarize version like only showing first 10 rows.

In [26]:
df.head(10)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7.0,0.6,188,2,...,20,756.0,2549.0,9,7,19,0.0,0,1,1
1,1021,1,0.5,1,0,1,53.0,0.7,136,3,...,905,1988.0,2631.0,17,3,7,1.0,1,0,2
2,563,1,0.5,1,2,1,41.0,0.9,145,5,...,1263,1716.0,2603.0,11,2,9,1.0,1,0,2
3,615,1,2.5,0,0,0,10.0,0.8,131,6,...,1216,1786.0,2769.0,16,8,11,1.0,0,0,2
4,1821,1,1.2,0,13,1,44.0,0.6,141,2,...,1208,1212.0,1411.0,8,2,15,1.0,1,0,1
5,1859,0,0.5,1,3,0,22.0,0.7,164,1,...,1004,1654.0,1067.0,17,1,10,1.0,0,0,1
6,1821,0,1.7,0,4,1,10.0,0.8,139,8,...,381,1018.0,3220.0,13,8,18,1.0,0,1,3
7,1954,0,0.5,1,0,0,24.0,0.8,187,4,...,512,,700.0,16,3,5,1.0,1,1,0
8,1445,1,0.5,0,0,0,53.0,0.7,174,7,...,386,836.0,1099.0,17,1,20,1.0,0,0,0
9,509,1,0.6,1,2,1,9.0,0.1,93,5,...,1137,1224.0,513.0,19,10,12,1.0,0,0,0


It is showing all the statistical values calculated by using Mobile_Price_Data.csv file

In [4]:
df.describe()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,1999.0,1999.0,2000.0,2000.0,...,2000.0,1999.0,1999.0,2000.0,2000.0,2000.0,1999.0,2000.0,2000.0,2000.0
mean,1238.5185,0.495,1.52225,0.5095,4.3095,0.5215,32.035018,0.501601,140.249,4.5205,...,645.108,1251.566783,2124.218609,12.3065,5.767,11.011,0.761381,0.503,0.507,1.5
std,439.418206,0.5001,0.816004,0.500035,4.341444,0.499662,18.142986,0.288411,35.399655,2.287837,...,443.780811,432.301505,1085.003435,4.213245,4.356398,5.463955,0.426346,0.500116,0.500076,1.118314
min,501.0,0.0,0.5,0.0,0.0,0.0,2.0,0.1,80.0,1.0,...,0.0,500.0,256.0,5.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,851.75,0.0,0.7,0.0,1.0,0.0,16.0,0.2,109.0,3.0,...,282.75,874.5,1207.0,9.0,2.0,6.0,1.0,0.0,0.0,0.75
50%,1226.0,0.0,1.5,1.0,3.0,1.0,32.0,0.5,141.0,4.0,...,564.0,1247.0,2147.0,12.0,5.0,11.0,1.0,1.0,1.0,1.5
75%,1615.25,1.0,2.2,1.0,7.0,1.0,48.0,0.8,170.0,7.0,...,947.25,1633.0,3065.0,16.0,9.0,16.0,1.0,1.0,1.0,2.25
max,1998.0,1.0,3.0,1.0,19.0,1.0,64.0,1.0,200.0,8.0,...,1960.0,1998.0,3998.0,19.0,18.0,20.0,1.0,1.0,1.0,3.0


In [28]:
#Explore the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     1999 non-null   float64
 7   m_dep          1999 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       1999 non-null   float64
 13  ram            1999 non-null   float64
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        1999 non-null   float64
 18  touch_sc

In [59]:
#Remove the missing dataset
print('Sum of null data in each coloumn')
print(df.isnull().sum())

print('\n\nLength before cleaning data set is',len(df))
df.dropna()
df.loc[df['blue']!= 0,'blue'] = None
df.loc[df['dual_sim']!= 0,'dual_sim'] = None
df.loc[df['fc']!= 0,'fc'] = None
df.loc[df['px_height	']!= 0,'px_height	'] = None
df.loc[df['sc_w']!= 0,'sc_w'] = None
df.loc[df['three_g	']!= 0,'three_g	'] = None
df.loc[df['touch_screen	']!= 0,'touch_screen	'] = None
df.loc[df['wifi	']!= 0,'wifi	'] = None
df.loc[df['price_range	']!= 0,'price_range	'] = None
# df.drop(columns=[])

print('Sum of null data in each coloumn')
print(df.isnull().sum())

print("Length after cleaning data set is",len(df))

Sum of null data in each coloumn
battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       1
m_dep            1
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         1
ram              1
sc_h             0
sc_w             0
talk_time        0
three_g          1
touch_screen     0
wifi             0
price_range      0
dtype: int64


Length before cleaning data set is 2000
Sum of null data in each coloumn
battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       1
m_dep            1
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         1
ram              1
sc_h             0
sc_w             0
talk_time        0
three_g          1
touch_screen     0
wifi             0
price_range      0
dtype: int64
Length after cleaning data set is 2000


In [39]:
#Data explored after cleaning dataset

df.describe()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,1999.0,1999.0,2000.0,2000.0,...,2000.0,1999.0,1999.0,2000.0,2000.0,2000.0,1999.0,2000.0,2000.0,2000.0
mean,1238.5185,0.495,1.52225,0.5095,4.3095,0.5215,32.035018,0.501601,140.249,4.5205,...,645.108,1251.566783,2124.218609,12.3065,5.767,11.011,0.761381,0.503,0.507,1.5
std,439.418206,0.5001,0.816004,0.500035,4.341444,0.499662,18.142986,0.288411,35.399655,2.287837,...,443.780811,432.301505,1085.003435,4.213245,4.356398,5.463955,0.426346,0.500116,0.500076,1.118314
min,501.0,0.0,0.5,0.0,0.0,0.0,2.0,0.1,80.0,1.0,...,0.0,500.0,256.0,5.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,851.75,0.0,0.7,0.0,1.0,0.0,16.0,0.2,109.0,3.0,...,282.75,874.5,1207.0,9.0,2.0,6.0,1.0,0.0,0.0,0.75
50%,1226.0,0.0,1.5,1.0,3.0,1.0,32.0,0.5,141.0,4.0,...,564.0,1247.0,2147.0,12.0,5.0,11.0,1.0,1.0,1.0,1.5
75%,1615.25,1.0,2.2,1.0,7.0,1.0,48.0,0.8,170.0,7.0,...,947.25,1633.0,3065.0,16.0,9.0,16.0,1.0,1.0,1.0,2.25
max,1998.0,1.0,3.0,1.0,19.0,1.0,64.0,1.0,200.0,8.0,...,1960.0,1998.0,3998.0,19.0,18.0,20.0,1.0,1.0,1.0,3.0


In [41]:
#Correlation PArt
print(df.corr())

               battery_power      blue  clock_speed  dual_sim        fc  \
battery_power       1.000000  0.011252     0.011482 -0.041847  0.033334   
blue                0.011252  1.000000     0.021419  0.035198  0.003593   
clock_speed         0.011482  0.021419     1.000000 -0.001315 -0.000434   
dual_sim           -0.041847  0.035198    -0.001315  1.000000 -0.029123   
fc                  0.033334  0.003593    -0.000434 -0.029123  1.000000   
four_g              0.015665  0.013443    -0.043073  0.003187 -0.016560   
int_memory         -0.004701  0.041831     0.006177 -0.015044 -0.030751   
m_dep               0.034653  0.004564    -0.014101 -0.022661 -0.002711   
mobile_wt           0.001844 -0.008605     0.012350 -0.008979  0.023618   
n_cores            -0.029727  0.036161    -0.005724 -0.024658 -0.013356   
pc                  0.031441 -0.009952    -0.005245 -0.017143  0.644595   
px_height           0.014901 -0.006872    -0.014523 -0.020875 -0.009990   
px_width           -0.008

A quick search will reveal many different ways to do linear regression in Python. We will use the sklearn LinearRegression function. The sklearn module has many standard machine learning methods so it is a good one to get used to working with.

Linear Regression involves fitting a model of the form:


Where 
 is the (numerical) variable we're trying to predict, 
 is the vector of input variables, 
 is the array of model coefficients and 
 is the intercept. In the simple case when X is one-dimensional (one input variable) then this is the forumula for a straight line with gradient 
.

We will first try to predict price_range from battery_power in the df data. You should look at the plot of these two variables to see that they are roughly correlated. Here is the code using slkearn to do this. We first create a linear model, then select the data we will use to train it - note that X (the input) is a one-column pandas dataframe while y (the output) is a Series. The fit method is used to train the model. The result is a set of coefficients (in this case just one) and an intercept.

In [10]:
reg = linear_model.LinearRegression()
X = df[['price_range']]
y = df['battery_power']
reg.fit(X, y)
print("y = x *", reg.coef_, "+", reg.intercept_)

y = x * [78.8698] + 1120.2138


In [11]:
reg.predict(X[:3])

array([1199.0836, 1277.9534, 1277.9534])

In [12]:
reg.coef_

array([78.8698])

In [13]:
reg.intercept_

1120.2138

In [14]:
reg = linear_model.Ridge(alpha=.5)
X = df[['price_range']]
y = df['battery_power']
reg.fit(X, y)

What we have done so far is to train and test the model on the same data. This is not good practice as we have no idea how good the model would be on new data. Better practice is to split the data into two sets - training and testing data. We build a model on the training data and test it on the test data.

Sklearn provides a function train_test_split to do this common task. It returns two arrays of data. Here we ask for 20% of the data in the test set.

In [42]:
train, test = train_test_split(df, test_size=0.2, random_state=142)
print('Train Shape: ',train.shape)
print('Test Shape: ',test.shape)

Train Shape:  (1600, 21)
Test Shape:  (400, 21)


In [18]:
train.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
740,1004,1,2.9,1,0,0,35.0,0.2,141,6,...,901,1162.0,3772.0,17,8,18,0.0,1,1,3
1624,555,1,3.0,1,5,1,38.0,0.8,193,2,...,214,1970.0,1686.0,8,1,8,1.0,0,1,1
56,823,1,2.7,1,13,0,60.0,0.5,148,8,...,822,1449.0,905.0,14,11,17,1.0,1,1,0
1593,1864,0,2.2,0,0,1,7.0,0.1,142,1,...,225,1545.0,2258.0,10,1,10,1.0,0,0,2
94,1322,0,1.7,1,6,0,7.0,0.8,140,3,...,177,1990.0,1418.0,19,17,12,0.0,1,0,1


In [19]:
train.price_range

740     3
1624    1
56      0
1593    2
94      1
       ..
1292    0
511     3
411     3
1221    2
277     1
Name: price_range, Length: 1600, dtype: int64

 We can measure the mean squared error which is based on the difference between the real and predicted values of price_range (mean of the squared differences). Another measure is 
 which measures the amount of variance in the data that is explained by the model. Smaller MSE is better. 
 close to 1 is better.

In [60]:
reg = linear_model.LinearRegression()
X_train = train[['ram', 'battery_power','px_height','int_memory']]
y_train = train['price_range']

X_test = test[['ram', 'battery_power','px_height','int_memory']]
y_test = test['price_range']

reg.fit(X_train, y_train)
print("y = x *", reg.coef_, "+", reg.intercept_)

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr=LogisticRegression().fit(X_train,y_train)

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [21]:
predicted = reg.predict(X_test)
mse = ((np.array(y_test)-predicted)**2).sum()/len(y_test)
r2 = r2_score(y_test, predicted)
print("MSE:", mse)
print("Root MSE:",np.sqrt(mse))
print("R Squared:", r2)

MSE: 0.6941193550966651
Root MSE: 0.8331382568917749
R Squared: -0.007026728185320685


In [None]:
print("Accuracy of training set",accuracy_score())
print("Accuracy of testing set",accuracy_score())

In [54]:
np.all(np.isfinite(df))
np.any(np.isnan(df))

True

the value for mse is Smaller so it makes MSE better and R squared value is close to 1 is better, so overall it makes the graph overall performance higher
.

The workshop task this week involves unsupervised learning - an exercise in clustering. We'll use a the Mobile Price Data dataset to walk through the process of kmeans and hierarchical clustering. We'll then introduce a text dataset for you to experiment with text analysis.

In [55]:
km=KMeans(n_clusters=2)
km.fit(df)

ValueError: Input X contains NaN.
KMeans does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values