# TCS RIO 125 Internship Project

This notebook is the part of my TCS iON Internship Project. The main aim of the project is "Rank Features of a Smartphone - Build a Python Application to Classify and Rank Dataset". In the previous notebook we trained a model to calculate the `price_range` given the features of the mobile phones. In this notebook we will create an algorithm for ranking mobiles.

## Importing Necessary Libraries

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

## Reading the Dataset

We got this dataset from the tasks performed in the previous notebook.

In [2]:
mobile_df = pd.read_csv("mobile.csv")
mobile_df.head()

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,ram,talk_time,three_g,touch_screen,wifi,px_area,sc_area,price_range
0,1,842,0,2.2,0,1,0,7,0.6,188,2,2,2549,19,0,0,1,15120,63,1
1,2,1021,1,0.5,1,0,1,53,0.7,136,3,6,2631,7,1,1,0,1799140,51,2
2,3,563,1,0.5,1,2,1,41,0.9,145,5,6,2603,9,1,1,0,2167308,22,2
3,4,615,1,2.5,0,0,0,10,0.8,131,6,9,2769,11,1,0,0,2171776,128,2
4,5,1821,1,1.2,0,13,1,44,0.6,141,2,14,1411,15,1,1,0,1464096,16,1


In [3]:
mobile_df.shape

(3000, 20)

In [4]:
mobile_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             3000 non-null   int64  
 1   battery_power  3000 non-null   int64  
 2   blue           3000 non-null   int64  
 3   clock_speed    3000 non-null   float64
 4   dual_sim       3000 non-null   int64  
 5   fc             3000 non-null   int64  
 6   four_g         3000 non-null   int64  
 7   int_memory     3000 non-null   int64  
 8   m_dep          3000 non-null   float64
 9   mobile_wt      3000 non-null   int64  
 10  n_cores        3000 non-null   int64  
 11  pc             3000 non-null   int64  
 12  ram            3000 non-null   int64  
 13  talk_time      3000 non-null   int64  
 14  three_g        3000 non-null   int64  
 15  touch_screen   3000 non-null   int64  
 16  wifi           3000 non-null   int64  
 17  px_area        3000 non-null   int64  
 18  sc_area 

In [5]:
mobile_df.columns

Index(['id', 'battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc',
       'four_g', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'ram',
       'talk_time', 'three_g', 'touch_screen', 'wifi', 'px_area', 'sc_area',
       'price_range'],
      dtype='object')

## Bifurcating Dataset into Software and Hardware Features

In [6]:
hardware_features = ['id', 'battery_power', 'clock_speed', 'dual_sim', 'fc', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores',
                     'pc', 'ram', 'talk_time', 'px_area', 'sc_area']
software_features = ['id', 'blue', 'four_g', 'three_g', 'touch_screen', 'wifi']

In [7]:
hardware_df = mobile_df[hardware_features]
hardware_df.head()

Unnamed: 0,id,battery_power,clock_speed,dual_sim,fc,int_memory,m_dep,mobile_wt,n_cores,pc,ram,talk_time,px_area,sc_area
0,1,842,2.2,0,1,7,0.6,188,2,2,2549,19,15120,63
1,2,1021,0.5,1,0,53,0.7,136,3,6,2631,7,1799140,51
2,3,563,0.5,1,2,41,0.9,145,5,6,2603,9,2167308,22
3,4,615,2.5,0,0,10,0.8,131,6,9,2769,11,2171776,128
4,5,1821,1.2,0,13,44,0.6,141,2,14,1411,15,1464096,16


In [8]:
software_df = mobile_df[software_features]
software_df.head()

Unnamed: 0,id,blue,four_g,three_g,touch_screen,wifi
0,1,0,0,0,0,1
1,2,1,1,1,1,0
2,3,1,1,1,1,0
3,4,1,0,1,0,0
4,5,1,1,1,1,0


## Merging the Dataset

In [9]:
merged_df = pd.merge(hardware_df, software_df, on='id', how='inner')
merged_df.head()

Unnamed: 0,id,battery_power,clock_speed,dual_sim,fc,int_memory,m_dep,mobile_wt,n_cores,pc,ram,talk_time,px_area,sc_area,blue,four_g,three_g,touch_screen,wifi
0,1,842,2.2,0,1,7,0.6,188,2,2,2549,19,15120,63,0,0,0,0,1
1,2,1021,0.5,1,0,53,0.7,136,3,6,2631,7,1799140,51,1,1,1,1,0
2,3,563,0.5,1,2,41,0.9,145,5,6,2603,9,2167308,22,1,1,1,1,0
3,4,615,2.5,0,0,10,0.8,131,6,9,2769,11,2171776,128,1,0,1,0,0
4,5,1821,1.2,0,13,44,0.6,141,2,14,1411,15,1464096,16,1,1,1,1,0


## TOPSIS method to Rank the Dataset

In [10]:
mobile_df.head()

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,ram,talk_time,three_g,touch_screen,wifi,px_area,sc_area,price_range
0,1,842,0,2.2,0,1,0,7,0.6,188,2,2,2549,19,0,0,1,15120,63,1
1,2,1021,1,0.5,1,0,1,53,0.7,136,3,6,2631,7,1,1,0,1799140,51,2
2,3,563,1,0.5,1,2,1,41,0.9,145,5,6,2603,9,1,1,0,2167308,22,2
3,4,615,1,2.5,0,0,0,10,0.8,131,6,9,2769,11,1,0,0,2171776,128,2
4,5,1821,1,1.2,0,13,1,44,0.6,141,2,14,1411,15,1,1,0,1464096,16,1


`price_range` has value 0 for some rows which won't be considered while calculating the rankings, so we shift the range from `0-3` to `1-4`.

In [11]:
print("Before:")
print(mobile_df['price_range'].value_counts())

mobile_df['price_range'] = mobile_df['price_range'] + 1

print("\nAfter:")
print(mobile_df['price_range'].value_counts())

Before:
3    759
0    756
2    750
1    735
Name: price_range, dtype: int64

After:
4    759
1    756
3    750
2    735
Name: price_range, dtype: int64


Now we divide our features into `beneficial` and `non-beneficial`. `Beneficial` features are features where higher values are preferred and `Non-beneficial` features are features where lower values are preferred.

In [12]:
benf = ['battery_power', 'clock_speed', 'fc', 'int_memory',
        'n_cores', 'pc', 'ram', 'talk_time', 'px_area', 'sc_area']
non_benf = ['m_dep', 'mobile_wt', 'price_range']

Before we begin we need to normalize our data.

In [13]:
normalized_df = pd.DataFrame()

normalized_benf = mobile_df[benf].apply(
    lambda x: x / np.linalg.norm(x), axis=0)
normalized_df[benf] = normalized_benf

normalized_non_benf = mobile_df[non_benf].apply(
    lambda x: x / np.linalg.norm(x), axis=0)
normalized_df[non_benf] = normalized_non_benf

normalized_df.head()

Unnamed: 0,battery_power,clock_speed,fc,int_memory,n_cores,pc,ram,talk_time,px_area,sc_area,m_dep,mobile_wt,price_range
0,0.011677,0.023155,0.002938,0.003427,0.007289,0.00313,0.019473,0.02816,0.000228,0.010569,0.01882,0.023776,0.013307
1,0.014159,0.005263,0.0,0.025945,0.010933,0.009389,0.020099,0.010375,0.027075,0.008556,0.021957,0.0172,0.01996
2,0.007808,0.005263,0.005877,0.020071,0.018222,0.009389,0.019885,0.013339,0.032616,0.003691,0.02823,0.018338,0.01996
3,0.008529,0.026313,0.0,0.004895,0.021866,0.014083,0.021154,0.016303,0.032683,0.021474,0.025093,0.016568,0.01996
4,0.025254,0.01263,0.038199,0.021539,0.007289,0.021908,0.010779,0.022232,0.022033,0.002684,0.01882,0.017832,0.013307


Let us rank the dataset with respect to best RAM and price. Battery power with medium importance.

In [14]:
weights = {
    'battery_power': 7,
    'clock_speed': 1,
    'fc': 1,
    'int_memory': 1,
    'n_cores': 1,
    'pc': 1,
    'ram': 10,
    'talk_time': 1,
    'px_area': 1,
    'sc_area': 1,
    'm_dep': 1,
    'mobile_wt': 1,
    'price_range': 10
}

total_weight = sum(weights.values())
normalized_weights = {col: weight /
                      total_weight for col, weight in weights.items()}
print(normalized_weights)

{'battery_power': 0.1891891891891892, 'clock_speed': 0.02702702702702703, 'fc': 0.02702702702702703, 'int_memory': 0.02702702702702703, 'n_cores': 0.02702702702702703, 'pc': 0.02702702702702703, 'ram': 0.2702702702702703, 'talk_time': 0.02702702702702703, 'px_area': 0.02702702702702703, 'sc_area': 0.02702702702702703, 'm_dep': 0.02702702702702703, 'mobile_wt': 0.02702702702702703, 'price_range': 0.2702702702702703}


We multiply the weights with the normalized values in this step.

In [15]:
weighted_df = normalized_df.copy()

for col in normalized_weights:
    weighted_df[col] = weighted_df[col] * normalized_weights[col]

weighted_df.head()

Unnamed: 0,battery_power,clock_speed,fc,int_memory,n_cores,pc,ram,talk_time,px_area,sc_area,m_dep,mobile_wt,price_range
0,0.002209,0.000626,7.9e-05,9.3e-05,0.000197,8.5e-05,0.005263,0.000761,6e-06,0.000286,0.000509,0.000643,0.003596
1,0.002679,0.000142,0.0,0.000701,0.000295,0.000254,0.005432,0.00028,0.000732,0.000231,0.000593,0.000465,0.005395
2,0.001477,0.000142,0.000159,0.000542,0.000492,0.000254,0.005374,0.000361,0.000882,0.0001,0.000763,0.000496,0.005395
3,0.001614,0.000711,0.0,0.000132,0.000591,0.000381,0.005717,0.000441,0.000883,0.00058,0.000678,0.000448,0.005395
4,0.004778,0.000341,0.001032,0.000582,0.000197,0.000592,0.002913,0.000601,0.000595,7.3e-05,0.000509,0.000482,0.003596


Now we calculate the distance of the datapoint from the best possible solution in each features and then take the square root of the sum of the distances.

In [16]:
best_df = pd.DataFrame()

for col in weighted_df.columns:
    if col in benf:
        best_df[col] = (weighted_df[col] - max(weighted_df[col]))**2
    else:
        best_df[col] = (weighted_df[col] - min(weighted_df[col]))**2

best_df['score'] = np.sqrt(best_df.sum(axis=1))
best_df.head()

Unnamed: 0,battery_power,clock_speed,fc,int_memory,n_cores,pc,ram,talk_time,px_area,sc_area,m_dep,mobile_wt,price_range,score
0,9.21514e-06,5.178893e-08,2.043457e-06,5.687275e-07,3.492472e-07,5.795245e-07,9e-06,1.604578e-09,2.479177e-06,1.600326e-06,1.796685e-07,1.362749e-07,3e-06,0.005421
1,6.58435e-06,5.057513e-07,2.276815e-06,2.118068e-08,2.425328e-07,3.505765e-07,8e-06,2.711737e-07,7.206649e-07,1.740949e-06,2.587226e-07,3.663907e-08,1.3e-05,0.005823
2,1.419528e-05,5.057513e-07,1.822713e-06,9.259983e-08,8.73118e-08,3.505765e-07,8e-06,1.94154e-07,4.888438e-07,2.105233e-06,4.599513e-07,4.936227e-08,1.3e-05,0.006448
3,1.318583e-05,2.023005e-08,2.276815e-06,5.104369e-07,3.880524e-08,2.164273e-07,6e-06,1.299708e-07,4.863059e-07,9.415159e-07,3.521502e-07,3.038847e-08,1.3e-05,0.006129
4,2.181098e-07,2.621815e-07,2.270508e-07,7.001877e-08,3.492472e-07,6.439161e-08,2.9e-05,4.011446e-08,9.70606e-07,2.184919e-06,1.796685e-07,4.347385e-08,3e-06,0.006031


Here we do the same but for the worst possible solution.

In [17]:
worst_df = pd.DataFrame()

for col in weighted_df.columns:
    if col in benf:
        worst_df[col] = (weighted_df[col] - min(weighted_df[col]))**2
    else:
        worst_df[col] = (weighted_df[col] - max(weighted_df[col]))**2

worst_df['score'] = np.sqrt(worst_df.sum(axis=1))
worst_df.head()

Unnamed: 0,battery_power,clock_speed,fc,int_memory,n_cores,pc,ram,talk_time,px_area,sc_area,m_dep,mobile_wt,price_range,score
0,8.051697e-07,2.338594e-07,6.306966e-09,4.376173e-09,9.701311e-09,7.154623e-09,2.2e-05,4.637231e-07,3.782013e-11,8.159832e-08,1.149878e-07,1.682406e-09,1.3e-05,0.006089
1,1.868576e-06,0.0,0.0,4.552971e-07,3.880524e-08,6.439161e-08,2.4e-05,4.011446e-08,5.354875e-07,5.347373e-08,6.468065e-08,4.785512e-08,3e-06,0.005518
2,2.732224e-08,0.0,2.522786e-08,2.662464e-07,1.55221e-07,6.439161e-08,2.3e-05,7.862434e-08,7.770709e-07,9.950513e-09,7.186739e-09,3.534222e-08,3e-06,0.005307
3,9.103972e-08,3.236808e-07,0.0,1.1203e-08,2.425328e-07,1.448811e-07,2.7e-05,1.299708e-07,7.802782e-07,3.368372e-07,2.874696e-08,5.562456e-08,3e-06,0.005683
4,1.201271e-05,3.96509e-08,1.065877e-06,3.087828e-07,9.701311e-09,3.505765e-07,6e-06,2.711737e-07,3.546162e-07,5.263081e-09,1.149878e-07,4.066984e-08,1.3e-05,0.005762


TOPSIS method considers both the distance from best solution and from worst solution to calculate the final score.

In [18]:
weighted_df['perf_score'] = worst_df['score'] / \
    (best_df['score'] + worst_df['score'])
weighted_df.sort_values(by='perf_score', ascending=False)

Unnamed: 0,battery_power,clock_speed,fc,int_memory,n_cores,pc,ram,talk_time,px_area,sc_area,m_dep,mobile_wt,price_range,perf_score
2819,0.005226,0.000370,0.000635,0.000569,0.000492,0.000634,0.004499,0.000801,0.000000,0.000190,0.000848,0.000523,0.001798,0.636168
2968,0.004295,0.000171,0.000953,0.000754,0.000788,0.000719,0.006873,0.000240,0.000003,0.000449,0.000424,0.000379,0.005395,0.620523
2156,0.005072,0.000142,0.000476,0.000304,0.000689,0.000296,0.008203,0.000761,0.000966,0.000589,0.000085,0.000349,0.007193,0.604505
2214,0.003615,0.000711,0.000318,0.000688,0.000295,0.000296,0.007284,0.000280,0.000003,0.000023,0.000424,0.000629,0.005395,0.603674
770,0.005014,0.000569,0.000556,0.000847,0.000788,0.000338,0.007995,0.000441,0.001430,0.000326,0.000509,0.000557,0.007193,0.601233
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1267,0.002351,0.000284,0.000238,0.000053,0.000098,0.000169,0.002189,0.000160,0.000824,0.000544,0.000254,0.000379,0.003596,0.368076
1706,0.002440,0.000597,0.000000,0.000714,0.000098,0.000550,0.003599,0.000361,0.001125,0.000032,0.000763,0.000273,0.005395,0.367778
908,0.003290,0.000484,0.000000,0.000595,0.000295,0.000127,0.001251,0.000080,0.001250,0.000063,0.000339,0.000499,0.003596,0.360077
345,0.001716,0.000370,0.000000,0.000688,0.000394,0.000169,0.002139,0.000521,0.001324,0.000308,0.000593,0.000670,0.003596,0.359649


In [19]:
mobile_df.iloc[weighted_df.sort_values(
    by='perf_score', ascending=False).index, :].head(10).set_index('id')

Unnamed: 0_level_0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,ram,talk_time,three_g,touch_screen,wifi,px_area,sc_area,price_range
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2820,1992,1,1.3,1,8,1,43,1.0,153,5,15,2179,20,1,1,0,0,42,1
2969,1637,1,0.6,1,12,1,57,0.5,111,8,17,3329,6,1,1,0,6580,99,3
2157,1933,1,0.5,1,6,0,23,0.1,102,7,7,3973,19,0,0,1,2373956,130,4
2215,1378,1,2.5,1,4,1,52,0.5,184,3,7,3528,7,1,0,0,8333,5,3
771,1911,1,2.0,0,7,1,64,0.6,163,8,8,3872,11,1,1,0,3514610,72,4
2983,1998,1,0.5,0,6,0,47,0.2,182,4,7,3918,9,1,0,1,2944711,36,4
659,1926,1,1.1,0,13,1,50,0.2,179,6,17,3809,17,1,1,0,371000,204,4
393,1860,1,2.3,0,15,0,23,0.6,162,4,16,3918,8,0,1,0,2051172,18,4
2910,1053,0,2.1,0,4,0,12,0.1,95,7,7,3701,17,0,1,0,6534,13,3
2343,1850,0,2.2,1,9,0,41,0.8,133,8,13,3032,10,1,1,1,13986,8,3


The ranked dataset gives best RAM with lowest price possible. Looking at the data, it seems our algorithm works well.

## Creating a function of the above algorithm to rank dataset

In this step we create a function to calculate the rankings so that we dont have to write the entire code everytime we want to rank the dataset.

In [20]:
def rank_mobiles(weights, df=normalized_df):
    total_weight = sum(weights.values())
    normalized_weights = {col: weight / total_weight for col, weight in weights.items()}

    weighted_df = normalized_df.copy()
    for col in normalized_weights:
        weighted_df[col] = weighted_df[col] * normalized_weights[col]

    best_df = pd.DataFrame()
    for col in weighted_df.columns:
        if col in benf:
            best_df[col] = (weighted_df[col] - max(weighted_df[col]))**2
        else:
            best_df[col] = (weighted_df[col] - min(weighted_df[col]))**2
    best_df['score'] = np.sqrt(best_df.sum(axis=1))

    worst_df = pd.DataFrame()
    for col in weighted_df.columns:
        if col in benf:
            worst_df[col] = (weighted_df[col] - min(weighted_df[col]))**2
        else:
            worst_df[col] = (weighted_df[col] - max(weighted_df[col]))**2
    worst_df['score'] = np.sqrt(worst_df.sum(axis=1))

    weighted_df['perf_score'] = worst_df['score'] / (best_df['score'] + worst_df['score'])
    weighted_df.sort_values(by='perf_score', ascending=False)
    return mobile_df.iloc[weighted_df.sort_values(by='perf_score', ascending=False).index, :].head(10).set_index('id')

Now let us check for phones with best battery and camera and price in mind.

In [21]:
weights = {
    'battery_power': 10,
    'clock_speed': 1,
    'fc': 10,
    'int_memory': 1,
    'n_cores': 1,
    'pc': 10,
    'ram': 1,
    'talk_time': 1,
    'px_area': 1,
    'sc_area': 1,
    'm_dep': 1,
    'mobile_wt': 1,
    'price_range': 5
}

rank_mobiles(weights)

Unnamed: 0_level_0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,ram,talk_time,three_g,touch_screen,wifi,px_area,sc_area,price_range
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1555,1957,0,1.2,1,18,1,36,0.8,151,2,19,1115,18,1,0,1,2062038,32,2
2751,1836,0,0.6,0,18,1,57,0.8,111,5,20,1039,8,1,1,1,148044,15,1
1407,1731,1,2.3,1,18,0,60,0.5,171,4,20,1220,20,0,1,0,147538,27,2
2206,1811,0,1.2,0,18,1,25,0.8,185,6,19,1677,2,1,1,0,2036073,60,3
306,1348,0,2.0,0,18,0,52,0.3,98,3,20,955,7,1,1,1,3629598,198,2
2949,1659,0,2.7,0,18,0,64,0.7,83,2,19,1778,13,1,0,1,1438733,90,3
1889,1544,0,2.4,0,18,1,12,0.1,186,7,20,489,2,1,0,1,396680,36,1
2581,1506,1,2.9,1,19,1,47,0.6,130,7,20,1522,3,1,0,0,144900,5,2
2509,1667,1,2.5,0,17,0,5,0.8,119,3,20,1496,12,0,0,0,379167,117,2
1417,1448,0,0.5,1,18,0,2,0.2,100,5,19,593,18,1,1,1,967824,36,1


Again the rankings are really good. Now we will use this code to create the app.