<a href="https://www.kaggle.com/code/cosmicpegasis/phone-price?scriptVersionId=183605229" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Basic Popularity and Price Prediction of Phone Models

In [1]:
import numpy as np 
import pandas as pd

In [2]:
df = pd.read_csv("/kaggle/input/ukrainian-market-mobile-phones-data/phones_data.csv")

# Preprocessing

All data was in UAH but I'm familliar with the american pricing, so I used UAH to Dollar 2019 conversion rate, also accounting for a premium that buyers may have to pay in Ukraine.

In [3]:
def uah_to_dollar(uah):
    return uah * 0.038

df.best_price = df.best_price.apply(uah_to_dollar)
df.lowest_price = df.lowest_price.apply(uah_to_dollar)
df.highest_price = df.highest_price.apply(uah_to_dollar)

In [4]:
df['ram'] = df[df["memory_size"] < 32]['memory_size']
df['rom'] = df[df['memory_size'] > 32]['memory_size']
df['ram'] = df['ram'].fillna(df['ram'].mean())
df['rom'] = df['rom'].fillna(df['rom'].mean())
df.drop(["Unnamed: 0", "os", "model_name", "memory_size"], inplace=True, axis=1)
df.dropna(inplace=True)
df.reset_index(inplace=True, drop=True)

Memory size was sometimes referring to RAM values while other times it was referring to ROM values.

In [5]:
from sklearn.preprocessing import OneHotEncoder

Brand names play a huge role in the pricing of a phone. Some brands are linked to more competetive phones, with more popularity.
Simillarly, release date also plays a huge role in impacting phone prices, as older lower spec models were priced simillarly to newer higher spec models.

In [6]:
brand_ohe = OneHotEncoder(sparse_output=False)
brand_ohe.fit(df[['brand_name']])
ohe_labels = brand_ohe.transform(df[['brand_name']])
new_df = pd.DataFrame(ohe_labels, columns=brand_ohe.get_feature_names_out())
df = df.merge(new_df, left_index=True, right_index=True)

In [7]:
def transform_date(date: str) -> int:
    return int(date.split('-')[1])
df.release_date = df.release_date.transform(transform_date)

In [8]:
X = df.drop(['brand_name', 'popularity', 'best_price', 'lowest_price', 'highest_price', 'sellers_amount'], axis=1)
Y = df[['best_price']]

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Model Selection

I decided to start testing with K-Nearest Neighbors, Random Forests and Decision Trees.

When a company makes a phone, it first chooses a bracket e.g. Budget, Mid-range, Flagship-Killer, Flagship.

According to this bracket, the best specs possible are then chosen. My model tries to make the same decisions given the specs. It is really intuitive to think of this as a series of decisions, hence my choice for the decision trees based models.

The above can also be modelled as a simillarity problem, where a theoretical phone is priced simillarly to other phones already in the market. This is a simulation of an actual market. Hence my choice for KNN. A possible shortcoming of this approach is the gift of foresight. A phone that came out in 2015 might be influenced by the pricing of a phone in 2016. However, pricing in 2015 and 2016 can differ a lot in the case of surprising technical advances between the two years.

# Observations

Model performs well on simple bigger number higher price tasks. It has also managed to capture the brand premium component of a device. A phone from Apple will cost more than a phone with the same specs from OnePlus.

In [10]:
X_pop = df.drop(['brand_name', 'popularity', 'best_price', 'lowest_price', 'highest_price', 'sellers_amount'], axis=1)
Y_pop = df[['popularity']]
X_pop_train, X_pop_test, Y_pop_train, Y_pop_test = train_test_split(X_pop, Y_pop, test_size=0.2)

# GridCV to search for best parameters

In [11]:
from sklearn.model_selection import GridSearchCV

In [12]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor

models = {
    "DecisionTree": make_pipeline(DecisionTreeRegressor()),
    "KNN": make_pipeline(StandardScaler(), KNeighborsRegressor()),
    "RFR": make_pipeline(RandomForestRegressor()),
    "AdaBoost": make_pipeline(AdaBoostRegressor(DecisionTreeRegressor()))
}
params = {
    "RFR": {
         "randomforestregressor__n_estimators": [300, 500],
    },
    "DecisionTree": {
         "decisiontreeregressor__splitter": ["best", "random"],
         "decisiontreeregressor__max_depth": [5, 10, 15, None],
    },
    "KNN" : {
         "kneighborsregressor__n_neighbors": [3, 5, 7, 10]
    },
    "AdaBoost" :{
        "adaboostregressor__n_estimators" : [50, 100]
    }
}

In [13]:
results = {}
for model_name, model in models.items():
    print(model_name)
    grid = GridSearchCV(model, params[model_name], cv=5, n_jobs=-1, verbose=0)
    grid.fit(X_train.values, Y_train.values.ravel())
    results[model_name] = {
        'best_score': grid.best_score_,
        'best_params': grid.best_params_,
        'model': grid
    }

DecisionTree
KNN
RFR
AdaBoost


In [14]:
results

{'DecisionTree': {'best_score': 0.856434969647186,
  'best_params': {'decisiontreeregressor__max_depth': None,
   'decisiontreeregressor__splitter': 'best'},
  'model': GridSearchCV(cv=5,
               estimator=Pipeline(steps=[('decisiontreeregressor',
                                          DecisionTreeRegressor())]),
               n_jobs=-1,
               param_grid={'decisiontreeregressor__max_depth': [5, 10, 15, None],
                           'decisiontreeregressor__splitter': ['best', 'random']})},
 'KNN': {'best_score': 0.7905240984485908,
  'best_params': {'kneighborsregressor__n_neighbors': 3},
  'model': GridSearchCV(cv=5,
               estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                         ('kneighborsregressor',
                                          KNeighborsRegressor())]),
               n_jobs=-1,
               param_grid={'kneighborsregressor__n_neighbors': [3, 5, 7, 10]})},
 'RFR': {'best_score': 0.8767730

On the basis of the above results, I decided to choose RFR. My guess for KNN performing worse than RFR is becuase of the extra noise introduced by RAM and ROM missing values.

In [15]:
for model_name, result in results.items():
    print(f"{model_name}:", result['model'].score(X_test.values, Y_test.values.ravel()))

DecisionTree: 0.8812114404517977
KNN: 0.7584249620828392
RFR: 0.8834985898358616
AdaBoost: 0.8957971162842188
