# Which linear model is best suited for predicting phones' prices?

## Introduction (150)

Scholars argue that smartphones evolved from luxurious items into necessities (Tanveer et al.). We use them for "calling and sending messages," "capturing pictures," "socializing," etc. They turned from communication tools into daily "multimedia machines" (Tanveer et al.).

Nevertheless, buying new phones can be challenging and frustrating due to the flood of features they offer (Kobie). To escape this tough choice, consumers often consider only a device's advertised characteristics without inquiring whether the phone price corresponds to them (K. Srujan Raju et al. 773). Hence, they are likely to make an uninformed decision and overpay.

Therefore, it is essential to create an effective **model that would predict a phone's market price given the device's set of characteristics**.

Thus, **we want to create and analyze several linear regression models and decide which one, if any, can help consumers evaluate whether the phone's proposed price aligns with the competition and is worth paying**.

## Data set and Methods (87)

We use a data set containing phones specifications and prices, which scholars scraped from [gadgets360.com](https://www.gadgets360.com/mobiles/best-phones) - an Indian tech news website - and [published](https://www.kaggle.com/datasets/pratikgarai/mobile-phone-specifications-and-prices) in **2022** (Garai). Therefore, the data has **1321** reliable and recent observations.

We use the variables presented below, adjusting them this way to ease the investigation:
- Rename them from the original data set
- Derive `resolution` from another two variables
- Convert `price` from Indian Rupee to US Dollar

| Variable Name  | Description                                                 |
| -------------- | ----------------------------------------------------------- |
| `price`        | Phone price in USD                                          |
| `battery`      | Battery capacity in mAh                                     |
| `screen_size`  | Screen Size in Inches across opposite corners               |
| `resolution`   | The resolution of the phone: (width $\times$ height) / 1000 |
| `processor`    | Number of processor cores                                   |
| `ram`          | RAM available in phone in GB                                |
| `storage`      | Internal Storage of phone in GB                             |
| `rear_camera`  | Resolution of rear camera in MP (0 if unavailable)          |
| `front_camera` | Resolution of front camera in MP (0 if unavailable)         |
| `num_of_sims`  | Number of SIM card slots in phone                           |

To create linear regression models and assess their performance, we perform these steps:

#### 1. [Preliminary Analysis](#preliminary_analysis) <a name="methods"></a>
- [Reading and wrangling the data](#data_reading)
- [Calculating summary statistics](#summary_statistics_calculation)
- [Visualizing the correlation between `price` and other variables](#correlation_visualization)

#### 2. [Data Preparation](#data_preparation)
- [**Splitting the data** into training and testing, **70% being the training set**](#data_splitting)
- [Dealing with **multicollinearity** by calculating Variance Inflation Factor (VIF):](#multicollinearity_check)

#### 3. [Model creation](#model_creation)
- [Building **Full MLR model**](#mlr_full)
- [Building **Reduced MLR model**](#mlr_reduced)
- [Building **LASSO regression model**](#lasso)

#### 4. [Model evaluation](#evaluation):
- Assessing the models' performance using **Mean Squared Error (MSE)**

## [⧋](#methods) Preliminary Analysis (95) <a name="preliminary_analysis"></a>

Here we do the first part from the list above.

#### [⧋](#methods) Reading and wrangling the data <a name="data_reading"></a>

In [None]:
# Installing missing packages
# https://stackoverflow.com/a/4090208/18184038
package_list <- "psych"
to_install <- package_list[!(package_list %in% installed.packages()[, "Package"])]
if (length(to_install)) install.packages(to_install)

**(!!! change URL to the data set on the main branch)**

In [None]:
options(jupyter.plot_mimetypes = "image/png")

library(tidyverse)
library(psych)
library(GGally)
library(broom)
library(car)
library(leaps)
library(mltools)
library(glmnet)
library(grid)
library(gridExtra)

set.seed(7)

# Font size for the plots
font_size <- 22

# Reading the data set from the web
url <- "https://raw.githubusercontent.com/Ihor16/stat-301-project/main/data/specs.csv"
data_raw <- read.csv(url) %>%
  as_tibble()

# Previewing the raw data set
data_raw %>%
  head(3)

In [None]:
# Conversion rate from INR to USD
# https://www.forbes.com/advisor/money-transfer/currency-converter/inr-usd/
rate <- 0.012282

# Renaming the variables and creating a derived variable for `resolution`
phone_data <- data_raw %>%
  select_if(is.numeric) %>%
  select(-X) %>%
  rename(
    battery = "Battery.capacity..mAh.",
    screen_size = "Screen.size..inches.",
    resolution_x = "Resolution.x",
    resolution_y = "Resolution.y",
    processor = "Processor",
    ram = "RAM..MB.",
    storage = "Internal.storage..GB.",
    rear_camera = "Rear.camera",
    front_camera = "Front.camera",
    num_of_sims = "Number.of.SIMs",
    price = "Price"
  ) %>%
  mutate(
    price = price * rate,
    resolution = (resolution_x * resolution_y) / 1000,
    ram = ram / 1000
  ) %>%
  relocate(resolution, .before = resolution_x) %>%
  select(-c(resolution_x, resolution_y)) %>%
  drop_na() %>%
  select(price, everything())

# Previewing the wrangled data set
phone_data %>%
  head()

In [None]:
# Calculating summary statistics for the data set
phone_data %>%
  describe() %>%
  select(min, mean, median, max, sd)

Scholars specify `ram`, `storage`, and `resolution` as the most significant variables influencing a phone's price (Listianingrum et al.). We check if our data reflects this by comparing the correlations of each variable with `price`.

In [None]:
# Calculating correlation between `price` and every other variable
# https://stackoverflow.com/a/45892364/18184038
phone_data_corr <- cor(phone_data[-1], phone_data$price) %>%
  as.data.frame() %>%
  rename(correlation = V1)

# Printing the correlations for each variable in descending order
phone_data_corr %>%
  arrange(desc(correlation))

In [None]:
options(repr.plot.width = 15, repr.plot.height = 7)

# Plotting the distribution of calculated correlation coefficients
tibble(
  correlation = phone_data_corr$correlation,
  name = phone_data[-1] %>% colnames()
) %>%
  ggplot(aes(x = reorder(name, correlation), y = correlation, fill = name)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Correlations of Input Variables with Phone Price",
    x = "Input Variable",
    y = "Correlation with Phone Price"
  ) +
  theme_minimal() +
  theme(
    text = element_text(size = font_size)
  ) +
  coord_flip()

**!!! title**

The data corresponds to the scholarly claim, so we expect these variables to be present in our best predictive model.

However, the correlation coefficients for these variables are below 0.7. Therefore, we need to build linear prediction models to see if this data is suitable for predicting `price` via linear regression.

## [⧋](#methods) Data Preparation (277) <a name="data_preparation"></a>

#### [⧋](#methods) Splitting the Data Set <a name="data_splitting"></a>

First, we apply the **holdout method** by splitting the wrangled data into:
- Training (70% of the observations)
- Testing (30%)

We treat the training set as our sample and the testing one as the new data used to evaluate the models.

In [None]:
# Adding `ID` column to the full data set
phone_data$ID <- rownames(phone_data)

# (1) Shuffling the full data set
# (2) Selecting 70% of the observations
# (3) Assigning those to the training data set
phone_training <- phone_data %>%
  sample_n(size = nrow(phone_data) * 0.7)

# Assigning the rest 30% of the observations to the testing set
phone_testing <- anti_join(
  x = phone_data,
  y = phone_training,
  by = "ID"
)

# Dropping the `ID` column from the created data sets
phone_training <- phone_training %>%
  select(-ID)
phone_testing <- phone_testing %>%
  select(-ID)

# Calculating the number of observations in training and testing sets
tibble(
  data_set = c("training_data", "testing_data"),
  num_of_obs = c(phone_training %>% nrow(), phone_testing %>% nrow())
)

#### [⧋](#methods) Multicollinearity Check <a name="multicollinearity_check"></a>

Multicollinearity occurs when there's a high correlation between the model's input variables. Therefore, such a model tends to produce invalid predictions because a change in one input variable causes a significant alteration in another (Wu).

Thus, we estimate its presence in our data by creating a **correlation heat map** visualizing "the strength of relationships between numerical variables" (Kumar).

In [None]:
# Creating a correlation matrix for all numerical variables
phone_data_corr_matrix <- phone_training %>%
  cor() %>%
  as.data.frame() %>%
  rownames_to_column("var1") %>%
  pivot_longer(-var1, names_to = "var2", values_to = "corr")

In [None]:
options(repr.plot.width = 20, repr.plot.height = 12)

# Creating the heat map plot
phone_data_corr_matrix %>%
  ggplot(aes(var1, var2)) +
  geom_tile(aes(fill = corr), color = "white") +
  scale_fill_distiller(
    "Correlation coefficient \n",
    palette = "Greens",
    direction = 1,
    limits = c(-1, 1)
  ) +
  theme_minimal() +
  labs(
    title = "Correlation matrix for all continuous input variables",
    x = "", y = "") +
  theme(
    axis.text.x = element_text(
      angle = 45, vjust = 1,
      size = font_size, hjust = 1
    ),
    axis.text.y = element_text(
      vjust = 1,
      size = font_size, hjust = 1
    ),
    legend.title = element_text(size = font_size, face = "bold"),
    title = element_text(size = font_size)
  ) +
  coord_fixed() +
  geom_text(aes(x = var1,
                y = var2,
                label = round(corr, 2)),
            color = "black",
            size = 9
  )

We observe these variable pairs with correlation $\geq$ 0.7:

| Variable pair               | Correlation value |
| --------------------------- | ----------------- |
| `storage` and `ram`         | 0.85              |
| `screen_size` and `battery` | 0.75              |
| `resolution` and `ram`      | 0.72              |
| `screen_size` and `ram`     | 0.71              |
| `ram` and `front_camera`    | 0.7               |

We now assess the multicollinearity rigorously and find whether the correlations presented above are problematic by creating a function that calculates **VIF**, a measure of "how much the variance of an independent variable is influenced by its correlation with the other independent variables" (Potters).

In [None]:
# (1) Creates MLR for the `data_set` using `price` as a dependant variable and all the rest as input variables
# (2) Calculates the model's VIFs for each input variable and prints the highest 5 of them
calculate_vif <- function(data_set) {
  # Creating a multiple linear regression for `price` using all input variables from the given data set
  mlr <- data_set %>%
    lm(formula = price ~ .)

  # Using Variance Inflation Factor (VIF) to quantify the possible multicollinearity
  mlr %>%
    vif() %>%
    as.data.frame() %>%
    rename(VIF = ".") %>%
    round(3) %>%
    arrange(desc(VIF)) %>%
    head(5)
}

Now we input the training set into this function to see the VIF for each input variable.

In [None]:
# Calculating VIF values using the whole training set
calculate_vif(phone_training)

> `ram` and `storage` have the highest VIFs, which we expected because their correlation in the heat map was the highest.

Scholars argue that VIF above 10 denotes multicollinearity, while others say values above 5 are also problematic (Bock).

Therefore, we try to remedy our **6.18** VIF for `ram` by:
- Removing highly correlated variables
- Adjusting the `ram` variable and changing its interpretation

In [None]:
# Calculating VIFs after dropping `storage`
calculate_vif(phone_training %>% select(-storage))

In [None]:
# Calculating VIFs after dropping `front_camera`
calculate_vif(phone_training %>% select(-front_camera))

In [None]:
# Adjusting `ram` column to indicate how much a phone's ram capacity is above average
phone_data_adjusted_ram <- phone_training %>%
  mutate(ram = ram - 4)

# Calculating VIFs after adjusting `ram`
calculate_vif(phone_data_adjusted_ram)

VIFs become significantly smaller after we drop `storage`. However, scholars consider this variable significant for predicting a phone's price (Listianingrum et al.), so we keep it as it may give our models more information.

Thus, we continue our investigation without changing the data.

## [⧋](#methods) Model Creation (553) <a name="model_creation"></a>

### [⧋](#methods) MLR Full Model <a name="mlr_full"></a>

Multiple Linear Regression (MLR) is a linear regression that uses multiple input variables to predict a dependent variable. We create an MLR using all input variables to predict `price`.

In [None]:
# Building predictive Ordinary Least Squares LR model with all input variables
phone_model_full <- phone_training %>%
  lm(formula = price ~ .)

# Displaying the summary statistics of the full MLR model
phone_model_full %>%
  glance()

From the `r.squared` column, which indicates the **coefficient of determination**, we see that the full MLR explains **52.8%** of the variation of the dependent variable `price`. However, this value does not indicate how well our model would predict out-of-sample observations, so we need to evaluate this metric using the testing set.

In [None]:
# Obtaining out-of-sample predictions for `price` from the testing set using the full model
phone_model_full_test <- predict(
  object = phone_model_full,
  newdata = phone_testing,
)

# Displaying a confidence interval for prediction for the full model
phone_training %>%
  select(ram, storage, screen_size) %>%
  cbind(predict(phone_model_full, interval = "confidence")) %>%
  head(1)

The **95% Confidence Interval for Prediction (CIP)** shown above means that, using the full MLR model, we're 95% certain that the range of **45.01 to 72.94 USD** contains the average price of a phone with `ram` of 1GB, `storage` of 8GB, and `screen_size` of 4.5 inches.

Now, we use the testing set again to evaluate the model by calculating its **MSE**, the average squared difference between the prices in the testing set and the predicted prices stored in `phone_model_full_test`.

In [None]:
# Calculating the MSE value for the full model
phone_model_full_rmse <- rmse(
  preds = phone_model_full_test,
  actuals = phone_testing$price
)

# Storing full model's MSE in the results tibble
phone_rmses <- tibble(
  model = "OLS Full Regression",
  rmse = phone_model_full_rmse
)
phone_rmses

We aim to use MSE values to compare models' performances, so our value of **114.19** is only helpful if we create another model, calculate its MSE, and compare them.

### [⧋](#methods) MLR Stepwise Selection Reduced Model <a name="mlr_reduced"></a>

Next, we select variables significantly associated with `price` and fit another MLR using those. Using a model with fewer input variables can be beneficial because of these reasons (Rajpurohit):
- The model would contain only relevant variables, so it's less likely to have hidden relationships between its predictors
- Models with fewer input variables tend to perform well on both training and testing sets because such models avoid overfitting

In our analysis, we perform a stepwise variable selection using a **forward selection** algorithm. This algorithm starts with a model containing one input variable. Then it builds other models by sequentially adding variables so that each model is the best for its number of input variables. However, once a variable is selected, it is present in all subsequent models.

The last built model contains `nmax` input variables. We set this value to the maximum possible number of predictors, so the rearmost model is the full MLR from above.

In [None]:
# Performing forward variable selection on the training set
phone_forward_sel <- regsubsets(
  x = price ~ .,
  nvmax = ncol(phone_training) - 1,
  data = phone_training,
  method = "forward"
)
phone_forward_sel

# Storing the summary statistics of the variable selection
phone_forward_sel_summary <- summary(phone_forward_sel)
phone_forward_sel_summary

Thus, the most precise model would have the smallest Cp.

In [None]:
# Storing variable selection metrics
phone_forward_sel_summary_df <- tibble(
  n_input_vars = 1:9,
  rss = phone_forward_sel_summary$rss,
  bic = phone_forward_sel_summary$bic,
  cp = phone_forward_sel_summary$cp
)
phone_forward_sel_summary_df

# Saving the row with the smallest Cp value
min_cp <- phone_forward_sel_summary_df %>%
  filter(cp == min(phone_forward_sel_summary_df$cp))
min_cp

options(repr.plot.width = 15, repr.plot.height = 7)

# Plotting Cp values from the variable selection summary table
phone_forward_sel_summary_df %>%
  ggplot(aes(x = n_input_vars, y = cp)) +
  geom_line() +
  geom_point(aes(x = n_input_vars, y = cp), size = 3) +
  scale_x_discrete(limits = factor(1:9)) +
  geom_point(
    data = min_cp,
    color = "red",
    size = 5
  ) +
  labs(
    title = "Mallows' Cp Statistic vs. Num. of Variables",
    x = "Num. of variables",
    y = "Mallows' Cp Statistic"
  ) +
  theme(text = element_text(size = font_size))

The minimum Cp value corresponds to the model with **7** variables, so now we use `phone_forward_sel_summary` to see which variables were selected and build a reduced MLR using them.

In [None]:
phone_forward_sel_summary

In [None]:
# Building a model using the variables selected by the stepwise forward selection
phone_model_stepwise <- phone_training %>%
  lm(formula = price ~ battery +
    resolution +
    processor +
    ram +
    storage +
    front_camera +
    num_of_sims
  )

phone_model_stepwise %>%
  glance()

From the `r.squared` column, we obtain the coefficient of determination of **52.8%**, the same as for the full MLR. Now, we can build and interpret a CIP from this model.

In [None]:
# Obtaining out-of-sample predictions for `price` from the testing set using the stepwise reduced model
phone_model_stepwise_test <- predict(
  object = phone_model_stepwise,
  newdata = phone_testing
)

# Displaying a confidence interval for prediction for the stepwise reduced model
phone_training %>%
  select(ram, storage, screen_size) %>%
  cbind(predict(phone_model_stepwise,
                interval = "confidence",
                level = 0.95)) %>%
  head(1)

The CIP shows that using this model, we're 95% certain that the range of **45.13 to 72.79 USD** contains the average phone price with the presented characteristics. This CIP is similar to the one from the full MLR, which hints that performing the variable selection did not yield significantly different results for our data.

However, we now evaluate the reduced model's performance.

In [None]:
# Calculating the MSE value for the stepwise reduced model
phone_model_stepwise_rmse <- rmse(
  preds = phone_model_stepwise_test,
  actuals = phone_testing$price
)

# Adding stepwise reduced model's MSE to the results tibble
phone_rmses <- phone_rmses %>%
  rbind(tibble(
    model = "OLS Stepwise Reduced Regression",
    rmse = phone_model_stepwise_rmse
  )) %>%
  unique()

phone_rmses

#### LASSO Model

We have decided to utilise the Lasso model and think it will be advantageous in the aspects listed below:
- **prevent overfitting**
- **shrink part of the coefficients to 0, helps with variable selection**

To prepare for this model, we first generated four metrics. Two of these are matrices produced from the training set; one is for the entire set of input variables, while the other is just for the response variable. For the testing set, use the same procedure.
The MSE for the reduced model is smaller, which means this model is better. We expected this, given that fewer variables are likely to reduce the input variables' collinearity we studied above.

Nevertheless, next, we build the LASSO model and evaluate if it is even better.

### LASSO Model

In [None]:
# Creating a matrix of all input variables from the training set
phone_training_matrix_x <- phone_training %>%
  select(-price) %>%
  as.matrix()

# Creating a matrix of responses from the training set
phone_training_matrix_y <- phone_training %>%
  select(price) %>%
  as.matrix()

# Creating a matrix of all input variables from the testing set
phone_testing_matrix_x <- phone_testing %>%
  select(-price) %>%
  as.matrix()

# Creating a matrix of responses from the testing set
phone_testing_matrix_y <- phone_testing %>%
  select(price) %>%
  as.matrix()

# Finding the optimal value of lambda, the penalty parameter
phone_cv_lambda <- cv.glmnet(
  x = phone_training_matrix_x,
  y = phone_training_matrix_y,
  alpha = 1
)

# Plotting the range of lambda values vs. MSE
phone_cv_lambda %>%
  plot(
    main = "Lambda selection by Cross Validation with LASSO\n\n"
  )

# Storing the min value of lambda
phone_lambda_min <- round(phone_cv_lambda$lambda.min, 3)

# Displaying the min value of lambda
round(log(phone_lambda_min), 2)



In [None]:
# Building LASSO model with the min lambda
phone_model_lasso <- glmnet(
  x = phone_training_matrix_x,
  y = phone_training_matrix_y,
  alpha = 1,
  lambda = phone_lambda_min
)

# Comparing coefficients from the full MLR model and LASSO with min lambda
data.frame(
  full_model = coef(phone_model_full),
  lasso = c(phone_model_lasso$a0, as.vector(phone_model_lasso$beta))
) %>%
  round(3)

# Obtaining out-of-sample predictions for `price` from the testing set using the lasso model
phone_model_lasso_test <- predict(
  object = phone_model_lasso,
  newx = phone_testing_matrix_x
)

# Calculating the MSE value for the lasso model
phone_model_lasso_rmse <- rmse(
  preds = phone_model_lasso_test,
  actuals = phone_testing$price
)

# Adding LASSO model's MSE to the results tibble
phone_rmses <- phone_rmses %>%
  rbind(tibble(
    model = "LASSO Regression with min MSE",
    rmse = phone_model_lasso_rmse
  )) %>%
  unique()

phone_rmses %>%
  as.data.frame()

## [⧋](#methods) Evaluation (395) <a name="evaluation"></a>

Our models explain 52.8% of the training set's variation, and the MSE for all of them is approximately the same. Therefore, in our analysis, we made multiple crucial mistakes that prevented us from creating effective regression models.

**(1) First**, we did not check for the linearity assumption yet still assumed that our variables are linearly related. However, the scatter plots of `price` vs. each input variable make it evident that most variables are not linearly related to `price`.

In [None]:
options(repr.plot.width = 20, repr.plot.height = 10)

# Creating scatter plots between all variables
plots <- phone_training %>%
  ggpairs()

# Printing the scatter plots between price and an each input variable
grid.arrange(plots[2, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             plots[3, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             plots[4, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             plots[5, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             plots[6, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             plots[7, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             plots[8, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             plots[9, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             plots[10, 1] + coord_flip() + geom_point(size = 3) + theme(text = element_text(size = font_size)),
             nrow = 3, ncol = 3,
             top = textGrob("Price vs. Each Input Variable",
                            gp = gpar(fontsize = font_size + 10))
)

Thus, it was wrong to use linear regression for modeling their relationship. Future studies may use other methods to fit a predictive model that better captures the relationship between input variables and `price` in this data to fix this mistake.

**(2) Second**, as the plot above shows, it could be better to treat `processor`, `ram`, `storage`, and `num_of_sims` as categorical variables rather than continuous. This change would involve adjusting the methodology of selecting significant variables: it would be necessary to use F-tests instead of the forward selection algorithm we applied. Therefore, future studies can try this method and compare their findings.

**(3) Third**, we assumed the normality of the residuals for our fitted models. However, their distribution plots below indicate right skewness, which means our models don't explain all trends in the data. One potential remedy that future research can employ is building a model using interaction terms. This addition would cover the cases when, for example, an independent variable `A` influences the `price` differently depending on the values of another independent variable `B` (Interactions in Multiple Linear Regression Basic Ideas).

In [None]:
options(repr.plot.width = 20, repr.plot.height = 10)

# Plotting the distributions of residuals for full and reduced MLR models
# https://stackoverflow.com/a/10907452/18184038
par(mfrow = c(1, 2), cex = 2)
hist(residuals(phone_model_full),
     main = "Full MLR Residuals Distribution")
hist(residuals(phone_model_stepwise),
     main = "Reduced MLR Residuals Distribution")

Additionally, there are several limitations of the data set, which prevented our analysis from being successful:

- Even though the data set was published in 2022, it missed variables describing critical features of modern smartphones. For example, the data did not have variables describing screen refresh rate, availability of NFS module, or 5G compatibility, which may influence smartphone prices today. This lack of input variables can be corrected by re-parsing the website and retrieving these new features.

- The data contained smartphone observations from the Indian market, so our study assumed that its prices reflected the global situation. However, it's not always the case. For example, iPhone prices in the US are significantly lower (Hilsenteger). Future studies can fix this limitation by parsing websites from different regions and randomly selecting phones from this international data set.

!!! add figure/table numbers"

## [⧋](#methods) Conclusion (200)

## References

“Best Mobile Phones in India | Latest & New Smartphones Price.” *Gadgets 360*, 2020, www.gadgets360.com/mobiles/best-phones. Accessed 3 Dec. 2022.

Bock, Tim. “What Are Variance Inflation Factors (VIFs)? | Displayr.com.” *Displayr*, 6 Apr. 2018, www.displayr.com/variance-inflation-factors-vifs/.

Garai, Pratik. “Mobile Phone Specifications and Prices.” *Www.kaggle.com*, 14 Aug. 2022, www.kaggle.com/datasets/pratikgarai/mobile-phone-specifications-and-prices. Accessed 4 Dec. 2022.

Hilsenteger, Lewis. “IPhone 14 ESIM Controversy Explained.” *Www.youtube.com*, 12 Sept. 2022, www.youtube.com/watch?t=543&v=DLILlKdELEk&feature=youtu.be. Accessed 3 Dec. 2022.

K. Srujan Raju, et al. *Data Engineering and Communication Technology*. Springer, 9 Jan. 2020, p. 773.

Kobie, Nicole. “Why Does Buying a New Phone Have to Be so - ProQuest.” *Www.proquest.com*, Apr. 2017, www.proquest.com/docview/1985885659?accountid=14656&forcedol=true&pq-origsite=summon. Accessed 3 Dec. 2022.

Kumar, Ajitesh. “Correlation Concepts, Matrix & Heatmap Using Seaborn.” *Data Analytics*, 16 Apr. 2022, vitalflux.com/correlation-heatmap-with-seaborn-pandas/#:~:text=with%20each%20other.-.

Listianingrum, T, et al. “Smartphone Hedonic Price Study Based on Online Retail Price in Indonesia.” *Journal of Physics: Conference Series*, vol. 1863, no. 1, 1 Mar. 2021, p. 012032, 10.1088/1742-6596/1863/1/012032. Accessed 1 May 2022.

Potters, Charles. “Variance Inflation Factor (VIF).” *Investopedia*, 26 July 2022, www.investopedia.com/terms/v/variance-inflation-factor.asp#:~:text=Variance%20inflation%20factor%20measures%20how.

Tanveer, Muhammad, et al. “Mobile Phone Buying Decisions among Young Adults: An Empirical Study of Influencing Factors.” *Sustainability*, vol. 13, no. 19, 27 Sept. 2021, p. 10705, 10.3390/su131910705. Accessed 8 Oct. 2021.

Wu, Songhao. “Multi-Collinearity in Regression.” *Medium*, 23 May 2020, towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea.