# Laptop Price Prediction

This project focuses on a dataset that provides a detailed overview of various laptops such as their models, brand and hardware components. This purpose of this project is to develop a predictive model that can accurately predict the prices of different laptops based on their features. 

The dataset is provided by a user from Kaggle  (https://www.kaggle.com/datasets/owm4096/laptop-prices). We will be using Python and various tools such as machine learning algorithms, preprocessing and pipelines to create our model. 

## Import Libraries

In [100]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
from matplotlib import pyplot as plt 
import seaborn as sns
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error, r2_score 
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

## Load and Inspect Data 

In [101]:
laptops = pd.read_csv('laptop_prices.csv')
laptops.info()
laptops.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1275 entries, 0 to 1274
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Company               1275 non-null   object 
 1   Product               1275 non-null   object 
 2   TypeName              1275 non-null   object 
 3   Inches                1275 non-null   float64
 4   Ram                   1275 non-null   int64  
 5   OS                    1275 non-null   object 
 6   Weight                1275 non-null   float64
 7   Price_euros           1275 non-null   float64
 8   Screen                1275 non-null   object 
 9   ScreenW               1275 non-null   int64  
 10  ScreenH               1275 non-null   int64  
 11  Touchscreen           1275 non-null   object 
 12  IPSpanel              1275 non-null   object 
 13  RetinaDisplay         1275 non-null   object 
 14  CPU_company           1275 non-null   object 
 15  CPU_freq             

Unnamed: 0,Company,Product,TypeName,Inches,Ram,OS,Weight,Price_euros,Screen,ScreenW,...,RetinaDisplay,CPU_company,CPU_freq,CPU_model,PrimaryStorage,SecondaryStorage,PrimaryStorageType,SecondaryStorageType,GPU_company,GPU_model
0,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1339.69,Standard,2560,...,Yes,Intel,2.3,Core i5,128,0,SSD,No,Intel,Iris Plus Graphics 640
1,Apple,Macbook Air,Ultrabook,13.3,8,macOS,1.34,898.94,Standard,1440,...,No,Intel,1.8,Core i5,128,0,Flash Storage,No,Intel,HD Graphics 6000
2,HP,250 G6,Notebook,15.6,8,No OS,1.86,575.0,Full HD,1920,...,No,Intel,2.5,Core i5 7200U,256,0,SSD,No,Intel,HD Graphics 620
3,Apple,MacBook Pro,Ultrabook,15.4,16,macOS,1.83,2537.45,Standard,2880,...,Yes,Intel,2.7,Core i7,512,0,SSD,No,AMD,Radeon Pro 455
4,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1803.6,Standard,2560,...,Yes,Intel,3.1,Core i5,256,0,SSD,No,Intel,Iris Plus Graphics 650


The dataset contains 1275 rows and 23 columns containing information on used cars. There does not seem to be any values present in any of the columns. 

Here's a summary of all the columns:

- **Company**: Laptop Manufacturer.
- **Product**: Brand and Model.
- **TypeName**: Laptop Type (Notebook, Ultrabook, Gaming, …etc).
- **Inches**: Screen Size.
- **Ram**: Total amount of RAM in laptop (GBs).
- **OS**: Operating System installed.
- **Weight**: Laptop Weight in kilograms.
- **Price_euros**: Price of Laptop in Euros. (Target)
- **Screen**: Screen definition (Standard, Full HD, 4K Ultra HD, Quad HD+).
- **ScreenW**: Screen width (pixels).
- **ScreenH**: Screen height (pixels).
- **Touchscreen**: Whether or not the laptop has a touchscreen.
- **IPSpanel**: Whether or not the laptop has an IPSpanel.
- **RetinaDisplay**: Whether or not the laptop has retina display.
- **CPU_company**: Company that manufactured the CPU
- **CPU_freq**: frequency of laptop CPU (Hz).
- **CPU_model**: Model of the CPU
- **PrimaryStorage**: Primary storage space (GB).
- **PrimaryStorageType**: Primary storage type (HDD, SSD, Flash Storage, Hybrid).
- **SecondaryStorage**: Secondary storage space if any (GB).
- **SecondaryStorageType**: Secondary storage type (HDD, SSD, Hybrid, None).
- **GPU_company**: Company that manufactured the GPU
- **GPU_model**: Model of the GPU 

## Data Cleaning and Preparation 

We will be converting the data type of the 'RetinaDisplay' column from an object into a boolean column.

In [102]:
# Value count for RetinaDisplay
laptops.RetinaDisplay.value_counts()

RetinaDisplay
No     1258
Yes      17
Name: count, dtype: int64

In [103]:
# Converting RetinaDisplay column to boolean 
laptops.RetinaDisplay.replace({'Yes':1, 'No':0}, inplace = True)
laptops.RetinaDisplay = laptops.RetinaDisplay.astype('bool')

We are also going to check for duplicate rows in this dataset.

In [104]:
# Checking for duplicate rows
laptops[laptops.duplicated() == True]

Unnamed: 0,Company,Product,TypeName,Inches,Ram,OS,Weight,Price_euros,Screen,ScreenW,...,RetinaDisplay,CPU_company,CPU_freq,CPU_model,PrimaryStorage,SecondaryStorage,PrimaryStorageType,SecondaryStorageType,GPU_company,GPU_model


Fortunately, there are no duplicate rows to remove.

## Feature Engineering 

### Frequency Encoding

In [105]:
# Frequency encoding 
freq_cols = ['Product', 'GPU_model', 'CPU_model']

for feature in freq_cols:
    freq_encoding = laptops[feature].value_counts().to_dict()
    col_name = f'encoded_{feature}'
    laptops[col_name] = laptops[feature].map(freq_encoding)

In [106]:
laptops.head()

Unnamed: 0,Company,Product,TypeName,Inches,Ram,OS,Weight,Price_euros,Screen,ScreenW,...,CPU_model,PrimaryStorage,SecondaryStorage,PrimaryStorageType,SecondaryStorageType,GPU_company,GPU_model,encoded_Product,encoded_GPU_model,encoded_CPU_model
0,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1339.69,Standard,2560,...,Core i5,128,0,SSD,No,Intel,Iris Plus Graphics 640,10,8,12
1,Apple,Macbook Air,Ultrabook,13.3,8,macOS,1.34,898.94,Standard,1440,...,Core i5,128,0,Flash Storage,No,Intel,HD Graphics 6000,2,5,12
2,HP,250 G6,Notebook,15.6,8,No OS,1.86,575.0,Full HD,1920,...,Core i5 7200U,256,0,SSD,No,Intel,HD Graphics 620,21,279,193
3,Apple,MacBook Pro,Ultrabook,15.4,16,macOS,1.83,2537.45,Standard,2880,...,Core i7,512,0,SSD,No,AMD,Radeon Pro 455,10,1,4
4,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1803.6,Standard,2560,...,Core i5,256,0,SSD,No,Intel,Iris Plus Graphics 650,10,2,12


## Feature Importances

In [107]:
# Separating feature and target columns
features = laptops.drop(columns = ['Price_euros', 'Product', 'GPU_model', 'CPU_model', 'Company',
                                  'GPU_company', 'PrimaryStorageType', 'CPU_company', 'SecondaryStorageType', 'IPSpanel', 'OS']).columns
target = ['Price_euros']

X = laptops[features]
y = laptops[target]

# Variables for different data type columns
num_cols = X.select_dtypes(include = {'int64', 'float64'}).columns
cat_cols = X.select_dtypes(include = 'object').columns
bool_cols = X.select_dtypes(include = 'bool').columns

# Initialising ColumnTransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers = [
        ('num_vals', StandardScaler(), num_cols),
        ('cat_vals', OneHotEncoder(sparse = False, drop = 'first'), cat_cols),
        ('bool_vals', 'passthrough', bool_cols)
    ]
)

In [108]:
# Apply the transformations to the training data
X_preprocessed = preprocessor.fit_transform(X)
X_preprocessed = pd.DataFrame(X_preprocessed, columns=preprocessor.get_feature_names_out())

# Split the data into train and test sets
x_train_processed, x_test_processed, y_train_processed, y_test_processed = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

In [109]:
# Initialize the GradientBoostingRegressor
dtr = GradientBoostingRegressor(random_state = 0)

# Fit the model to the training data
dtr.fit(x_train_processed, y_train_processed.values.ravel())

In [110]:
# Get feature importances
importances = dtr.feature_importances_

# Create a DataFrame to view feature importances
feature_importances = pd.DataFrame({'feature': x_train_processed.columns, 'importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='importance', ascending=False)

# Print the top 15 most important features
print(feature_importances)

                           feature  importance
1                    num_vals__Ram    0.514293
13     cat_vals__TypeName_Notebook    0.100380
2                 num_vals__Weight    0.083947
5               num_vals__CPU_freq    0.078700
3                num_vals__ScreenW    0.040085
0                 num_vals__Inches    0.029175
6         num_vals__PrimaryStorage    0.028524
15  cat_vals__TypeName_Workstation    0.027135
9      num_vals__encoded_GPU_model    0.026714
4                num_vals__ScreenH    0.018061
8        num_vals__encoded_Product    0.014517
10     num_vals__encoded_CPU_model    0.010914
16        cat_vals__Screen_Full HD    0.009090
7       num_vals__SecondaryStorage    0.004782
18       cat_vals__Screen_Standard    0.003508
11       cat_vals__TypeName_Gaming    0.002987
17       cat_vals__Screen_Quad HD+    0.002912
19       cat_vals__Touchscreen_Yes    0.002364
14    cat_vals__TypeName_Ultrabook    0.001682
20        bool_vals__RetinaDisplay    0.000229
12      cat_v

It seems that the number of RAMs is the most important feature with an importance value of nearly 50%, followed by the weight of the laptop and CPU frequency. We will remove some columns as such as 'GPU_company', 'PrimaryStorageType' and 'CPU_company'.

## Model Selection and Evaluation

For this project, we will be using GradientBoostingRegressor as our predictive model. We were first test its performance without tuning any parameters using accuracy score, cross-validation and RMSE (Root Mean Squared Error).

In [111]:
# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Initialising Gradient Boosting Regressor
gbr = GradientBoostingRegressor(random_state = 0)
pipeline_gbr = Pipeline([('preprocessing', preprocessor), ('regressor', gbr)])

# Fitting pipeline to training data
pipeline_gbr.fit(x_train, y_train.values.ravel())
y_pred = pipeline_gbr.predict(x_test)

In [112]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(pipeline_gbr, x_train, y_train.values.ravel(), cv=5, scoring='neg_mean_squared_error')

# Converting the negative MSE to positive and take the square root
rmse_scores = (-cv_scores) ** 0.5

#Pipeline score
train_score = pipeline_gbr.score(x_train, y_train)
test_score = pipeline_gbr.score(x_test, y_test)
print(f'Gradient Boosting Regressor Train Score: {train_score}')
print(f'Gradient Boosting Regressor Test Score: {test_score}')

# Display the RMSE for each fold and the average RMSE
print("RMSE for each fold: ", rmse_scores)
print("Average RMSE: ", rmse_scores.mean())

Gradient Boosting Regressor Train Score: 0.9045556095542848
Gradient Boosting Regressor Test Score: 0.782755017216914
RMSE for each fold:  [314.03113244 254.5806023  250.06493636 341.64101535 327.88276715]
Average RMSE:  297.6400907199877


In [113]:
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse}")

# Baseline RMSE: Predicting the mean of y_true for all observations
baseline_pred = [np.mean(y_test)] * len(y_test)
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
print(f"Baseline RMSE: {baseline_rmse}")

# Improvement over baseline
improvement = (baseline_rmse - rmse) / baseline_rmse * 100
print(f"Improvement over baseline: {improvement:.2f}%")

RMSE: 357.89341156454367
Baseline RMSE: 767.8543092584338
Improvement over baseline: 53.39%


### Hyperparameter Tuning

Next, we will be tuning some parameters such as n_estimators and max_depth to optimise the predictive model and attempt to achieve a better evaluation score. 

In [114]:
# Initialising RandomizedSearchCV and creating parameters for tuning
param_grid = {
    'regressor__n_estimators': [200, 300, 400],
    'regressor__max_depth': [5, 8, 10],
    'regressor__min_samples_split': [3, 5, 8],
    'regressor__subsample': [0.5, 0.7, 1.0]
}

rsv = RandomizedSearchCV(estimator = pipeline_gbr, param_distributions = param_grid, cv=2, scoring='neg_mean_squared_error', 
                         verbose = 1, n_iter =10, n_jobs = -1)

In [115]:
# Fit rsv using training data 
rsv.fit(x_train, y_train.values.ravel())
best_rf = rsv.best_estimator_

# Evaluate the best estimator model 
y_pred = best_rf.predict(x_test)
print(rsv.best_params_)

Fitting 2 folds for each of 10 candidates, totalling 20 fits
{'regressor__subsample': 0.7, 'regressor__n_estimators': 300, 'regressor__min_samples_split': 5, 'regressor__max_depth': 5}


In [116]:
# Calculating accuracy score for the model
train_score = best_rf.score(x_train, y_train)
test_score = best_rf.score(x_test, y_test)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(best_rf, x_train, y_train.values.ravel(), cv=5, scoring='neg_mean_squared_error')

# Converting the negative MSE to positive and take the square root
rmse_scores = (-cv_scores) ** 0.5

In [117]:
# Displaying evaluation metric scores
print(f'Train Score: {train_score}')
print(f'Test Score: {test_score}')
print("RMSE for each fold: ", rmse_scores)
print("Average RMSE: ", rmse_scores.mean())

Train Score: 0.9908798169629444
Test Score: 0.815234550550946
RMSE for each fold:  [290.50384583 245.89660135 269.15674172 345.10344202 338.16858158]
Average RMSE:  297.7658424985268


In [119]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse}")

# Baseline RMSE: Predicting the mean of y_true for all observations
baseline_pred = [np.mean(y_test)] * len(y_test)
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
print(f"Baseline RMSE: {baseline_rmse}")

# Improvement over baseline
improvement = (baseline_rmse - rmse) / baseline_rmse * 100
print(f"Improvement over baseline: {improvement:.2f}%")

RMSE: 330.05719714082943
Baseline RMSE: 767.8543092584338
Improvement over baseline: 57.02%


After hyperparameter tuning, we managed to improve the training and testing score from 0.9 to 0.99 and 0.78 to 0.81 respectively. There is also slight increase in the improvement over baseline from 53.4% to 57%. This shows that hyperparameter tuning has improved our model.

## Conclusion

This project explores the development of a predictive model using a dataset that contains information about different laptops and their features. The goal of the model was to accurately predict the price of these laptops. After performing feature engineering, model selection and hyperparameter tuning, we managed to create a model using Gradient Boosting Regressor with an accuracy of nearly 100% and 82% on training and testing data respectively. 