# Laptop Price Prediction

- **Dataset:** [Laptop Price Dataset on Kaggle](https://www.kaggle.com/datasets/muhammetvarl/laptop-price)
- **Group:** Group **D**
- **Task**: Developong a robust linear regression model that can predict the price of a laptop given its various attributes
- **Random State:** 5
- **Test Size:** 25%

## Importing dependencies and Initializations


In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Initializing constants to be used in model building
RANDOM_STATE=5
TEST_SIZE=25/100 # i.e 25 %

## Loading datasets 



In [31]:
df=pd.read_csv("dataset/laptop_price.csv",encoding='latin-1')

# display first 10 rows
df.head(10)
# Display the number of rows and columns
df.shape

(1303, 13)

## Data Cleaning and Preprocessing:

1. Identify and handle missing values appropriately.

In [4]:

# Identifying Missing Vlues
df.isnull().sum()
# The result shows that there are no missing values

laptop_ID           0
Company             0
Product             0
TypeName            0
Inches              0
ScreenResolution    0
Cpu                 0
Ram                 0
Memory              0
Gpu                 0
OpSys               0
Weight              0
Price_euros         0
dtype: int64

2. Address outliers in numerical features.

In [39]:

## remove gb and kg from Ram and weight and convert the cols to numeric
df['Ram'] = df['Ram'].str.replace("GB", "")
df['Weight'] = df['Weight'].str.replace("kg", "")
df['Ram'] = df['Ram'].astype('int32')
df['Weight'] = df['Weight'].astype('float32')


# Identifying nmerial columns
numerical_columns = df.select_dtypes(include=np.number).columns

# finding outliers

for col in numerical_columns:

    Q1= df[col].quantile(0.25)
    Q3= df[col].quantile(0.75)
    IQR= Q3-Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outliers

    # Addressing outliers by removing them
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
df

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_euros
0,1,Apple,MacBook Pro,Ultrabook,13.3,2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,1339.69
1,2,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,898.94
2,3,HP,250 G6,Notebook,15.6,1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,1.86,575.00
4,5,Apple,MacBook Pro,Ultrabook,13.3,2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,1803.60
5,6,Acer,Aspire 3,Notebook,15.6,1366x768,AMD A9-Series 9420 3GHz,4,500GB HDD,AMD Radeon R5,Windows 10,2.10,400.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1297,1315,Asus,X556UJ-XO044T (i7-6500U/4GB/500GB/GeForce,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,4,500GB HDD,Nvidia GeForce 920M,Windows 10,2.20,720.32
1298,1316,Lenovo,Yoga 500-14ISK,2 in 1 Convertible,14.0,1920x1080,Intel Core i7 6500U 2.5GHz,4,128GB SSD,Intel HD Graphics 520,Windows 10,1.80,638.00
1300,1318,Lenovo,IdeaPad 100S-14IBR,Notebook,14.0,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2,64GB Flash Storage,Intel HD Graphics,Windows 10,1.50,229.00
1301,1319,HP,15-AC110nv (i7-6500U/6GB/1TB/Radeon,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19,764.00


3. Convert categorical features into numerical representations suitable for machine learning algorithms. Clearly document your chosen encoding strategies.

In [None]:
# Encoding Categorical Features
from sklearn.preprocessing import LabelEncoder

# Create a copy of the original dataframe
df_encoded = df.copy()

# Determine categorical features
categorical_columns = df.select_dtypes(include="object").columns
print("Categorical columns:", list(categorical_columns))

"""
One-Hot Encoding Strategy:
Columns: Company, TypeName, OpSys
Reason: Low number of unique values, nominal (no natural order), important for model
"""
one_hot_columns = ["Company", "TypeName", "OpSys"]

# Apply one-hot encoding
for col in one_hot_columns:
    # Create dummy variables
    dummies = pd.get_dummies(df_encoded[col], prefix=col, dtype=int)
    # Concatenate with the main dataframe
    df_encoded = pd.concat([df_encoded, dummies], axis=1)
    # Drop the original column
    df_encoded.drop(col, axis=1, inplace=True)

"""
Label Encoding Strategy:
Columns: Product, ScreenResolution, Cpu, Memory, Gpu
Reason: High number of unique values - one-hot encoding would create too many columns
"""
label_encoded_columns = ["Product", "ScreenResolution", "Cpu", "Memory", "Gpu"]

for col in label_encoded_columns:
    le = LabelEncoder()  # Create new encoder for each column
    df_encoded[col + "_encoded"] = le.fit_transform(df_encoded[col])
    # Drop the original column
    df_encoded.drop(col, axis=1, inplace=True)

print(f"Original shape: {df.shape}")
print(f"Encoded shape: {df_encoded.shape}")


Categorical columns: ['Company', 'Product', 'TypeName', 'ScreenResolution', 'Cpu', 'Memory', 'Gpu', 'OpSys']
Original shape: (1008, 13)
Encoded shape: (1008, 43)
Columns after encoding: 43


## Exploratory Data Analysis (EDA)

1. Conduct a comprehensive EDA to understand the distribution of features and their relationships with the price.

2. Generate relevant visualizations (e.g., histograms, scatter plots, box plots, and correlation heatmaps) to illustrate key insights.

3. Identify and discuss the most influential factors affecting car prices based on your analysis.

## Feature Engineering

1. Create new and meaningful features from the existing dataset where possible.

2.  Justify the creation of any new features and explain how they might improve model performance.

## Model Building

1.  Split your data into training and testing sets.

2. Build a linear regression model for laptop price prediction

3. Train the regression model on the training data.


## Model Evaluation

1. Evaluate the performance of your trained models on the test set using appropriate regression metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²).

## Save Model and Preprocessing Objects