## Laptop Price Prediction - Data Cleaning
This notebook covers all steps to clean and prepare the dataset for modeling.
We will:
1. Drop unnecessary columns.
2. Convert text columns to numeric.
3. Parse complex columns (Memory, ScreenResolution).
4. Encode categorical features.
5. Prepare the target variable.

### Step 0: Import, Load & Preview
- Import `pandas` and `numpy` as pd and np respectively.
- Load the dataset.
- Display the column names to get a good overview of everything.
- Display the first few rows of the dataset.

In [80]:
# 01_data_cleaning.ipynb
import pandas as pd
import numpy as np

In [81]:
# Load dataset
data = pd.read_csv("../data/raw/laptop_price.csv", encoding="ISO-8859-1")

In [82]:
# Show the column names
print("Columns in dataset:")
print(data.columns.tolist())

Columns in dataset:
['laptop_ID', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price_euros']


In [83]:
data.head()  # Display the first few rows of the dataset

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_euros
0,1,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,1339.69
1,2,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,898.94
2,3,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,575.0
3,4,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,2537.45
4,5,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,1803.6


### Step 1: Drop Unnecesary Column(s)
We drop `laptop_ID` column because it doesn't provide useful information for predicting the price.

In [84]:
# data.notnull().sum()
# data.head().T
print(len(data['Product'].unique()))

618


In [85]:
# Drop columns
data = data.drop(['laptop_ID'], axis=1)
# Dropping laptop_ID because it's similar to the index.
print(data.columns.tolist())
# data.head(1)

['Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price_euros']


### Step 2: Convert RAM To Numeric
The `Ram` column contains text like "8GB". We remove "GB" and convert it to integer for modeling.

In [86]:
data['Ram'] = data['Ram'].str.replace('GB', '').astype(int)
data['Ram'].head()

0     8
1     8
2     8
3    16
4     8
Name: Ram, dtype: int64

### Step 3: Convert Weight To Numeric
The `Weight` column contains text like "1.37kg". We remove "kg" and convert it to float.

In [87]:
data['Weight'] = data['Weight'].str.replace('kg', '').astype(float)
data['Weight'].head()

0    1.37
1    1.34
2    1.86
3    1.83
4    1.37
Name: Weight, dtype: float64

### Step 4: Parse Memory Column
The `Memory` column contains text like "256GB SSD + 1TB HDD".
We will split it into separate columns for SSD, HDD, Hybrid storage and convert everything to GB.

In [88]:
# Create new columns with default 0
data['SSD'] = 0
data['HDD'] = 0
data['Hybrid'] = 0
data['Flash_Storage'] = 0

import re

# Function to convert memory strings to numbers.
def convert_memory(mem):
    mem = str(mem)
    ssd = hdd = hybrid = flash = 0  # Start with 0 for all storage types

    # Split by '+'
    parts = mem.split('+')
    for part in parts:
        part = part.strip()

        # Extract numeric size
        size_match = re.search(r'(\d+)', part)
        size = int(size_match.group(1)) if size_match else 0

        # Convert TB → GB
        if "TB" in part:
            size *= 1024

        # Assign to storage type
        if "SSD" in part:
            ssd += size
        elif "HDD" in part:
            hdd += size
        elif "Hybrid" in part:
            hybrid += size
        elif "Flash" in part or "Flash Storage" in part:
            flash += size

    return pd.Series([ssd, hdd, hybrid, flash])

# Apply function
data[['SSD', 'HDD', 'Hybrid', 'Flash_Storage']] = data['Memory'].apply(convert_memory)
data = data.drop('Memory', axis=1)
data.head()

Unnamed: 0,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Gpu,OpSys,Weight,Price_euros,SSD,HDD,Hybrid,Flash_Storage
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,Intel Iris Plus Graphics 640,macOS,1.37,1339.69,128,0,0,0
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,Intel HD Graphics 6000,macOS,1.34,898.94,0,0,0,128
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,Intel HD Graphics 620,No OS,1.86,575.0,256,0,0,0
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,AMD Radeon Pro 455,macOS,1.83,2537.45,512,0,0,0
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,Intel Iris Plus Graphics 650,macOS,1.37,1803.6,256,0,0,0


### Step 5: Parse ScreenResolution
We will extract:
1. X_resolution
2. Y_resolution
3. Touchscreen (if mentioned)

In [89]:
# data['ScreenResolution'].unique()

In [90]:
# Touchscreen column
data['Touchscreen'] = data['ScreenResolution'].apply(lambda x: 1 if 'Touchscreen' in x else 0)

# Extract X and Y resolution
data['X_res'] = data['ScreenResolution'].str.split('x').str[0].str.extract("(\d+)").astype(int)
data['Y_res'] = data['ScreenResolution'].str.split('x').str[1].str.extract("(\d+)").astype(int)

# Drop original ScreenResolution column
data = data.drop('ScreenResolution', axis=1)
data.head()

  data['X_res'] = data['ScreenResolution'].str.split('x').str[0].str.extract("(\d+)").astype(int)
  data['Y_res'] = data['ScreenResolution'].str.split('x').str[1].str.extract("(\d+)").astype(int)


Unnamed: 0,Company,Product,TypeName,Inches,Cpu,Ram,Gpu,OpSys,Weight,Price_euros,SSD,HDD,Hybrid,Flash_Storage,Touchscreen,X_res,Y_res
0,Apple,MacBook Pro,Ultrabook,13.3,Intel Core i5 2.3GHz,8,Intel Iris Plus Graphics 640,macOS,1.37,1339.69,128,0,0,0,0,2560,1600
1,Apple,Macbook Air,Ultrabook,13.3,Intel Core i5 1.8GHz,8,Intel HD Graphics 6000,macOS,1.34,898.94,0,0,0,128,0,1440,900
2,HP,250 G6,Notebook,15.6,Intel Core i5 7200U 2.5GHz,8,Intel HD Graphics 620,No OS,1.86,575.0,256,0,0,0,0,1920,1080
3,Apple,MacBook Pro,Ultrabook,15.4,Intel Core i7 2.7GHz,16,AMD Radeon Pro 455,macOS,1.83,2537.45,512,0,0,0,0,2880,1800
4,Apple,MacBook Pro,Ultrabook,13.3,Intel Core i5 3.1GHz,8,Intel Iris Plus Graphics 650,macOS,1.37,1803.6,256,0,0,0,0,2560,1600


### Step 6: Simplify CPU & GPU
We will only keep the CPU and GPU brand names for modeling.

In [91]:
# CPU brand
data['Cpu_brand'] = data['Cpu'].apply(lambda x: x.split()[0])
data = data.drop('Cpu', axis=1)

# GPU brand
data['Gpu_brand'] = data['Gpu'].apply(lambda x: x.split()[0])
data = data.drop('Gpu', axis=1)

data.head()

Unnamed: 0,Company,Product,TypeName,Inches,Ram,OpSys,Weight,Price_euros,SSD,HDD,Hybrid,Flash_Storage,Touchscreen,X_res,Y_res,Cpu_brand,Gpu_brand
0,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1339.69,128,0,0,0,0,2560,1600,Intel,Intel
1,Apple,Macbook Air,Ultrabook,13.3,8,macOS,1.34,898.94,0,0,0,128,0,1440,900,Intel,Intel
2,HP,250 G6,Notebook,15.6,8,No OS,1.86,575.0,256,0,0,0,0,1920,1080,Intel,Intel
3,Apple,MacBook Pro,Ultrabook,15.4,16,macOS,1.83,2537.45,512,0,0,0,0,2880,1800,Intel,AMD
4,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1803.6,256,0,0,0,0,2560,1600,Intel,Intel


### Step 7: Encode Product Column
The `Product` column contains 618 unique values, which makes one-hot encoding impractical because it would create hundreds of new columns.
to handle this efficiently, we will use **Hashtag Encoding**, which maps each product into a fixed number of numeric columns (hash buckets).

This reduces dimensionality while still capturing useful patterns from the `Product` names. We will start with 10 hash components, but this number can be tuned (e.g., 5, 10, 20) to balance performance and complexity.

In [92]:
# data = pd.get_dummies(data, columns=['TypeName', 'Cpu_brand', 'Gpu_brand', 'OpSys'], drop_first=False)
from category_encoders import HashingEncoder

# initialize encoder with 10 hash components
encoder = HashingEncoder(cols=['Product'], n_components=10)

# Fit and transform the dataset
data = encoder.fit_transform(data)

# The `Product` column is automatically replaced with hashed numeric features.
data.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,Price_euros,SSD,HDD,Hybrid,Flash_Storage,Touchscreen,X_res,Y_res,Cpu_brand,Gpu_brand
0,0,0,0,0,0,1,0,0,0,0,...,1339.69,128,0,0,0,0,2560,1600,Intel,Intel
1,0,0,0,0,1,0,0,0,0,0,...,898.94,0,0,0,128,0,1440,900,Intel,Intel
2,0,0,0,0,0,0,0,1,0,0,...,575.0,256,0,0,0,0,1920,1080,Intel,Intel
3,0,0,0,0,0,1,0,0,0,0,...,2537.45,512,0,0,0,0,2880,1800,Intel,AMD
4,0,0,0,0,0,1,0,0,0,0,...,1803.6,256,0,0,0,0,2560,1600,Intel,Intel


### Step 8: Prepare Target Variable
Ensure Price_euros is numeric and check for missing values.

In [93]:
data['Price_euros'] = pd.to_numeric(data['Price_euros'], errors='coerce')
data = data.dropna()  # Drop rows with missing values
data.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,Price_euros,SSD,HDD,Hybrid,Flash_Storage,Touchscreen,X_res,Y_res,Cpu_brand,Gpu_brand
0,0,0,0,0,0,1,0,0,0,0,...,1339.69,128,0,0,0,0,2560,1600,Intel,Intel
1,0,0,0,0,1,0,0,0,0,0,...,898.94,0,0,0,128,0,1440,900,Intel,Intel
2,0,0,0,0,0,0,0,1,0,0,...,575.0,256,0,0,0,0,1920,1080,Intel,Intel
3,0,0,0,0,0,1,0,0,0,0,...,2537.45,512,0,0,0,0,2880,1800,Intel,AMD
4,0,0,0,0,0,1,0,0,0,0,...,1803.6,256,0,0,0,0,2560,1600,Intel,Intel


### Step 9: Save Cleaned Dataset
We save the cleaned dataset for modeling.

In [94]:
data.to_csv("../data/processed/laptops_clean.csv", index=False)