## Laptop Price Prediction - Data Cleaning
This notebook covers all steps to clean and prepare the dataset for modeling.
We will:
1. Drop unnecessary columns.
2. Convert text columns to numeric.
3. Parse complex columns (Memory, ScreenResolution).
4. Encode categorical features.
5. Prepare the target variable.

### Step 0: Import, Load & Preview
- Import `pandas` and `numpy` as pd and np respectively.
- Load the dataset.
- Display the column names to get a good overview of everything.
- Display the first few rows of the dataset.

In [None]:
# 01_data_cleaning.ipynb
import pandas as pd
import numpy as np

In [None]:
# Load dataset
data = pd.read_csv("../data/raw/laptop_price.csv", encoding="ISO-8859-1")

In [None]:
# Show the column names
print("Columns in dataset:")
print(data.columns.tolist())

In [None]:
data.head()  # Display the first few rows of the dataset

### Step 1: Drop Unnecesary Columns
We drop `laptop_ID` and `Product` because they don't provide useful information for predicting the price.

In [None]:
# Drop columns
data = data.drop(['laptop_ID', 'Product'], axis=1)
# print(data.columns.tolist())
# data.head(1)

### Step 2: Convert RAM To Numeric
The `Ram` column contains text like "8GB". We remove "GB" and convert it to integer for modeling.

In [None]:
data['Ram'] = data['Ram'].str.replace('GB', '').astype(int)
data['Ram'].head()

### Step 3: Convert Weight To Numeric
The `Weight` column contains text like "1.37kg". We remove "kg" and convert it to float.

In [None]:
data['Weight'] = data['Weight'].str.replace('kg', '').astype(float)
data['Weight'].head()

### Step 4: Parse Memory Column
The `Memory` column contains text like "256GB SSD + 1TB HDD".
We will split it into separate columns for SSD, HDD, Hybrid storage and convert everything to GB.

In [None]:
# Create new columns with default 0
data['SSD'] = 0
data['HDD'] = 0
data['Hybrid'] = 0
data['Flash_Storage'] = 0

# Function to convert memory strings to numbers.
def convert_memory(mem):
    mem = str(mem)
    ssd = hdd = hybrid = flash = 0
    parts = mem.split('+')
    for part in parts:
        part = part.strip()
        if 'SSD' in part and 'GB' in part:
            ssd += int(part.replace('SSD', '').replace('GB', '').strip())
        elif 'SSD' in part and 'TB' in part:
            ssd += int(part.replace('SSD', '').replace('TB', '').strip()) * 1024
        elif 'HDD' in part and 'GB' in part:
            hdd += int(part.replace('HDD', '').replace('GB', '').strip())
        elif 'HDD' in part and 'TB' in part:
            hdd += int(part.replace('HDD', '').replace('TB', '').strip()) * 1024
        elif 'Hybrid' in part and 'GB' in part:
            hybrid += int(part.replace('Hybrid', '').replace('GB', '').strip())
        elif 'Hybrid' in part and 'TB' in part:
            hybrid += int(part.replace('Hybrid', '').replace('TB', '').strip()) * 1024
        elif 'Flash Storage' in part and 'GB' in part:
            flash += int(part.replace('Flash Storage', '').replace('GB', '').strip())
        elif 'Flash Storage' in part and 'TB' in part:
            flash += int(part.replace('Flash Storage', '').replace('TB', '').strip()) * 1024
    return pd.Series([ssd, hdd, hybrid, flash])

data[['SSD', 'HDD', 'Hybrid', 'Flash_Storage']] = data['Memory'].apply(convert_memory)
data = data.drop('Memory', axis=1)
data.head()

### Step 5: Parse ScreenResolution
We will extract:
1. X_resolution
2. Y_resolution
3. Touchscreen (if mentioned)

In [None]:
# Touchscreen column
data['Touchscreen'] = data['ScreenResolution'].apply(lambda x: 1 if 'Touchscreen' in x else 0)

# Extract X and Y resolution
data['X_res'] = data['ScreenResolution'].str.split('x').str[0].str.extract("(\d+)").astype(int)
data['Y_res'] = data['ScreenResolution'].str.split('x').str[1].str.extract("(\d+)").astype(int)

# Drop original ScreenResolution column
data = data.drop('ScreenResolution', axis=1)
data.head()

### Step 6: Simplify CPU & GPU
We will only keep the CPU and GPU brand names for modeling.

In [None]:
# CPU brand
data['Cpu_brand'] = data['Cpu'].apply(lambda x: x.split()[0])
data = data.drop('Cpu', axis=1)

# GPU brand
data['Gpu_brand'] = data['Gpu'].apply(lambda x: x.split()[0])
data = data.drop('Gpu', axis=1)

data.head()

### Step 7: Encode Categorical Features
We will one-hot encode: Company, TypeName, Cpu_brand, Gpu_brand, OpSys

In [None]:
data = pd.get_dummies(data, columns=['TypeName', 'Cpu_brand', 'Gpu_brand', 'OpSys'], drop_first=True)
data.head()

### Step 8: Prepare Target Variable
Ensure Price_euros is numeric and check for missing values.

In [None]:
data['Price_euros'] = pd.to_numeric(data['Price_euros'], errors='coerce')
data = data.dropna()  # Drop rows with missing values
data.head()

### Step 9: Save Cleaned Dataset
We save the cleaned dataset for modeling.

In [None]:
data.to_csv("../data/processed/clean_laptop_price.csv", index=False)