# ðŸ§¹ Laptop Features Cleaning & Preprocessing

## ðŸŽ¯ Objective
This notebook focuses on cleaning and standardizing the critical features for laptop price prediction:
- **RAM**: `RAM_SIZE` (Standardizing to GB), `RAM_TYPE` (Handling missings).
- **Storage**: `SSD_SIZE`, `HDD_SIZE`, `STORAGE_SIZE` (Parsing fuzzy strings like 'GO'/'TO', resolving missings).
- **Feature Engineering**: Creating a clean `STORAGE_TYPE` feature.

We will handle French units (`1 TO`, `500 GO`), impute missing values using logic ("Rescue Mission"), and drop only the rows that are completely unusable (~10%).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Setting visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Define file paths
FILE_PATH = 'full_merged_dataset_copy.csv'
OUTPUT_PATH = 'cleaned_dataset_leena.csv'

## 1. Load Data
We start by loading the dataset. We'll use the 'NedToBeFilled' placeholder as NaN to make our life easier.

In [None]:
df = pd.read_csv(FILE_PATH)

# Replace 'NeedToBeFilled' with actual NaN values for easier analysis
df.replace('NeedToBeFilled', np.nan, inplace=True)

print(f"Initial Dataset Shape: {df.shape}")
# Display some rows with mixed units to verify the problem
print(df[['RAM_SIZE', 'SSD_SIZE', 'HDD_SIZE', 'STORAGE_SIZE']].sample(5))

## 2. Helper Function: The 'Standardizer'

### Why this step?
Our data contains messy strings like `500 GO`, `1 TO`, `512 MB`. Computers can't do math on text. 
We need a function that:
1. Finds the number.
2. Identifies the unit (`TB`/`TO` -> *1024, `MB` -> /1024).
3. Returns a clean float in **GB**.

In [None]:
def parse_size_to_gb(value):
    if pd.isna(value) or str(value).strip().lower() in ['nan', 'none', '']:
        return np.nan
    
    text = str(value).upper().strip()
    
    # Standardize French units
    text = text.replace('GO', 'GB').replace('TO', 'TB').replace('MO', 'MB')
    
    # Regex to find the first occurring number (int or float)
    match = re.search(r'(\d+(?:\.\d+)?)', text)
    if not match:
        return np.nan
        
    number = float(match.group(1))
    
    # Logic for unit conversion
    if 'TB' in text:
        return number * 1024
    elif 'MB' in text:
        return number / 1024
    else:
        # Default to GB if no unit specified or 'GB' is found
        return number

# Test the function on edge cases
test_values = ['16 GB', '512 MO', '1 TO', '500GO', 'NeedToBeFilled']
print("Testing Parser:")
for v in test_values:
    print(f"'{v}' -> {parse_size_to_gb(v)} GB")

## 3. Cleaning RAM_SIZE

### What are we doing?
We apply the standardizer to `RAM_SIZE`. 
Then, for the ~7.6% of missing values, we impute using the **Median**. We choose Median over Mean to avoid skew from ultra-high-end servers (e.g., 128GB RAM).

In [None]:
# 1. Clean
df['RAM_GB'] = df['RAM_SIZE'].apply(parse_size_to_gb)

# 2. Visualize before imputation
plt.figure(figsize=(10,4))
sns.histplot(df['RAM_GB'].dropna(), bins=20, kde=False, color='skyblue')
plt.title('Distribution of RAM (GB) Before Imputation')
plt.xlabel('GB')
plt.show()

# 3. Impute
median_ram = df['RAM_GB'].median()
missing_ram_count = df['RAM_GB'].isna().sum()
print(f"Missing RAM rows: {missing_ram_count}. Imputing with median: {median_ram} GB")

df['RAM_GB'].fillna(median_ram, inplace=True)

## 4. Handling RAM_TYPE

### Decision
We found that **80%** of `RAM_TYPE` values are missing. 
Any attempt to guess these (e.g., by checking CPU gen) is complex and risky. A wrong guess biases the model.
**Verdict**: Fill with 'Unknown'.

In [None]:
df['RAM_TYPE_CLEAN'] = df['RAM_TYPE'].fillna('Unknown').str.upper().str.strip()

## 5. Storage Rescue Mission ðŸš‘

### The Problem
- `SSD_SIZE` missing: ~46%
- `HDD_SIZE` missing: ~96%
- **Opportunity**: `STORAGE_SIZE` column has textual info for 17,000+ rows where specific columns are empty!

### The Strategy
1. **Clean** existing `SSD` and `HDD` columns to GB.
2. **Rescue**: If both are NaN, parse `STORAGE_SIZE`. If it says "HDD", assume HDD. Else, assume SSD (Safest bet for modern laptops).
3. **Zero-fill**: If a row has SSD but missing HDD, assume HDD is 0 (and vice-versa).

In [None]:
# Step 1: Parse existing columns
df['SSD_GB'] = df['SSD_SIZE'].apply(parse_size_to_gb)
df['HDD_GB'] = df['HDD_SIZE'].apply(parse_size_to_gb)

# Step 2: Rescue Logic function
def rescue_storage(row):
    ssd = row['SSD_GB']
    hdd = row['HDD_GB']
    storage_text = str(row['STORAGE_SIZE']).upper()
    
    # Only act if both specific columns are missing
    if pd.isna(ssd) and pd.isna(hdd):
        # Try to parse the fallback text
        val = parse_size_to_gb(storage_text)
        if pd.notna(val):
            if 'HDD' in storage_text or 'HARD DISK' in storage_text:
                return pd.Series([np.nan, val]) # SSD is NaN, HDD is Found Val
            else:
                return pd.Series([val, np.nan]) # SSD is Found Val (Default), HDD NaN
                
    return pd.Series([ssd, hdd])

print("Applying Rescue Mission... this might calculate for a moment...")
df[['SSD_GB', 'HDD_GB']] = df.apply(rescue_storage, axis=1)

# Step 3: Finalize (Fill remaining NaNs with 0 if at least one storage exists)
df['SSD_GB'].fillna(0, inplace=True)
df['HDD_GB'].fillna(0, inplace=True)

# Check how many are still totally empty (0 SSD and 0 HDD)
empty_storage_count = len(df[(df['SSD_GB'] == 0) & (df['HDD_GB'] == 0)])
print(f"Rows with absolutely NO storage info remaining: {empty_storage_count}")

## 6. Re-Engineering STORAGE_TYPE

Since we now have clean numeric values for SSD and HDD, we can create a `STORAGE_TYPE` column that is 100% accurate, replacing the old one which had many missing values.

In [None]:
def get_storage_type(row):
    if row['SSD_GB'] > 0 and row['HDD_GB'] > 0:
        return 'Hybrid'
    elif row['SSD_GB'] > 0:
        return 'SSD'
    elif row['HDD_GB'] > 0:
        return 'HDD'
    else:
        return 'Unknown'

df['STORAGE_TYPE_CLEAN'] = df.apply(get_storage_type, axis=1)

# Visualizing the types
sns.countplot(x='STORAGE_TYPE_CLEAN', data=df, palette='viridis')
plt.title('Distribution of Storage Types After Cleaning')
plt.show()

## 7. Dropping the Unsalvageable

### Why Drop?
Rows with `Unknown` storage type have **0 GB** of storage. A laptop with no storage cannot exist or be priced correctly. Keeping them introduces noise.
We drop these ~5,800 rows (approx 10%) to ensure high data quality.

In [None]:
initial_len = len(df)
df_clean = df[df['STORAGE_TYPE_CLEAN'] != 'Unknown'].copy()
final_len = len(df_clean)

print(f"Dropped {initial_len - final_len} rows ({((initial_len - final_len)/initial_len)*100:.2f}%).")
print(f"Final Clean Dataset Shape: {df_clean.shape}")

## 8. Export Data
Saving the clean data to CSV.

In [None]:
df_clean.to_csv(OUTPUT_PATH, index=False)
print(f"File saved successfully to: {OUTPUT_PATH}")