## 📊 Executive Summary

This project focuses on predicting laptop prices using machine learning regression models.

We started by cleaning and preprocessing the dataset — converting data types, extracting useful features (e.g., touchscreen, IPS, PPI, CPU/GPU/Storage categories), and simplifying the operating system labels. Feature engineering played a crucial role in improving model performance.

After preparing the data, we trained and compared multiple regression models including XGBoost, Random Forest, LightGBM, and Ridge Regression using R² Score and MAE as evaluation metrics.

🔍 **Best Model**: XGBoost  
📈 **R² Score**: 0.9028  
📉 **MAE**: 0.15

The results show that ensemble models like XGBoost and Random Forest perform significantly better in predicting laptop prices, making the solution both robust and scalable.


## 💻 Laptop Price Prediction  
### Meta-Data (About Dataset)

---

### 📌 Context

This is a multivariate dataset involving a variety of features that affect laptop pricing. It contains information about specifications such as RAM, storage type, GPU, CPU, screen resolution, weight, operating system, and more. The objective is to predict the price of a laptop based on these features using regression models.

This dataset is ideal for machine learning-based price prediction tasks and feature engineering practice, especially in structured tabular data.

---

### 🎯 Main Objectives

- Predict the **price** of laptops using machine learning regression models.  
- Perform data preprocessing and feature extraction to prepare the dataset for modeling.  
- Compare multiple regression algorithms to identify the most accurate model.  

---

### 🧾 Content

Below is a description of the main columns/features used in the analysis:

| Column          | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| Company          | Brand of the laptop (e.g., Dell, HP, Apple)                                |
| TypeName         | Type of the laptop (e.g., Ultrabook, Gaming, Notebook)                     |
| Inches           | Screen size in inches                                                      |
| ScreenResolution | Screen resolution (used to derive PPI and IPS display)                    |
| Cpu              | Full CPU name (used to extract CPU brand/type)                             |
| Ram              | RAM size in GB                                                             |
| Memory           | Storage configuration (e.g., 256GB SSD + 1TB HDD)                         |
| Gpu              | GPU name (used to categorize into GPU family)                              |
| OpSys            | Operating System                                                           |
| Weight           | Weight in kg                                                               |
| Price            | Target variable — laptop price in currency units                          |

Additional engineered features include:

- `IsTouchScreen`, `HasIPS`, `PPI`, `SSD_GB`, `HDD_GB`, `Flash_Storage_GB`, `Hybrid_GB`, `GpuName`, `OS_Category`, `CPU_Brand`.

---

### 🧰 Tools and Libraries Used

- **Language**: Python  
- **Environment**: Jupyter Notebook  
- **Libraries**: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, XGBoost, LightGBM

---

### 🏷️ Project Type

Personal ML Project (for Portfolio & Skill Development)

---

### 👤 Author

Ibrahim Yousuf


# Importing the libraries

In [135]:
import pandas as pd
import numpy as np

In [136]:
df  = pd.read_csv('laptop_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0
3,3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336
4,4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808


# Checking the size of the data

In [137]:
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns')

The dataset has 1303 rows and 12 columns


# Checking information of the data

In [138]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1303 non-null   int64  
 1   Company           1303 non-null   object 
 2   TypeName          1303 non-null   object 
 3   Inches            1303 non-null   float64
 4   ScreenResolution  1303 non-null   object 
 5   Cpu               1303 non-null   object 
 6   Ram               1303 non-null   object 
 7   Memory            1303 non-null   object 
 8   Gpu               1303 non-null   object 
 9   OpSys             1303 non-null   object 
 10  Weight            1303 non-null   object 
 11  Price             1303 non-null   float64
dtypes: float64(2), int64(1), object(9)
memory usage: 122.3+ KB


## This shows dataset doesnot contain any missing alue

# Checking duplicate rows

In [139]:
df.duplicated().sum()

0

In [140]:
df['Ram'].unique()

array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'],
      dtype=object)

## The lowest RAM present in the laptop in this dataset is 2GB and largest is 64GB

## Dropping Unnamed Column because it doesnot have any use in the analysis and modeling

In [141]:
df.drop(columns =['Unnamed: 0'] , inplace = True)

In [142]:
df.head()

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808


## Unnamed column removed sucessfully

# 💾 Data Type Conversion Guide

## Converting 'RAM' from Object to Integer
Your 'RAM' column is currently showing as an **object** data type due to "GB" units. Here's how to convert it to integer:

1. **Remove "GB"**  
   First strip out the "GB" string from each entry

2. **Convert to Integer**  
   Convert the cleaned string values into integers

## ⚖️ Converting 'Weight' from Object to Float
Your 'Weight' column appears as **object** because of "kg" units. Conversion steps:

1. **Remove "kg"**  
   Begin by removing the "kg" string from each entry

2. **Convert to Float**  
   Transform the resulting string values into floa`dtypes` afterwards!

# Remove 'GB' from RAM and convert to integer

In [143]:
# Remove 'GB' from RAM and convert to integer
df['Ram'] = df['Ram'].str.replace('GB', '').astype(int)

# Remove 'kg' from Weight and convert to float
df['Weight'] = df['Weight'].str.replace('kg', '').astype(float)


# Verifying the changes 

In [144]:
df.dtypes

Company              object
TypeName             object
Inches              float64
ScreenResolution     object
Cpu                  object
Ram                   int32
Memory               object
Gpu                  object
OpSys                object
Weight              float64
Price               float64
dtype: object

In [145]:
df['Ram'].unique()

array([ 8, 16,  4,  2, 12,  6, 32, 24, 64])

In [146]:
df['Weight'].unique()

array([1.37 , 1.34 , 1.86 , 1.83 , 2.1  , 2.04 , 1.3  , 1.6  , 2.2  ,
       0.92 , 1.22 , 0.98 , 2.5  , 1.62 , 1.91 , 2.3  , 1.35 , 1.88 ,
       1.89 , 1.65 , 2.71 , 1.2  , 1.44 , 2.8  , 2.   , 2.65 , 2.77 ,
       3.2  , 0.69 , 1.49 , 2.4  , 2.13 , 2.43 , 1.7  , 1.4  , 1.8  ,
       1.9  , 3.   , 1.252, 2.7  , 2.02 , 1.63 , 1.96 , 1.21 , 2.45 ,
       1.25 , 1.5  , 2.62 , 1.38 , 1.58 , 1.85 , 1.23 , 1.26 , 2.16 ,
       2.36 , 2.05 , 1.32 , 1.75 , 0.97 , 2.9  , 2.56 , 1.48 , 1.74 ,
       1.1  , 1.56 , 2.03 , 1.05 , 4.4  , 1.29 , 1.95 , 2.06 , 1.12 ,
       1.42 , 3.49 , 3.35 , 2.23 , 4.42 , 2.69 , 2.37 , 4.7  , 3.6  ,
       2.08 , 4.3  , 1.68 , 1.41 , 4.14 , 2.18 , 2.24 , 2.67 , 2.14 ,
       1.36 , 2.25 , 2.15 , 2.19 , 2.54 , 3.42 , 1.28 , 2.33 , 1.45 ,
       2.79 , 1.84 , 2.6  , 2.26 , 3.25 , 1.59 , 1.13 , 1.78 , 1.15 ,
       1.27 , 1.43 , 2.31 , 1.16 , 1.64 , 2.17 , 1.47 , 3.78 , 1.79 ,
       0.91 , 1.99 , 4.33 , 1.93 , 1.87 , 2.63 , 3.4  , 3.14 , 1.94 ,
       1.24 , 4.6  ,

# Changes done Successfully

In [147]:
df['ScreenResolution'].unique()

array(['IPS Panel Retina Display 2560x1600', '1440x900',
       'Full HD 1920x1080', 'IPS Panel Retina Display 2880x1800',
       '1366x768', 'IPS Panel Full HD 1920x1080',
       'IPS Panel Retina Display 2304x1440',
       'IPS Panel Full HD / Touchscreen 1920x1080',
       'Full HD / Touchscreen 1920x1080',
       'Touchscreen / Quad HD+ 3200x1800',
       'IPS Panel Touchscreen 1920x1200', 'Touchscreen 2256x1504',
       'Quad HD+ / Touchscreen 3200x1800', 'IPS Panel 1366x768',
       'IPS Panel 4K Ultra HD / Touchscreen 3840x2160',
       'IPS Panel Full HD 2160x1440',
       '4K Ultra HD / Touchscreen 3840x2160', 'Touchscreen 2560x1440',
       '1600x900', 'IPS Panel 4K Ultra HD 3840x2160',
       '4K Ultra HD 3840x2160', 'Touchscreen 1366x768',
       'IPS Panel Full HD 1366x768', 'IPS Panel 2560x1440',
       'IPS Panel Full HD 2560x1440',
       'IPS Panel Retina Display 2736x1824', 'Touchscreen 2400x1600',
       '2560x1440', 'IPS Panel Quad HD+ 2560x1440',
       'IPS Panel 

# 🖥️ Laptop Screen Information Standardization

## Current Data Situation
📊 **Screen Resolution Data Issues:**
- Inconsistent formatting across different laptops
- Multiple representation styles (e.g., "1920x1080", "Full HD", "4K UHD")

## ✅ Identifiable Patterns
### Touchscreen Classification:
- 💻👉 Touchscreen-enabled devices
- 💻✋ Non-touchscree,920x1080")

2. **Touchscreen Identification:**
   - Extract touchscreen capability
   - Represent as boolean values:
     - `True` for touchscreen devices
     - `False` for non
  "is_touchscreen": true
}

# Performing Feature Engineering
Creating a new column IsTouchScreen using lambda function

In [148]:
# Create new column using lambda function
df['IsTouchScreen'] = df['ScreenResolution'].apply(
    lambda x: 'Yes' if 'Touchscreen' in str(x) else 'No'
)
df.head()

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price,IsTouchScreen
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832,No
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232,No
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0,No
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.336,No
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.808,No


In [149]:
df['IsTouchScreen'].value_counts()

IsTouchScreen
No     1111
Yes     192
Name: count, dtype: int64

💻 Approximately **85%** of the laptops are **non-touchscreen**, while only **15%** are **touchscreen** 🖱️📱.


🧠 We extracted whether the screen uses **IPS technology** or not based on the `ScreenResolution` column.

✅ If the term **'IPS'** is found in the screen resolution text, we label it as **'Yes'**, otherwise **'No'**.

🔽 Here's the code used for this transformation:


In [150]:
df['IPS'] = df['ScreenResolution'].apply(
    lambda x: 'Yes' if 'IPS' in str(x) else 'No'
)
df.head()

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price,IsTouchScreen,IPS
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832,No,Yes
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232,No,No
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0,No,No
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.336,No,Yes
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.808,No,Yes


In [151]:
df['IPS'].value_counts()

IPS
No     938
Yes    365
Name: count, dtype: int64

📊 Approximately **72%** of the laptops have an **IPS display**

# Splitting X and Y aspect ratio of the display resolution

In [152]:
new = df['ScreenResolution'].str.split('x',n=1,expand=True)

In [153]:
df['X_res'] = new[0]
df['Y_res'] = new[1]

In [154]:
df['X_res'] = df['X_res'].str.replace(',','').str.findall(r'(\d+\.?\d+)').apply(lambda x:x[0])

In [155]:
df.head()

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price,IsTouchScreen,IPS,X_res,Y_res
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832,No,Yes,2560,1600
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232,No,No,1440,900
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0,No,No,1920,1080
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.336,No,Yes,2880,1800
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.808,No,Yes,2560,1600


In [156]:
df['X_res'] = df['X_res'].astype('int')
df['Y_res'] = df['Y_res'].astype('int')

📐 We calculated the **PPI (Pixels Per Inch)** of each laptop display using the screen resolution and size.

🧮 The formula used is:  
**PPI = √(X_res² + Y_res²) / Inches**

🔽 Here's the code used to perform this calculation:

In [157]:
df['ppi'] = (((df['X_res']**2) + (df['Y_res']**2))**0.5/df['Inches']).astype('float')

# Successfully performed Feature Engineering in ScrenResolution now dropping it

In [158]:
df.drop(columns=['ScreenResolution'],inplace=True)

In [159]:
df.drop(columns=['Inches','X_res','Y_res'],inplace=True)

In [160]:
df.head()

Unnamed: 0,Company,TypeName,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price,IsTouchScreen,IPS,ppi
0,Apple,Ultrabook,Intel Core i5 2.3GHz,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832,No,Yes,226.983005
1,Apple,Ultrabook,Intel Core i5 1.8GHz,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232,No,No,127.67794
2,HP,Notebook,Intel Core i5 7200U 2.5GHz,8,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0,No,No,141.211998
3,Apple,Ultrabook,Intel Core i7 2.7GHz,16,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.336,No,Yes,220.534624
4,Apple,Ultrabook,Intel Core i5 3.1GHz,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.808,No,Yes,226.983005


In [161]:
df['Cpu'].unique()

array(['Intel Core i5 2.3GHz', 'Intel Core i5 1.8GHz',
       'Intel Core i5 7200U 2.5GHz', 'Intel Core i7 2.7GHz',
       'Intel Core i5 3.1GHz', 'AMD A9-Series 9420 3GHz',
       'Intel Core i7 2.2GHz', 'Intel Core i7 8550U 1.8GHz',
       'Intel Core i5 8250U 1.6GHz', 'Intel Core i3 6006U 2GHz',
       'Intel Core i7 2.8GHz', 'Intel Core M m3 1.2GHz',
       'Intel Core i7 7500U 2.7GHz', 'Intel Core i7 2.9GHz',
       'Intel Core i3 7100U 2.4GHz', 'Intel Atom x5-Z8350 1.44GHz',
       'Intel Core i5 7300HQ 2.5GHz', 'AMD E-Series E2-9000e 1.5GHz',
       'Intel Core i5 1.6GHz', 'Intel Core i7 8650U 1.9GHz',
       'Intel Atom x5-Z8300 1.44GHz', 'AMD E-Series E2-6110 1.5GHz',
       'AMD A6-Series 9220 2.5GHz',
       'Intel Celeron Dual Core N3350 1.1GHz',
       'Intel Core i3 7130U 2.7GHz', 'Intel Core i7 7700HQ 2.8GHz',
       'Intel Core i5 2.0GHz', 'AMD Ryzen 1700 3GHz',
       'Intel Pentium Quad Core N4200 1.1GHz',
       'Intel Atom x5-Z8550 1.44GHz',
       'Intel Celeron Du

🧠 We extracted the **CPU brand/type** from the `Cpu` column using a regular expression.

🔍 The goal was to identify common CPU categories such as:
- Intel Core i3 / i5 / i7
- Intel Atom
- Intel Celeron
- Intel Pentium
- Intel Core M
- AMD
- Samsung

🧪 Any CPU names that didn't match the above (e.g., **Xeon**) were labeled as **'Other'**.

In [162]:
# Extract processor family using regex
df['CpuName'] = df['Cpu'].str.extract(
    r'(Intel Core i[357]|Intel Atom|Intel Celeron|Intel Pentium|Intel Core M|AMD|Samsung)',
    expand=False
)

# For any remaining null values (like Xeon), fill with 'Other'
df['CpuName'] = df['CpuName'].fillna('Other')

# Verify the results
print(df['CpuName'].value_counts())

CpuName
Intel Core i7    527
Intel Core i5    423
Intel Core i3    136
Intel Celeron     88
AMD               62
Intel Pentium     30
Intel Core M      19
Intel Atom        13
Other              4
Samsung            1
Name: count, dtype: int64


# Successfully performed Feature Engineering in CPU column now dropping it

In [163]:
df.drop(columns = ['Cpu'] , inplace = True)
df.head()

Unnamed: 0,Company,TypeName,Ram,Memory,Gpu,OpSys,Weight,Price,IsTouchScreen,IPS,ppi,CpuName
0,Apple,Ultrabook,8,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832,No,Yes,226.983005,Intel Core i5
1,Apple,Ultrabook,8,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232,No,No,127.67794,Intel Core i5
2,HP,Notebook,8,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0,No,No,141.211998,Intel Core i5
3,Apple,Ultrabook,16,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.336,No,Yes,220.534624,Intel Core i7
4,Apple,Ultrabook,8,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.808,No,Yes,226.983005,Intel Core i5


# Exploring Memory Column

In [164]:
df['Memory'].unique()

array(['128GB SSD', '128GB Flash Storage', '256GB SSD', '512GB SSD',
       '500GB HDD', '256GB Flash Storage', '1TB HDD',
       '32GB Flash Storage', '128GB SSD +  1TB HDD',
       '256GB SSD +  256GB SSD', '64GB Flash Storage',
       '256GB SSD +  1TB HDD', '256GB SSD +  2TB HDD', '32GB SSD',
       '2TB HDD', '64GB SSD', '1.0TB Hybrid', '512GB SSD +  1TB HDD',
       '1TB SSD', '256GB SSD +  500GB HDD', '128GB SSD +  2TB HDD',
       '512GB SSD +  512GB SSD', '16GB SSD', '16GB Flash Storage',
       '512GB SSD +  256GB SSD', '512GB SSD +  2TB HDD',
       '64GB Flash Storage +  1TB HDD', '180GB SSD', '1TB HDD +  1TB HDD',
       '32GB HDD', '1TB SSD +  1TB HDD', '512GB Flash Storage',
       '128GB HDD', '240GB SSD', '8GB SSD', '508GB Hybrid', '1.0TB HDD',
       '512GB SSD +  1.0TB Hybrid', '256GB SSD +  1.0TB Hybrid'],
      dtype=object)

💾 We extracted the **storage capacity details** from the `Memory` column, which includes combinations like:
- `256GB SSD + 1TB HDD`
- `128GB Flash Storage`
- `1TB Hybrid + 256GB SSD`
  
### 🛠️ Key Steps:

1. **Initialize columns** for each storage type with 0:
   - `SSD_GB`, `HDD_GB`, `Flash_Storage_GB`, `Hybrid_GB`

In [165]:
# Initialize all storage type columns with 0
storage_types = ['SSD', 'HDD', 'Flash_Storage', 'Hybrid']
for stype in storage_types:
    df[stype+'_GB'] = 0

# Function to extract numeric value from memory string
def extract_gb(value):
    if pd.isna(value):
        return 0
    num = ''.join(filter(str.isdigit, value.split('GB')[0]))
    return float(num) if num else 0

# Function to extract TB values and convert to GB
def extract_tb(value):
    if pd.isna(value):
        return 0
    if 'TB' in value:
        num = value.split('TB')[0].strip()
        try:
            return float(num) * 1024
        except:
            return 0
    return 0

# Process each memory configuration
for i, row in df.iterrows():
    parts = [part.strip() for part in row['Memory'].split('+')]
    
    for part in parts:
        if 'SSD' in part:
            if 'TB' in part:
                df.at[i, 'SSD_GB'] = extract_tb(part)
            else:
                df.at[i, 'SSD_GB'] = extract_gb(part)
                
        elif 'HDD' in part:
            if 'TB' in part:
                df.at[i, 'HDD_GB'] = extract_tb(part)
            else:
                df.at[i, 'HDD_GB'] = extract_gb(part)
                
        elif 'Flash Storage' in part:
            if 'TB' in part:
                df.at[i, 'Flash_Storage_GB'] = extract_tb(part)
            else:
                df.at[i, 'Flash_Storage_GB'] = extract_gb(part)
                
        elif 'Hybrid' in part:
            if 'TB' in part:
                df.at[i, 'Hybrid_GB'] = extract_tb(part)
            else:
                df.at[i, 'Hybrid_GB'] = extract_gb(part)

# Display results
df[['Memory'] + [stype+'_GB' for stype in storage_types]]

Unnamed: 0,Memory,SSD_GB,HDD_GB,Flash_Storage_GB,Hybrid_GB
0,128GB SSD,128,0,0,0
1,128GB Flash Storage,0,0,128,0
2,256GB SSD,256,0,0,0
3,512GB SSD,512,0,0,0
4,256GB SSD,256,0,0,0
...,...,...,...,...,...
1298,128GB SSD,128,0,0,0
1299,512GB SSD,512,0,0,0
1300,64GB Flash Storage,0,0,64,0
1301,1TB HDD,0,1024,0,0


# Successfully performed Feature Engineering on Memory Column now droping it

In [167]:
df.drop(columns = ['Memory'] , inplace = True)


In [168]:
df.head()

Unnamed: 0,Company,TypeName,Ram,Gpu,OpSys,Weight,Price,IsTouchScreen,IPS,ppi,CpuName,SSD_GB,HDD_GB,Flash_Storage_GB,Hybrid_GB
0,Apple,Ultrabook,8,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832,No,Yes,226.983005,Intel Core i5,128,0,0,0
1,Apple,Ultrabook,8,Intel HD Graphics 6000,macOS,1.34,47895.5232,No,No,127.67794,Intel Core i5,0,0,128,0
2,HP,Notebook,8,Intel HD Graphics 620,No OS,1.86,30636.0,No,No,141.211998,Intel Core i5,256,0,0,0
3,Apple,Ultrabook,16,AMD Radeon Pro 455,macOS,1.83,135195.336,No,Yes,220.534624,Intel Core i7,512,0,0,0
4,Apple,Ultrabook,8,Intel Iris Plus Graphics 650,macOS,1.37,96095.808,No,Yes,226.983005,Intel Core i5,256,0,0,0


# Exploring GPU Column

In [169]:
df['Gpu'].unique()

array(['Intel Iris Plus Graphics 640', 'Intel HD Graphics 6000',
       'Intel HD Graphics 620', 'AMD Radeon Pro 455',
       'Intel Iris Plus Graphics 650', 'AMD Radeon R5',
       'Intel Iris Pro Graphics', 'Nvidia GeForce MX150',
       'Intel UHD Graphics 620', 'Intel HD Graphics 520',
       'AMD Radeon Pro 555', 'AMD Radeon R5 M430',
       'Intel HD Graphics 615', 'AMD Radeon Pro 560',
       'Nvidia GeForce 940MX', 'Intel HD Graphics 400',
       'Nvidia GeForce GTX 1050', 'AMD Radeon R2', 'AMD Radeon 530',
       'Nvidia GeForce 930MX', 'Intel HD Graphics',
       'Intel HD Graphics 500', 'Nvidia GeForce 930MX ',
       'Nvidia GeForce GTX 1060', 'Nvidia GeForce 150MX',
       'Intel Iris Graphics 540', 'AMD Radeon RX 580',
       'Nvidia GeForce 920MX', 'AMD Radeon R4 Graphics', 'AMD Radeon 520',
       'Nvidia GeForce GTX 1070', 'Nvidia GeForce GTX 1050 Ti',
       'Nvidia GeForce MX130', 'AMD R4 Graphics',
       'Nvidia GeForce GTX 940MX', 'AMD Radeon RX 560',
       'Nvid

🎮 We extracted and categorized the **GPU (Graphics Processing Unit)** information from the `Gpu` column.

🧠 The goal was to group similar GPUs under broader, meaningful categories for analysis.

### 🛠️ Steps Involved:

1. ✅ A new column `GpuName` was created and initialized with the default value **'Other'**.

2. 📋 We defined a list of common GPU prefixes and their corresponding categories:
   - `Intel Iris`
   - `Intel HD`
   - `Intel UHD`
   - `Nvidia GeForce`
   - `Nvidia GTX`
   - `Nvidia Quadro`
   - `AMD`

3. 🔁 For each GPU in the dataset:
   - If the `Gpu` string starts with a known prefix, we assigned it the corresponding category in the `GpuName` column.
   - ✅ The **order matters**, so more specific categories (e.g., `Intel Iris`) were matched before more general ones (e.g., `Intel HD`).


In [170]:
# Initialize GpuName as string column
df['GpuName'] = 'Other'  # Default value

# Define GPU categories - order matters for precedence!
gpu_categories = [
    ('Intel Iris', 'Intel Iris'),
    ('Intel HD', 'Intel HD'),
    ('Intel UHD', 'Intel UHD'),
    ('Nvidia GeForce', 'Nvidia GeForce'),
    ('Nvidia GTX', 'Nvidia GTX'),
    ('Nvidia Quadro', 'Nvidia Quadro'),
    ('AMD', 'AMD')
]

# Categorize GPUs
for prefix, category in gpu_categories:
    df.loc[df['Gpu'].str.startswith(prefix, na=False), 'GpuName'] = category

In [171]:
df.head()

Unnamed: 0,Company,TypeName,Ram,Gpu,OpSys,Weight,Price,IsTouchScreen,IPS,ppi,CpuName,SSD_GB,HDD_GB,Flash_Storage_GB,Hybrid_GB,GpuName
0,Apple,Ultrabook,8,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832,No,Yes,226.983005,Intel Core i5,128,0,0,0,Intel Iris
1,Apple,Ultrabook,8,Intel HD Graphics 6000,macOS,1.34,47895.5232,No,No,127.67794,Intel Core i5,0,0,128,0,Intel HD
2,HP,Notebook,8,Intel HD Graphics 620,No OS,1.86,30636.0,No,No,141.211998,Intel Core i5,256,0,0,0,Intel HD
3,Apple,Ultrabook,16,AMD Radeon Pro 455,macOS,1.83,135195.336,No,Yes,220.534624,Intel Core i7,512,0,0,0,AMD
4,Apple,Ultrabook,8,Intel Iris Plus Graphics 650,macOS,1.37,96095.808,No,Yes,226.983005,Intel Core i5,256,0,0,0,Intel Iris


In [172]:
df['GpuName'].value_counts()

GpuName
Intel HD          639
Nvidia GeForce    368
AMD               180
Intel UHD          68
Nvidia Quadro      31
Intel Iris         14
Other               2
Nvidia GTX          1
Name: count, dtype: int64

📊 Among all the GPUs:

- 🔝 The most common GPU type is **Intel HD**.
- 🔽 The least common is **Nvidia GTX**.


# Dropping GPU column as Feature Engineering Successfully performed

In [173]:
df.drop(columns = ['Gpu'] , inplace = True)

In [174]:
df.head()

Unnamed: 0,Company,TypeName,Ram,OpSys,Weight,Price,IsTouchScreen,IPS,ppi,CpuName,SSD_GB,HDD_GB,Flash_Storage_GB,Hybrid_GB,GpuName
0,Apple,Ultrabook,8,macOS,1.37,71378.6832,No,Yes,226.983005,Intel Core i5,128,0,0,0,Intel Iris
1,Apple,Ultrabook,8,macOS,1.34,47895.5232,No,No,127.67794,Intel Core i5,0,0,128,0,Intel HD
2,HP,Notebook,8,No OS,1.86,30636.0,No,No,141.211998,Intel Core i5,256,0,0,0,Intel HD
3,Apple,Ultrabook,16,macOS,1.83,135195.336,No,Yes,220.534624,Intel Core i7,512,0,0,0,AMD
4,Apple,Ultrabook,8,macOS,1.37,96095.808,No,Yes,226.983005,Intel Core i5,256,0,0,0,Intel Iris


# Exploring OpSys column

In [175]:
df['OpSys'].unique()

array(['macOS', 'No OS', 'Windows 10', 'Mac OS X', 'Linux', 'Android',
       'Windows 10 S', 'Chrome OS', 'Windows 7'], dtype=object)

In [176]:
df['OpSys'].value_counts()

OpSys
Windows 10      1072
No OS             66
Linux             62
Windows 7         45
Chrome OS         27
macOS             13
Mac OS X           8
Windows 10 S       8
Android            2
Name: count, dtype: int64

 As we can see, **Windows 10** is by far the most commonly installed OS in this dataset.

🧠 We simplified the operating systems into broader categories using a new column called **`OS_Category`**.

### 🛠️ Steps Involved:

1. 🆕 A new column `OS_Category` was initialized with a default value of **'Others'**.

2. 💻 **Windows-based systems**:
   - Any `OpSys` value containing `'Windows'` (case-insensitive) was categorized as **'Windows'**.

3. 🍏 **Mac-based systems**:
   - Any `OpSys` value containing `'mac'` or `'Mac'` was categorized as **'Mac'**.


In [None]:
# Create the new 'OS_Category' column
df['OS_Category'] = 'Others'  # Default value

# Categorize Windows systems (contains 'Windows' in name)
df.loc[df['OpSys'].str.contains('Windows', case=False, na=False), 'OS_Category'] = 'Windows'

# Categorize Mac systems (contains 'mac' or 'Mac' in name)
df.loc[df['OpSys'].str.contains('Mac|mac', case=False, na=False), 'OS_Category'] = 'Mac'

In [178]:
df['OS_Category'].value_counts()

OS_Category
Windows    1125
Others      157
Mac          21
Name: count, dtype: int64

# Performed Feature Engineering successfully on OpSys column now dropping it

In [179]:
df.drop(columns = ['OpSys'] , inplace = True)

In [180]:
df.head()

Unnamed: 0,Company,TypeName,Ram,Weight,Price,IsTouchScreen,IPS,ppi,CpuName,SSD_GB,HDD_GB,Flash_Storage_GB,Hybrid_GB,GpuName,OS_Category
0,Apple,Ultrabook,8,1.37,71378.6832,No,Yes,226.983005,Intel Core i5,128,0,0,0,Intel Iris,Mac
1,Apple,Ultrabook,8,1.34,47895.5232,No,No,127.67794,Intel Core i5,0,0,128,0,Intel HD,Mac
2,HP,Notebook,8,1.86,30636.0,No,No,141.211998,Intel Core i5,256,0,0,0,Intel HD,Others
3,Apple,Ultrabook,16,1.83,135195.336,No,Yes,220.534624,Intel Core i7,512,0,0,0,AMD,Mac
4,Apple,Ultrabook,8,1.37,96095.808,No,Yes,226.983005,Intel Core i5,256,0,0,0,Intel Iris,Mac


In [194]:
df.to_pickle("Laptop_Preprocessed.pkl")