<h1 align="center"> 💻 Laptop Dataset EDA</h1>

### 💻 Laptop Dataset EDA

#### 📌 Problem  
Consumers face a wide range of laptop options, making it difficult to **compare features, prices, and performance** effectively. Without proper analysis, both buyers and sellers may **struggle to make informed decisions** about laptops based on their needs or market trends.


In [13]:
# import some necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px

In [14]:
# read the data
df = pd.read_csv("laptops.csv")

In [38]:
# show the first 5 rows of the data
df.head()

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.0
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.0
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.0
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.0
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01


In [46]:
#drop the duplicates values
data = df.drop_duplicates()

In [39]:
#What are the dimensions of the dataset (rows, columns)?
data.shape

(2160, 11)

In [68]:
# check the percentage of missing values
percent_of_miss_value =data.isnull().sum() / data.shape[0] * 100
percent_of_miss_value 

Laptop           0.000000
Status           0.000000
Brand            0.000000
Model            0.000000
CPU              0.000000
RAM              0.000000
Storage          0.000000
Storage type     1.944444
GPU             63.472222
Screen           0.185185
Touch            0.000000
Final Price      0.000000
dtype: float64

In [40]:
#What are the column names and their data types?
data.dtypes

Laptop           object
Status           object
Brand            object
Model            object
CPU              object
RAM               int64
Storage           int64
Storage type     object
Screen          float64
Touch            object
Final Price     float64
dtype: object

In [16]:
# check the data types and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2160 entries, 0 to 2159
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Laptop        2160 non-null   object 
 1   Status        2160 non-null   object 
 2   Brand         2160 non-null   object 
 3   Model         2160 non-null   object 
 4   CPU           2160 non-null   object 
 5   RAM           2160 non-null   int64  
 6   Storage       2160 non-null   int64  
 7   Storage type  2118 non-null   object 
 8   GPU           789 non-null    object 
 9   Screen        2156 non-null   float64
 10  Touch         2160 non-null   object 
 11  Final Price   2160 non-null   float64
dtypes: float64(2), int64(2), object(8)
memory usage: 202.6+ KB


In [None]:
#What are the most common laptop brands?
data["Brand"].value_counts()

Brand
Asus                415
HP                  368
Lenovo              366
MSI                 308
Acer                137
Apple               116
Dell                 84
Microsoft            77
Gigabyte             48
Razer                37
Medion               32
LG                   32
Alurin               29
PcCom                24
Samsung              22
Dynabook Toshiba     19
Vant                 11
Deep Gaming           8
Primux                8
Innjoo                6
Thomson               4
Prixton               3
Millenium             2
Denver                1
Jetwing               1
Realme                1
Toshiba               1
Name: count, dtype: int64

In [42]:
#What are the most common laptop models?
data["Model"].value_counts()

Model
15S         115
IdeaPad     104
ROG         101
VivoBook     99
ThinkPad     99
           ... 
A7            1
V330          1
Delta         1
GL65          1
GL75          1
Name: count, Length: 121, dtype: int64

In [45]:
#What are the unique processor types and their distributions?
data["CPU"].value_counts()

CPU
Intel Core i7            710
Intel Core i5            535
AMD Ryzen 7              156
Intel Core i3            130
AMD Ryzen 5              127
Intel Celeron             94
Intel Core i9             94
Intel Evo Core i7         82
AMD Ryzen 9               44
AMD Ryzen 3               44
Intel Evo Core i5         30
Apple M2                  28
AMD 3020e                 13
Apple M2 Pro              13
Apple M1                  11
AMD Athlon                10
Intel Pentium             10
Apple M1 Pro               7
Intel Core M3              5
Qualcomm Snapdragon 7      3
AMD 3015e                  3
Microsoft SQ1              3
AMD Radeon 9               2
Qualcomm Snapdragon 8      2
Intel Evo Core i9          1
AMD Radeon 5               1
AMD 3015Ce                 1
Mediatek MT8183            1
Name: count, dtype: int64

In [48]:
ram_distribution = df['RAM'].value_counts().sort_index()
ram_distribution 

RAM
4       68
6        3
8      817
12      15
16     928
32     301
40       2
64      25
128      1
Name: count, dtype: int64

In [54]:
# Describe the Final Price column to get the range and basic statistics
price_stats = data['Final Price'].describe()

In [56]:
# Check for any obviously extreme price values (e.g., outliers)
price_outliers = data[data['Final Price'] > price_stats['75%'] + 1.5 * (price_stats['75%'] - price_stats['25%'])]

price_outliers[['Brand', 'Model', 'Final Price']].head()

Unnamed: 0,Brand,Model,Final Price
100,Razer,Blade,3299.99
292,Asus,ROG,3699.01
307,Asus,ROG,3699.01
351,Asus,ROG,3599.0
361,Asus,ROG,3399.0


In [67]:
#What features are most correlated with price?
correlation_matrix = df.select_dtypes(include=[float, int]).corr()
# Get correlations with the Final Price column
correlation_with_price = correlation_matrix['Final Price'].sort_values(ascending=False).drop("Final Price")
correlation_with_price.idxmax()

'RAM'

In [69]:
# since GPU has a higher percentage of missing values , i'm going to drop it
data.drop(columns=["GPU"],inplace=True)

In [70]:
data.describe()

Unnamed: 0,RAM,Storage,Screen,Final Price
count,2160.0,2160.0,2156.0,2160.0
mean,15.413889,596.294444,15.168112,1312.638509
std,9.867815,361.220506,1.203329,911.475417
min,4.0,0.0,10.1,201.05
25%,8.0,256.0,14.0,661.0825
50%,16.0,512.0,15.6,1031.945
75%,16.0,1000.0,15.6,1708.97
max,128.0,4000.0,18.0,7150.47


### Exploratory data visualization

In [71]:
#counts the number of storages that are in the data
Storage_type = data["Storage type"].value_counts()

In [72]:
#show the number of storages

px.bar(Storage_type, x=Storage_type.index, y=Storage_type.values, color=Storage_type.index, title="Number of Storage Types", labels={"x": "Storage Type", "y": "Count"}, color_discrete_sequence=px.colors.qualitative.Plotly)

In [89]:
#the percentage of laptops for each combination of "Status" (New, Used) and "Storage type" (SSD, HDD).
table1 = pd.crosstab(data["Status"],data["Storage type"],margins=True,normalize=True)

In [90]:
table1

Storage type,SSD,eMMC,All
Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
New,0.672805,0.023135,0.69594
Refurbished,0.300755,0.003305,0.30406
All,0.97356,0.02644,1.0


In [91]:

#The average RAM for each combination of laptop Status and Storage type
table2 = pd.crosstab(data["Status"],data["Storage type"],values=df['RAM'], aggfunc='mean')

In [None]:
table2

Storage type,SSD,eMMC
Status,Unnamed: 1_level_1,Unnamed: 2_level_1
New,15.09193,5.632653
Refurbished,17.44427,7.428571


In [95]:
#the percentage of average RAM for each Status and "Storage type
proportion_table = round(table.div(table.sum(axis=1), axis=0)*100)

proportion_table

fig = px.bar(
    proportion_table,
    x=proportion_table.index,
    y=proportion_table.columns,
    title="Purchase Proportion by Gender",
    labels={"value": "Percentage", "index": "Gender", "variable": "Purchase"},
    text_auto='.1f',
    barmode='stack',
    color_discrete_sequence=px.colors.diverging.Portland 
)

fig.update_layout(
    font=dict(size=15),
    legend_title_text='Purchase'
)

fig.update_traces(textposition='inside')
fig.show()

In [96]:
cpu = data["CPU"].value_counts().head()

In [97]:
import plotly.express as px
px.bar(cpu)
#cpu.plot(kind="bar")

In [98]:
mean_price = data.groupby(by=["Brand"])["Final Price"].mean()

In [99]:
px.line(mean_price)#.plot()


#### ✅ Solution  
This project performs an **Exploratory Data Analysis (EDA)** on a laptop dataset to uncover **key patterns** in pricing, specifications (RAM, processor, brand, storage, etc.), and performance. Using **Python (Pandas, Seaborn, Matplotlib)**, the analysis reveals **trends, correlations, and outliers** to support better understanding of the laptop market and help guide **purchasing or marketing decisions**.


### 🔍 Methodologies

- **Data Collection & Cleaning**  
  Imported the laptop dataset and handled missing values, duplicates, and inconsistent entries.

- **Data Preprocessing**  
  Converted data types where necessary, parsed categorical variables, and standardized numerical features.

- **Exploratory Data Analysis (EDA)**  
  Used summary statistics and visualizations (histograms, boxplots, bar charts) to understand distributions of key features like price, RAM, storage, and brand.

- **Correlation Analysis**  
  Computed correlation matrices and heatmaps to identify relationships between features such as price and specifications.

- **Feature Engineering**  
  Created new variables, such as price per GB of RAM or storage, to gain deeper insights.

- **Outlier Detection**  
  Identified and analyzed outliers using boxplots and scatter plots to understand pricing anomalies.

- **Visualization**  
  Employed Python libraries like Matplotlib, Seaborn, and Plotly for interactive and static visualizations to communicate findings.

