# 01 — EDA: Laptop Price Dataset 🔍💻

<p align="left">
  <img alt="Python" src="https://img.shields.io/badge/Python-3.10+-3776AB?logo=python&logoColor=white">
  <img alt="Pandas" src="https://img.shields.io/badge/Pandas-Data%20Wrangling-150458?logo=pandas&logoColor=white">
  <img alt="Plotly" src="https://img.shields.io/badge/Plotly-Interactive%20Charts-3F4F75?logo=plotly&logoColor=white">
  <img alt="Status" src="https://img.shields.io/badge/Notebook-EDA-6aa84f">
</p>

> <strong>Purpose</strong>: Explore, validate, and visualize the laptop price dataset.  
> <strong>Data Source</strong>: Google Drive (Kaggle optional).  
> <strong>Author</strong>: <span style="color:#6C63FF"><b>Noëlla Buti</b></span>

---

### 📋 What you’ll do
- ✅ Load data safely (encoding fallback)  
- ✅ Quick sanity checks (dtypes, NAs)  
- ✅ Descriptive stats & distributions  
- ✅ Outliers (box plots) & correlations (heatmap)

<details>
  <summary><b>📁 Paths & Setup (click to expand)</b></summary>

- Drive directory: <code>/content/drive/MyDrive</code>  
- CSV filename: <code>laptop_prices.csv</code>  
- If you see a “file not found” error, confirm both the folder and filename.
</details>

---

### 🎯 Quick Links
- ▶️ <a href="#scrollTo=load-data">Jump to Load Data</a>  
- 📈 <a href="#scrollTo=plots">Jump to Visuals</a>  
- 🧪 <a href="#scrollTo=checks">Jump to Checks</a>

---

> 💡 **Tip:** Keep plots interactive (no file writes) to save space in Colab.

## 1. Setup and Load Cell

In [18]:
# === Setup ===
import os, random, warnings, numpy as np, pandas as pd
warnings.filterwarnings("ignore")
SEED = 42
random.seed(SEED); np.random.seed(SEED)

# Drive path (edit if your CSV lives elsewhere)
DRIVE_DIR = "/content/drive/MyDrive"
CSV_NAME = "laptop_prices.csv"
CSV_PATH = f"{DRIVE_DIR}/{CSV_NAME}"

from google.colab import drive
drive.mount('/content/drive')
assert os.path.exists(CSV_PATH), f"CSV not found at {CSV_PATH}"

# Robust read (UTF-8 then ISO-8859-1)
try:
    df = pd.read_csv(CSV_PATH)
except UnicodeDecodeError:
    df = pd.read_csv(CSV_PATH, encoding="ISO-8859-1")

print(f"Rows: {len(df):,} | Cols: {len(df.columns)}")
df.head(3)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Rows: 1,275 | Cols: 23


Unnamed: 0,Company,Product,TypeName,Inches,Ram,OS,Weight,Price_euros,Screen,ScreenW,...,RetinaDisplay,CPU_company,CPU_freq,CPU_model,PrimaryStorage,SecondaryStorage,PrimaryStorageType,SecondaryStorageType,GPU_company,GPU_model
0,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1339.69,Standard,2560,...,Yes,Intel,2.3,Core i5,128,0,SSD,No,Intel,Iris Plus Graphics 640
1,Apple,Macbook Air,Ultrabook,13.3,8,macOS,1.34,898.94,Standard,1440,...,No,Intel,1.8,Core i5,128,0,Flash Storage,No,Intel,HD Graphics 6000
2,HP,250 G6,Notebook,15.6,8,No OS,1.86,575.0,Full HD,1920,...,No,Intel,2.5,Core i5 7200U,256,0,SSD,No,Intel,HD Graphics 620


## 2. Data Checks Cell

In [19]:
display(df.dtypes)
print("\nMissing values per column:")
print(df.isnull().sum().sort_values(ascending=False).to_string())

# Ensure numeric columns are numeric
num_like = ["Inches","Ram","Weight","Price_euros","ScreenW","ScreenH","CPU_freq",
            "PrimaryStorage","SecondaryStorage"]
for c in num_like:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce")

# Cast common categoricals
cat_cols = ["Company","Product","TypeName","OS","Screen","Touchscreen","IPSpanel",
            "RetinaDisplay","CPU_company","CPU_model","PrimaryStorageType",
            "SecondaryStorageType","GPU_company","GPU_model"]
for c in cat_cols:
    if c in df.columns:
        df[c] = df[c].astype("category")

Unnamed: 0,0
Company,object
Product,object
TypeName,object
Inches,float64
Ram,int64
OS,object
Weight,float64
Price_euros,float64
Screen,object
ScreenW,int64



Missing values per column:
Company                 0
Product                 0
TypeName                0
Inches                  0
Ram                     0
OS                      0
Weight                  0
Price_euros             0
Screen                  0
ScreenW                 0
ScreenH                 0
Touchscreen             0
IPSpanel                0
RetinaDisplay           0
CPU_company             0
CPU_freq                0
CPU_model               0
PrimaryStorage          0
SecondaryStorage        0
PrimaryStorageType      0
SecondaryStorageType    0
GPU_company             0
GPU_model               0


## 3. EDA

In [20]:
import plotly.express as px
import plotly.graph_objects as go

# Descriptives
num_cols = ["Ram","Weight","CPU_freq","Price_euros"]
desc = pd.DataFrame({
    "Mean": df[num_cols].mean(),
    "Median": df[num_cols].median(),
    "Mode": df[num_cols].mode().iloc[0],
}).round(3)
desc

# Histograms
fig = go.Figure()
for col in ["Ram","Weight","CPU_freq","Price_euros"]:
    fig.add_trace(go.Histogram(x=df[col], name=col, nbinsx=20, opacity=0.75))
fig.update_layout(barmode='overlay', title="Distributions")
fig.show()

# Boxplots
px.box(df, y=["Ram","Weight","CPU_freq","Price_euros"], title="Outliers via Boxplots").show()

# Correlations
corr = df[["Ram","Weight","CPU_freq","PrimaryStorage","Price_euros"]].corr()
px.imshow(corr, color_continuous_scale="Viridis", title="Correlation Heatmap").show()