# Day 03 â€” Making the Dataset ready for ML.

In this notebook, I'm **learning step by step** how to clean the `vehicles.csv` dataset. I'll make notes along the way so I understand *why* each step is necessary.

Goals:
1. Load and explore the raw dataset.
2. Check for missing values, duplicates, and inconsistent data.
3. Handle outliers and incorrect entries.
4. Create new features (if useful).
5. Save the cleaned dataset.

---

## Step 1 â€” Load the dataset

ðŸ”¹ First I will import libraries and load the `vehicles.csv` file.

ðŸ‘‰ Note: In Colab, I may need to upload the file manually.

In [None]:
import pandas as pd
import numpy as np

# Load dataset (update path if needed)
df = pd.read_csv('vehicles.csv')

# Show shape and first rows
print("Shape:", df.shape)
df.head()

## Step 2 â€” Quick dataset overview

ðŸ”¹ I want to understand what columns exist, their types, and what kind of data they contain.

In [None]:
# Info and summary statistics
df.info()
df.describe(include='all').T

## Step 3 â€” Missing values check

ðŸ”¹ Missing values can cause problems for ML models, so Iâ€™ll check which columns have them.

In [None]:
# Count missing values
df.isnull().sum().sort_values(ascending=False).head(20)

ðŸ‘‰ Notes:
- Some columns might have a huge amount of missing values (like `size`, `drive`, etc.).
- I should decide whether to **drop those columns** or **fill them** later.
- Strategy depends on importance + how much data is missing.

## Step 4 â€” Remove duplicates

ðŸ”¹ Duplicated rows donâ€™t add value and can bias analysis.

In [None]:
before = df.shape[0]
df = df.drop_duplicates()
after = df.shape[0]
print(f"Removed {before - after} duplicates")

## Step 5 â€” Handle outliers

ðŸ”¹ Some prices or odometer values might be unrealistic (e.g., car with $1 price or 5 million miles).

ðŸ‘‰ Common technique: use the **IQR method** (Interquartile Range) to remove extreme outliers.

In [None]:
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
low, high = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
df = df[(df['price'] >= low) & (df['price'] <= high)]

print("Shape after removing price outliers:", df.shape)

ðŸ‘‰ I can repeat this method for other numeric columns like `odometer` if needed.

## Step 6 â€” Feature Engineering (learning practice)

ðŸ”¹ Create some new columns to help models later:
- **Car age** = current year âˆ’ year of the car.
- **Price per mile** = price / odometer.
- **Log price** = log transformation of price to reduce skewness.

In [None]:
from datetime import datetime

df['car_age'] = datetime.now().year - df['year']
df['price_per_mile'] = df['price'] / (df['odometer'] + 1)
df['log_price'] = np.log1p(df['price'])

df[['year', 'car_age', 'price', 'price_per_mile', 'log_price']].head()

## Step 7 â€” Save the cleaned dataset

ðŸ”¹ Finally, Iâ€™ll save the cleaned version so I can use it later in modeling.

In [None]:
df.to_csv('vehicles_cleaned.csv', index=False)
print("âœ… Cleaned dataset saved as vehicles_cleaned.csv")

# âœ… Summary of what I learned
- How to load and explore a dataset in pandas.
- How to check and deal with missing values.
- How to remove duplicates.
- How to detect and remove outliers using IQR.
- How to create new features (`car_age`, `price_per_mile`, `log_price`).
- How to save the cleaned dataset for later use.