# 01 — Data Cleaning

Purpose: produce a clean, analysis-ready CSV from the raw sales records. This notebook:

- Loads raw CSV(s) from `data/raw/`
- Handles missing values and duplicates
- Normalizes column names and types
- Exports cleaned CSV to `data/processed/BMW_Worldwide_Sales_Cleaned.csv`

Run instructions: execute cells in order; no external inputs required beyond placing raw CSV(s) in `data/raw/`.

In [None]:
# Import libraries
import pandas as pd
import os

# Define file paths
RAW_DATA_PATH = "../data/raw/BMW_Worldwide_Sales_Records_2010_2024.csv"
PROCESSED_PATH = "../data/processed/BMW_Worldwide_Sales_Cleaned.csv"

In [None]:
df = pd.read_csv(RAW_DATA_PATH)

print("Shape:", df.shape)
df.head()


In [None]:
# Basic info
print("Data info:")
print(df.info())

# Check column names
print("\nColumns:")
print(df.columns.tolist())

# Quick overview of numeric stats
df.describe()


In [None]:
missing = df.isna().sum().sort_values(ascending=False)
missing_percent = (missing / len(df)) * 100

print("Missing values per column:")
print(pd.DataFrame({"Missing_Count": missing, "Missing_Percent": missing_percent.round(2)}))


In [None]:
duplicates = df.duplicated().sum()
print("Duplicate rows found:", duplicates)

if duplicates > 0:
    df = df.drop_duplicates()
    print("Removed duplicates. New shape:", df.shape)


In [None]:
# Separate numeric and object columns
num_cols = df.select_dtypes(include=['number']).columns
obj_cols = df.select_dtypes(include=['object']).columns

# Fill numeric columns with median
for col in num_cols:
    median_val = df[col].median()
    df[col].fillna(median_val, inplace=True)

# Fill object columns with mode
for col in obj_cols:
    mode_val = df[col].mode()[0]
    df[col].fillna(mode_val, inplace=True)

print("Missing values after filling:")
print(df.isna().sum().sum())  # Should be 0 if all handled


In [None]:
print("Data shape after cleaning:", df.shape)
df.info()


In [None]:
# Save cleaned data for analysis in EDA notebook
df.to_csv(PROCESSED_PATH, index=False)
print(f"Cleaned dataset saved to {PROCESSED_PATH}")


## Summary

This notebook successfully cleaned the raw BMW sales dataset with the following outcomes:

**Key Actions:**
- Loaded 15 years of worldwide sales data (2010-2024)
- Identified and removed duplicate records
- Handled missing values using median (numeric) and mode (categorical) imputation
- Validated data types and column consistency

**Output:**
- Clean dataset saved to `data/processed/BMW_Worldwide_Sales_Cleaned.csv`
- Zero missing values in final dataset
- Data ready for exploratory analysis and feature engineering

**Next Steps:** Proceed to `02_eda.ipynb` for exploratory analysis and visualization of trends.