# EDA- Car Price Analysis 

## Overview
This notebook explores relationships in the cleaned car price dataset to surface key pricing drivers and patterns.

**Charts included:**
- **Plotly (interactive):** Price vs Engine Size
- **Seaborn:** Violin plot of Price by Engine Size Band (Small/Medium/Large)
- **Matplotlib:** Top 10 Car Makes by Average Price

We assume the dataset was cleaned during ETL and saved as  `../data/car_price_clean.csv`.

## Inputs

- **Cleaned dataset**: `data/car_price_clean.csv`
  - Standardised columns and corrected brand names.
  - Engineered features (e.g., `price_per_cc`, `power_to_weight`, `mpg_ratio`).
  - One-hot encoded categorical variables.
- **Python libraries**: `pandas`, `numpy`, `matplotlib`, `seaborn`, `plotly`.

## Outputs

- 1 interactive Plotly chart.
- 1 Seaborn violin plot.
- 1 Matplotlib bar chart.
- Observations and insights for inclusion in the README.

## Import Libraries

We will use the following libraries for data analysis and visualisation:

- **pandas**: Data manipulation and analysis
- **numpy**: Numerical operations
- **matplotlib**: Basic plotting
- **seaborn**: Statistical visualisation
- **plotly**: Interactive visualisation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Optional: set plot styles for consistency
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"]

 ## 1. Load Cleaned Data

We will load the cleaned dataset `car_price_clean.csv` generated from the ETL process.  
This file contains:
- Standardised column names.
- Corrected brand names.
- Engineered features (e.g., `price_per_cc`, `power_to_weight`, `mpg_ratio`).
- One-hot encoded categorical variables.

The dataset is stored in the `data/` folder.

In [None]:
import pandas as pd
from pathlib import Path

# Path to cleaned dataset
data_file = Path("../data/car_price_clean.csv")

# Load the cleaned dataset
try:
    df = pd.read_csv(data_file)
    print(f"Cleaned dataset loaded successfully: {df.shape[0]} rows, {df.shape[1]} columns")
except FileNotFoundError:
    raise FileNotFoundError(f"Cleaned dataset not found at {data_file}. Please run the ETL notebook first.")

# Preview the first few rows
df.head()

# 2. Quick Data Checks

We’ll sanity-check basic structure and key statistics before plotting.

In [None]:
# Basic info and quick stats
display(df.head())
print(df.shape)
df.describe(include='all').T.head(20)

## 3. Plotly: Price vs Engine Size (Interactive)

Interactive scatter to see how price changes with engine size and to spot potential outliers.  
Hover to reveal extra details when available (e.g., horsepower, mpg).

In [None]:
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path

# Ensure required columns exist
required_cols = ["enginesize", "price"]
for col in required_cols:
    if col not in df.columns:
        raise ValueError(f"Column '{col}' not found in DataFrame.")

# Create a simple scatter plot using matplotlib as backup if Plotly fails
hover_cols = [c for c in ["model", "horsepower", "citympg", "highwaympg"] if c in df.columns]

# Create the Plotly figure
fig = px.scatter(
    df,
    x="enginesize",
    y="price",
    hover_data=hover_cols,
    title="Price vs Engine Size (Interactive)"
)
fig.update_traces(marker=dict(size=8, opacity=0.7))
fig.update_layout(xaxis_title="Engine Size (cc)", yaxis_title="Price (USD)")

# Create images directory
Path("../images").mkdir(exist_ok=True)

# Save as HTML file instead of showing (avoids renderer issues)
fig.write_html("../images/price_vs_enginesize.html")
print("Interactive chart saved to ../images/price_vs_enginesize.html")

# Also create a simple matplotlib version
plt.figure(figsize=(10, 6))
plt.scatter(df['enginesize'], df['price'], alpha=0.7)
plt.xlabel('Engine Size (cc)')
plt.ylabel('Price (USD)')
plt.title('Price vs Engine Size')
plt.grid(True, alpha=0.3)
plt.savefig('../images/price_vs_enginesize_static.png', dpi=300, bbox_inches='tight')
plt.show()
print("Static chart saved to ../images/price_vs_enginesize_static.png")

## 4. Seaborn: Price Distribution by Engine Size Band (Violin)

The original categorical columns may have been one-hot encoded during ETL.  
To keep a robust categorical comparison, we create **engine size bands**:

- **Small:** bottom 33%  
- **Medium:** middle 33%  
- **Large:** top 33%

This shows how price distributions vary across engine size groups.

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create engine size bands (tertiles)
q = df["enginesize"].quantile([0.33, 0.66])
def band_engine_size(v):
    if v <= q.iloc[0]:
        return "Small"
    elif v <= q.iloc[1]:
        return "Medium"
    else:
        return "Large"

df["_engine_band"] = df["enginesize"].apply(band_engine_size)

plt.figure(figsize=(10,6))
sns.violinplot(
    data=df,
    x="_engine_band",
    y="price",
    inner="quartile"
)
plt.title("Price Distribution by Engine Size Band")
plt.xlabel("Engine Size Band")
plt.ylabel("Price (USD)")
plt.tight_layout()
plt.show()

# (Optional) clean up helper column if you prefer
# df.drop(columns=["_engine_band"], inplace=True)

## 5. Matplotlib: Top 10 Makes by Average Price

If the ETL kept a `make` column, we’ll use it directly.  
If `make` was one-hot encoded (e.g., `make_audi`, `make_bmw`, …), we **reconstruct** the make label by taking the
one-hot column with the highest value per row. Rows that belong to the dropped baseline category (due to `drop_first=True`)
are labelled **"other/ baseline"**.

In [None]:
import re
import numpy as np
import matplotlib.pyplot as plt

def reconstruct_make_from_dummies(df_in, prefix="make_"):
    """Return a Series with make labels reconstructed from one-hot columns.
       Baseline (all zeros) becomes 'other/ baseline'."""
    make_cols = [c for c in df_in.columns if c.startswith(prefix)]
    if not make_cols:
        return None  # Nothing to reconstruct
    # Argmax across available make_* columns
    make_matrix = df_in[make_cols].to_numpy(dtype=float)
    idx = make_matrix.argmax(axis=1)
    # When all zeros (baseline), mark as None
    all_zero = (make_matrix.sum(axis=1) == 0)
    labels = np.array([re.sub(f"^{prefix}", "", make_cols[i]) for i in idx], dtype=object)
    labels[all_zero] = "other/ baseline"
    return pd.Series(labels, index=df_in.index, name="make_reconstructed")

# Choose a make label to use
if "make" in df.columns:
    make_label = df["make"]
else:
    rec = reconstruct_make_from_dummies(df, prefix="make_")
    if rec is None:
        # Fallback: no make info available — use a numeric grouping instead
        make_label = pd.Series(["unknown"]*len(df), index=df.index, name="make_reconstructed")
    else:
        make_label = rec

# Compute average price by make and plot top 10
avg_price_by_make = (
    pd.DataFrame({"make_label": make_label, "price": df["price"]})
    .groupby("make_label")["price"]
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

plt.figure(figsize=(10,6))
avg_price_by_make.plot(kind="bar", edgecolor="black")
plt.title("Top 10 Makes by Average Price")
plt.xlabel("Make")
plt.ylabel("Average Price (USD)")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

## EDA Notes & Next Steps

**What we saw**
- Price generally increases with engine size, but there are notable high-price points at moderate engine sizes (likely brand/trim effects).
- Price distributions differ across engine size bands; “Large” engines unsurprisingly skew higher.
- The highest average prices cluster among a small set of makes (or the reconstructed baseline), indicating brand effects.

**What to do next**
- Explore relationships with engineered features (`price_per_cc`, `power_to_weight`, `mpg_ratio`).
- Examine correlations and multicollinearity among numeric variables.
- Add interactive breakdowns (Plotly) by reconstructed make or by other meaningful groupings.
- Feed insights into the README and outline potential features for modelling.