# Real-World Use Case: Used Car Price Prediction

## 1. The Problem
A used car dealership wants to price their cars competitively. If they price too high, they don't sell. If they price too low, they lose money.
*   **Goal**: Predict the market price of a used car based on its specs.

## 2. Why Linear Regression?
*   **Relationship**: The depreciation of a car is roughly linear over time/mileage (Price drops by $X per mile).
*   **Interpretability**: We need to tell the customer "Your car is worth $500 less because of the high mileage". Coefficients allow this.
*   **Output**: Continuous value (Price).

## 3. Data Simulation
We will generate a dataset with:
*   **Age**: Years since manufacture.
*   **Mileage**: Total kilometers driven.
*   **Brand**: Toyota, BMW, or Ford (Categorical).
*   **Fuel**: Petrol or Diesel (Categorical).
*   **Price**: Target variable.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score

# 1. Generate Realistic Data
np.random.seed(42)
n = 500

age = np.random.randint(1, 15, n)
mileage = np.random.randint(5000, 200000, n)
brand = np.random.choice(['Toyota', 'BMW', 'Ford'], n)
fuel = np.random.choice(['Petrol', 'Diesel'], n)

# Base Price logic
base_price = 30000
price = base_price - (age * 1200) - (mileage * 0.05) + np.random.normal(0, 1500, n)

# Brand adjustments
for i in range(n):
    if brand[i] == 'BMW': price[i] += 10000
    if brand[i] == 'Ford': price[i] -= 2000
    if fuel[i] == 'Diesel': price[i] += 1000

# Ensure no negative prices
price = np.maximum(price, 500)

df = pd.DataFrame({'Age': age, 'Mileage': mileage, 'Brand': brand, 'Fuel': fuel, 'Price': price})
print("Sample Data:")
display(df.head())

## 4. Feature Engineering Pipeline
We fit a standard Linear Regression model, but we MUST handle the data first:
*   **Scaling**: Mileage is in thousands. Price is in thousands. Linear Regression prefers standardized features.
*   **Encoding**: Brand and Fuel are strings. We need One-Hot Encoding.

In [None]:
# Split Data
X = df.drop('Price', axis=1)
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define Preprocessing
numeric_features = ['Age', 'Mileage']
categorical_features = ['Brand', 'Fuel']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])

# Create Pipeline
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', LinearRegression())])

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f"R2 Score: {r2_score(y_test, y_pred):.2f} (Variance Explained)")
print(f"MAE: ${mean_absolute_error(y_test, y_pred):,.2f} (Average Error)")

# Example Prediction
new_car = pd.DataFrame({'Age': [3], 'Mileage': [25000], 'Brand': ['BMW'], 'Fuel': ['Diesel']})
pred_price = model.predict(new_car)[0]
print(f"\nPredicted Price for 3-yr old BMW with 25k miles: ${pred_price:,.2f}")