# Car Breakdown Prediction Challenge
**Course:** Artificial Intelligence  
**Objective:** Predict whether a vehicle will experience a breakdown within the next 30 days.

This notebook includes:
- Exploratory Data Analysis (EDA)
- Data Cleaning
- Preprocessing Pipeline
- Random Forest Baseline
- Logistic Regression Comparison
- Threshold Tuning
- Feature Importance
- Final Model Selection
- Kaggle Submission
- GenAI Usage Statement

## 1. Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve
)

sns.set_theme(style="whitegrid")

## 2. Load the data

In [None]:
train_df = pd.read_csv("train_CarBreakDown.csv")
test_df = pd.read_csv("test_CarBreakDown.csv")

train_df.head()

## 3. Basic info

In [None]:
train_df.info()
train_df.describe()

## Target Distribution

The dataset is imbalanced (~17% breakdown cases).
Accuracy alone is therefore misleading.
We will prioritize:
- Recall (class 1)
- F1-score
- ROC-AUC

## 4. Plot Target

In [None]:
sns.countplot(x="breakdown_next_30_days", data=train_df)
plt.title("Target Distribution")
plt.show()

## 5. Data Cleaning

In [None]:
def clean_data(df):
    df = df.copy()
    
    df["mileage_km"] = df["mileage_km"].clip(lower=0)
    df["engine_hours"] = df["engine_hours"].clip(lower=0)
    df["vehicle_age_years"] = df["vehicle_age_years"].clip(lower=0)
    df["oil_quality_pct"] = df["oil_quality_pct"].clip(0, 100)
    df["cleanliness_score"] = df["cleanliness_score"].clip(0, 100)
    
    return df

train_df = clean_data(train_df)
test_df = clean_data(test_df)

## 6. Split data

In [None]:
X = train_df.drop(["breakdown_next_30_days", "id"], axis=1)
y = train_df["breakdown_next_30_days"]

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

## 7. Preprocessing pipeline

In [None]:
categorical_cols = X.select_dtypes(include=["object", "string", "category"]).columns
numerical_cols = X.select_dtypes(include=["int64", "float64"]).columns

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numerical_cols),
    ("cat", categorical_transformer, categorical_cols)
])

## 8. Random Forest

In [None]:
rf_model = RandomForestClassifier(
    n_estimators=300,
    class_weight="balanced",
    random_state=42
)

rf_pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("model", rf_model)
])

rf_pipeline.fit(X, y)

## 9. Evaluation with Threshold Tuning

In [None]:
y_probs = rf_pipeline.predict_proba(X_val)[:,1]

threshold = 0.4
y_pred = (y_probs > threshold).astype(int)

print("Accuracy:", accuracy_score(y_val, y_pred))
print("ROC-AUC:", roc_auc_score(y_val, y_probs))
print(classification_report(y_val, y_pred))

cm = confusion_matrix(y_val, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - Random Forest")
plt.show()

## 10. ROC Curve

In [None]:
fpr, tpr, _ = roc_curve(y_val, y_probs)

plt.plot(fpr, tpr)
plt.plot([0,1], [0,1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()

## 11. Feature Importance

In [None]:
feature_names = rf_pipeline.named_steps["preprocessing"].get_feature_names_out()
importances = rf_pipeline.named_steps["model"].feature_importances_

feat_imp = pd.Series(importances, index=feature_names)
top_features = feat_imp.sort_values(ascending=False).head(15)

plt.figure(figsize=(8,6))
top_features.sort_values().plot(kind="barh")
plt.title("Top 15 Feature Importances")
plt.show()

## 12. Logistic Regression Comparison

In [None]:
log_model = LogisticRegression(
    max_iter=1000,
    class_weight="balanced"
)

log_pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("model", log_model)
])

log_pipeline.fit(X_train, y_train)

y_probs_log = log_pipeline.predict_proba(X_val)[:,1]

print("Logistic ROC-AUC:", roc_auc_score(y_val, y_probs_log))

## Final Model Selection

Random Forest achieved:
- Higher ROC-AUC
- Better recall for breakdown class
- Stronger nonlinear pattern detection

Given the business objective (reduce unexpected breakdowns),
we prioritize recall and select the tuned Random Forest model.

## 13. Kaggle Submission

In [None]:
test_ids = test_df["id"]
test_features = test_df.drop("id", axis=1)

test_probs = rf_pipeline.predict_proba(test_features)[:,1]
threshold = 0.4
test_preds = (test_probs > threshold).astype(int)

submission = pd.DataFrame({
    "id": test_ids,
    "breakdown_next_30_days": test_preds
})

submission.to_csv("submission.csv", index=False)
submission.head()

## GenAI Usage Statement

GenAI tools were used to explore potential modeling strategies and 
improve notebook structure.

All preprocessing decisions, model choices, threshold tuning, and 
interpretations were made by the team and are fully understood.

We ensured:
- Every preprocessing step is explainable
- Every metric choice is justified
- Model behavior is interpretable

The team retains full ownership and responsibility for all decisions.