# Car Breakdown Prediction Challenge

**Course:** Artificial Intelligence  
**Goal:** Predict whether a car will experience a mechanical breakdown within the next 30 days.

This notebook covers:
- Exploratory Data Analysis (EDA)
- Preprocessing
- Modeling with Random Forest (baseline)
- Extra model
- Evaluation
- Kaggle submission
Stef

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## 1. Load the data

In [None]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head()
train_df.info()
train_df.describe()

## 2. Exploratory Data Analysis (EDA)

### 2.1 Target variable distribution

In [None]:
sns.countplot(x="breakdown_next_30_days", data=train_df)
plt.title("Target Variable Distribution")
plt.show()

### 2.2 Numerical feature analysis

In [None]:
num_features = [
    "vehicle_age_years",
    "mileage_km",
    "engine_hours",
    "avg_trip_length_km",
    "last_service_km_ago",
    "oil_quality_pct",
    "cleanliness_score",
    "driver_satisfaction_score"
]

train_df[num_features].hist(figsize=(15, 10))
plt.tight_layout()
plt.show()

### 2.3 Categorical feature analysis

In [None]:
cat_features = [
    "vehicle_brand",
    "weather_exposure",
    "fuel_type",
    "tyre_type"
]

for col in cat_features:
    print(f"\n{col}")
    print(train_df[col].value_counts())

## 3. Preprocessing

In [None]:
X = train_df.drop(["breakdown_next_30_days", "id"], axis=1)
y = train_df["breakdown_next_30_days"]

categorical_cols = X.select_dtypes(include="object").columns
numerical_cols = X.select_dtypes(exclude="object").columns

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numerical_cols)
    ]
)

## 4. Train-validation split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## 5. Random Forest baseline model

In [None]:
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

rf_pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", rf_model)
])

rf_pipeline.fit(X_train, y_train)

## 6. Model evaluation (Random Forest)

In [None]:
y_pred_rf = rf_pipeline.predict(X_val)

print("Accuracy:", accuracy_score(y_val, y_pred_rf))
print(classification_report(y_val, y_pred_rf))

In [None]:
cm = confusion_matrix(y_val, y_pred_rf)

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - Random Forest")
plt.show()

## 7. Extra model: Gradient Boosting

In [None]:
gb_model = GradientBoostingClassifier(random_state=42)

gb_pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", gb_model)
])

gb_pipeline.fit(X_train, y_train)

In [None]:
y_pred_gb = gb_pipeline.predict(X_val)

print("Accuracy:", accuracy_score(y_val, y_pred_gb))
print(classification_report(y_val, y_pred_gb))

In [None]:
cm_gb = confusion_matrix(y_val, y_pred_gb)

sns.heatmap(cm_gb, annot=True, fmt="d", cmap="Greens")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - Gradient Boosting")
plt.show()

## 8. Kaggle submission

In [None]:
test_ids = test_df["id"]
test_features = test_df.drop("id", axis=1)

test_predictions = rf_pipeline.predict(test_features)

submission = pd.DataFrame({
    "id": test_ids,
    "breakdown_next_30_days": test_predictions
})

submission.to_csv("submission.csv", index=False)
submission.head()

## 9. GenAI usage statement

GenAI tools were used to help understand the structure of a standard
machine learning pipeline and to explore possible modeling approaches.

All preprocessing steps, model choices, evaluations, and interpretations
were made by the team and are fully understood and defendable.