# Supervised Learning for Road Accident Severity Prediction
From the labeled dataset (with "*Accident Severity*" as the target variable), the **Supervised Learning Models** will be used for classification.

## Train and Evaluate Learning Models
We will use different supervised classification models to predict *Accident Severity (Minor, Moderate, Severe)*.

### Data Preprocessing
1. Convert categorical variables (e.g., Weather Conditions, Road Type, Driver Age Group) into numerical values using one-hot encoding.
2. Normalize numerical values (e.g., Speed Limit, Visibility Level, Traffic Volume) to ensure consistent model performance.
3. Handle missing values (if any) by imputation or removal.

### Splitting the Data
- 80% for training, 20% for testing

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["Accident Severity"])  # Features
y = df["Accident Severity"]  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Train Supervised Learning Models

We will train and compare multiple classification models

1. **Decision Trees**
    - Simple and interpretable.
    - Can handle categorical and numerical data.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

2. **Random Forest (Ensemble Model)**
    - Simple and interpretable.
    - Can handle categorical and numerical data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

3. **XGBoost (Boosting Model)**
    - More powerful for structured data.
    - Uses gradient boosting to minimize error.

In [None]:
import xgboost as xgb

xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric="mlogloss")
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

## Use Appropriate Evaluation Metrics
For classification models, we will evaluate using:

* **Accuracy** – Measures the overall correctness of predictions.
* **Precision** – Measures how many predicted severe accidents were actually severe.
* **Recall** – Measures how many actual severe accidents were correctly identified.
* **F1-score** – Balances precision and recall for better overall performance.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_model(y_test, y_pred, model_name):
    print(f"Model: {model_name}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
    print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.2f}")
    print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.2f}")
    print(f"F1-score: {f1_score(y_test, y_pred, average='weighted'):.2f}")
    print("-" * 50)

evaluate_model(y_test, y_pred_dt, "Decision Tree")
evaluate_model(y_test, y_pred_rf, "Random Forest")
evaluate_model(y_test, y_pred_xgb, "XGBoost")

## Provide a Clear Interpretation of Model Performance
To visualize performance, we can generate:

1. **Classification Report & Confusion Matrix**
    - A confusion matrix shows where the model makes mistakes.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

print("Classification Report for Random Forest")
print(classification_report(y_test, y_pred_rf))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix for Random Forest")
plt.show()

2. **Feature Importance (Understanding What Factors Influence Severity)**
    * Helps identify which features impact accident severity the most.

In [None]:
import pandas as pd

feature_importances = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

# Plot Feature Importance
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feature_importances)
plt.title("Feature Importance in Accident Severity Prediction")
plt.show()

3. **ROC Curve (Performance Across Different Thresholds)**
    * Helps evaluate how well the model distinguishes between severity levels.

In [None]:
from sklearn.metrics import roc_curve, auc
import numpy as np

y_probs = rf_model.predict_proba(X_test)  # Get probability scores
y_test_bin = np.where(y_test == "Severe", 1, 0)  # Convert labels to binary

fpr, tpr, _ = roc_curve(y_test_bin, y_probs[:, 1])
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, color="blue", lw=2, label=f"ROC curve (area = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="grey", linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Accident Severity Prediction")
plt.legend(loc="lower right")
plt.show()

## Final Summary
- **Train Models**: Decision Trees, Random Forest, XGBoost

- **Evaluate**: Accuracy, Precision, Recall, F1-score

- **Visualize & Interpret**: Confusion Matrix, Feature Importance, ROC Curve

*This approach will help **identify risk factors for severe accidents** and **improve road safety interventions**. 🚦*