# Predictions of Severity

This notebook demonstrates the prediction of accident severity using machine learning models on French road accident data from 2005 to 2016.

**Dataset**: Accidents in France from 2005 to 2016

**Objective**: Build classification models to predict the severity of road accidents based on various features such as weather conditions, time, location, and road characteristics.

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## 2. Load and Explore Data

Note: This notebook expects the data to be available. You can download it using:
```python
import kagglehub
path = kagglehub.dataset_download("ahmedlahlou/accidents-in-france-from-2005-to-2016")
print("Path to dataset files:", path)
```

In [None]:
# Load data - adjust path as needed
# For demonstration, we'll create sample code structure
# In practice, replace with actual data loading

print("Data loading section")
print("This section would load the accident data from CSV files")
print("Expected files: caracteristiques.csv, lieux.csv, usagers.csv, vehicules.csv")

## 3. Data Preprocessing and Feature Engineering

In [None]:
def preprocess_data(df):
    """
    Preprocess the accident data:
    - Handle missing values
    - Encode categorical variables
    - Create derived features
    - Scale numerical features
    """
    df_processed = df.copy()
    
    # Handle missing values
    # Strategy: fill numerical with median, categorical with mode
    for col in df_processed.columns:
        if df_processed[col].dtype == 'object':
            df_processed[col].fillna(df_processed[col].mode()[0] if not df_processed[col].mode().empty else 'Unknown', inplace=True)
        else:
            df_processed[col].fillna(df_processed[col].median(), inplace=True)
    
    return df_processed

## 4. Exploratory Data Analysis (EDA)

In [None]:
def plot_severity_distribution(df, severity_col='grav'):
    """
    Plot the distribution of accident severity levels
    """
    plt.figure(figsize=(10, 6))
    severity_counts = df[severity_col].value_counts().sort_index()
    
    plt.bar(severity_counts.index, severity_counts.values)
    plt.xlabel('Severity Level')
    plt.ylabel('Count')
    plt.title('Distribution of Accident Severity')
    plt.xticks(severity_counts.index)
    
    # Add value labels on bars
    for i, v in enumerate(severity_counts.values):
        plt.text(severity_counts.index[i], v, str(v), ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()

print("EDA functions defined")

## 5. Feature Selection and Preparation

In [None]:
def prepare_features(df, target_col='grav'):
    """
    Prepare features for modeling:
    - Separate features and target
    - Encode categorical variables
    - Scale features
    """
    # Separate features and target
    X = df.drop(columns=[target_col])
    y = df[target_col]
    
    # Encode categorical variables
    label_encoders = {}
    for col in X.select_dtypes(include=['object']).columns:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col].astype(str))
        label_encoders[col] = le
    
    # Scale numerical features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_scaled = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)
    
    return X_scaled, y, label_encoders, scaler

print("Feature preparation functions defined")

## 6. Model Training and Evaluation

### 6.1 Random Forest Classifier

Random Forest is an ensemble learning method that operates by constructing multiple decision trees and outputting the class that is the mode of the classes of the individual trees.

In [None]:
def train_random_forest(X_train, y_train, n_estimators=100, max_depth=None, random_state=42):
    """
    Train a Random Forest classifier
    """
    rf_model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=random_state,
        n_jobs=-1
    )
    
    rf_model.fit(X_train, y_train)
    
    return rf_model

print("Random Forest training function defined")

### 6.2 Logistic Regression

In [None]:
def train_logistic_regression(X_train, y_train, max_iter=1000, random_state=42):
    """
    Train a Logistic Regression classifier
    """
    lr_model = LogisticRegression(
        max_iter=max_iter,
        random_state=random_state,
        n_jobs=-1
    )
    
    lr_model.fit(X_train, y_train)
    
    return lr_model

print("Logistic Regression training function defined")

### 6.3 Decision Tree Classifier

In [None]:
def train_decision_tree(X_train, y_train, max_depth=None, random_state=42):
    """
    Train a Decision Tree classifier
    """
    dt_model = DecisionTreeClassifier(
        max_depth=max_depth,
        random_state=random_state
    )
    
    dt_model.fit(X_train, y_train)
    
    return dt_model

print("Decision Tree training function defined")

## 7. Model Evaluation

In [None]:
def evaluate_model(model, X_test, y_test, model_name="Model"):
    """
    Evaluate model performance
    """
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    print(f"\n{'='*50}")
    print(f"{model_name} Performance")
    print(f"{'='*50}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score (weighted): {f1:.4f}")
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'{model_name} - Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()
    
    return accuracy, f1, y_pred

print("Evaluation functions defined")

## 8. Feature Importance Analysis

In [None]:
def plot_feature_importance(model, feature_names, top_n=15):
    """
    Plot feature importance for tree-based models
    """
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
        indices = np.argsort(importances)[::-1][:top_n]
        
        plt.figure(figsize=(12, 8))
        plt.title(f'Top {top_n} Feature Importances')
        plt.barh(range(top_n), importances[indices])
        plt.yticks(range(top_n), [feature_names[i] for i in indices])
        plt.xlabel('Importance')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
    else:
        print("Model does not have feature_importances_ attribute")

print("Feature importance plotting function defined")

## 9. Main Execution Pipeline

This section demonstrates the complete workflow when data is available.

In [None]:
def run_severity_prediction_pipeline(df, target_col='grav', test_size=0.2, random_state=42):
    """
    Complete pipeline for severity prediction
    """
    print("Starting Severity Prediction Pipeline...")
    print(f"Dataset shape: {df.shape}")
    
    # Step 1: Preprocess data
    print("\n[1/6] Preprocessing data...")
    df_processed = preprocess_data(df)
    
    # Step 2: Prepare features
    print("[2/6] Preparing features...")
    X, y, label_encoders, scaler = prepare_features(df_processed, target_col)
    
    # Step 3: Split data
    print("[3/6] Splitting data...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    print(f"Training set size: {X_train.shape[0]}")
    print(f"Test set size: {X_test.shape[0]}")
    
    # Step 4: Train models
    print("\n[4/6] Training models...")
    
    print("  - Training Random Forest...")
    rf_model = train_random_forest(X_train, y_train)
    
    print("  - Training Logistic Regression...")
    lr_model = train_logistic_regression(X_train, y_train)
    
    print("  - Training Decision Tree...")
    dt_model = train_decision_tree(X_train, y_train)
    
    # Step 5: Evaluate models
    print("\n[5/6] Evaluating models...")
    
    rf_acc, rf_f1, _ = evaluate_model(rf_model, X_test, y_test, "Random Forest")
    lr_acc, lr_f1, _ = evaluate_model(lr_model, X_test, y_test, "Logistic Regression")
    dt_acc, dt_f1, _ = evaluate_model(dt_model, X_test, y_test, "Decision Tree")
    
    # Step 6: Compare models
    print("\n[6/6] Model Comparison")
    print(f"{'='*50}")
    
    results = pd.DataFrame({
        'Model': ['Random Forest', 'Logistic Regression', 'Decision Tree'],
        'Accuracy': [rf_acc, lr_acc, dt_acc],
        'F1 Score': [rf_f1, lr_f1, dt_f1]
    })
    
    results = results.sort_values('F1 Score', ascending=False)
    print(results.to_string(index=False))
    
    # Plot feature importance for best model (Random Forest)
    print("\nPlotting feature importance for Random Forest...")
    plot_feature_importance(rf_model, X.columns.tolist())
    
    return {
        'models': {'rf': rf_model, 'lr': lr_model, 'dt': dt_model},
        'results': results,
        'preprocessors': {'label_encoders': label_encoders, 'scaler': scaler}
    }

print("Pipeline function defined")

## 10. Example Usage

To use this notebook with actual data:

```python
# Load your data
df = pd.read_csv('path/to/your/data.csv')

# Run the pipeline
results = run_severity_prediction_pipeline(df, target_col='grav')

# Access trained models
best_model = results['models']['rf']

# Make predictions on new data
# predictions = best_model.predict(X_new)
```

## Conclusion

This notebook provides a complete framework for predicting accident severity using machine learning techniques. The key findings typically include:

1. **Random Forest** generally performs best for this classification task due to its ability to capture non-linear relationships and handle mixed data types.

2. **Important Features** for severity prediction usually include:
   - Weather conditions
   - Time of day and day of week
   - Road type and surface conditions
   - Number of vehicles involved
   - Speed limit
   - Location characteristics (urban vs rural)

3. **Recommendations**:
   - Focus safety interventions on high-risk conditions identified by the model
   - Consider ensemble methods for production deployment
   - Regularly retrain models with new data to maintain accuracy
   - Implement proper data validation and monitoring in production