## TO DO

### Steps on how to tackle the project

#### What is the Project?

#### **Predict a bicycle accident.**


- [x] **Understand the Datasets**
- [ ] **Define the problem**
   - Objective: Predict whether a given situation (defined by features such as time, location, weather, etc.) is likely to result in a bicycle accident.
   - Output: A binary classification where 1 indicates a high likelihood of an accident and 0 indicates a low likelihood (which could include near misses or safe conditions).
&nbsp;

- [ ] **Data Preparation**
   - Combine the Datasets: merge the datasets and add an extra column to indicate whether the incident was an accident (1) or near miss (0)
     - the column could be called 'accident likelihood etc'
     - **Maybe rename some columns** e.g
       - **The i1 - i9 in near accidents could be renamed similarly to actual accidents. e.g i7 = ist PKW etc**
     - **columns to maybe get rid of**
         - Actual accidents: 'OBJECTID', 'LINREFX', 'LINREFY',
         - Near accidents: 'desc'
   - Handling Imbalance:
        - If the datasets are imbalanced (more near misses than accidents),
     consider using techniques like:
        - Oversampling: Duplicate accident instances.
        - Undersampling: Reduce the number of near-miss instances.
        - Synthetic Data Generation: Use techniques like SMOTE to create synthetic samples of the minority class (accidents).
   - Handle Missing Values:
   - Data cleaning
&nbsp;

- [ ] **Feature Engineering**
    - Time-Based Features: Extract new features from the time(e.g hour, day, month, holiday etc)
       - [ ] **will need to do this for the bi-nearly accidents**
         - **Currently, we have ts(timestamp) We will need to extract hour, day, month etc so that it could be uniform like actual accidents**

     - Spatial Features: Convert locations to Coordinates or use Spatial clustering to identify high-risk zones
        - [ ] **did something similar on merge file**
          - **We converted lat and lon to distance and classified the near accidents that happened next to actual accidents.**
          - **We also identified the locations with the most accidents**
          - *Results are in the near accidents CSV file. see merge file for more info on function etc*

    - Weather Features: Include detailed weather data such as precipitation, temperature, visibility, and wind speed.
       - Actual accidents column: 'ULICHTVERH', 'IstStrassenzustand', 
       - Near accidents: no info on weather. However, after deducing the day and time from 'ts', we could match it to the weather in actual accidents.
         
    - Traffic and Road Features: Aggregate traffic volume data, road type, presence of bike lanes, speed limits, etc.
       - Actual accidents: 
       - Near accidents:
      
    - Driver and Cyclist Behavior: Include features that might capture risky behaviours (speed, sudden stops, proximity to other vehicles).
       - Actual accidents: 
       - Near accidents:
         
    - Interaction Features: Create interaction terms between relevant features (e.g., traffic volume during rush hour).

 &nbsp;
- [ ] **Feature Selection**
  - Correlation Analysis: Identify which features are most correlated with accidents and near misses.
     - Actual accidents: istPKW, 'ULICHTVERH'= 0(daylight), 'IstStrassenzustand'= 0(dry)
     - Near accidents: i7(car), i4(delivery/van)
  - Feature Importance: Use techniques like random forest importance or LASSO to determine the most important features.
  - Dimensionality Reduction: Consider PCA or other techniques if you have a large number of features.

  - **Example Target Variable: 'Scary', , 'Accident Likelihood'(new column still to be created after combining the datasets etc**


&nbsp;

- [ ] **Model Selection**
- **Model Types**
    - Logistic Regression: Simple and interpretable, useful as a baseline model.
    - Random Forest/Gradient Boosting: These models handle non-linear relationships well and often perform strongly in classification tasks.
    - Support Vector Machines (SVM): Effective in high-dimensional spaces and with clear margins of separation.
    - Ensemble Methods: Combining several models (e.g., stacking or boosting) might improve prediction accuracy.
    - Neural Networks: If the dataset is large and complex, neural networks might capture complex patterns. (This is disqualified. Too complex for the scope and we don't have too much data)
 
&nbsp;

- [ ] **Model Training**
   - Split the Data: Divide the data into training, validation, and test sets (e.g., 70% training, 15% validation, 15% test).
   - Cross-Validation: Use k-fold cross-validation to ensure the model generalizes well.
   - Hyperparameter Tuning: Use Grid Search, Random Search, or Bayesian optimization to find the best hyperparameters.
   - Class Imbalance Handling: If the dataset is imbalanced (e.g., more accidents than near misses), consider techniques like oversampling, undersampling, or using class weights.
     
&nbsp;
    
- [ ] **Model Evaluation**
   - Performance Metrics: Use metrics like accuracy, precision, recall, F1-score, and ROC-AUC to evaluate the model’s performance.
   - Confusion Matrix: Analyze the confusion matrix to understand where the model makes errors (e.g., false positives vs. false negatives).
   - Precision-Recall Curve: Particularly useful when dealing with imbalanced datasets.

&nbsp;

- [ ] **Model Interpretation and Testing**
   - Feature Importance Analysis: Identify which features the model considers most important in distinguishing between accidents and near misses.
   - Model Explanation: Use techniques like SHAP (SHapley Additive exPlanations) to understand how individual predictions are made.
   - Testing on Unseen Data: Validate the model on a separate test set to ensure that it generalizes well to new data.

&nbsp;

- [ ] **Deployment and Monitoring**
   - Deploy the Model: If the model meets the performance criteria, deploy it in a real-world scenario where it can be used to predict the likelihood of an accident or near miss.
   - Monitor Performance: Continuously monitor the model's performance over time and retrain it as new data becomes available.

&nbsp;

- [ ] **Consider Ethical implications**
   - Bias: Ensure the model doesn't inadvertently introduce or exacerbate biases (e.g., based on location or time).
   - Transparency: Make the model interpretable to ensure stakeholders can understand and trust its predictions.

&nbsp;

- [ ] **Real-Time Prediction (Optional)**
  - Data Pipeline: If predicting accidents in real-time, set up a pipeline that ingests real-time data (e.g., weather, traffic) and feeds it to the model.
  - Alerts: Implement a system that triggers alerts when the model predicts a high likelihood of an accident.


This approach aims to create a robust model that can predict bicycle accidents based on historical data and patterns identified in both accident and near-miss incidents.



#### Differentiate between actual accident and near miss
- [ ] Understand the Datasets
- [ ] Data Preparation
   - Combine the Datasets: merge the datasets and add an extra column to indicate whether the incident was an accident (1) or near miss (0)
   - Handle Missing Values:
   - Data cleaning
     
- [ ] Feature Engineering
    - Time-Based Features: Extract new features from the time(e.g hour, day, month, holiday etc)
    - Spatial Features: Convert locations to Coordinates or use Spatial clustering to identify high-risk zones
    - Interaction Terms: Create features that capture interactions between other features (e.g., the interaction between traffic volume and road type).
    - Environmental Conditions: Enhance weather data by including features like visibility, temperature, and wind speed.
    - Driver and Cyclist Behavior: If available, use data on speed, abrupt braking, or other behavioural indicators.
      
- [ ] Model Selection
- Model Types
    - Logistic Regression: Simple and interpretable, useful as a baseline model.
    - Decision Trees/Random Forest: Good for handling non-linear relationships and interactions between features.
    - Support Vector Machines (SVM): Effective in high-dimensional spaces and with clear margins of separation.
    - Gradient Boosting Machines (XGBoost, LightGBM): Powerful for tabular data and often performs well in classification tasks.
    - Neural Networks: If the dataset is large and complex, neural networks might capture complex patterns. (This is disqualified. Too complex for the scope and we don't have too much data)
- [ ] Model Training
   - Split the Data: Divide the data into training, validation, and test sets (e.g., 70% training, 15% validation, 15% test).
   - Hyperparameter Tuning: Use techniques like Grid Search or Random Search combined with cross-validation on the training set to tune hyperparameters.
   - Class Imbalance Handling: If the dataset is imbalanced (e.g., more accidents than near misses), consider techniques like oversampling, undersampling, or using class weights.
- [ ] Model Evaluation
   - Performance Metrics: Use metrics like accuracy, precision, recall, F1-score, and ROC-AUC to evaluate the model’s performance.
   - Confusion Matrix: Analyze the confusion matrix to understand where the model makes errors (e.g., false positives vs. false negatives).
   - Cross-Validation: Ensure robustness by performing k-fold cross-validation and evaluating the consistency of the results.
- [ ] Model Interpretation and Testing
   - Feature Importance Analysis: Identify which features the model considers most important in distinguishing between accidents and near misses.
   - Model Explanation: Use techniques like SHAP (SHapley Additive exPlanations) to understand how individual predictions are made.
   - Testing on Unseen Data: Validate the model on a separate test set to ensure that it generalizes well to new data.
- [ ] Deployment and Monitoring
   - Deploy the Model: If the model meets the performance criteria, deploy it in a real-world scenario where it can be used to predict the likelihood of an accident or near miss.
   - Monitor Performance: Continuously monitor the model's performance over time and retrain it as new data becomes available.
- [ ] Consider Ethical implications
   - Bias and Fairness: Ensure that the model does not introduce or exacerbate biases, especially if the features used could unfairly impact certain groups (e.g., specific locations or times of day)
   - Explainability: Make sure that the model's decisions can be explained to stakeholders, especially in safety-critical applications.
 

This approach will help build a machine-learning model that can effectively differentiate between bicycle accidents and near misses, providing valuable insights for improving cyclist safety.

