# Project: Road Accidents in France based on Annual Road Traffic Accident Injury Database (2005 - 2023)

## Step 1: Exploration + Data Visualization

---

### Description

For each accident involving injury (i.e., an accident occurring on a road open to public traffic, involving at least one vehicle and resulting in at least one victim requiring medical treatment), information describing the accident is recorded by the law enforcement unit (police, gendarmerie, etc.) that responded to the scene. These entries are compiled into a form called the *Injury Accident Analysis Bulletin*. Together, these forms constitute the national road traffic accident file, known as the **BAAC File**, administered by the **National Interministerial Road Safety Observatory (ONISR)**.

The databases, extracted from the BAAC file, list all road traffic accidents involving injuries that occurred during a given year in:

- Mainland France
- Overseas departments (Guadeloupe, French Guiana, Martinique, Réunion, and Mayotte since 2012)
- Overseas territories (Saint-Pierre and Miquelon, Saint-Barthélemy, Saint-Martin, Wallis and Futuna, French Polynesia, and New Caledonia; available since 2019)

The databases from 2005 to 2023 are structured annually and consist of four CSV files:

- **Caractéristiques** (Accident details)
- **Lieux** (Locations)
- **Véhicules** (Vehicles)
- **Usagers** (Users)

---

## Important Notes

- **Runaway Users:**  
  Since 2021, data on runaway users has been added.  
  This results in missing information such as sex, age, and injury severity (unharmed, slightly injured, hospitalized).

- **Missing Data:**  
  Most variables across the four main files may contain:
  - Empty cells
  - Zeros
  - Periods (`.`)
In these cases, the field was either not populated by law enforcement or the information was deemed irrelevant.

- **Hospitalized Injured Persons:**  
  - Data regarding the classification of hospitalized injured persons since 2018 cannot be compared with previous years due to changes in the law enforcement data entry process. The "hospitalized injured person" indicator has not been certified by the Public Statistics Authority since 2019.

---

## Data Handling

I downloaded the data from:  
[Data Source - data.gouv.fr](https://www.data.gouv.fr/en/datasets/bases-de-donnees-annuelles-des-accidents-corporels-de-la-circulation-routiere-annees-de-2005-a-2023/)  
and saved it into the `data/raw/` folder.

Each year includes four main tables:

- **Caractéristiques** (Caracteristics, or Accident details)
- **Lieux** (Locations)
- **Véhicules** (Vehicles)
- **Usagers** (Users)

Additionally, data on **registered vehicles (vehicules-immatricules-baac)** for 2009–2022 is available.  
However, I chose not to use this dataset for now, as it may introduce problems in machine learning algorithms due to missing years.

To streamline data processing, I corrected the names of some downloaded files to follow the format:  
`category_year` (e.g., `Caracteristiques_2020.csv`, `Vehicules_2018.csv`).

The file columns contain encoded information about various accident characteristics.  
The descriptions are provided in the latest version of the encoding documentation:

- **description-des-bases-de-donnees-annuelles.pdf** (description of annual databases)
- **Caracteristics.docx** (translated characteristics from French to English)

Both documents are stored in the `references/` folder.

---

To prepare the data for machine learning, all the individual files must be merged into a single dataset. The Num_Acc column acts as the main key, allowing us to join the Caractéristiques, Lieux, Véhicules, and Usagers tables for each accident record.

In [2]:
import pandas as pd
import os

raw_path = "../data/raw"
processed_path = "../data/processed"
output_file = os.path.join(processed_path, "accidents_merged_2005_2023.csv")

#Some files have different delimiter and a typical French encoder ISO-8859-1
def robust_read_csv(fpath, encoding="ISO-8859-1"): 
    for delim in [",", ";", "\t"]:
        df = pd.read_csv(fpath, delimiter=delim, encoding=encoding, on_bad_lines="skip", low_memory=False)
        df.columns = df.columns.str.strip().str.lower().str.replace('"', '')
        # Rename for accident_id
        if 'accident_id' in df.columns:
            df = df.rename(columns={'accident_id': 'num_acc'}) #some datasets had different name for the num_acc key column
        if 'num_acc' in df.columns:
            return df
    print(f"!! Could not properly split columns for {fpath}, got: {df.columns.tolist()}")
    return df

all_years = []
for year in range(2005, 2024):
    files = {
        "carac": os.path.join(raw_path, f"caracteristiques_{year}.csv"),
        "lieux": os.path.join(raw_path, f"lieux_{year}.csv"),
        "vehic": os.path.join(raw_path, f"vehicules_{year}.csv"),
        "usag": os.path.join(raw_path, f"usagers_{year}.csv"),
    }
    dfs = {k: robust_read_csv(fpath) for k, fpath in files.items()}

    if all('num_acc' in df.columns for df in dfs.values()):
        df = dfs["carac"].merge(dfs["lieux"], on="num_acc", how="inner", suffixes=('', '_lieux')) #Keep only rows that have a match in both DataFrames
        df = df.merge(dfs["vehic"], on="num_acc", how="inner", suffixes=('', '_vehic'))
        df = df.merge(dfs["usag"], on="num_acc", how="inner", suffixes=('', '_usag'))
        df["an"] = year  # Replace 'an' column with the current year for unification, for some years it was 5 instead of 2005
        all_years.append(df)
    else:
        print(f"Skipping year {year}: 'num_acc' not found in all files.")

# Combine all years and save
merged_df = pd.concat(all_years, ignore_index=True)
os.makedirs(processed_path, exist_ok=True)
merged_df.to_csv(output_file, index=False)
print(f"Merged file saved as: {output_file}")


Merged file saved as: ../data/processed\accidents_merged_2005_2023.csv


After a big merge like this, we need have a quick data health check.

In [5]:
merged_df.head(10)

Unnamed: 0,num_acc,an,mois,jour,hrmn,lum,agg,int,atm,col,...,an_nais,num_veh_usag,vma,id_vehicule,motor,id_vehicule_usag,secu1,secu2,secu3,id_usager
0,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,1976.0,A01,,,,,,,,
1,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,1968.0,B02,,,,,,,,
2,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,1964.0,B02,,,,,,,,
3,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,2004.0,B02,,,,,,,,
4,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,1998.0,B02,,,,,,,,
5,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,1991.0,B02,,,,,,,,
6,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,1976.0,A01,,,,,,,,
7,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,1968.0,B02,,,,,,,,
8,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,1964.0,B02,,,,,,,,
9,200500000001,2005,1,12,1900,3,2,1,1.0,3.0,...,2004.0,B02,,,,,,,,


Description of the columns 
| Column                | Description |
|-----------------------|-----------------------------------|
| **num_acc**           | Unique identifier of the accident, assigned by the law enforcement. Join key between the files characteristics, locations, vehicles and users. |
| **an**                | Year of the accident. |
| **mois**              | Month of the accident. |
| **jour**              | Day of the accident. |
| **hrmn**              | Hour and minutes of the accident (format hhmm). |
| **lum**               | Lighting conditions at the time of the accident: 1 = Full daylight, 2 = Twilight or dawn, 3 = Night without public lighting, 4 = Night with public lighting not lit, 5 = Night with public lighting lit. |
| **agg**               | Urban area indicator: 1 = Outside urban area, 2 = Inside urban area. |
| **int**               | Type of intersection: 1 = Outside intersection, 2 = Intersection in X, 3 = Intersection in T, 4 = Intersection in Y, 5 = More than 4 branches, 6 = Roundabout, 7 = Square, 8 = Level crossing, 9 = Other intersection. |
| **atm**               | Atmospheric conditions at the time of the accident: -1 = Not specified, 1 = Normal, 2 = Light rain, 3 = Heavy rain, 4 = Snow–hail, 5 = Fog–smoke, 6 = Strong wind–storm, 7 = Dazzling, 8 = Overcast, 9 = Other. |
| **col**               | Type of collision: -1 = Not specified, 1 = Two vehicles – head-on, 2 = Two vehicles – rear-end, 3 = Two vehicles – side, 4 = Three vehicles or more – in chain, 5 = Three vehicles or more – multiple collisions, 6 = Other collision, 7 = Without collision. |
| **com**               | INSEE code of the municipality of the accident. |
| **adr**               | Postal address of the accident, when the accident is located inside urban area. |
| **gps**               | GPS coordinates (raw text). |
| **lat**               | Latitude (decimal degrees). |
| **long**              | Longitude (decimal degrees). |
| **dep**               | INSEE department code of the accident. |
| **catr**              | Road category: 1 = Highway, 2 = National road, 3 = Departmental road, 4 = Communal road, 5 = Outside public network, 6 = Parking lot, 7 = Urban metropolitan road, 9 = Other. |
| **voie**              | Name or number of the road at the place of the accident. |
| **v1**                | Numeric sub-identifier for the road (e.g., "2 bis" expressed as number). |
| **v2**                | Alphanumeric sub-identifier for the road. |
| **circ**              | Traffic regime: -1 = Not specified, 1 = One way, 2 = Two way, 3 = Separated carriageways, 4 = Variable. |
| **nbv**               | Number of lanes for vehicles at the section of the road where the accident happened. |
| **pr**                | PR number (reference point, upstream terminal) — -1 if not filled. |
| **pr1**               | Distance to PR in meters (relative to the upstream terminal) — -1 if not filled. |
| **vosp**              | Reserved lane: -1 = Not specified, 0 = Not applicable, 1 = Cycle track, 2 = Cycle lane, 3 = Reserved lane. |
| **prof**              | Longitudinal profile of the road: -1 = Not specified, 1 = Flat, 2 = Slope, 3 = Top of hill, 4 = Bottom of hill. |
| **plan**              | Road plan: -1 = Not specified, 1 = Straight, 2 = Left curve, 3 = Right curve, 4 = "S" curve. |
| **lartpc**            | Width of the central reservation (in meters). |
| **larrout**           | Width of the carriageway assigned to vehicles (in meters). |
| **surf**              | Surface condition: -1 = Not specified, 1 = Normal, 2 = Wet, 3 = Puddles, 4 = Flooded, 5 = Snow, 6 = Mud, 7 = Icy, 8 = Fatty/oily, 9 = Other. |
| **infra**             | Infrastructure: -1 = Not specified, 0 = None, 1 = Tunnel, 2 = Bridge, 3 = Interchange or ramp, 4 = Railway, 5 = Crossing, 6 = Pedestrian area, 7 = Toll zone, 8 = Construction, 9 = Other. |
| **situ**              | Situation: -1 = Not specified, 0 = None, 1 = On roadway, 2 = On emergency lane, 3 = On shoulder, 4 = On sidewalk, 5 = On cycle path, 6 = On special route, 8 = Other. |
| **env1**              | Environment at the site of the accident (additional codings, see documentation if available). |
| **senc**              | Direction of movement: -1 = Not specified, 0 = Unknown, 1 = Ascending, 2 = Descending, 3 = No reference. |
| **catv**              | Category of vehicle involved: 01 = Bicycle, 07 = Passenger car, 13 = Heavy truck, 31 = Motorcycle, 37 = Bus, etc. (see full coding in documentation). |
| **occutc**            | Number of people present in the vehicle (for public transport vehicles). |
| **obs**               | Fixed obstacle hit: -1 = Not specified, 0 = Not applicable, 1 = Parked vehicle, 2 = Tree, 3 = Post, 4 = Rail guard, 5 = Concrete wall, 6 = Building, 7 = Fire hydrant, 8 = Lamp post, 9 = Other. |
| **obsm**              | Mobile obstacle hit: -1 = Not specified, 0 = None, 1 = Pedestrian, 2 = Vehicle, 3 = Animal, 4 = Other. |
| **choc**              | Initial point of impact: -1 = Not specified, 0 = None, 1 = Front, 2 = Front right, 3 = Front left, 4 = Rear, 5 = Rear right, 6 = Rear left, 7 = Side right, 8 = Side left. |
| **manv**              | Main maneuver before the accident: -1 = Not specified, 1 = No change, 2 = Stopped, 3 = Stationary, 4 = Reversing, 5 = Parking, 6 = Starting, 7 = Overtaking right, 8 = Overtaking left, 9 = Changing lanes, 10 = U-turn, 11 = Turning right, 12 = Turning left, 13 = Other. |
| **num_veh**           | Vehicle identifier in the accident, allows linking with occupants and users. Alphanumeric code. |
| **place**             | Position occupied by the user in the vehicle or accident: 10 = Pedestrian (see documentation for other positions/seats). |
| **catu**              | Category of user: 1 = Driver, 2 = Passenger, 3 = Pedestrian. |
| **grav**              | Severity of injury for the user: 1 = Unharmed, 2 = Killed, 3 = Hospitalized, 4 = Light injury. |
| **sexe**              | Gender of the user: 1 = Male, 2 = Female. |
| **trajet**            | Reason for the journey: -1 or 0 = Not specified, 1 = Home-work, 2 = Home-school, 3 = Shopping, 4 = Professional, 5 = Leisure, 9 = Other. |
| **secu**              | Safety equipment used by the user. Up to 2018: 0 = None, 1 = Seat belt, 2 = Helmet, 3 = Child seat, 4 = Reflective vest, 5 = Other (see secu1/secu2/secu3 for 2019+). |
| **locp**              | Location of the pedestrian: -1 = Not specified, 1 = On roadway, 2 = On shoulder, 3 = On crosswalk, 4 = On cycle path, 5 = On sidewalk, 6 = On special route, 7 = Other. |
| **actp**              | Action of the pedestrian: -1 = Not specified, 0 = Not applicable, 1 = Heading toward the vehicle, 2 = Moving away, 3 = Crossing, 4 = Waiting, 5 = Playing, 6 = Working, 9 = Other. |
| **etatp**             | Pedestrian's situation: -1 = Not specified, 1 = Alone, 2 = Accompanied, 3 = In group. |
| **an_nais**           | Year of birth of the user involved in the accident. |
| **num_veh_usag**      | Vehicle identifier (user table), alphanumeric code. |
| **vma**               | Maximum speed limit at the location and time of the accident (in km/h). |
| **id_vehicule**       | Unique identifier of the vehicle (in users and vehicles files). |
| **motor**             | Engine type: -1 = Not specified, 1 = Internal combustion, 2 = Hybrid, 3 = Electric, 4 = Hydrogen, 5 = Other. |
| **id_vehicule_usag**  | Unique vehicle identifier from the users file. |
| **secu1**             | 1st safety equipment used (from 2019 onwards). |
| **secu2**             | 2nd safety equipment used (from 2019 onwards). |
| **secu3**             | 3rd safety equipment used (from 2019 onwards). |
| **id_usager**         | Unique identifier of the user (in the users file). |


In [12]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5314184 entries, 0 to 5314183
Data columns (total 60 columns):
 #   Column            Dtype  
---  ------            -----  
 0   num_acc           int64  
 1   an                int64  
 2   mois              int64  
 3   jour              int64  
 4   hrmn              object 
 5   lum               int64  
 6   agg               int64  
 7   int               int64  
 8   atm               float64
 9   col               float64
 10  com               object 
 11  adr               object 
 12  gps               object 
 13  lat               object 
 14  long              object 
 15  dep               object 
 16  catr              float64
 17  voie              object 
 18  v1                float64
 19  v2                object 
 20  circ              float64
 21  nbv               object 
 22  pr                object 
 23  pr1               object 
 24  vosp              float64
 25  prof              float64
 26  plan          

We need to adjust some column types

In [19]:
categorical_columns = [
    "lum", "agg", "int", "atm", "col", "catr", "circ", "prof", "plan", "surf",
    "infra", "situ", "env1", "senc", "catv", "catu", "grav", "sexe", "trajet",
    "secu", "locp", "etatp", "vosp", "obs", "obsm", "choc", "manv", "motor",
    "secu1", "secu2", "secu3", "place"
]
string_columns = [
    "com", "dep", "voie", "v2", "adr", "lartpc", "larrout", "nbv", "pr", "pr1",
    "lat", "long", "gps", "num_veh", "num_veh_usag", "id_vehicule",
    "id_vehicule_usag", "id_usager"
]
int_columns = ["an", "mois", "jour", "an_nais", "vma", "num_acc"]
time_columns = ["hrmn"]

for col in categorical_columns:
    if col in merged_df.columns:
        merged_df[col] = merged_df[col].astype("category")
for col in string_columns:
    if col in merged_df.columns:
        merged_df[col] = merged_df[col].astype(str)
for col in int_columns:
    if col in merged_df.columns:
        merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce').astype("Int64")
for col in time_columns:
    if col in merged_df.columns:
        merged_df[col] = merged_df[col].astype(str)

In [71]:
merged_df.describe()

Unnamed: 0,num_acc,an,mois,jour,lum,agg,int,atm,col,catr,...,trajet,secu,locp,etatp,an_nais,vma,motor,secu1,secu2,secu3
count,5314184.0,5314184.0,5314184.0,5314184.0,5314184.0,5314184.0,5314184.0,5313921.0,5314089.0,5314182.0,...,5313106.0,4002731.0,5194985.0,5194893.0,5292557.0,1251410.0,1251410.0,1251410.0,1251410.0,1251410.0
mean,201332800000.0,2013.328,6.701324,15.58823,1.860729,1.616256,1.794296,1.577097,3.598634,3.223352,...,3.090887,16.90227,0.01232997,-0.1731289,1975.504,61.81612,1.219787,1.901712,1.102819,-0.9200246
std,565902700.0,5.65908,3.374824,8.759571,1.470012,0.4862969,1.641354,1.622932,1.711043,1.261386,...,2.700352,17.04492,0.7889048,0.5367348,18.55589,26.97713,1.038167,2.239605,3.092796,0.8557038
min,200500000000.0,2005.0,1.0,1.0,-1.0,1.0,-1.0,-1.0,-1.0,1.0,...,-1.0,0.0,-1.0,-1.0,1896.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,200800100000.0,2008.0,4.0,8.0,1.0,1.0,1.0,1.0,2.0,3.0,...,0.0,11.0,0.0,0.0,1963.0,50.0,1.0,1.0,-1.0,-1.0
50%,201300000000.0,2013.0,7.0,16.0,1.0,2.0,1.0,1.0,3.0,3.0,...,4.0,11.0,0.0,0.0,1978.0,50.0,1.0,1.0,0.0,-1.0
75%,201800000000.0,2018.0,10.0,23.0,2.0,2.0,2.0,1.0,5.0,4.0,...,5.0,21.0,0.0,0.0,1989.0,80.0,1.0,2.0,0.0,-1.0
max,202300100000.0,2023.0,12.0,31.0,5.0,2.0,9.0,9.0,7.0,9.0,...,9.0,93.0,9.0,3.0,2023.0,901.0,6.0,9.0,9.0,9.0


Unique value counts per year

In [23]:
print(merged_df['an'].value_counts())

an
2005    374561
2006    362507
2007    356228
2008    322196
2009    311706
2023    309341
2010    288112
2011    281675
2012    263194
2017    260392
2016    257286
2019    253488
2014    248642
2018    248406
2021    248187
2015    245706
2013    242163
2022    241487
2020    198907
Name: count, dtype: Int64


Remove full row duplicates

In [26]:
print(merged_df.duplicated().sum())
merged_df = merged_df.drop_duplicates()

4644


Check the NaN and remove if above 80%

In [33]:
display(merged_df.isnull().sum())
merged_df = merged_df.loc[:, merged_df.isnull().mean() <= 0.8]

num_acc                   0
an                        0
mois                      0
jour                      0
hrmn                      0
lum                       0
agg                       0
int                       0
atm                     263
col                      95
com                       0
adr                       0
gps                       0
lat                       0
long                      0
dep                       0
catr                      2
voie                      0
v1                  2678639
v2                        0
circ                   6506
nbv                       0
pr                        0
pr1                       0
vosp                  11927
prof                   8157
plan                  10181
lartpc                    0
larrout                   0
surf                   8124
infra                 24171
situ                  22184
env1                1275534
senc                    583
catv                      0
occutc              

-----
Next steps: Missing data: Columns with lots of -1, 0, or empty—consider how to handle “not specified”
