# Introduction

This notebook is used to analyse, concat and merge all data from 2005 to 2020, and generate a 'raw' csv files with all the available data.  
This 'raw' CSV file will then be useful for our preprocessing (removing unused columns or rows, keeping the rows we need...).

***
# Import libraries & packages

In [101]:
import pandas as pd
import numpy as np

***
# Loading files from 2018 and 2019, for comparison

In [102]:
carac_2018 = pd.read_csv('data/2018/caracteristiques-2018.csv', sep=',', encoding = "ANSI")
vehic_2018 = pd.read_csv('data/2018/vehicules-2018.csv',sep=',', encoding = "ANSI")
lieux_2018 = pd.read_csv('data/2018/lieux-2018.csv', sep=',', encoding = "ANSI")
usage_2018 = pd.read_csv('data/2018/usagers-2018.csv', sep=',', encoding = "ANSI")

carac_2019 = pd.read_csv('data/2019/caracteristiques-2019.csv', sep=';')
vehic_2019 = pd.read_csv('data/2019/vehicules-2019.csv', sep=';')
lieux_2019 = pd.read_csv('data/2019/lieux-2019.csv', sep=';')
usage_2019 = pd.read_csv('data/2019/usagers-2019.csv', sep=';')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


***
# Analysing columns before/after 2019

In [103]:
print("Carac 2018 columns     : ", list(carac_2018.columns))
print("Carac 2019 columns     : ", list(carac_2019.columns))
print('\n')
print("vehic 2018 columns     : ", list(vehic_2018.columns))
print("vehic 2019 columns     : ", list(vehic_2019.columns))
print('\n')
print("lieux 2018 columns     : ", list(lieux_2018.columns))
print("lieux 2019 columns     : ", list(lieux_2019.columns))
print('\n')
print("usage 2018 columns     : ", list(usage_2018.columns))
print("usage 2019 columns     : ", list(usage_2019.columns))

Carac 2018 columns     :  ['Num_Acc', 'an', 'mois', 'jour', 'hrmn', 'lum', 'agg', 'int', 'atm', 'col', 'com', 'adr', 'gps', 'lat', 'long', 'dep']
Carac 2019 columns     :  ['Num_Acc', 'jour', 'mois', 'an', 'hrmn', 'lum', 'dep', 'com', 'agg', 'int', 'atm', 'col', 'adr', 'lat', 'long']


vehic 2018 columns     :  ['Num_Acc', 'senc', 'catv', 'occutc', 'obs', 'obsm', 'choc', 'manv', 'num_veh']
vehic 2019 columns     :  ['Num_Acc', 'id_vehicule', 'num_veh', 'senc', 'catv', 'obs', 'obsm', 'choc', 'manv', 'motor', 'occutc']


lieux 2018 columns     :  ['Num_Acc', 'catr', 'voie', 'v1', 'v2', 'circ', 'nbv', 'pr', 'pr1', 'vosp', 'prof', 'plan', 'lartpc', 'larrout', 'surf', 'infra', 'situ', 'env1']
lieux 2019 columns     :  ['Num_Acc', 'catr', 'voie', 'v1', 'v2', 'circ', 'nbv', 'vosp', 'prof', 'pr', 'pr1', 'plan', 'lartpc', 'larrout', 'surf', 'infra', 'situ', 'vma']


usage 2018 columns     :  ['Num_Acc', 'place', 'catu', 'grav', 'sexe', 'trajet', 'secu', 'locp', 'actp', 'etatp', 'an_nais', 'num_

***
# Concat old CSV file : 2005 - 2018

Remarks:
- Files before 2019 (2005-2018) are separated with commas ','
- Files before 2019 (2005-2018) are encoded in ANSI
- File 'caracteristiques' in 2009 is separated with tabulations

In [104]:
# We initialyze our 4 dataframes
concat_carac_old = pd.DataFrame()
concat_vehic_old = pd.DataFrame()
concat_lieux_old = pd.DataFrame()
concat_usage_old = pd.DataFrame()

# We old separators and encoding (before 2019)
separator = ','
encoding = 'ANSI'

# Loading and concatenation of files for each year (the four files are not merged yet)
for year in range(2005, 2019):

    # Only one file is different from the others in 2009 : carac (separator is a tabulation and encoding is changed)
    if year == 2009:
        concat_carac_old = pd.concat([concat_carac_old, pd.read_csv(f'data/{year}/caracteristiques-{year}.csv', sep='\t', encoding = 'UTF-8')])
    # Otherwise it's the same as the 3 other files below
    else:
        concat_carac_old = pd.concat([concat_carac_old, pd.read_csv(f'data/{year}/caracteristiques-{year}.csv', sep=separator, encoding = encoding)])

    concat_vehic_old = pd.concat([concat_vehic_old, pd.read_csv(f'data/{year}/vehicules-{year}.csv', sep=separator, encoding = encoding)])
    concat_lieux_old = pd.concat([concat_lieux_old, pd.read_csv(f'data/{year}/lieux-{year}.csv', sep=separator, encoding = encoding)])
    concat_usage_old = pd.concat([concat_usage_old, pd.read_csv(f'data/{year}/usagers-{year}.csv', sep=separator, encoding = encoding)])

In [105]:
# Displaying some infos on our contenated files
print('concat_carac_old:', concat_carac_old.shape[0], 'lines')
print('concat_vehic_old:', concat_vehic_old.shape[0], 'lines')
print('concat_lieux_old:', concat_lieux_old.shape[0], 'lines')
print('concat_usage_old:', concat_usage_old.shape[0], 'lines')

concat_carac_old: 958469 lines
concat_vehic_old: 1635811 lines
concat_lieux_old: 958469 lines
concat_usage_old: 2142195 lines


***
# Apply transformation rules to the old files to obtain 'raw' dataframes before the concat with newer files

**caracteristiques** :
- Remove column 'gps'
- 'an' : add '20' before the date (we only have '18') - we can add 2000 to the int value
- 'hrmn' : split to get 'HH:mm' format. Be careful with zeros (40 -> 00:40)
- 'atm', 'col' : replace NaN with '-1' and change to int
- 'dep' : remove the last digit *if it is equal to 0*, change it to string
- 'com' : the code is just the last digits, add the department (and zeros) to get the complete city (5 and 590 -> 59005), change it to string
- 'lat' and 'long' : divide the float by 100 000 (5055737 -> 50.55737)

**vehicules**:
- We don't have the id of the vehicle : no need, we will remove it on the new files
- 'motor' is missing: create the column with '-1
- 'senc', 'obs', 'obsm', 'choc', 'manv' : replace NaN with '-1' and change to int

**lieux** :
- Compare 'env1' (old file) with 'vma' in the new files
  - Nothing in common, remove both columns (env1 in old files, vma in newer files)
- 'circ', 'nbv', 'pr', 'pr1', 'vosp', 'prof', 'plan', 'surf', 'infra', 'situ', 'catr' : replace NaN with '-1' and change to int

**usagers** :
- We don't have the id of the vehicle : no need, we will remove it on the new files
- 'secu' has become 'secu1', 'secu2' and 'secu3'
  - The values are partially consistent, use the function made by Houssam to align on 2019
  - Rename secu to secu1
  - On newer files, we will keep only secu1, and remove secu2 and secu3 (almost empty anyway)
- 'place', 'trajet', 'locp', 'actp', 'etatp', 'an_nais': replace NaN with '-1' and change to int

### caracteristiques

In [106]:
# Remove column 'gps'
concat_carac_old = concat_carac_old.drop('gps', axis = 1)

# 'an' : add '20' before the date (we only have '18') - we can add 2000 to the int value
concat_carac_old.an += 2000

# 'hrmn' : split to get 'HH:mm' format. Be careful with zeros (40 -> 00:40)
concat_carac_old.hrmn = concat_carac_old.hrmn.astype(str)
concat_carac_old.hrmn = concat_carac_old.hrmn.apply(lambda x : ('0000'+x)[-4:])
concat_carac_old.hrmn = concat_carac_old.hrmn.apply(lambda x : x[:2] + ':' + x[2:])

# 'atm', 'col' : replace NaN with '-1' and change to int
concat_carac_old.fillna({x:-1 for x in ['atm','col']}, inplace= True)
concat_carac_old[['atm', 'col']] = concat_carac_old[['atm', 'col']].astype(int)

# 'dep' : remove the last digit *if it is equal to 0*, change it to string
concat_carac_old.loc[concat_carac_old.dep % 10 == 0, 'dep'] = concat_carac_old.loc[concat_carac_old.dep % 10 == 0, 'dep'] / 10
concat_carac_old.dep = concat_carac_old.dep.astype(int)
concat_carac_old.dep = concat_carac_old.dep.astype(str)

# 'com' : the code is just the last digits, add the department (and zeros) to get the complete city (5 and 590 -> 59005), change it to string
concat_carac_old.fillna({'com':'0'}, inplace= True)
concat_carac_old.com = concat_carac_old.com.astype(int)
concat_carac_old.com = concat_carac_old.com.astype(str)
# We make sure all communes are encoded with 3 digits (starts with 0 if necessary)
concat_carac_old.com = concat_carac_old.com.apply(lambda x : ('000'+x)[-3:])
# We concat the dep (only first 2 digits) + com
concat_carac_old.com = concat_carac_old.dep.str[:2] + concat_carac_old.com
# We make sure all communes are now encoded with 5 digits (starts with 0 if necessary)
concat_carac_old.com = concat_carac_old.com.apply(lambda x : ('0'+x)[-5:])

# 'lat' and 'long' : divide the float by 100 000 (5055737 -> 50.55737)
concat_carac_old.lat = concat_carac_old.lat.astype(float)
concat_carac_old.lat = concat_carac_old.lat / 100000
concat_carac_old.long = concat_carac_old.long.replace({'-':'0'}) # Some long values are '-' instead of NaN
concat_carac_old.long = concat_carac_old.long.astype(float)
concat_carac_old.long = concat_carac_old.long / 100000

### vehicules

In [107]:
# 'motor' is missing: create the column with '-1
concat_vehic_old['motor'] = -1

# 'senc', 'obs', 'obsm', 'choc', 'manv' : replace NaN with '-1' and change to int
concat_vehic_old.fillna({x:-1 for x in ['senc', 'obs', 'obsm', 'choc', 'manv']}, inplace= True)
concat_vehic_old[['senc', 'obs', 'obsm', 'choc', 'manv']] = concat_vehic_old[['senc', 'obs', 'obsm', 'choc', 'manv']].astype(int)

### lieux

In [108]:
# Remove env1 column
concat_lieux_old.drop('env1', axis = 1, inplace= True)

# 'circ', 'nbv', 'pr', 'pr1', 'vosp', 'prof', 'plan', 'surf', 'infra', 'situ' : replace NaN with '-1' and change to int
concat_lieux_old.fillna({x:-1 for x in ['catr','circ', 'nbv', 'pr', 'pr1', 'vosp', 'prof', 'plan', 'surf', 'infra', 'situ']}, inplace= True)
concat_lieux_old[['catr','circ', 'nbv', 'pr', 'pr1', 'vosp', 'prof', 'plan', 'surf', 'infra', 'situ']] = concat_lieux_old[['catr','circ', 'nbv', 'pr', 'pr1', 'vosp', 'prof', 'plan', 'surf', 'infra', 'situ']].astype(int)

### usagers

In [109]:
# Define the secu adjusting function, based on the documentation
# Old values are encoded on 2 digits
def adjust_secu_with_2019(x):
    # Values not conform to the doc are changed to '-1'
    if len(str(x))!=2 or str(x)[1]=='3' or x==-1 :
        y=-1
    # If the second digit equals '2' are changed to '0' (equipment not worn)
    elif str(x)[1]=='2':
        y=0
    # Else, the first digit is used
    else:
        y= int(str(x)[0])
    return y

# The secu values are partially consistent, use the function made by Houssam to align on 2019

concat_usage_old['secu'] = concat_usage_old['secu'].fillna(-1)
concat_usage_old['secu'] = concat_usage_old['secu'].astype(int)
concat_usage_old.secu = concat_usage_old.secu.apply(adjust_secu_with_2019)


# Rename secu to secu1
concat_usage_old.rename({'secu':'secu1'}, axis = 'columns', inplace= True)

# 'place', 'trajet', 'locp', 'actp', 'etatp', 'an_nais': replace NaN with '-1' and change to int
concat_usage_old.fillna({x:-1 for x in ['place', 'trajet', 'locp', 'actp', 'etatp', 'an_nais']}, inplace= True)
concat_usage_old[['place', 'trajet', 'locp', 'actp', 'etatp', 'an_nais']] = concat_usage_old[['place', 'trajet', 'locp', 'actp', 'etatp', 'an_nais']].astype(int)

***
# Concat new CSV file : 2019 - 2020

Remarks:
- Files in 2019 and 2020 are separated with semicolons ';'
- Files before 2019 (2005-2018) are encoded in UTF-8

In [110]:
# We initialyze our 4 dataframes
concat_carac_new = pd.DataFrame()
concat_vehic_new = pd.DataFrame()
concat_lieux_new = pd.DataFrame()
concat_usage_new = pd.DataFrame()

# Since separators and encoding change in 2019, we start by using the old ones (before 2019)
separator = ';'
encoding = 'UTF-8'

# Loading and concatenation of files for each year (the four files are not merged yet)
for year in range(2019, 2021):
    concat_carac_new = pd.concat([concat_carac_new, pd.read_csv(f'data/{year}/caracteristiques-{year}.csv', sep=separator, encoding = encoding)])
    concat_vehic_new = pd.concat([concat_vehic_new, pd.read_csv(f'data/{year}/vehicules-{year}.csv', sep=separator, encoding = encoding)])
    concat_lieux_new = pd.concat([concat_lieux_new, pd.read_csv(f'data/{year}/lieux-{year}.csv', sep=separator, encoding = encoding)])
    concat_usage_new = pd.concat([concat_usage_new, pd.read_csv(f'data/{year}/usagers-{year}.csv', sep=separator, encoding = encoding)])

In [111]:
# Displaying some infos on our contenated files
print('concat_carac_new:', concat_carac_new.shape[0], 'lines')
print('concat_vehic_new:', concat_vehic_new.shape[0], 'lines')
print('concat_lieux_new:', concat_lieux_new.shape[0], 'lines')
print('concat_usage_new:', concat_usage_new.shape[0], 'lines')

concat_carac_new: 106584 lines
concat_vehic_new: 181776 lines
concat_lieux_new: 106584 lines
concat_usage_new: 238272 lines


***
# Apply transformation rules to the new files to obtain 'raw' dataframes before the concat with old files

**vehicules**:
- Remove 'id_vehicule'

**lieux** :
- Remove column vma
- 'pr', 'pr1' : remplace '(1)' by '1' and change to 'int'

**usagers** :
- Remove 'id_vehicule'
- Keep only secu1, and remove secu2 and secu3 (almost empty anyway)

### vehicules

In [112]:
# Remove 'id_vehicule'
concat_vehic_new.drop('id_vehicule', axis = 1, inplace= True)

### lieux

In [113]:
# Remove column vma
concat_lieux_new.drop('vma', axis = 1, inplace= True)

# 'pr', 'pr1' : remplace '(1)' by '1' and change to 'int'
concat_lieux_new.replace({'(1)':'1'} , inplace=True)
concat_lieux_new[['pr', 'pr1']] = concat_lieux_new[['pr', 'pr1']].astype(int)

### usagers

In [114]:
# Remove 'id_vehicule'
concat_usage_new.drop('id_vehicule', axis = 1, inplace= True)

# Keep only secu1, and remove secu2 and secu3 (almost empty anyway)
concat_usage_new.drop(['secu2', 'secu3'], axis = 1, inplace= True)

***
# Concat old and new files together

In [115]:
concat_carac = pd.concat([concat_carac_old, concat_carac_new])
concat_vehic = pd.concat([concat_vehic_old, concat_vehic_new])
concat_lieux = pd.concat([concat_lieux_old, concat_lieux_new])
concat_usage = pd.concat([concat_usage_old, concat_usage_new])

In [116]:
# Displaying some infos on our contenated files
print('concat_carac:', concat_carac.shape[0], 'lines')
print('concat_vehic:', concat_vehic.shape[0], 'lines')
print('concat_lieux:', concat_lieux.shape[0], 'lines')
print('concat_usage:', concat_usage.shape[0], 'lines')

concat_carac: 1065053 lines
concat_vehic: 1817587 lines
concat_lieux: 1065053 lines
concat_usage: 2380467 lines


***
# Merge the 4 CSV files

We need to start from 'usagers' and:
- Merge with 'caracteristiques' on 'Num_Acc'
- Merge with 'lieux' on 'Num_Acc'
- Merge with 'vehicules' on 'Num_Acc' and 'num_veh'

In [117]:
# This will be our merged data at the end (all years * 4 files)
concat_data = pd.DataFrame()

# Merging usagers and caracteristiques on Num_Acc
concat_data = pd.merge(concat_usage, concat_carac, on='Num_Acc')

# Merging data and lieux on Num_Acc
concat_data = pd.merge(concat_data, concat_lieux, on='Num_Acc')

# Merging data and vehicules on Num_Acc and num_veh
concat_data = pd.merge(concat_data, concat_vehic, on=['Num_Acc', 'num_veh'])

In [118]:
# Checking the resulting columns
concat_data.columns

Index(['Num_Acc', 'place', 'catu', 'grav', 'sexe', 'trajet', 'secu1', 'locp',
       'actp', 'etatp', 'an_nais', 'num_veh', 'an', 'mois', 'jour', 'hrmn',
       'lum', 'agg', 'int', 'atm', 'col', 'com', 'adr', 'lat', 'long', 'dep',
       'catr', 'voie', 'v1', 'v2', 'circ', 'nbv', 'pr', 'pr1', 'vosp', 'prof',
       'plan', 'lartpc', 'larrout', 'surf', 'infra', 'situ', 'senc', 'catv',
       'occutc', 'obs', 'obsm', 'choc', 'manv', 'motor'],
      dtype='object')

In [119]:
# Checking the resulting shape
print('concat_data:', concat_data.shape[0], 'rows')
print('concat_data:', concat_data.shape[1], 'columns')

concat_data: 2380573 rows
concat_data: 50 columns


In [120]:
# Checking the head of the result
concat_data.head(10)

Unnamed: 0,Num_Acc,place,catu,grav,sexe,trajet,secu1,locp,actp,etatp,...,infra,situ,senc,catv,occutc,obs,obsm,choc,manv,motor
0,200500000001,1,1,4,1,1,1,0,0,0,...,0,1,0,7,0.0,0,2,1,1,-1
1,200500000001,1,1,3,2,3,1,0,0,0,...,0,1,0,7,0.0,0,2,8,10,-1
2,200500000001,2,2,1,1,0,1,0,0,0,...,0,1,0,7,0.0,0,2,8,10,-1
3,200500000001,4,2,1,1,0,3,0,0,0,...,0,1,0,7,0.0,0,2,8,10,-1
4,200500000001,5,2,1,1,0,1,0,0,0,...,0,1,0,7,0.0,0,2,8,10,-1
5,200500000001,3,2,1,2,0,1,0,0,0,...,0,1,0,7,0.0,0,2,8,10,-1
6,200500000002,1,1,1,1,5,1,0,0,0,...,0,5,0,7,0.0,0,2,7,16,-1
7,200500000002,1,1,3,1,5,2,0,0,0,...,0,5,0,2,0.0,0,2,1,1,-1
8,200500000003,1,1,1,1,1,2,0,0,0,...,0,5,0,2,0.0,0,2,1,1,-1
9,200500000003,1,1,3,1,1,2,0,0,0,...,0,5,0,2,0.0,0,2,1,1,-1


***
# Save the generated dataset to a new file :

- The generated dataset is saved to *'data/merged_data_2005_2020.csv'*  in the ***data/*** folder :

In [121]:
concat_data.to_csv(path_or_buf= 'data/merged_data_2005_2020.csv',sep=';')