# Stay safe
Week 1 (1st nov - 8th Nov) : Understand the data

- Missing values
- Change the data formats
- Understand the categorical data
- Find distributions
- Correlations between the different features
- Find patterns
- Check for errors/dubious data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
import folium
from pyproj import Proj, transform
from datetime import datetime
from datetime import date, time
from dateutil.parser import parse

In [2]:
# Read the dataset
acc_data = 'Data/OTC_ACCIDENTS.csv'
compt_trafic_data = 'Data/OTC_COMPTAGE_TRAFIC.csv'
acc_df = pd.read_csv(acc_data, sep=';', encoding='latin-1')
compt_trafic_df = pd.read_csv(compt_trafic_data, sep=';', encoding='latin-1')

In [3]:
# Show the df to have a better idea
acc_df.head(3)

Unnamed: 0,ID_ACCIDENT,DATE_,GROUPE_ACCIDENT,CAUSE,COMMUNE,CONDITIONS_LUMINEUSES,CONDITIONS_METEO,CONSEQUENCES,COOR_X,COOR_Y,...,NB_MOTOS_50,NB_MOTOS_125,NB_MOTOS_11KW,NB_VOITURES_TOURISME,NB_VOITURES_LIVRAISON,NB_CAMIONS,NB_BUS,NB_TRAM,E,N
0,876245.0,2010-11-30 00:00:00,Dérapage ou perte de maîtrise,Inattention et distraction - Manque d'attentio...,Genève,Nuit,Chute de neige,Avec blessés légers,2500774.0,1117364.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2500774.0,1117364.0
1,879408.0,2010-12-08 00:00:00,Autres,Utilisation inadéquate du véhicule - Stationne...,Genève,Jour,Beau,Autres,2498974.0,1118100.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2498974.0,1118100.0
2,877254.0,2010-12-02 00:00:00,Dérapage ou perte de maîtrise,Inobservation de signaux ou de la signalisatio...,Vandoeuvres,Jour,Couvert,Avec blessés légers,2504618.0,1119635.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2504618.0,1119635.0


In [4]:
# Shape of the dataset
acc_df.shape

(19231, 35)

In [5]:
# Information of the dataset
acc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19231 entries, 0 to 19230
Data columns (total 35 columns):
ID_ACCIDENT              19231 non-null float64
DATE_                    19231 non-null object
GROUPE_ACCIDENT          19231 non-null object
CAUSE                    19231 non-null object
COMMUNE                  19231 non-null object
CONDITIONS_LUMINEUSES    19231 non-null object
CONDITIONS_METEO         19231 non-null object
CONSEQUENCES             19231 non-null object
COOR_X                   19231 non-null float64
COOR_Y                   19231 non-null float64
ETAT_ROUTE               19231 non-null object
GENRE_ROUTE              19231 non-null object
HEURE                    19230 non-null object
JOUR                     19231 non-null object
LOCALITE                 19231 non-null object
NB_ENFANTS_IMPLIQUES     19231 non-null float64
NB_ENFANTS_ECOLE         19231 non-null float64
NB_BLESSES_LEGERS        19231 non-null float64
NB_BLESSES_GRAVES        19231 non-null

The Geneva accidents dataset has 19'231 datapoints with 35 different features. They can be grouped in the following subgroups:
* **Accident type**: GROUPE_ACCIDENT, CAUSE 

* **Localisation**: COMMUNE, COOR_X, COOR_Y, LOCALITE, E, N 

* **Time**: DATE_, HEURE, JOUR

* **Conditions**: CONDITIONS_LUMINEUSES, CONDITIONS_METEO, ETAT_ROUTE, GENRE_ROUTE

* **Number and type of people**: NB_ENFANTS_IMPLIQUES, NB_ENFANTS_ECOLE, NB_BLESSES_LEGERS, NB_BLESSES_GRAVES, NB_TUES, NB_PIETONS  

* **Number of vehicles involved**: NB_BICYCLETTES, NB_VAE_25, NB_VAE_45, NB_CYCLOMOTEURS, NB_MOTOS_50, NB_MOTOS_125, NB_MOTOS_11KW, NB_VOITURES_TOURISME, NB_VOITURES_LIVRAISON, NB_CAMIONS, NB_BUS, NB_TRAM


### TIME

In [6]:
print(acc_df.DATE_.head(2))
print(acc_df.HEURE.head(2))
print(acc_df.JOUR.head(2))

0    2010-11-30 00:00:00
1    2010-12-08 00:00:00
Name: DATE_, dtype: object
0    1899-12-30 21:00:00
1    1899-12-30 14:00:00
Name: HEURE, dtype: object
0       Mardi
1    Mercredi
Name: JOUR, dtype: object


In [7]:
# Datetime format
acc_df['DATE_'] = acc_df['DATE_'].apply(lambda d: pd.to_datetime(d))
acc_df['HEURE'] = acc_df['HEURE'].apply(lambda d: pd.to_datetime(d))

# Create new time features
acc_df['YEAR'] = [date.year for date in acc_df['DATE_']]
acc_df['MONTH'] = [date.month for date in acc_df['DATE_']]
acc_df['DAY'] = [date.day for date in acc_df['DATE_']]
acc_df['HEURE'] = acc_df['HEURE'].fillna(acc_df['HEURE'].iloc[0])  # Fillna with first value of the df (Error neglible)
#acc_df.info()

### COORDINATES

In [8]:
# projection definition
p1 = Proj(init='epsg:2056')
p2 = Proj(init='epsg:4326')

# Helper functions
def coord_proj(acc_df,i, p1, p2):
    x1 = acc_df['COOR_X'].loc[i]
    y1 = acc_df['COOR_Y'].loc[i]
    x2, y2 = transform(p1,p2,x1,y1)
    acc_df['COOR_X'].set_value(i, x2)
    acc_df['COOR_Y'].set_value(i, y2)
    return acc_df


In [9]:
# Project data
for i in range(0, len(acc_df['COOR_X'])-1):
    acc_df = coord_proj(acc_df,i, p1, p2)
acc_df

Unnamed: 0,ID_ACCIDENT,DATE_,GROUPE_ACCIDENT,CAUSE,COMMUNE,CONDITIONS_LUMINEUSES,CONDITIONS_METEO,CONSEQUENCES,COOR_X,COOR_Y,...,NB_VOITURES_TOURISME,NB_VOITURES_LIVRAISON,NB_CAMIONS,NB_BUS,NB_TRAM,E,N,YEAR,MONTH,DAY
0,876245.0,2010-11-30,Dérapage ou perte de maîtrise,Inattention et distraction - Manque d'attentio...,Genève,Nuit,Chute de neige,Avec blessés légers,6.153116e+00,4.620040e+01,...,1.0,0.0,0.0,0.0,0.0,2500774.0,1117364.0,2010,11,30
1,879408.0,2010-12-08,Autres,Utilisation inadéquate du véhicule - Stationne...,Genève,Jour,Beau,Autres,6.129641e+00,4.620675e+01,...,1.0,0.0,0.0,0.0,0.0,2498974.0,1118100.0,2010,12,8
2,877254.0,2010-12-02,Dérapage ou perte de maîtrise,Inobservation de signaux ou de la signalisatio...,Vandoeuvres,Jour,Couvert,Avec blessés légers,6.202445e+00,4.622138e+01,...,1.0,0.0,0.0,0.0,0.0,2504618.0,1119635.0,2010,12,2
3,857129.0,2010-10-16,Collision frontale,Vitesse - Inadaptée au tracé de la route (vira...,Cologny,Nuit,Couvert,Avec blessés graves,6.180094e+00,4.620817e+01,...,2.0,0.0,0.0,0.0,0.0,2502870.0,1118194.0,2010,10,16
4,843463.0,2010-09-17,Accident par tamponnement,Inattention et distraction - Manque d'attentio...,Conches,Nuit,Pluie,Autres,6.174721e+00,4.619958e+01,...,2.0,0.0,0.0,0.0,0.0,2502440.0,1117246.0,2010,9,17
5,856554.0,2010-10-15,Dérapage ou perte de maîtrise,Inattention et distraction - Manque d'attentio...,Cologny,Jour,Beau,Avec blessés graves,6.190697e+00,4.623289e+01,...,0.0,0.0,0.0,0.0,0.0,2503732.0,1120928.0,2010,10,15
6,879064.0,2010-12-07,Accident impliquant des piétons,Influence / comportement du piéton - Non-respe...,Genève,Nuit,Couvert,Avec blessés graves,6.131960e+00,4.619438e+01,...,0.0,0.0,1.0,0.0,0.0,2499130.0,1116722.0,2010,12,7
7,886468.0,2010-12-24,Dérapage ou perte de maîtrise,Etat / intention du conducteur - Influence de ...,Genève,Nuit,Beau,Autres,6.152773e+00,4.621164e+01,...,1.0,0.0,0.0,0.0,0.0,2500768.0,1118614.0,2010,12,24
8,110995.0,2010-02-21,Dérapage ou perte de maîtrise,Cause inconnue - Cause inconnue,Conches,Jour,Beau,Autres,6.182250e+00,4.619843e+01,...,1.0,0.0,0.0,0.0,0.0,2503019.0,1117108.0,2010,2,21
9,883769.0,2010-12-17,Dérapage ou perte de maîtrise,Etat / intention du conducteur - Influence de ...,Genève,Nuit,Chute de neige,Autres,6.128960e+00,4.621999e+01,...,1.0,1.0,0.0,0.0,0.0,2498946.0,1119572.0,2010,12,17


In [10]:
# Delete unuseful columns
del acc_df['N']
del acc_df['E']

In [11]:
acc_df.COMMUNE.value_counts()

Genève               8913
Vernier              1486
Petit-Lancy          1054
Meyrin                971
Carouge GE            858
Plan-les-Ouates       501
Le Grand-Saconnex     491
Thônex                435
Conches               412
Versoix               404
Bernex                367
Cologny               345
Bellevue              321
Onex                  315
Chêne-Bourg           273
Vésenaz               266
Vessy                 200
Confignon             167
Perly                 149
Satigny               145
Chambésy              113
Vandoeuvres           111
Genthod                83
Jussy                  80
Meinier                76
La Croix-de-Rozon      74
Corsier GE             66
Anières                61
Puplinge               52
Collex                 51
Troinex                50
Choulex                37
Avully                 35
La Plaine              32
Cartigny               30
Chancy                 29
Russin                 29
Aire-la-Ville          28
Soral       

In [12]:
acc_df.LOCALITE.value_counts()

A l'intérieur d'une localité    10789
Non renseigné                    6101
A l'extérieur d'une localité     2341
Name: LOCALITE, dtype: int64

### TYPE OF ACCIDENT

In [13]:
acc_df.GROUPE_ACCIDENT.value_counts()

Dérapage ou perte de maîtrise                                                    5294
Accident par tamponnement                                                        3615
Accident en parquant                                                             3033
Accident lors d'un dépassement ou lors d'un changement de voie de circulation    1934
Accident en traversant une route                                                 1354
Accident en quittant une route                                                   1251
Accident impliquant des piétons                                                  1247
Accident en s'engageant sur une route                                             886
Collision frontale                                                                453
Autres                                                                            124
Accident impliquant des animaux                                                    40
Name: GROUPE_ACCIDENT, dtype: int64

In [23]:
# Create a new df to treat the causes column
causes = acc_df[['CAUSE', 'GROUPE_ACCIDENT']]
causes['count_'] = causes.CAUSE.apply(lambda x: x.count(' - '))
del causes['GROUPE_ACCIDENT']

# Goal: To have only one "-" in order to split the causes into two columns, group and sugroup of causes
conflictive_txt = "(confusion frein - accélérateur, glisser de la pédale de frein dans les véhicules à transmission automatique, etc.)"
causes.CAUSE = causes.CAUSE.str.replace(conflictive_txt, "")
causes['count_'] = causes.CAUSE.apply(lambda x: x.count(' - '))
causes = causes.CAUSE.str.split(' - ', expand=True)
causes.rename(columns ={0: 'CAUSE_g', 1: 'CAUSE_sg'}, inplace=True)
acc_new = pd.concat([acc_df, causes], axis=1)
del acc_new['CAUSE']
acc_new

# acc_df['CAUSE_g']
# acc_df['CAUSE_sg']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Manque d'attention momentané                                                                                                      3485
Véhicule stationné (omission de la déclaration obligatoire)                                                                       2141
Influence de l'alcool                                                                                                             1799
Marche arrière imprudente                                                                                                         1182
Cause inconnue                                                                                                                    1182
Manque d'égards en changeant de voie                                                                                              1153
Suivre de trop près le véhicule qui précède                                                                                       1041
Priorité en obliquant à gauche devant des véhicules ven

Utilisation inadéquate du véhicule
    - Manoeuvres incorrectes dans le véhicule ():
        -(confusion frein - accélérateur, glisser de la pédale de frein dans les véhicules à transmission automatique, etc.)

In [15]:
# Create the map visualization
Geneve_coord = [46.2004013, 6.1531163]
m = folium.Map(location=lausanne_coord, zoom_start=13)

NameError: name 'lausanne_coord' is not defined

In [None]:
i=1
coord(acc_df,i, p1, p2)
for i in range(0, len(acc_df['COOR_X'].head(10))):
    folium.Marker(coord(acc_df,i, p1, p2) , popup="1").add_to(m)
m

Ideas:
* For each accident zone personalize the information panels