# Capstone Project (Week 1)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Business Problem](#introduction)
* [Data](#data)

## Business Problem

If we look at the Worl Health Organization page (https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries) we can see that approximately 1.35 million people die each year in road traffic accidents. Also road traffic injuries are the leading cause of death for children and young adults aged 5-29 years. Apart from being a social problem, vehicle accidents have an economic impact, in fact cost most countries 3% of their gross domestic product. Thus, reducing the number of injuries and deaths from traffic accidents will mitigate the suffering and free up resources for more productive utilization.

Specifically, this report will be targeted to the security and health bodies of Barcelona that will have a tool that allows them to know **some of the factors that influence the severity of the injured and to be able to be prevent the severity of the injuries in order to act appropriately.** 

## Data

Based on definition of our problem and the existents reports about this theme, factors that will influence our decission are:

* Districts of Barcelona city.
* Day of the week: Monday to Sunday.
* Type of day: We separate between Pre-public holiday, Public holiday and Post-public holiday.
* Month: January to December.
* Day: 1 to 31.
* Moment of the day: Morning, Afternoon and Night.

This election of factor is inspired by the following project that came up with the idea of introduce the division of the type of day manually, but in a more simple way. The rest of the factors are included in our dataset originally. 

**"Predicción de la gravedad de los heridos en accidentes de tráfico en Barcelona", David Vila.**
That project is developed in RStudio and RMarkdown.

Following data source will be needed to extract/generate the required information:

* The information about the traffic accidents in Barcelona will be obteined by Open data service of the Barcelona City Council (https://opendata-ajuntament.barcelona.cat./data/es/dataset/accidents-gu-bcn).

* Calendar with public holidays in the city of Barcelona (https://www.elperiodico.com/es/economia/20180913/calendario-laboral-festivos-barcelona-2019-7032364).



### Data cleaning

We will start cleaning our data and then when de data is ready we will explain how we want to use all the information.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [2]:
d = pd.read_csv("/home/leonpb/Descargas/2019_accidents_gu_bcn.csv")

In [3]:
d.head(10)

Unnamed: 0,Numero_expedient,Codi_districte,Nom_districte,Codi_barri,Nom_barri,Codi_carrer,Nom_carrer,Num_postal_caption,Descripcio_dia_setmana,Dia_setmana,...,Descripcio_causa_vianant,Numero_morts,Numero_lesionats_lleus,Numero_lesionats_greus,Numero_victimes,Numero_vehicles_implicats,Coordenada_UTM_X,Coordenada_UTM_Y,Longitud,Latitud
0,2019S000001,5,Sarrià-Sant Gervasi,26,Sant Gervasi - Galvany,144601,Diagonal / Augusta ...,0482 0482,Dimarts,Dm,...,No és causa del vianant,0,0,1,1,1,42948203,458323602,2.155236,41.395744
1,2019S000002,9,Sant Andreu,62,el Congrés i els Indians,102907,Felip II / Congrés Eucarístic ...,9999 9999,Dimarts,Dm,...,No és causa del vianant,0,1,0,1,2,4315826,458650413,2.17999,41.425361
2,2019S000003,10,Sant Martí,66,el Parc i la Llacuna del Poblenou,242906,Pallars ...,0111 0113,Dimarts,Dm,...,Creuar per fora pas de vianants,0,1,0,1,1,43253256,458333382,2.191712,41.396887
3,2019S000004,6,Gràcia,32,el Camp d'en Grassot i Gràcia Nova,228803,Taxdirt / Nogués ...,0054 0054,Dimarts,Dm,...,No és causa del vianant,0,1,0,1,2,43034814,458467408,2.165429,41.408772
4,2019S000005,2,Eixample,7,la Dreta de l'Eixample,89004,Consell de Cent / Girona ...,0395 0397,Dimarts,Dm,...,No és causa del vianant,0,2,0,2,1,43075223,458312825,2.17044,41.394884
5,2019S000006,6,Gràcia,32,el Camp d'en Grassot i Gràcia Nova,90502,Còrsega ...,0461 0463,Dimarts,Dm,...,No és causa del vianant,0,1,0,1,2,43047169,458397884,2.166987,41.402521
6,2019S000007,2,Eixample,7,la Dreta de l'Eixample,18505,Aragó / Balmes ...,0225 0225,Dimarts,Dm,...,No és causa del vianant,0,1,0,1,6,43002371,458261251,2.161787,41.390176
7,2019S000008,2,Eixample,7,la Dreta de l'Eixample,222002,Mossèn Jacint Verdaguer ...,9999 9999,Dimarts,Dm,...,No és causa del vianant,0,1,0,1,2,43070656,458359499,2.169841,41.399084
8,2019S000009,10,Sant Martí,67,la Vila Olímpica del Poblenou,700661,Joan Miró / Doctor Trueta ...,0025 0025,Dimarts,Dm,...,No és causa del vianant,0,0,0,0,1,43267327,458271336,2.193464,41.391311
9,2019S000010,2,Eixample,5,el Fort Pienc,9209,Lepant / Alí Bei ...,0130 0130,Dimarts,Dm,...,No és causa del vianant,0,1,0,1,2,43203923,45833768,2.185806,41.397233


In [4]:
d.columns

Index(['Numero_expedient', 'Codi_districte', 'Nom_districte', 'Codi_barri',
       'Nom_barri', 'Codi_carrer', 'Nom_carrer', 'Num_postal_caption',
       'Descripcio_dia_setmana', 'Dia_setmana', 'Descripcio_tipus_dia',
       'NK_Any', 'Mes_any', 'Nom_mes', 'Dia_mes', 'Hora_dia',
       'Descripcio_torn', 'Descripcio_causa_vianant', 'Numero_morts',
       'Numero_lesionats_lleus', 'Numero_lesionats_greus', 'Numero_victimes',
       'Numero_vehicles_implicats', 'Coordenada_UTM_X', 'Coordenada_UTM_Y',
       'Longitud', 'Latitud'],
      dtype='object')

#### We drop the rows that we won't considerer for our project.

In [5]:
d.drop(['Numero_expedient', 'Codi_barri', 'Nom_barri', 'Num_postal_caption','Dia_setmana', 'Coordenada_UTM_X','Coordenada_UTM_Y', 'NK_Any', 'Numero_victimes', 'Descripcio_causa_vianant', 'Nom_mes', 'Codi_carrer', 'Nom_carrer', 'Numero_vehicles_implicats', 'Numero_morts'], axis = 'columns', inplace=True)

#### We create a new column with the severity of the injuries divided in two categories: Mild and Serious. Also drop all the rows without injuries.

In [6]:
d.loc[d['Numero_lesionats_lleus'] > 0, 'Numero_lesionats_lleus'] = 'Mild'
d.loc[d['Numero_lesionats_greus'] > 0, 'Numero_lesionats_greus'] = 'Serious'
d['Numero_lesionats_lleus'] = d['Numero_lesionats_lleus'].astype(str)
d['Numero_lesionats_greus'] = d['Numero_lesionats_greus'].astype(str)
d['Severity_of_the_injuries'] = d['Numero_lesionats_lleus'].str.cat(d['Numero_lesionats_greus'], sep=" ")
d.drop(['Numero_lesionats_lleus', 'Numero_lesionats_greus'], axis = 'columns', inplace=True)
d.loc[d['Severity_of_the_injuries'] == '0 Serious', 'Severity_of_the_injuries'] = 'Serious'
d.loc[d['Severity_of_the_injuries'] == 'Mild 0', 'Severity_of_the_injuries'] = 'Mild'
d.loc[d['Severity_of_the_injuries'] == '0 0', 'Severity_of_the_injuries'] = '0'
d.loc[d['Severity_of_the_injuries'] == 'Mild Serious', 'Severity_of_the_injuries'] = 'Serious'
d[~d['Severity_of_the_injuries'].str.contains('0')]
d.replace('0', np.nan, inplace = True)
d = d.dropna()
cols = list(d.columns.values) 
cols.pop(cols.index('Longitud')) 
cols.pop(cols.index('Latitud')) 
cols.pop(cols.index('Severity_of_the_injuries'))
d= d[cols+['Severity_of_the_injuries','Latitud', 'Longitud']]

#### Now we have to add the pre-public holiday, public holiday and post-public holiday manually

In [7]:
d['Mes_any'] = d['Mes_any'].astype(str)
d['Dia_mes'] = d['Dia_mes'].astype(str)
d['Date'] = d['Mes_any'].str.cat(d['Dia_mes'], sep="-")
d['Mes_any'] = d['Mes_any'].astype(int)
d['Dia_mes'] = d['Dia_mes'].astype(int) 

d.loc[d['Date'] == '1-1', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '1-2', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '4-18', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '4-19', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '4-20', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '4-21', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '4-22', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '4-23', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '4-30', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '5-1', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '5-2', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '6-9', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '6-10', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '6-11', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '6-23', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '6-24', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '6-25', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '8-14', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '8-15', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '8-16', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '9-10', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '9-11', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '9-12', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '9-23', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '9-24', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '9-25', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '10-11', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '10-12', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '10-13', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '10-31', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '11-1', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '11-2', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '12-5', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '12-6', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '12-7', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '12-24', 'Descripcio_tipus_dia'] = 'Pre_public holiday'
d.loc[d['Date'] == '12-25', 'Descripcio_tipus_dia'] = 'Public holiday'
d.loc[d['Date'] == '12-26', 'Descripcio_tipus_dia'] = 'Post_public holiday'

d.loc[d['Date'] == '12-26', 'Descripcio_tipus_dia'] = 'Public holiday'
d.drop(['Date'], axis = 'columns', inplace=True)

#### Since all information is in Spanish we have to translate all the information.

In [8]:
d.columns = ['District_code', 'District_name','Day_of_the_week', 'Type_of_day', 'Month', 'Day','Hour', 'Moment_of_day','Severity_of_the_accident', 'Longitude','Latitude']

In [9]:
d["Moment_of_day"].unique()

array(['Nit', 'Matí', 'Tarda'], dtype=object)

In [10]:
d["Day_of_the_week"].unique()

array(['Dimarts', 'Dimecres', 'Dijous', 'Divendres', 'Dissabte',
       'Diumenge', 'Dilluns'], dtype=object)

In [11]:
d.loc[d['Type_of_day'] == 'Laborable', 'Type_of_day'] = 'Working day'
d.loc[d['Moment_of_day'] == 'Nit', 'Moment_of_day'] = 'Night'
d.loc[d['Moment_of_day'] == 'Matí', 'Moment_of_day'] = 'Morning'
d.loc[d['Moment_of_day'] == 'Tarda', 'Moment_of_day'] = 'Afternoon'
d.loc[d['Day_of_the_week'] == 'Dilluns', 'Day_of_the_week'] = 'Monday'
d.loc[d['Day_of_the_week'] == 'Dimarts', 'Day_of_the_week'] = 'Tuesday'
d.loc[d['Day_of_the_week'] == 'Dimecres', 'Day_of_the_week'] = 'Wednesday'
d.loc[d['Day_of_the_week'] == 'Dijous', 'Day_of_the_week'] = 'Thursday'
d.loc[d['Day_of_the_week'] == 'Divendres', 'Day_of_the_week'] = 'Friday'
d.loc[d['Day_of_the_week'] == 'Dissabte', 'Day_of_the_week'] = 'Saturday'
d.loc[d['Day_of_the_week'] == 'Diumenge', 'Day_of_the_week'] = 'Sunday'

In [12]:
d.head(10)

Unnamed: 0,District_code,District_name,Day_of_the_week,Type_of_day,Month,Day,Hour,Moment_of_day,Severity_of_the_accident,Longitude,Latitude
0,5,Sarrià-Sant Gervasi,Tuesday,Public holiday,1,1,1,Night,Serious,41.395744,2.155236
1,9,Sant Andreu,Tuesday,Public holiday,1,1,4,Night,Mild,41.425361,2.17999
2,10,Sant Martí,Tuesday,Public holiday,1,1,5,Night,Mild,41.396887,2.191712
3,6,Gràcia,Tuesday,Public holiday,1,1,8,Morning,Mild,41.408772,2.165429
4,2,Eixample,Tuesday,Public holiday,1,1,12,Morning,Mild,41.394884,2.17044
5,6,Gràcia,Tuesday,Public holiday,1,1,7,Morning,Mild,41.402521,2.166987
6,2,Eixample,Tuesday,Public holiday,1,1,13,Morning,Mild,41.390176,2.161787
7,2,Eixample,Tuesday,Public holiday,1,1,13,Morning,Mild,41.399084,2.169841
9,2,Eixample,Tuesday,Public holiday,1,1,18,Afternoon,Mild,41.397233,2.185806
10,7,Horta-Guinardó,Tuesday,Public holiday,1,1,19,Afternoon,Mild,41.426641,2.171439


**Now we have our data ready to analize in order to determinate if all the variables that we choosed are usefull to make our prediction model.**

**The Longitude, Latitude and District_code variables in this project are going to be used only to visualizate our data. With the rest of the variables we will try to determinate if there is any relationship between the severity of the injuries and that categories. In order to achieve that first we will visualizate that relationship to make us an idea of that. After that we will modify the format of our data with the porpouse to try different predictive models.**

