# City of Calgary Traffic Incidents Exploratory Data Analysis

## 1. Introduction

### 1.1 Project Overview

The City of Calgary is consistently rated as one of the best places to live in the world according to the [EIU](https://moving2canada.com/news-and-features/features/planning/destination-guides/calgary/2022-eiu-liveability-index-three-canadian-cities-top-ten/). With it's proximity to the Rocky Mountains and it's large business sector, Calgary provides to all demographics an excellent place to call home. With all the locations to visit within and around the city, Calgary has an avid commuter culture. I happen to be one such commuter.

During my commuting time, I stick to roads in the south west quadrant of the city and have never been in a traffic incident. However, many traffic incidents are reported via radio at high volume times. Combining this with my recent exploration of north west and north east quadrants of the city, I began to ask myself whether I am at a higher risk of being in an accident. Naively, I would assume that smaller roads and more cars would mean more incidents. In particular, downtown streets or busy highways with small merges seem like the most likely place to have an incident. 

The most dangerous roads in Calgary can be determined by using data from the traffic incident dataset provided by the City of Calgary's [open data](https://data.calgary.ca/Transportation-Transit/Traffic-Incidents/35ra-9556) website. The data from this website can be used to get insight into the traffic incident patterns of the burgeoning metropolitan city  by utilizing exploratory data analysis and further data exploration techniques such as predictive modeling. 

### 1.2 Literature Review and background

Traffic incidents are a heavily researched area. The [World Health Organization](https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries) has an overview of traffic incidents within the global picture. Some key points are below:
- Approximately 1.3 million indidivuals die each year as a result of a traffic incident.
- 93% of the world's fatalities on the roads occur in low- and middle-income countries, even though these countries have approximately 60% of the world's vehicles.

The global stats aid in guiding the data anaylsis. The data from Calgary will be expected to show us a higher number of traffic incidents in lower income areas, such as Forest Lawn.

Research has also been done on a more focused level within the City of Calgary. The dataset this project focuses on has been at the center of analysis before. Data analysis of the 2020 traffic incidents compared to 2019 was completed [here](https://pub-calgary.escribemeetings.com/filestream.ashx?DocumentId=189649). It shows a sharp decline in overall incidents. Which is to be expected during the height of the lockdown in Alberta, Canada. While the linked paper focuses more on the stats from 2019 compared to 2020, this project is meant to paint a larger overall picture using the complete dataset. With the increase in population and return to regular road use, analysing data from "normal" times is crucial to understanding incident heavy areas today.

Another more informal analysis was conducted by Siavash Fard, M.Sc., P.Eng., PMP, which is posted on [LinkedIn](https://www.linkedin.com/pulse/prediction-traffic-incidents-calgary-siavash-fard-/). This analysis attempts to predict future incidents using the traffic incident dataset used in this study, in addition to data regarding traffic control devices. The analysis focuses on prediction techniques, as opposed to visualizing an finding insights into the dataset.

The former studies show a solid area for this project to focus on. We can use the former research projects and define teh areas where we need to focus on below.

### 1.3 Aims and Objectives

To examine the nature of the traffic incidients in Calgary, the following questions will need to be answered:
1. Which areas of the city have the most incidents?
    - Hypothesis: Downtown roads and highways are to be the most dangerous roads.
2. What time of day has highest number of incidents?
    - Hypothesis: Rush hours (8-9) and (5-6) are the most dangerous hours of a day.
3. Does the day, week, month, or year cause variance in the frequency of incidents?
    - Hypothesis: Winter months and weekdays are the most dangerous times to be on the road.
4. What kind of incidents are happening?
    - Does the type of incident affect the time between the start and end of accident?
    - Does the type of incident change based on location in the city?
    
Answering these questions will allow us to pinpoint unsafe areas during a commute within the city. The data source noted that the dataset is updated every 10 minutes with new traffic incidents. Therefore, as the dataset grows, we will be able to bring in more data to more thoroughly examine the nature of traffic incidents in Calgary.

### 1.4 Introduction Summary

Understanding the dangers of commuting within Calgary will shed some light on areas to avoid during certain hours. Furthermore it can also be used to help city planners develop solid plans to mitigate problem areas. Let us move on to importing the data we will use for the project.

In [1]:
# Import Necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Library to get most up to date csv from open Calgary
from sodapy import Socrata

## 2. Data Acquisition and Justification 

### 2.1 Data Source and Review

As mentioned previously, we will be focusing on traffic incident data provided by the City of Calgary. This data is an archive of reported traffic incidents within the city. These traffic incidents range from stalled vehicles to multi-vehicle collisions. The dataset is updated every ten minutes and has been updated since December 4th, 2017. Analyzing this dataset will answer the research questions and it provides a large enough dataset to gain meaningful insights into the traffic incidents in Calgary.

Traffic incidents are collected via an advanced traveler information system, or ATIS, which collects information from a wide source of inputs. Inputs include commuter reported incidents via the WAZE application and traffic cameras. More information about Calgary's ATIS can be found [here](https://www.calgary.ca/roads/conditions/advanced-traveller-information-system.html).

This dataset provides the most accurate picture of Calgary traffic incidents available for public use. It does however have its flaws. The website indicates, "please note there may be gaps in the data due to system or script malfunction." This is a good indicator of problems areas to view during data cleansing. The data is also limited by the methods of reporting, for example instances where individuals may decide not to report. Instances of false positives, or erros in collection methodology. Ensuring effective data wrangling is employed will ensure the data is accurate and reasonably scrutinized.

### 2.2 Data Comparison

This data set is well suited to answer our research questions. This can be confirmed by comparing it to other sources of data within the traffic of Calgary sphere of information.

The [Calgary Traffic Counts System](https://trafficcounts.calgary.ca/) is another open data source provided by the city of Calgary. It provides data related to traffic around major intersections. This data has been collected for over 40 years. Which would paint a much better historical picture of the traffic situation in Calgary. The data does not explicitly show incidents, rather the overall use of the roads within Calgary. It would be better suited toward's understanding the growth and use of the roads in general as opposed to the roads and times which are the most dangerous.

Considering the sources for accurate data are quite slim, we are limited to data provided by either the Alberta government or the City of Calgary. An example of data from the Alberta government can be found [here](https://open.alberta.ca/opendata/traffic-collision-casualties-alberta). This dataset is a high level overview of the number of incidents, deaths and injuries by year related to traffic incidents. Going from 2001-2014. This dataset paints an overall picture. However, each incident does not have associated information, there's no precise date and there is not any true analysis to follow because the dataset is so simple. It could be useful as supplementary data for this project. Comparing and contrasting Calgary's total incident rater to the provinces. Overall it is not ideal for our research area.

### 2.3 Scope of work

To answer the research questions; Patterns, trends and insights will need to be found within our dataset. To ensure these goals are met, the project will follow the below scope of work.

- Import the dataset and conduct initial data exploration and cleaning. 
    - i.e. check for missing values, boundary cases or possible inaccuracies and validate data types.
- Modify the data by adjusting any problem areas found in the prior step.
- Conduct exploratory data analysis to bring the patterns, insights and trends to the surface which will answer our research questions.
- Evaluate and summarize the findings brought to light by the EDA, ensuring these are linked back to the research questions.
- Reflect on the process and outcomes of the EDA, from start to finish. Observing any potential misteps or areas that could be improved.

This scope of work will ensre our research questions are answered. It will also allow for the data to be further processed and involved in future predictive modeling or machine learning algorithsm to help gain further insgihts into traffic incidents in the City of Calgary.

### 2.4 Loading the Data

#### Marker Note
The below code was taken directly [this](https://stackoverflow.com/questions/46572365/import-data-to-dataframe-using-soda-api) stack overflow post.

If the below code does not execute, a back-up file is provided. The data was saved on May 28th, 2023.

In [2]:
# Load updated data from https://data.calgary.ca/Transportation-Transit/Traffic-Incidents/35ra-9556

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.calgary.ca", None)

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get_all("35ra-9556")

# Convert to pandas DataFrame
df = pd.DataFrame.from_records(results)



In [3]:
# BACK UP CODE IF API FAILS.

#df = pd.read_csv("Traffic_Incidents_05_28_2023.csv")

### 2.4 Data Acquisition and Justification Summary

The dataset from open Calgary combined with our review of similar reasearch papers/analysis will allow us to ask important questions and provide accurate insights that have yet to be explored.

## 3. Data Exploration and Cleaning

### 3.1 Explore the data

In [4]:
df.shape

(39796, 13)

The shape property shows us that there have been approximately 40,000 traffic incidents since this dataset has started collection. The dataset also has 13 points of data per incident to analyzie and compare. An idea about the data columns and their irrespective information can be gathered by using the columns, head(), tail(), sample(), info(), describe() and dtypes() properties/methods on the dataframe.

In [5]:
# get first 5 rows of dataset
df.head()

Unnamed: 0,incident_info,description,start_dt,modified_dt,quadrant,longitude,latitude,count,id,point,:@computed_region_kxmf_bzkv,:@computed_region_4a3i_ccfj,:@computed_region_4b54_tmc4
0,Memorial Drive at Edmonton Trail NE,Ongoing incident. EB Memorial Dr is closed at ...,2023-05-29T20:38:21.000,2023-05-29T20:41:07.000,NW,-114.05033553745902,51.05060185177274,1,2023-05-29T20:38:2151.05060185177274-114.05033...,"{'type': 'Point', 'coordinates': [-114.0503355...",137,4,7
1,Seton Ri and Seton Ci SE,Traffic incident.,2023-05-29T18:12:14.000,2023-05-29T18:13:09.000,SE,-113.96008727085788,50.867204871573264,1,2023-05-29T18:12:1450.867204871573264-113.9600...,"{'type': 'Point', 'coordinates': [-113.9600872...",229,3,4
2,Millrise Drive and Millbank Co SW,Traffic incident.,2023-05-29T17:51:27.000,2023-05-29T18:13:09.000,SW,-114.07836521612283,50.91669145289532,1,2023-05-29T17:51:2750.91669145289532-114.07836...,"{'type': 'Point', 'coordinates': [-114.0783652...",156,1,5
3,Eastbound 64 Avenue and 11 Street NE,Traffic incident. Blocking the left turn lane,2023-05-29T17:47:13.000,2023-05-29T17:52:17.000,NE,-114.03694968994456,51.11046111522696,1,2023-05-29T17:47:1351.110461115226954-114.0369...,"{'type': 'Point', 'coordinates': [-114.0369496...",111,4,11
4,22 Street and 24 Avenue NW,Traffic incident.,2023-05-29T17:22:13.000,2023-05-29T17:52:17.000,NW,-114.11294630640889,51.074277108159365,1,2023-05-29T17:22:1351.074277108159365-114.1129...,"{'type': 'Point', 'coordinates': [-114.1129463...",153,2,7


From df.head() a few obeservations can be made. 
- ***incident_info*** contains the street(s) the incident took place on.
- ***description*** is a brief sentence regarding the type of incident.
    - Traffic incident is not descriptive and takes up 4 of the 5 data points. How many incidents are labelled with this information?
- ***start_dt and modified_dt*** contain the start and end of the reporting of said incident.
    - Potential to use these columns to get total time of clean-up and get an inference on the severity of the accident based on new total_time column.
- ***quadrant*** is what part of the city the incident took place in.
- ***longitude and latitude*** store the longitude and latitude respectively.
- ***count*** has an unkown meaning. The open Calgary website does not have any information on this column.
    - All columns appear to have the value 1? Will need to confirm and if so remove from dataframe.   
- ***id*** is the identifier of the incident. It contains a concatenation of the start_dt, latitude and longitutde columns
    - This column appears to be what is current sorting the dataframe.
    - This column is another candidate to potentially remove from dataframe.   
- ***point*** stores a JSON object which contains information for a point containing the longitude and latitude
    - Appears to be an object used within sodapy. Could be useful for a map visualization.
    - If the point column make it easy to visualise data on a map, the longitude and lattude columns could be removed. Or vice versa.    
- The last three clumns appear to be a byproduct of the SODA API. The open Calgary website does not list these and [this](https://hub.safe.com/publishers/cdesisto/templates/socrata_computed_columns) website confirms this.
    - When cleaning data these three columns are to be dropped as they provide no usable data.

In [6]:
# Get list of current columns before drop function.
df.columns

Index(['incident_info', 'description', 'start_dt', 'modified_dt', 'quadrant',
       'longitude', 'latitude', 'count', 'id', 'point',
       ':@computed_region_kxmf_bzkv', ':@computed_region_4a3i_ccfj',
       ':@computed_region_4b54_tmc4'],
      dtype='object')

In [7]:
# Remove the last three columns as they are not part of original dataset and don't add any value.
df.drop(columns=[':@computed_region_kxmf_bzkv', ':@computed_region_4a3i_ccfj', ':@computed_region_4b54_tmc4'], inplace=True)

# Ensure last three columns were removed.
df.columns

Index(['incident_info', 'description', 'start_dt', 'modified_dt', 'quadrant',
       'longitude', 'latitude', 'count', 'id', 'point'],
      dtype='object')

In [8]:
# Display the last 5 elements to get an idea of the end of the dataframe.
df.tail()

Unnamed: 0,incident_info,description,start_dt,modified_dt,quadrant,longitude,latitude,count,id,point
39791,Southbound University Drive at Crowchild Trail NW,2 vehicle incident.,2016-12-06T17:05:00.000,2016-12-06T17:10:00.000,NW,-114.1195835,51.06639113,1,2016-12-06T17:05:0051.06639113-114.1195835,"{'type': 'Point', 'coordinates': [-114.1195835..."
39792,Ogden Road at Bonnybrook Road SE,2 vehicle incident.,2016-12-06T16:26:00.000,2016-12-06T16:38:00.000,SE,-114.0308717,51.02839263,1,2016-12-06T16:26:0051.02839263-114.0308717,"{'type': 'Point', 'coordinates': [-114.0308717..."
39793,Macleod Trail at 9 Avenue SE,2 vehicle incident.,2016-12-06T16:25:00.000,2016-12-06T16:26:00.000,SE,-114.0581785,51.04447099,1,2016-12-06T16:25:0051.04447099-114.0581785,"{'type': 'Point', 'coordinates': [-114.0581785..."
39794,Eastbound Memorial Drive approaching Deerfoot ...,2 vehicle incident blocking the middle lane.,2016-12-06T14:36:00.000,2016-12-06T14:42:00.000,NE,-114.0205479,51.0476343,1,2016-12-06T14:36:0051.0476343-114.0205479,"{'type': 'Point', 'coordinates': [-114.0205479..."
39795,Eastbound McKnight Boulevard at 2 Street NW,Multi vehicle incident.,2016-12-06T10:00:00.000,2016-12-06T10:01:00.000,NW,-114.0649874,51.09611149,1,2016-12-06T10:00:0051.09611149-114.0649874,"{'type': 'Point', 'coordinates': [-114.0649874..."


We can see that the data goes back to 2016 instead of our initialy understanding of 2017. This appears to of been an error on open Calgary. 

In [10]:
# Get a random selection of 10 data points
df.sample(10)

Unnamed: 0,incident_info,description,start_dt,modified_dt,quadrant,longitude,latitude,count,id,point
5435,32 Avenue and Collegiate Boulevard NW,Traffic incident.,2022-09-14T14:25:47.000,2022-09-14T14:26:44.000,NW,-114.13761307310502,51.08154095114506,1,2022-09-14T14:25:4751.08154095114506-114.13761...,"{'type': 'Point', 'coordinates': [-114.1376130..."
23833,Southbound Deerfoot Trail after Memorial Drive SE,Single vehicle incident.,2019-11-23T12:31:00.000,,,-114.0137037,51.0454027,1,2019-11-23T12:31:2551.04540269898794-114.01370...,"{'type': 'Point', 'coordinates': [-114.0137037..."
39647,17 Avenue at 52 Street SE,2 vehicle incident.,2016-12-11T17:06:00.000,2016-12-11T17:20:00.000,SE,-113.9585022,51.03785545,1,2016-12-11T17:06:1551.0378554535525-113.958502...,"{'type': 'Point', 'coordinates': [-113.9585022..."
23163,Ogden Road and 80 Avenue SE,Two vehicle incident.,2019-12-20T17:41:00.000,,,-113.9987731,50.98139553,1,2019-12-20T17:41:4650.98139553230935-113.99877...,"{'type': 'Point', 'coordinates': [-113.9987731..."
33410,Northbound Deerfoot Trail and Peigan Trail SE,Stalled vehicle. Off Peigan Trail exit ramp,2018-02-06T17:51:00.000,2018-02-06T18:19:00.000,SE,-114.0052562,51.01500138,1,2018-02-06T17:51:2451.01500137989-114.00525624...,"{'type': 'Point', 'coordinates': [-114.0052562..."
13884,Northbound Deerfoot Trail and Southland Drive SE,Two vehicle incident. Blocking the right shou...,2021-07-25T10:00:00.000,,,-114.036222,50.97058953,1,2021-07-25T10:00:5950.97058952980319-114.03622...,"{'type': 'Point', 'coordinates': [-114.036222,..."
23012,Northbound Deerfoot Trail at 16 Avenue NE,Single vehicle incident. Blocking the right lane,2020-01-01T04:48:00.000,,,-114.0267599,51.06719295,1,2020-01-01T04:48:3551.067192947356034-114.0267...,"{'type': 'Point', 'coordinates': [-114.0267599..."
9187,Spruce Meadows Trail and 6 Street SW,Traffic incident.,2022-03-01T10:01:00.000,2022-03-01T10:04:00.000,SW,-114.0864422,50.89354166,1,2022-03-01T10:01:3650.893541662874-114.0864422...,"{'type': 'Point', 'coordinates': [-114.0864422..."
508,Canyon Meadows Drive and Bonaventure Drive SE,Traffic signals are blank. Crew has been dispa...,2023-05-01T18:51:30.000,2023-05-01T19:35:07.000,SE,-114.0626154565802,50.93031538815888,1,2023-05-01T18:51:3050.930315388158874-114.0626...,"{'type': 'Point', 'coordinates': [-114.0626154..."
33526,Sarcee Trail at 112 Avenue NW,Two vehicle incident.,2018-02-03T15:55:00.000,2018-02-03T17:01:00.000,NW,-114.1629421,51.16000129,1,2018-02-03T15:55:5751.1600012884133-114.162942...,"{'type': 'Point', 'coordinates': [-114.1629421..."


More observations can be made with the sample method.
- The desciption field is confirmed to contain simple descriptions about the incident.
    - Simple NLP processing and word mapping is a good starting point for processing this column.
- NaN appears in both modified_dt and quadrant. 
    - When comparing the total time of an incident if NaN is present in modified_dt it would be best to exclude the value of that datapoint.
    - The quadrant NaN can be inferred from the incident_info column.