# Toronto Traffic Collisions – Data Overview

This notebook provides an initial overview of the Toronto Traffic Collisions dataset.  
The goal is to understand the dataset structure, size, column types, missing values,  
and record early observations before performing any data cleaning or analysis.


In [1]:
import pandas as pd
import numpy as np


## Load Dataset

The raw collision dataset is loaded from the `data/raw` directory.


In [2]:
df = pd.read_csv("../data/raw/Traffic_Collisions_Data.csv")


## Dataset Size

We examine the number of rows and columns in the dataset.


In [3]:
df.shape


(772516, 23)

## Preview of Dataset

The first few rows provide a high-level view of the dataset and its columns.


In [4]:
df.head()


Unnamed: 0,OBJECTID,EVENT_UNIQUE_ID,OCC_DATE,OCC_MONTH,OCC_DOW,OCC_YEAR,OCC_HOUR,DIVISION,FATALITIES,INJURY_COLLISIONS,...,NEIGHBOURHOOD_158,LONG_WGS84,LAT_WGS84,AUTOMOBILE,MOTORCYCLE,PASSENGER,BICYCLE,PEDESTRIAN,x,y
0,1,GO-20148000028,1/1/2014 5:00:00 AM,January,Wednesday,2014,17,D53,0,NO,...,Mount Pleasant East (99),-79.377616,43.701225,YES,NO,NO,NO,NO,-8836276.0,5419322.0
1,2,GO-20148004875,1/1/2014 5:00:00 AM,January,Wednesday,2014,14,D32,0,NO,...,Lawrence Park North (105),-79.397589,43.726091,YES,NO,NO,NO,NO,-8838499.0,5423152.0
2,3,GO-20141260499,1/1/2014 5:00:00 AM,January,Wednesday,2014,2,NSA,0,YES,...,NSA,0.0,0.0,YES,NO,NO,NO,NO,6.32778e-09,5.664924e-09
3,4,GO-20141260663,1/1/2014 5:00:00 AM,January,Wednesday,2014,3,NSA,0,NO,...,NSA,0.0,0.0,YES,NO,NO,NO,NO,6.32778e-09,5.664924e-09
4,5,GO-20141261162,1/1/2014 5:00:00 AM,January,Wednesday,2014,5,NSA,0,YES,...,NSA,0.0,0.0,YES,NO,NO,NO,NO,6.32778e-09,5.664924e-09


## Column Structure and Data Types

This section examines column names, data types, and non-null counts.


In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 772516 entries, 0 to 772515
Data columns (total 23 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   OBJECTID           772516 non-null  int64  
 1   EVENT_UNIQUE_ID    772516 non-null  object 
 2   OCC_DATE           772516 non-null  object 
 3   OCC_MONTH          772516 non-null  object 
 4   OCC_DOW            772516 non-null  object 
 5   OCC_YEAR           772516 non-null  int64  
 6   OCC_HOUR           772516 non-null  int64  
 7   DIVISION           772516 non-null  object 
 8   FATALITIES         772516 non-null  int64  
 9   INJURY_COLLISIONS  772512 non-null  object 
 10  FTR_COLLISIONS     772512 non-null  object 
 11  PD_COLLISIONS      772512 non-null  object 
 12  HOOD_158           772516 non-null  object 
 13  NEIGHBOURHOOD_158  772516 non-null  object 
 14  LONG_WGS84         772516 non-null  float64
 15  LAT_WGS84          772516 non-null  float64
 16  AU

## Missing Values

We calculate the number of missing values for each column.


In [6]:
df.isnull().sum()


OBJECTID             0
EVENT_UNIQUE_ID      0
OCC_DATE             0
OCC_MONTH            0
OCC_DOW              0
OCC_YEAR             0
OCC_HOUR             0
DIVISION             0
FATALITIES           0
INJURY_COLLISIONS    4
FTR_COLLISIONS       4
PD_COLLISIONS        4
HOOD_158             0
NEIGHBOURHOOD_158    0
LONG_WGS84           0
LAT_WGS84            0
AUTOMOBILE           4
MOTORCYCLE           4
PASSENGER            4
BICYCLE              4
PEDESTRIAN           4
x                    0
y                    0
dtype: int64

## Missing Values Percentage

The percentage of missing values helps assess data completeness.


In [7]:
(df.isnull().sum() / len(df)) * 100


OBJECTID             0.000000
EVENT_UNIQUE_ID      0.000000
OCC_DATE             0.000000
OCC_MONTH            0.000000
OCC_DOW              0.000000
OCC_YEAR             0.000000
OCC_HOUR             0.000000
DIVISION             0.000000
FATALITIES           0.000000
INJURY_COLLISIONS    0.000518
FTR_COLLISIONS       0.000518
PD_COLLISIONS        0.000518
HOOD_158             0.000000
NEIGHBOURHOOD_158    0.000000
LONG_WGS84           0.000000
LAT_WGS84            0.000000
AUTOMOBILE           0.000518
MOTORCYCLE           0.000518
PASSENGER            0.000518
BICYCLE              0.000518
PEDESTRIAN           0.000518
x                    0.000000
y                    0.000000
dtype: float64

## Summary Statistics

Basic descriptive statistics for numerical columns.


In [8]:
df.describe()


Unnamed: 0,OBJECTID,OCC_YEAR,OCC_HOUR,FATALITIES,LONG_WGS84,LAT_WGS84,x,y
count,772516.0,772516.0,772516.0,772516.0,772516.0,772516.0,772516.0,772516.0
mean,386258.5,2019.256124,13.493745,0.00086,-66.435517,36.579449,-7395568.0,4536463.0
std,223006.304615,3.414332,4.978853,0.030219,29.33867,16.153893,3265966.0,2003358.0
min,1.0,2014.0,0.0,0.0,-79.639247,0.0,-8865400.0,5.664924e-09
25%,193129.75,2016.0,10.0,0.0,-79.444902,43.644548,-8843766.0,5410599.0
50%,386258.5,2019.0,14.0,0.0,-79.370469,43.693022,-8835480.0,5418059.0
75%,579387.25,2022.0,17.0,0.0,-79.258814,43.75165,-8823051.0,5427089.0
max,772516.0,2025.0,23.0,4.0,0.0,43.853164,6.32778e-09,5442747.0


## Initial Observations

Based on the initial exploration:

- The dataset contains over 770,000 collision records.
- There are 23 columns including temporal, geographic, and collision outcome variables.
- Several categorical columns are stored as object types.
- Some collision-related fields contain missing values.
- Latitude and longitude data appear complete.
- The dataset spans multiple years and divisions across Toronto.

No data cleaning is performed at this stage.
