<a href="https://colab.research.google.com/github/DavidToca/ML-Toronto-AutoTheft/blob/main/ML_AutoTheft_Toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset of Major Crime Indicators Open Data

## Source:
https://data.torontopolice.on.ca/datasets/TorontoPS::major-crime-indicators-open-data/about

## Description:

This dataset includes all Major Crime Indicators (MCI) occurrences by reported date and related
offences. The MCI categories include Assault, Break and Enter, Auto Theft, Robbery and Theft
Over. This data is provided at the offence and/or victim level, therefore one occurrence number
may have several records associated to the various MCIs used to categorize the occurrence. This
data does not include occurrences that have been deemed unfounded. The definition of
unfounded according to Statistics Canada is: “It has been determined through police
investigation that the offence reported did not occur, nor was it attempted” (Statistics Canada,
2020).

# Documentation

Toronto police provides documentation of this dataset that shows what each field represents here  https://torontops.maps.arcgis.com/sharing/rest/content/items/c0b17f1888544078bf650f3b8b04d35d/data. As we will see soon, this information is incomplete.

I've extracted the documentation from this fields here:

| Field | FieldName | Description |
|----------|----------|----------|
1|EVENT_UNIQUE_ID|Offence Number
2|REPORT_DATE|Date Offence was Reported (time is displayed in UTCformat when downloaded as a CSV)
3|OCC_DATE| Date Offence Occurred (time is displayed in UTC format when downloaded as a CSV)
4|REPORT_YEAR| Year Offence was Reported
5|REPORT_MONTH| Month Offence was Reported
6|REPORT_DAY| Day of the Month Offence was Reported
7|REPORT_DOY| Day of the Year Offence was Reported
8|REPORT_DOW| Day of the Week Offence was Reported
9|REPORT_HOUR| Hour Offence was Reported
10|OCC_YEAR| Year Offence Occurred
11|OCC_MONTH| Month Offence Occurred
12|OCC_DAY| Day of the Month Offence Occurred
13|OCC_DOY| Day of the Year Offence Occurred
14|OCC_DOW| Day of the Week Offence Occurred
15|OCC_HOUR|Hour Offence Occurred
16|DIVISION|Police Division where Offence Occurred
17|LOCATION_TYPE|Location Type of Offence
18|PREMISES_TYPE|Premises Type of Offence
19|UCR_CODE|UCR Code for Offence
20|UCR_EXT|UCR Extension for Offence
21|OFFENCE|Title of Offence
22|MCI_CATEGORY|MCI Category of Occurrence
23|HOOD_158|Identifier of Neighbourhood using City of Toronto's new 158 neighbourhood structure
24|NEIGHBOURHOOD_158|Name of Neighbourhood using City of Toronto's new 158 neighbourhood structure
25|HOOD_140|Identifier of Neighbourhood using City of Toronto's old 140 neighbourhood structure
26|NEIGHBOURHOOD_140|Name of Neighbourhood using City of Toronto's old 140 neighbourhood structure
27|LONG_WGS84|Longitude Coordinates (Offset to nearest intersection)
28|LAT_WGS84|Latitude Coordinates (Offset to nearest intersection)

# Objective
Develop a model that predicts auto theft crimes in Toronto

# EDA

Let's start by loading the data, and identifying outstanding information.
(Download the dataset and place it in /content/Major_Crime_Indicators_Open_Data.csv)

In [2]:
import pandas as pd
FILE_PATH = "/content/Major_Crime_Indicators_Open_Data.csv"
df = pd.read_csv(FILE_PATH)

print(df.shape)

(372899, 31)


We notice that we have 31 columns, but acording to the documentation only accounts for 28 columns. Let's further explore the data

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372899 entries, 0 to 372898
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   X                  372899 non-null  float64
 1   Y                  372899 non-null  float64
 2   OBJECTID           372899 non-null  int64  
 3   EVENT_UNIQUE_ID    372899 non-null  object 
 4   REPORT_DATE        372899 non-null  object 
 5   OCC_DATE           372899 non-null  object 
 6   REPORT_YEAR        372899 non-null  int64  
 7   REPORT_MONTH       372899 non-null  object 
 8   REPORT_DAY         372899 non-null  int64  
 9   REPORT_DOY         372899 non-null  int64  
 10  REPORT_DOW         372899 non-null  object 
 11  REPORT_HOUR        372899 non-null  int64  
 12  OCC_YEAR           372788 non-null  float64
 13  OCC_MONTH          372788 non-null  object 
 14  OCC_DAY            372788 non-null  float64
 15  OCC_DOY            372788 non-null  float64
 16  OC

# Analysis

💡 The 2 extra columns correspond to coordinates X and Y.

🚨 There's no non-null values, that's good news.

We don't have any description about the crimes, but we have 2 columns OFFENCE and MCI_CATEGORY that can help us identify if it's an autothef

In [5]:
display(df[:3])

Unnamed: 0,X,Y,OBJECTID,EVENT_UNIQUE_ID,REPORT_DATE,OCC_DATE,REPORT_YEAR,REPORT_MONTH,REPORT_DAY,REPORT_DOY,...,UCR_CODE,UCR_EXT,OFFENCE,MCI_CATEGORY,HOOD_158,NEIGHBOURHOOD_158,HOOD_140,NEIGHBOURHOOD_140,LONG_WGS84,LAT_WGS84
0,-8809036.0,5431523.0,1,GO-20141260264,2014/01/01 05:00:00+00,2014/01/01 05:00:00+00,2014,January,1,1,...,1430,100,Assault,Assault,143,West Rouge,131,Rouge (131),-79.132915,43.780413
1,-8814320.0,5435514.0,2,GO-20141260033,2014/01/01 05:00:00+00,2013/12/31 05:00:00+00,2014,January,1,1,...,1430,100,Assault,Assault,144,Morningside Heights,131,Rouge (131),-79.180387,43.806289
2,-8832825.0,5419631.0,3,GO-20141259834,2014/01/01 05:00:00+00,2014/01/01 05:00:00+00,2014,January,1,1,...,1420,100,Assault With Weapon,Assault,55,Thorncliffe Park,55,Thorncliffe Park (55),-79.346615,43.703234


💡 The 2 extra columns correspond to coordinates X and Y.

We can notice that we have multiple crimes being logged, not sure autotheft. We also notice that we have 2014 data. Let's explore this in more detail.

In [9]:
# prompt: group by OFFENCE, then show me which values it has, order by the one with the most count

df['OFFENCE'].value_counts().sort_values(ascending=False)


Assault                           135392
B&E                                59033
Theft Of Motor Vehicle             58441
Assault With Weapon                33764
Robbery - Mugging                   9053
B&E W'Intent                        8754
Assault Bodily Harm                 8555
Theft Over                          6961
Assault Peace Officer               6278
Robbery With Weapon                 6134
Robbery - Other                     5634
Robbery - Business                  5222
Assault - Resist/ Prevent Seiz      3381
Theft From Motor Vehicle Over       3075
Aggravated Assault                  2884
Robbery - Swarming                  2524
Discharge Firearm With Intent       2300
Unlawfully In Dwelling-House        2219
Discharge Firearm - Recklessly      1664
Theft From Mail / Bag / Key         1464
Pointing A Firearm                  1350
Robbery - Home Invasion             1290
Robbery - Vehicle Jacking           1211
Robbery - Purse Snatch              1122
Robbery - Financ