# **Crime Analytics - Visualising Incident Reports**

In this notebook, we will be analysing the San Francisco criminal incident data as contained in the police department incident reports published on San Francisco's open data web portal: [https://data.sfgov.org](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783). Note that whilst the original instructions were to use incident data from the summer of 2014, this is now 13 years ago so the underlying dataset has been updated to a full year of data from the summer of 2022.

The raw data contained 136,677 rows of data across 35 columns:
| Field Name | Data Type |
| ----------- | --------- |
| Incident Datetime | Object |
| Incident Date | Object |
| Incident Time | Object |
| Incident Year | Integer |
| Incident Day of Week | Object |
| Report Datetime | Object |
| Row ID | Float |
| Incident ID | Integer |
| Incident Number | Integer |
| CAD Number | Object |
| Report Type Code | Object |
| Report Type Description | Object |
| Filed Online | Object |
| Incident Code | Integer |
| Incident Category | Object |
| Incident Subcategory | Object |
| Incident Description | Object |
| Resolution | Object |
| Intersection | Object |
| CNN | Float |
| Police District | Object |
| Analysis Neighborhood | Object |
| Supervisor District | Float |
| Supervisor District 2012 | Float |
| Latitude | Float |
| Longitude | Float |
| Point | Object |
| Neighborhoods | Float |
| ESNCAG - Boundary File | Float |
| Central Market/Tenderloin Boundary Polygon - Updated | Float |
| Civic Center Harm Reduction Project Boundary | Float |
| HSOC Zones as of 2018-06-05 | Float |
| Invest In Neighborhoods (IIN) Areas | Float |
| Current Supervisor Districts | Float |
| Current Police Districts | Float |

For field descriptions see the [Field Definitions](https://datasf.gitbook.io/datasf-dataset-explainers/sfpd-incident-report-2018-to-present#field-definitions).

In [175]:
# import all necessary libraries
import matplotlib.pyplot as plt
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import plotly.io as pio
pio.renderers.default = 'notebook'

In [310]:
incidents = pd.read_csv('sf_incidents_2223.csv')
# clean up the column names
incidents.columns = incidents.columns.str.lower().str.replace(' ', '_')
# select the 'useful' columns for the analysis
useful_cols = [col for col in incidents if col.startswith('incident') or col in ['resolution', 'police_district', 'analysis_neighborhood', 'supervisor_district', 'neighbourhoods',
                                                                                 'latitude', 'longitude', 'point']]
incidents = incidents.loc[:, useful_cols]
# convert incident_datetime to a datetime column
incidents['incident_datetime'] = pd.to_datetime(incidents.incident_datetime, infer_datetime_format = True)
incidents['incident_date'] = pd.to_datetime(incidents.incident_date, infer_datetime_format = True)
incidents.info()

# add hour and month information
incidents['incident_hour'] = incidents.incident_datetime.dt.hour
incidents['incident_month_num'] = incidents.incident_datetime.dt.month
incidents['incident_month'] = incidents.incident_datetime.dt.month_name()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136677 entries, 0 to 136676
Data columns (total 18 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   incident_datetime      136677 non-null  datetime64[ns]
 1   incident_date          136677 non-null  datetime64[ns]
 2   incident_time          136677 non-null  object        
 3   incident_year          136677 non-null  int64         
 4   incident_day_of_week   136677 non-null  object        
 5   incident_id            136677 non-null  int64         
 6   incident_number        136677 non-null  int64         
 7   incident_code          136677 non-null  int64         
 8   incident_category      136545 non-null  object        
 9   incident_subcategory   136545 non-null  object        
 10  incident_description   136677 non-null  object        
 11  resolution             136677 non-null  object        
 12  police_district        136677 non-null  obje

In [311]:
incidents.iloc[:5, :12]

Unnamed: 0,incident_datetime,incident_date,incident_time,incident_year,incident_day_of_week,incident_id,incident_number,incident_code,incident_category,incident_subcategory,incident_description,resolution
0,2023-05-31 23:49:00,2023-05-31,23:49,2023,Wednesday,1281250,230377142,19057,Disorderly Conduct,Intimidation,Terrorist Threats,Open or Active
1,2023-05-31 23:45:00,2023-05-31,23:45,2023,Wednesday,1281464,230377948,75000,Missing Person,Missing Person,Found Person,Open or Active
2,2023-05-31 23:45:00,2023-05-31,23:45,2023,Wednesday,1281464,230377948,74000,Missing Person,Missing Adult,Missing Adult,Open or Active
3,2023-05-31 23:44:00,2023-05-31,23:44,2023,Wednesday,1281275,230377136,16210,Drug Offense,Drug Violation,Opiates Offense,Cite or Arrest Adult
4,2023-05-31 23:44:00,2023-05-31,23:44,2023,Wednesday,1281275,230377136,62050,Warrant,Warrant,"Warrant Arrest, Enroute To Outside Jurisdiction",Cite or Arrest Adult


In [343]:
cats = pd.DataFrame(incidents.groupby(['incident_category', 'incident_subcategory'])['incident_id'].count())
cats.reset_index(inplace=True)
cats.sort_values(by='incident_id', inplace=True, ascending=False)
cats

Unnamed: 0,incident_category,incident_subcategory,incident_id
32,Larceny Theft,Larceny - From Vehicle,24294
35,Larceny Theft,Larceny Theft - Other,9312
43,Malicious Mischief,Vandalism,8765
48,Motor Vehicle Theft,Motor Vehicle Theft,8292
65,Other Miscellaneous,Other,6277
...,...,...,...
53,Offences Against The Family And Children,Intimidation,2
57,Other Miscellaneous,Bribery,2
16,Disorderly Conduct,Trespass,1
27,Homicide,Homicide - Excusable,1


In [312]:
incidents.iloc[:5, 12:]

Unnamed: 0,police_district,analysis_neighborhood,supervisor_district,latitude,longitude,point,incident_hour,incident_month_num,incident_month
0,Tenderloin,Tenderloin,5.0,37.78579,-122.41297,POINT (-122.41296966814406 37.78578958358186),23,5,May
1,Taraval,Lakeshore,7.0,37.719056,-122.481424,POINT (-122.48142396348878 37.71905643663192),23,5,May
2,Taraval,Lakeshore,7.0,37.719056,-122.481424,POINT (-122.48142396348878 37.71905643663192),23,5,May
3,Northern,Tenderloin,5.0,37.783101,-122.419182,POINT (-122.41918170505187 37.78310139923345),23,5,May
4,Northern,Tenderloin,5.0,37.783101,-122.419182,POINT (-122.41918170505187 37.78310139923345),23,5,May


There are multiple records for each unit incident id/number. The plot below clearly indicates that most incidents only have a single record attached to them, but about 10% (16,617) have up to 4 records. In order to ensure that any conclusions aren't biased by the presence of multiple records for a single incident, we took the latest record for each incident. For example, for incident 1281464, we kept the record with the description `Found person`.

In [313]:
counts = pd.DataFrame(incidents.incident_id.value_counts()).reset_index()
counts = counts.groupby("incident_id").count().reset_index()
counts['prop'] = counts['index'] / incidents.shape[0] * 100
counts.columns = ["id_count", "count", "prop"]
counts

Unnamed: 0,id_count,count,prop
0,1,98180,71.833593
1,2,11373,8.321078
2,3,5225,3.822882
3,4,19,0.013901


In [314]:
fig = px.bar(data_frame=counts, x="id_count", y="prop",
             hover_data={'prop':':.2f',
                         'count':':,'},
             labels={'prop':'Percentage (%)', 'id_count':'Occurrence'},
             title='Proportion of records by incident id occurrence',
             template='simple_white')
fig

In [315]:
incidents2 = incidents.groupby('incident_id').first()
incidents2.sort_values(by='incident_datetime', ascending=False, inplace=True)
incidents2.reset_index(inplace=True)

In [316]:
by_cat = pd.DataFrame(incidents2.incident_category.value_counts())
by_cat.reset_index(inplace=True)
by_cat.columns = ['category', 'count']
by_cat['prop'] = by_cat['count']/incidents2.shape[0]*100
by_cat.sort_values(by='prop', inplace=True, ascending=False)
by_cat

Unnamed: 0,category,count,prop
0,Larceny Theft,40391,35.184717
1,Motor Vehicle Theft,8264,7.198794
2,Malicious Mischief,7606,6.625609
3,Non-Criminal,6512,5.672622
4,Assault,6375,5.553281
5,Burglary,6278,5.468784
6,Recovered Vehicle,5974,5.203969
7,Other Miscellaneous,4872,4.244013
8,Fraud,4265,3.715254
9,Lost Property,3456,3.010532


In [None]:
# let's viw the distribution by category


In [335]:
# let's see if there is any pattern in the number of incidents by month of the year
by_month_cat = pd.DataFrame(incidents2.groupby(['incident_month', 'incident_month_num', 'incident_category'])['incident_id'].count())
by_month_cat.reset_index(inplace=True)
by_month_cat.columns = ['month', 'month_num', 'category', 'count']
by_month_cat['month_sum'] = by_month_cat.groupby('month')['count'].transform('sum')
by_month_cat['total_prop'] = by_month_cat['count']/incidents2.shape[0]*100
by_month_cat['month_prop'] = by_month_cat['count']/by_month_cat.month_sum*100

# pivot the data so each month is a column, each category is a row and each proportion is a value
month_cat_wide = by_month_cat.pivot(columns='month_num', index='category', values='month_prop')
# parallel coordinates needs 'category' in a numeric format
month_cat_wide['category_num'] = range(month_cat_wide.shape[0])
month_cat_wide.fillna(0, inplace=True)
# month_cat_wide

In [338]:
# let's visualise each category over the months of the year
fig = px.parallel_coordinates(month_cat_wide,
                              color="category_num",
                              dimensions=[6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5],
                              # labels={"species_id": "Species",
                              #         "sepal_width": "Sepal Width",
                              #         "sepal_length": "Sepal Length",
                              #         "petal_width": "Petal Width",
                              #         "petal_length": "Petal Length", },
                             # color_continuous_scale=px.colors.diverging.Tealrose,
                             # color_continuous_midpoint=2)
                             )
fig.show()