# Seattle City Crime Data Analysis

## 1: Motivation

As an international student in University of Washington, I am curious about the City of Seattle and would love to explore different city areas. By learning HCDE 410 and other data science classes in the UW, I am equipped with skills to analyze datasets and generate insights from them. Since I'm concerned about the city's public safety, I want to analyze its crime data to understand frequent crime types, in order to make me educated when going to different places. Research Question 2 is inspired from my past street-walking experience, especially when I'm walking in unfamiliar areas, as I'm concerned about whether it's more likely that a person attacks me or just wants my property (no personal harm if I comply).

Since I primarily reside in University District while studying in the UW, I want to learn the crime distribution in this area through study Question 3. Before COVID-19 pandemic that moves all UW classes online, I occasionally studied in UW libraries, where I discovered posters (from UW Police department) that urge students to protect their valuable items and never leave them unattended. In addition, I occasionally saw bikes (on bike rack) with missing parts, implying that an offender stole usable parts on students' bikes. 

## 2: Background and/or Related Work

While researching the Seattle Crime overview, I found [this US News article](https://realestate.usnews.com/places/washington/seattle/crime) describing that Seattle has "a \[relatively\] lower crime rate than similarly sized metro areas"; it has a higher rate of crimes against property and a lower rate of violent crimes (against people), comparing to the national average. Also, I learned that Seattle has a record high number of murder incidents in 2020, according to this [King5 news article](https://www.king5.com/article/news/local/seattle-police-reports-49-murders-setting-pace-for-record-homicides/281-c32aa4ae-ef9c-485f-a9d8-1113b491fc9d). Within the University of Washington; UW Police websites allow students to register [bicycles](https://bikeindex.org/uw) and [electronics](http://police.uw.edu/community-engagement/loveyourstuff/ereg/), and contents in these web pages describe the prevalence of thefts on them. Hence I assume that robbery/theft are the most frequent offense in the University District. Regarding how COVID-19 Pandemic affecting crime distribution, I remembered that attacks against people increases partially due to economic depression and discrimination against east Asian people (alleged origin of COVID-19).

I have adjusted my research question 2, 3, and 4 according to my background research, as I want to use the actual crime data from Seattle Police Department to verify claims made in these articles. In addition, I can learn whether robbery/theft crimes is the most frequent offense in the University District, and whether this crime is noticeably more frequent than the second most popular crime in this UW area. All in all, I hope my final project is related to HCDE 410's lecture regarding open data science where readers can use raw data to audit claims made in analytical reports.

## 3: Research questions

1. Do certain types of crime offenses have become more popular after the onset of COVID-19 Pandemic?
2. Do crimes against a person happen more frequently than crimes against property in Seattle, generally?
3. Within the University District, what are most frequent crime offenses.
4. Does Seattle has a record high homicide crime counts in 2020?

### 3-1: Hypothesis:**

1. Personal assault crimes have become more popular during COVID-19.
2. I hypothesize that crime against property happens more frequently than crimes against a person. 
3. Robbery/theft are the most frequent offenses in the University District.
4. Seattle does have a record high homicide crime count in 2020.


## 4: Data Source

**Seattle City Crime Data**

This data is provided by the City of Seattle Police Department documents city-wide crime data since 2008. It is comprehensive, including offense time, type of crime, large neighborhood district, providing me many ways to analyze the crime distribution in Seattle.

[Dataset source](https://data.seattle.gov/Public-Safety/SPD-Crime-Data-2008-Present/tazs-3rd5) (Time Range: from 2008 to present)

License: Public Domain

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
from datetime import datetime

In [4]:
#the URL we're retrieving the data from. Copy/paste it into your browser to view it!
api_endpoint = "https://data.seattle.gov/resource/tazs-3rd5.csv"

#the parameters we're passing to the API, to specify what subset of data we want.
api_parameters = "?$limit=1000000&" # Increase limit to 1000000, increase this number if dataset has more rows 
api_parameters = api_parameters + "$select=offense_start_datetime,crime_against_category,offense_parent_group,offense,mcpp&" # Only select columns that I need for my data analysis
api_parameters = api_parameters + "$$app_token=" + open("data/token.txt", "r").readline()

print(api_endpoint + api_parameters)
police_data = pd.read_csv(api_endpoint + api_parameters)

https://data.seattle.gov/resource/tazs-3rd5.csv?$limit=1000000&$select=offense_start_datetime,crime_against_category,offense_parent_group,offense,mcpp&$$app_token=SGxXejSEoCpKVTLb0IzwU4X8w


In [5]:
print(police_data)

         offense_start_datetime crime_against_category  \
0       2020-02-05T10:10:00.000                SOCIETY   
1       2020-02-03T08:00:00.000               PROPERTY   
2       2020-02-02T20:30:00.000               PROPERTY   
3       2020-02-05T01:17:00.000               PROPERTY   
4       2020-02-05T00:51:21.000                SOCIETY   
...                         ...                    ...   
904309  2013-07-13T01:00:00.000               PROPERTY   
904310  2013-06-26T11:00:00.000               PROPERTY   
904311  2012-02-14T15:04:00.000               PROPERTY   
904312  2010-09-19T16:59:00.000               PROPERTY   
904313  2010-02-25T18:00:00.000               PROPERTY   

                            offense_parent_group  \
0                         DRUG/NARCOTIC OFFENSES   
1                                  LARCENY-THEFT   
2                                        ROBBERY   
3       DESTRUCTION/DAMAGE/VANDALISM OF PROPERTY   
4                    DRIVING UNDER THE INFL

## 5: Methodology

### 5-1: Pre-analysis data cleaning

Convert date string (from all four datasets) to Python datetime object so we can compare equality of dates across different datasets.

Also, I decide to remove `offense_start_datetime` before 2008-01-01, as the website stated that only data that's recorded on and after 2008 is present in dataset. Since crimes happened earlier than 2008 may be entered to this dataset, such number of crimes may be noticeably less than actual number of incidents, which would make my analysis (on crime incidents happened before 2008) biased. In fact, we only remove 2644 cases (904314-901670) dated before 2008-01-01 (calculated on 2021-05-27).

In [13]:
# Drop mili-seconds due to Python's Date format does not recognize 3-digit mili-second (it only recognize micro-second)
police_data['offense_start_date_cleaned'] = pd.to_datetime(police_data['offense_start_datetime'], format='%Y-%m-%dT%H:%M:%S') 

# Drop offense_start_datetime before 2008-01-01
police_data_cleaned = police_data[police_data['offense_start_date_cleaned'] >= datetime.strptime('2008-01-01', '%Y-%m-%d')]

# Drop pre-parsed date time column
police_data_cleaned = police_data_cleaned.drop(columns=['offense_start_datetime'])

#print(police_data)
print(police_data_cleaned)

       crime_against_category                      offense_parent_group  \
0                     SOCIETY                    DRUG/NARCOTIC OFFENSES   
1                    PROPERTY                             LARCENY-THEFT   
2                    PROPERTY                                   ROBBERY   
3                    PROPERTY  DESTRUCTION/DAMAGE/VANDALISM OF PROPERTY   
4                     SOCIETY               DRIVING UNDER THE INFLUENCE   
...                       ...                                       ...   
904309               PROPERTY                       MOTOR VEHICLE THEFT   
904310               PROPERTY                       MOTOR VEHICLE THEFT   
904311               PROPERTY                             LARCENY-THEFT   
904312               PROPERTY                             LARCENY-THEFT   
904313               PROPERTY                       MOTOR VEHICLE THEFT   

                                            offense                 mcpp  \
0                      

## 5-2: increased crime type on the onset of COVID-19 Pandemic

To determine whether certain types of crimes will become more popular after the onset of COVID-19 Pandemic, I first tally the monthly total of every crime type from 2019-01-01 to 2020-12-31. Then, I will make a time series graph of all crime types to visually inspect the growing trend of certain crimes after 2020-01-01.



**Research Question 2:**

For this question, I will tally the total occurrence of time against person and property in every calendar day. By conducting a student t-test on two sets of data, I can determine whether it's statistically significant to conclude whether crimes against person happen more frequently than crimes against property. 

**Research Question 3:**

After filtering the city district to University, I will tally the number of crime occurrences by type in a day, and record three (or five) most frequent types. Then, I will compile daily top lists together and determine which daily top crime types happen in the most number of days. To make it easier for readers to understand results, I will use a bar graph and table to visually it.

I decide not to do simple arithmetic addition of all crime incidents (in all days) and find the most-frequent crime types, because this simple analysis is not focused on the type of crime that people may mostly encounter everyday. In addition, the total number of crimes can change noticeably on days that the University District is less active (such as in Summer vacation quarters), so determining the most frequent crime type in a single day can more accurately answer my research question.

**Research Question 4:**

First, I will filter the entire dataset to keep only the homicide crime listing, and then tally the annual number of homicide crimes. Finally, I will make a time series graph to visually present the number of homicide crimes, where readers can visually observe whether Seattle has a record high homicide crime count in 2021.

## Ethical consideration

* Crime classification is subjective, and one person may categorize a crime differently than another person.
* Crime offenses that are subsequently acquitted may not be removed from this dataset, resulting in a higher number of crime entries than actual prosecutable offenses.

## Unknown and dependencies:

* The Seattle Crime dataset may not contain all crimes that actually happened in the city. The website stated that only finalized (UCR approved) reports are published in this dataset. Those in draft, awaiting approval, or completed after the update may be published at a later date.
* When crime investigation proceeds further, crime entries may be retroactively added, updated, or removed, which may result in change of historical data when readers re-run my Jupyter notebook program at a later date.
* This dataset does not include committed crimes that are not yet discovered/found out by any police department.
* Many crimes are reported after crime offenses, and the offense date & time may not be accurate.
* University District covers a larger area than the UW itself (even though this district is very related to the UW community), and some readers may disagree with this geographical scope.

In [None]:
#ggplot2 in R is a very intuitive and easy to understand graph programming syntax (and there're just a lot of articles on ggplot due to its popularity)
#One noticeable limitation for plotnine it that it could not draw pie chart (it lacks coord_polar implementation)

#!conda install plotnine # if plotnine is not already installed
#!pip install plotnine # if plotnine is not already installed
from plotnine import * #plotnine contains ggplot function
# from plotnine.data import mpg #import ggplot's sample data (such as mpg), useful for debugging

In [None]:
personal_crimes = police_data[police_data['crime_against_category'] == "PERSON"]

In [None]:
print(personal_crimes)

In [None]:
print(personal_crimes['offense_parent_group'].unique())

In [None]:
homicide_offense = police_data[police_data['offense_parent_group'] == "HOMICIDE OFFENSES"]

In [None]:
print(homicide_offense)