# The effects of COVID-19 on crime rates in Vancouver

group project proposal:

Sean Lee, Neil Li, Tracy Wang, Wendi Zhong

## Introduction

Before the pandemic, our teammate Neil has experienced no crimes more major than perhaps public drunkeness, but once the pandemic started, he has been subjected to two different attempts of grand theft auto and one shooting. This can't help but make us wonder: is this simply a streak of bad luck or is this the result of the pandemic?

But it wasn't so simple, as research (Nivette et. al., 2021) has shown that crime rate decreases due to lockdowns forcing people to stay in their homes, there are also arguments to be had about how the economic downturn (Munywoki, 2020) could lead more people into commiting crimes. So there are arguments for the pandemic leading people into commiting simultaneously less and more crimes.

### Research Question:

<b>Has Covid 19 affected the amount of crimes in Vancouver?<b>

In [4]:
library(tidyverse)
library(datateachr)
library(repr)
library(digest)
library(infer)
library(grid)
library(RCurl)


Attaching package: ‘RCurl’


The following object is masked from ‘package:tidyr’:

    complete




## Dataset Info:

The dataset is downloaded from \"[Vancouver Crime Data](https://geodash.vpd.ca/opendata/)\", an open data dataset provided by the Vancouver Police Department. Which we selected to list all the the crimes recorded in every neighbourhood in Vancouver since 2003. The dataset (Vancouver Crime Data) of specific crimes is directly downloaded from the Vancouver Police Department. We are confident that the dataset is trustworthy and representative with no bias in the data; even if there could still be unreported crimes.

In [7]:
crimes_url <- "https://raw.githubusercontent.com/NeilLi26/STAT201-project/main/crimedata_csv_AllNeighbourhoods_AllYears.csv"
crime_data <- read.csv(crimes_url)
head(crime_data)

Unnamed: 0_level_0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<dbl>,<dbl>
1,Break and Enter Commercial,2012,12,14,8,52,,Oakridge,491285.0,5453433
2,Break and Enter Commercial,2019,3,7,2,6,10XX SITKA SQ,Fairview,490613.0,5457110
3,Break and Enter Commercial,2019,8,27,4,12,10XX ALBERNI ST,West End,491007.8,5459174
4,Break and Enter Commercial,2021,4,26,4,44,10XX ALBERNI ST,West End,491007.8,5459174
5,Break and Enter Commercial,2014,8,8,5,13,10XX ALBERNI ST,West End,491015.9,5459166
6,Break and Enter Commercial,2020,7,28,19,12,10XX ALBERNI ST,West End,491015.9,5459166


Because we want to have the crime data be more representative of the difference between the years leading up to the pandemic to the years during and after the pandemic, we will filter the data to only include years from 2017 onwards, and before November since 2022 has not had a November yet. We will also only need the columns containing the type of the crime, year the crime was committed.

In [None]:
crime_data_processed <- crime_data %>%
    filter(YEAR >= 2017, MONTH <= 10) %>%
    select(TYPE, YEAR)

head(crime_data_processed)

In [None]:
# take a single sample with size 2000 from population

set.seed(2190)

crime_sample <- crime_data_processed %>%
    rep_sample_n(size = 2000, replace = FALSE) %>%
    mutate(Pandemic = ifelse(YEAR < 2020, "Before", "After"))
head(crime_sample)

We first decided to visualize the overall spread of crime over the six years by taking a sample of size 2000, and bootstrapping 1000 samples from it to see the overall

In [None]:
# create 1000 bootstrap samples with size 2000 of the difference in crimes commited before the pandemic 
# (YEAR < 2020) 

set.seed(2190)
bootstrap_sample <- crime_sample %>%
    rep_sample_n(size = 2000, reps = 1000, replace = TRUE)%>%
    group_by(replicate,Pandemic)%>%
    summarize(n = n())%>%
    pivot_wider(names_from = Pandemic, values_from = n) %>%
    mutate(diff = Before - After) 
    
head(bootstrap_sample)

# calculate the mean of difference in crimes commited
sample_mean <- mean(bootstrap_sample$diff)
sample_mean

In [None]:
#Visualize the bootstrap distribution
bootstrap_sampling_distribution <- bootstrap_sample%>%
    ggplot(aes(x = diff)) +
    geom_histogram(binwidth = 10) +
    xlab("Difference in Crimes Commited before and after Pandemic") +
    ggtitle("Bootstrap Sampling Distribution") 
    

bootstrap_sampling_distribution



In [None]:
#obtain 95% confidence interval 
ci <- bootstrap_sample %>%
    get_ci(level = 0.95, type = "percentile")
ci

Because this is a large dataset, we have the luxury of creating many large samples, and with those large samples we could apply the central limit theorem to get more crucial data.

In [None]:
#Visualize the bootstrap distribution with 95% confidence interval

ci_plot <- bootstrap_sample%>%
    ggplot(aes(x = diff)) +
    geom_histogram (binwidth = 10, colour = "white", fill = "grey") +
    annotate("rect", xmin = ci$lower_ci, xmax = ci$upper_ci, ymin = 0, ymax = Inf,
             fill = "deepskyblue",
             alpha = 0.3) +
    xlab("Difference in Crime Commited Before and After Pandemic")+
    ggtitle("Bootstrap Distribution with 95% Confidence Interval") +
    geom_vline(aes(xintercept= sample_mean), colour = "red")
ci_plot

In [None]:
# calculate mean and standard deviation on the difference between the total amount of 
# crime before and after the pandemic using the central limit theorem and obtain a 95% 
# confidence interval from this

In [None]:
# take 1000 single sample with size 2000 from population

set.seed(2190)

samples_CLT <- crime_data_processed %>%
    rep_sample_n(size = 2000, reps = 1000, replace = FALSE) %>%
    mutate(Pandemic = ifelse(YEAR < 2020, "Before", "After"))

head(samples_CLT)

In [None]:
# calculate difference of crimes before and after 2020

diff_sampling_dist_CLT <- 
samples_CLT %>%
    group_by(replicate , Pandemic)%>%
    summarize(n = n()) %>%
    pivot_wider(names_from = Pandemic, values_from = n) %>%
    mutate(diff = Before - After) 

head(diff_sampling_dist_CLT)

In [None]:
# Visualize size 2000 sampling distribution

diff_sampling_dist_plot_CLT <- 
diff_sampling_dist_CLT %>%
   ggplot(aes(x = diff)) +
   geom_histogram(bins = 20, color = 'white') +
   ggtitle("Sampling Distribution of Difference of Crimes Before and After Covid") +
   xlab("Difference of Crimes Before and After Covid") +
   theme(text = element_text(size = 14))

diff_sampling_dist_plot_CLT

In [None]:
# Get mean of sampling distribution
mean_CLT <- mean(diff_sampling_dist_CLT$diff)
mean_CLT

In [None]:
# Get 95% confidence interval
ci_CLT <- diff_sampling_dist_CLT %>%
            get_confidence_interval(type = "percentile", level = 0.95)
ci_CLT

In [None]:
# lab mean on the sampling distribution
# shade 95% confidence interval on the sampling distribution

sample_quantile_plot <- 
    diff_sampling_dist_CLT %>% 
      ggplot(aes(x = diff)) +
      geom_histogram(bins = 25, color = 'white') +
      geom_vline(xintercept = mean_CLT, colour = "red", size = 1) +
      annotate("rect", 
              xmin = ci_CLT$lower_ci,
              xmax = ci_CLT$upper_ci,
              ymin = 0,
              ymax = Inf,
              fill = "deepskyblue",
              alpha = 0.3) +
      ggtitle("Sampling Distribution of Difference of Crimes Before and After Covid") +
      xlab("Difference of Crimes Before and After Covid") +
      theme(text = element_text(size = 14))

sample_quantile_plot

## Methods: Plan

To statistically investigate on the difference in crime rate before and after pandemic, we first made hypothesis.
Our null hypothesis states there would be no difference in number of crimes before and after pandemic, while alternate hypothesis states there is a difference.

### Hypothesis:

$H_0: \mu_1 - \mu_2 = 0$ vs $H_1: \mu_1 - \mu_2 \neq 0$
    
$\mu_1$: the average total crimes commited before the outbreak
    
$\mu_2$: the average total crimes commited after the outbreak

By comparing the distribution plots from bootstrapping and CLT, they are both unimodal and symetric. We also shaded the 95% confidence interval and indicated the mean with the red vertical line. In both processes, we used 95% confidence interval. It is for us to say we would be 95% certain that the distribution would contain the true mean. We have high confidence with the samples we generated, as their large sample sizes give us a narrower distribution.

With the two preliminary distributions we've created, we could expect to reject the null hypothesis as the difference in $\mu_1$ and $\mu_2$ would not be 0. We hope to see more statistical evidence to reject the null hypothesis and adopt the alternate hypothesis. This is highly valuable information as it it provides some insight on the social consequences of the pandemic. Further research could be done to see how different kinds of crimes are affected by the pandemic.

# References:

Ferguson, E. (2015). Crime and punishment vocabulary with pronunciation. IELTS Liz. Retrieved October 31, 2022, from https://ieltsliz.com/crime-and-punishment-vocabulary/ 

Munywoki, G. (2020). Economic effects of novel coronavirus (COVID – 19) on the global economy. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3719130 

n.a. (n.d.). Crime Data Download. VPD open data. Retrieved October 31, 2022, from https://geodash.vpd.ca/opendata/ 

Nivette, A.E., Zahnow, R., Aguilar, R. et al. A global analysis of the impact of COVID-19 stay-at-home restrictions on crime. Nat Hum Behav 5, 868–877 (2021). https://doi.org/10.1038/s41562-021-01139-z