# The effects of COVID-19 on crime rates in Vancouver

group project proposal:

Sean Lee, Neil Li, Tracy Wang, Wendi Zhong

## Introduction

Before the pandemic, our teammate Neil has experienced no crimes more major than perhaps public drunkeness, but once the pandemic started, he has been subjected to two different attempts of grand theft auto and one shooting. This can't help but make us wonder: is this simply a streak of bad luck or is this the result of the pandemic?

But it wasn't so simple, as research (Nivette et. al., 2021) has shown that crime rate decreases due to lockdowns forcing people to stay in their homes, there are also arguments to be had about how the economic downturn (Munywoki, 2020) could lead more people into commiting crimes for financial reasons.

The result of this could have fascinating ramifications, and it could inform governments on the potential social benefits of providing stimulus cheques.

### Research Question:

<b>Has Covid 19 affected the proportion of financially motivated crimes in Vancouver?<b>
    
### Hypothesis:

$H_0: p_1 - p_2 = 0$ vs $H_1: p_1 - p_2 \neq 0$
    
$\mu_1$: proportion of financially motivated crimes before the pandemic
    
$\mu_2$: proportion of financially motivated crimes after the pandemic

In [None]:
library(tidyverse)
library(datateachr)
library(repr)
library(digest)
library(infer)
library(grid)

## Dataset Info:

The dataset is downloaded from \"[Vancouver Crime Data](https://geodash.vpd.ca/opendata/)\", an open data dataset provided by the Vancouver Police Department. Which we selected to list all the the crimes recorded in every neighbourhood in Vancouver since 2003. The dataset (Vancouver Crime Data) of specific crimes is directly downloaded from the Vancouver Police Department. We are confident that the dataset is trustworthy and representative with no bias in the data; even if there could still be unreported crimes.

In [None]:
# reading data
crimes_url <- "https://raw.githubusercontent.com/NeilLi26/STAT201-project/main/crimedata_csv_AllNeighbourhoods_AllYears.csv"
crime_data <- read.csv(crimes_url)
head(crime_data)

Because we want to have the crime data be more representative of the difference between the years leading up to the pandemic to the years during and after the pandemic, we will filter the data to only include years from 2017 onwards, and before November since 2022 has not had a November yet. We will also only need the columns containing the type of the crime, year the crime was committed.

In [None]:
# selecting data after 2017 and before November
crime_data_processed <- crime_data %>%
    filter(YEAR >= 2017, MONTH <= 10) %>%
    select(TYPE, YEAR)
head(crime_data_processed)

We then see what kinds of crimes there are, and from the list seen below, we deem the "Breaking and Entering" and "Theft" type crimes to be financially motivated. From the tibble above, and some further research by ourselves, the crimes that would likely be considered as financially motivated would be:

In [None]:
crime_types <- crime_data_processed %>%
    select(TYPE) %>%
    group_by(TYPE) %>%
    summarise(n = n())
crime_types
financial_crimes <- c("Break and Enter Commercial", "Break and Enter Residential/Other", "Other Theft", "Theft from Vehicle",
                     "Theft of Bicycle", "Theft of Vehicle")

We first decided to visualize the overall spread of crime over the six years by taking a sample of size 2000, and bootstrapping 1000 samples from it to see the overall

In [None]:
# take a single sample with size 2000 from population
set.seed(2190)
sample_size <- 2000

crime_sample <- crime_data_processed %>%
    rep_sample_n(size = sample_size, replace = FALSE) %>%
    mutate(Pandemic = ifelse(YEAR < 2020, "Before", "After"))

# create 1000 bootstrap samples with size 2000 of the difference in crimes commited before the pandemic 
# (YEAR < 2020) 
set.seed(2190)
bootstrap_sample <- crime_sample %>%
    rep_sample_n(size = sample_size, reps = 1000, replace = TRUE)%>%
    group_by(replicate,Pandemic)%>%
    summarize(prop = sum(TYPE %in% financial_crimes)/n())%>%
    pivot_wider(names_from = Pandemic, values_from = prop) %>%
    mutate(diff = Before - After) 
    
head(bootstrap_sample)



In [None]:
#Visualize the bootstrap distribution
bootstrap_sampling_distribution <- bootstrap_sample%>%
    ggplot(aes(x = diff)) +
    geom_histogram(binwidth = 0.01) +
    xlab("Difference in Crimes Commited before and after Pandemic") +
    ggtitle("Bootstrap Sampling Distribution") 
    
bootstrap_sampling_distribution

In [None]:
# calculate the mean and var of difference in crimes commited
sample_mean <- mean(bootstrap_sample$diff)
sample_var <- var(bootstrap_sample$diff)

#obtain 95% confidence interval 
ci <- bootstrap_sample %>%
    get_ci(level = 0.95, type = "percentile")
bootstrap_Table <- data.frame(sample_mean, sample_var) %>%
    merge(ci)
bootstrap_Table

Because this is a large dataset, we have the luxury of creating many large samples, and with those large samples we could apply the central limit theorem to get more crucial data.

In [None]:
#Visualize the bootstrap distribution with 95% confidence interval
ci_plot <- bootstrap_sample%>%
    ggplot(aes(x = diff)) +
    geom_histogram (binwidth = 0.01, colour = "white", fill = "grey") +
    annotate("rect", xmin = ci$lower_ci, xmax = ci$upper_ci, ymin = 0, ymax = Inf,
             fill = "deepskyblue",
             alpha = 0.3) +
    xlab("Difference in Crime Commited Before and After Pandemic")+
    ggtitle("Bootstrap Distribution with 95% Confidence Interval") +
    geom_vline(aes(xintercept= sample_mean), colour = "red")
ci_plot

# Wendi's next step in the project goes here

In [None]:
## make hypothesis test for difference proportions based 
# on bootstrapping methods, see Tutorial 6 question 3 for examples

# calculate observed test statistic
obs_test_stat <- mean(bootstrap_sample$diff)
    
obs_test_stat

# construct null model

null_model <- 
    bootstrap_sample %>% 
    mutate(stat = diff - (obs_test_stat - 0 ) )
head(null_model)



In [None]:
# visualize the null model
null_model_plot <-
    null_model %>% 
    ggplot(aes(x = stat)) +
    geom_histogram() +
    geom_vline(xintercept = obs_test_stat, color = "red", alpha=.3, lwd=2)  +
    xlab("Difference in Crime Commited Before and After Pandemic") +
    ggtitle("Null Distribution") 


null_model_plot


In [None]:
# obtain p-value
p_value <- mean(null_model$stat > obs_test_stat )

p_value

In [None]:
# take 1000 single sample with size 2000 from population

set.seed(2190)

samples_CLT <- crime_data_processed %>%
    rep_sample_n(size = 2000, reps = 1000, replace = FALSE) %>%
    mutate(Pandemic = ifelse(YEAR < 2020, "Before", "After"))

head(samples_CLT)

In [None]:
# calculate difference of crimes before and after 2020

diff_sampling_dist_CLT <- 
samples_CLT %>%
    group_by(replicate , Pandemic)%>%
    summarize(prop = sum(TYPE %in% financial_crimes)/n()) %>%
    pivot_wider(names_from = Pandemic, values_from = prop) %>%
    mutate(diff = Before - After) 

# Visualize size 2000 sampling distribution
diff_sampling_dist_plot_CLT <- 
diff_sampling_dist_CLT %>%
   ggplot(aes(x = diff)) +
   geom_histogram(bins = 20, color = 'white') +
   ggtitle("Sampling Distribution of Difference of Crimes Before and After Covid") +
   xlab("Difference of Crimes Before and After Covid") +
   theme(text = element_text(size = 14))
diff_sampling_dist_plot_CLT

In [None]:
# Get mean and var of sampling distribution
mean_CLT <- mean(diff_sampling_dist_CLT$diff)
var_CLT <- var(diff_sampling_dist_CLT$diff)

# Get 95% confidence interval
ci_CLT <- diff_sampling_dist_CLT %>%
            get_confidence_interval(type = "percentile", level = 0.95)

CLT_table <- data.frame(mean_CLT, var_CLT) %>%
    merge(ci_CLT)
CLT_table

In [None]:
# lab mean on the sampling distribution
# shade 95% confidence interval on the sampling distribution

sample_quantile_plot <- 
    diff_sampling_dist_CLT %>% 
      ggplot(aes(x = diff)) +
      geom_histogram(bins = 25, color = 'white') +
      geom_vline(xintercept = mean_CLT, colour = "red", size = 1) +
      annotate("rect", 
              xmin = ci_CLT$lower_ci,
              xmax = ci_CLT$upper_ci,
              ymin = 0,
              ymax = Inf,
              fill = "deepskyblue",
              alpha = 0.3) +
      ggtitle("Sampling Distribution of Difference of Crimes Before and After Covid") +
      xlab("Difference of Crimes Before and After Covid") +
      theme(text = element_text(size = 14))

sample_quantile_plot

# Tracy's next step in the project goes here

In [None]:
## do hypothesis testing for the difference in proportions based on the central limit theorem, see worksheet 8 question question 3.4 for examples
set.seed(2190)
sample_size <- 2000

samples_hypothesis <- crime_data_processed %>%
    rep_sample_n(size = 2000, reps = 1, replace = FALSE) %>%
    mutate(Pandemic = ifelse(YEAR < 2020, "Before", "After")) %>%
    group_by(replicate,Pandemic) %>%
    summarize(n = n(),
              prop = sum(TYPE %in% financial_crimes)/n())

samples_hypothesis

In [None]:
n1 <- samples_hypothesis$n[1]
n2 <-samples_hypothesis$n[2]
p1 <- samples_hypothesis$prop[1]
p2 <- samples_hypothesis$prop[2]
p_hat <- (n1*p1 + n2*p2)/(n1 + n2)

test_statistic_theoretical <- (p2 - p1)/sqrt(p_hat * (1 - p_hat) * (1/n1 + 1/n2))

p_score_theoretical <- 2 * pnorm(test_statistic_theoretical, lower.tail = FALSE)
p_score_theoretical

## Discussion

# Sean's next step goes here

# References:

Ferguson, E. (2015). Crime and punishment vocabulary with pronunciation. IELTS Liz. Retrieved October 31, 2022, from https://ieltsliz.com/crime-and-punishment-vocabulary/ 

Munywoki, G. (2020). Economic effects of novel coronavirus (COVID – 19) on the global economy. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3719130 

n.a. (n.d.). Crime Data Download. VPD open data. Retrieved October 31, 2022, from https://geodash.vpd.ca/opendata/ 

Nivette, A.E., Zahnow, R., Aguilar, R. et al. A global analysis of the impact of COVID-19 stay-at-home restrictions on crime. Nat Hum Behav 5, 868–877 (2021). https://doi.org/10.1038/s41562-021-01139-z