# Crime Rate Econometric Analysis

# **Introudction**

Understanding the factors that influence crime rates is crucial for effective crime prevention and the development of evidence-based policies. One key area of research focuses on investigating the relationship between the probabilities of apprehension and punishment and crime participation rates. By studying the deterrence effect of the criminal justice system, we can gain valuable insights into the mechanisms behind criminal decision-making and inform policy interventions aimed at reducing crime. Researching the impact of these probabilities on crime participation rates has significant implications for policy and the allocation of resources within the criminal justice system. By understanding how variations in these probabilities influence criminal behaviour, policymakers can develop targeted strategies to enhance deterrence and reduce crime rates.

**RESEARCH QUESTION: Do the probabilities of apprehension and punishment affect crime participation rate**

# **About the Data**
The provided dataset offers an in-depth exploration into the realm of crime dynamics spanning 90 counties within the North Carolina landscape. Capturing a temporal span from 1981 through 1987, this dataset encapsulates a comprehensive portrait of criminal activities over a significant period.

Total Number of Observations: 630

# Features:

county = county identifier
year => 81 to 87
crmrte => crimes committed per person
prbarr => '=>ty' of arrest
prbconv => 'probability' of conviction
prbpris => 'probability' of prison sentence
1. avgsen => avg. sentence, days
1. polpc => police per capita
1. density => people per sq. mile.
1. taxpc => tax revenue per capita
1. west => =1 if in western N.C.
1. central => =1 if in central N.C.
1. urban => =1 if in SMSA
1. pctmin80 => perc. minority, 1980
1. wcon => weekly wage, construction
1. wtuc => wkly wge, trns, util, commun
1. wtrd => wkly wge, whlesle, retail trade
1. wfir => wkly wge, fin, ins, real est
1. wser => wkly wge, service industry
1. wmfg => wkly wge, manufacturing
1. wfed => wkly wge, fed employees
1. wsta => wkly wge, state employees
1. wloc => wkly wge, local gov emps
1. mix => offense mix: face-to-face/other
1. pctymle => percent young male
1. d82 => =1 if year == 82
1. d83 => =1 if year == 83
1. d84 => =1 if year == 84
1. d85 => =1 if year == 85
1. d86 => =1 if year == 86
1. d87 => =1 if year == 87
1. lcrmrte => log(crmrte)
1. lprbarr => log(prbarr)
1. lprbconv => log(prbconv)
1. lprbpris => log(prbpris)
1. lavgsen => log(avgsen)
1. lpolpc => log(polpc)
1. ldensity => log(density)
1. ltaxpc => log(taxpc)
1. lwcon => log(wcon)
1. lwtuc => log(wtuc)

**Methodology**

The dataset provided to us is categorised as panel data, which comprises of information on crime in 90 counties in North Carolina, for the years 1981 through 1987. We select our OLS model as follows.

**Model:** Ln(crmrte)= β_0+ β_1Ln(prbarr)+ β_2Ln(prbconv)+ β_3 prbpris+µ

**crmrte** is crimes committed per person, this is our dependent variable.

**prbarr** is 'probability' of arrest, directly relates to the likelihood of a person being apprehended by law enforcement. It represents the chance that an individual will be arrested for a specific offense. We take the log as the distribution is log normal hence data will be concentrated in one part leaving a long tail, we take log to make it more normal.

**prbconv** is 'probability' of conviction, indicates the likelihood of a person being found guilty in court. It represents the chance that someone arrested will be convicted and held legally responsible for the offense. We take the log as the distribution is log normal hence data will be concentrated in one part leaving a long tail, we take log to make it more normal.

**prbpris** is ‘probability' of prison sentence, represents the probability of receiving a prison sentence if convicted. It signifies the likelihood of being punished with incarceration rather than alternative forms of punishment, such as fines or probation.

**We expect these independent variables to be negatively related to crime rate or in other words we expect the coefficient of these independent variables to be less than zero.**

Given that the data category is panel data it is very likely that simple OLS model will suffer from problem of endogeneity caused by unobserved Heterogeneity. “The unobserved dependency of other independent variable(s) is called unobserved heterogeneity and the correlation between the independent variable(s) and the error term (i.e. the unobserved independent variabels) is called endogeneity.”

There are five assumptions in simple linear regression:

Linearity
Exogeneity
(a) Homoskedasticity, and (b) Non-autocorrelation
Independent Variable are not stochastic
No Multicollinearity
Two of these assumptions can help us answer if we should be using Pooled OLS or Fixed Effect Model and Random Effect Model. If assumption 2 or 3 or both are violated then Fixed Effect and Random Effect are more suitable than Pooled OLS.

**1. Pooled OLS**
We start by importing important libraries and our dataset and our data. Then we start with the Pooled OLS and validate the required assumptions.

In [None]:
library(lmtest)
library(readxl)
library(estimatr)
library(plm)

In [None]:
# loading data

df <- read_excel("/kaggle/input/crimerate/Crime_NorthCarolina.xlsx")
head(df,n=3)

In [None]:
# pooled OLS model

model <- lm_robust(log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, data = df)
summary(model)

We notice all the coefficients are negative except for one which is probability of prison sentence. According to Pooled OLS model it is positively related to crime rate or as the probability of prison sentence increases the crime rate also increases, which goes against the initial assumption that all three independent variables should be negatively related to crime rate. This could be caused because our model may be inconsistent due to unobserved heterogeneity.

# 2. Validating LR Assumptions
Assumption 3(a) We check our assumption (3a) Homoskedasticity using the Breusch-Pagan test, also known as the Breusch-Pagan-Godfrey test, is a statistical test used to assess the presence of heteroscedasticity in a regression model. Heteroscedasticity refers to the situation where the variability of the error terms (residuals) in a regression model is not constant across different levels of the independent variables. In other words, the spread or dispersion of the residuals is not the same throughout the range of the predictors.

We run the test on our model and observe that out p value is less than 0.05 , hence we can reject H0 and conclude that **heteroscedasticity is present in the regression model and we can conclude that assumption 3(a) is violated.**

In [None]:
# Perform Breusch-Pagan test
bptest(model)

# Assumption 3(b)
Now we check for assumption (3b) Non-autocorrelation using The Durbin-Watson-Test which is a statistical test used to assess the presence of autocorrelation in the residuals of a regression model. Autocorrelation refers to the correlation between the error terms (residuals) at different time points or observations in panel data analysis.

We run our dwtest in R and observe that our p value is less than 0.05 hence we reject H0 and accept H1: true autocorrelation is greater than 0.

The output value of the test ranges from 0 to 4 with values between 0 and 2 indicating positive autocorrelation our DW = 0.5298 which suggest positive autocorrelation. With this we can conclude that the assumption 3(b) is violated.

Hence the assumption 3 as a whole is violated and now we can conclude that **we should be using Fixed effect or Random effect model for our regression.**

In [None]:
# fixed effect model and random effect model using plm
library(plm)

fixed = plm(log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, data=df, index=c("county", "year"), model="within")
random = plm(log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, data=df, index=c("county", "year"), model="random")
# fixed effect model summary
summary(fixed)

In [None]:
# random effect model summary
summary(random)

# **4. Selecting Final Model with Hausman test**
To select one model we also perform the Hausman test which is a statistical test used to compare the consistency and efficiency of two different estimators, typically the fixed effects (FE) and random effects (RE) estimators, in panel data analysis. It helps determine whether the fixed effects model or the random effects model is more appropriate for a given dataset.

The Hausman test is based on the principle that if the random effects estimator is consistent and efficient under the null hypothesis, but the fixed effects estimator is consistent under both the null and alternative hypotheses, then the fixed effects estimator is preferred. Conversely, if the random effects estimator is consistent and efficient under both the null and alternative hypotheses, then the random effects estimator is preferred. In simple words If the null hypothesis is rejected, indicating that the test statistic is greater than the critical value, it implies that the fixed effects estimator is more appropriate due to the presence of endogeneity or other issues that violate the random effects assumptions.

**We observe that in our results the p-value is less than 0.05 hence we select the fixed effect model. We will discuss the results of our fixed effect model in the next section.**

In [None]:
hausman_test <- phtest(fixed, random)
hausman_test

**5. Results Fixed effect regression model:**

**Ln(crmrte)= -0.2209Ln(prbarr)- 0.1351Ln(prbconv)- 0.3024*prbpris+µ**

In [None]:
# fixed model regression summary

summary(fixed)

To check the significance of our coefficients we check the critical t at alpha=0.01 and degrees of freedom = 626. We see the t critical is ±2.581 and all our coefficients are significant as they are outside the acceptance region.

1. R squared: 0.081 which indicates that we explain 8.11% of variation in our dependent variable using our independent variable.
 
1. The coefficient β_1 is -0.2209 which implies 1% change in probability of arrest will lead to -0.2209% change in crime rate, holding all other variables constant.

1. The coefficient β_2 is -0.1351 which implies 1% change in probability of conviction will lead to -0.1351% change in crime rate, holding all other variables constant.

1. The coefficient β_3 is -0.30241 which implies 1 unit increase in probability of prison sentence will lead to -30.24% change in crime rate, holding all other variables constant. β_3 has an observable effect on crime rate as compared to the other two variables.

1. We observe that our initial assumption of having a negative coefficient for each of our independent variable is true. We can conclude that all our independent variables are negatively related with crime rate.

1. We can also observe that the coefficient of probability of prison sentence which was positive in our pooled OLS model is now negative in our fixed effect model which aligns with our assumption.



# 6. Conclusion
For this project we were provided with the dataset which includes information on crime in 90 counties in North Carolina, for the years 1981 through 1987. The research question selected was “Do the probabilities of apprehension and punishment affect crime participation rate?”. First we notice that the data category provided to us is panel data, we start by building a simple OLS model and observe that the coefficient of probability of prison sentence is negative which does not align with our assumption. Due to the data being panel data simple OLS model constructed can suffer from inconsistencies and bias. To prove that we should not be using the pooled OLS model that will suffer from problem of endogeneity caused by unobserved Heterogeneity we check our assumption (3a) Homoskedasticity using the Breusch-Pagan test and our assumption (3b) Non-autocorrelation using The Durbin-Watson-Test and find observe that the assumption 3 as a whole has been violated and conclude that we should be using Fixed effect or Random effect model for our regression. To select one model from Fixed effect model and Random effect model we use Hausman test which is a statistical test used to compare the consistency and efficiency of two different estimators and select the Fixed Effect Estimator. The R squared value of 0.0811 which indicates we explain 8.11% of variation in our dependent variable.

The coefficients from our fixed effect model are all negative confirming our initial assumption that all of the independent variables are negatively related to the crime rate or if any of the independent variable increase crime rate will decrease. The coefficient of β_3 is -0.30241 which had a significant impact on the crime rate and could be used by policy makers, implies 1 unit increase in probability of prison sentence will lead to -30.24% change in crime rate. We tested the significance of our coefficients and find that all the independent variables are significant proving yes probabilities of apprehension and punishment do affect crime participation rate.

Although we have provided significant results, we acknowledge that R-Squared values is on the small end. We can describe better if we can gather data on unemployment levels and if we want to compare the crime rate in recent years we can utilize the covid cases as well along with education levels as all of these factors are used to better describe the crime rate. If we consider the limitations of fixed effect model we can say that The fixed effect model is not suitable for analyzing the effects of variables that do not change over time within entities (individuals, groups, etc.). These time-invariant variables are absorbed by the fixed effects, and their impact cannot be estimated in the model.