# Affect of Marijuana legalization on crime rates

![Banner](./assets/banner.jpeg)

## Topic
*What problem are you (or your stakeholder) trying to address?*

Evaluate the impact of marijuana legalization on other areas of crime through comparative analysis to establish/refute correlation.

## Project Question
*What specific question are you seeking to answer with this project?*
*This is not the same as the questions you ask to limit the scope of the project.*
Does marijuana legalization lead to an increase in other areas of criminality?

## What would an answer look like?
*What is your hypothesized answer to your question?*
I suspect there will be some increases in other areas such as impaired driving. I do not have a strong inclination towards the degree it is impacted.

## Data Sources
*What 3 data sources have you identified for this project?*
[Seattle Crime stats](https://www.kaggle.com/datasets/city-of-seattle/seattle-crime-stats)
[Denver Crime stats](https://www.kaggle.com/datasets/paultimothymooney/denver-crime-data/data)
[Kansas City Crime](https://www.kaggle.com/datasets/riteshkadakoti/crime-dataset-kansas)

*How are you going to relate these datasets?*
My working plan to compare 2 large cities where marijuana was first legalized (2012) for recreational use. I plan compare pre and post legalization crime trends. Then further compare/contrast that with a large city that had not (until this year) legalized recreational use. I am still ruminating on the mechanics of how to best do that.

## Approach and Analysis
*What is your approach to answering your project question?*
*How will you use the identified data to answer your project question?*
📝 <!-- Start Discussing the project here; you can add as many code cells as you need -->

In [1]:
# Start your code here
# Not sure what the expectation is for this block
import numpy as np
import pandas as pd
from scipy.stats import trim_mean

# Configure pandas to display 500 rows; otherwise it will truncate the output
pd.set_option('display.max_rows', 500)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
plt.style.use("bmh")

denver_crime_df = pd.read_csv('data/denver_crime.csv')
kc_crime_df = pd.read_csv('data/KCcrime2010To2018.csv', low_memory=False)
seattle01_crime_df = pd.read_csv('data/seattle-crime-stats-by-1990-census-tract-1996-2007.csv')
seattle02_crime_df = pd.read_csv('data/seattle-crime-stats-by-police-precinct-2008-present.csv')

denver_crime_df.head()
kc_crime_df.head()
seattle01_crime_df.head()
seattle02_crime_df.head()


Unnamed: 0,Police Beat,CRIME_TYPE,CRIME_DESCRIPTION,STAT_VALUE,REPORT_DATE,Sector,Precinct,Row_Value_ID
0,R2,Rape,Rape,1,2014-04-30T00:00:00.000,R,SE,27092
1,K2,Assault,Assault,5,2014-04-30T00:00:00.000,K,W,26506
2,M2,Homicide,Homicide,1,2014-04-30T00:00:00.000,M,W,27567
3,C3,Robbery,Robbery,2,2014-04-30T00:00:00.000,C,E,26225
4,E2,Motor Vehicle Theft,"Vehicle Theft is theft of a car, truck, motorc...",7,2014-04-30T00:00:00.000,E,E,26368


### Exploratory Data Analysis

Data visualizations to evaluate the data in order to form conclusions about whether marijuana legalization had a tangential impact on other aspects of crime.
I am paying particular attention to the 2-3 year period before and after legalization.

In [3]:
# Get the shapes of the data
display(denver_crime_df.shape)
display(seattle01_crime_df.shape)
display(seattle02_crime_df.shape)
display(kc_crime_df.shape)

(386865, 20)

(14268, 4)

(27125, 8)

(621281, 27)

In [6]:
# Look at the columns and compare the presentation of the data
display(denver_crime_df.info())
display(seattle01_crime_df.info())
display(seattle02_crime_df.info())
display(kc_crime_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386865 entries, 0 to 386864
Data columns (total 20 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   incident_id             386865 non-null  int64  
 1   offense_id              386865 non-null  int64  
 2   offense_code            386865 non-null  int64  
 3   offense_code_extension  386865 non-null  int64  
 4   offense_type_id         386865 non-null  object 
 5   offense_category_id     386865 non-null  object 
 6   first_occurrence_date   386865 non-null  object 
 7   last_occurrence_date    211309 non-null  object 
 8   reported_date           386865 non-null  object 
 9   incident_address        371362 non-null  object 
 10  geo_x                   371362 non-null  float64
 11  geo_y                   371362 non-null  float64
 12  geo_lon                 371096 non-null  float64
 13  geo_lat                 371096 non-null  float64
 14  district_id         

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14268 entries, 0 to 14267
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Report_Year        14268 non-null  int64  
 1   Census_Tract_1990  14236 non-null  float64
 2   Crime_Type         14268 non-null  object 
 3   Report_Year_Total  14268 non-null  int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 446.0+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27125 entries, 0 to 27124
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Police Beat        27125 non-null  object
 1   CRIME_TYPE         27125 non-null  object
 2   CRIME_DESCRIPTION  27125 non-null  object
 3   STAT_VALUE         27125 non-null  int64 
 4   REPORT_DATE        27125 non-null  object
 5   Sector             27125 non-null  object
 6   Precinct           27125 non-null  object
 7   Row_Value_ID       27125 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 1.7+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 621281 entries, 0 to 621280
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   140092859             621281 non-null  int64 
 1   2014                  621281 non-null  int64 
 2   12                    621281 non-null  int64 
 3   28                    621281 non-null  int64 
 4   19                    621281 non-null  int64 
 5   36                    621281 non-null  int64 
 6   2014.1                621281 non-null  int64 
 7   12.1                  621281 non-null  int64 
 8   22                    621281 non-null  int64 
 9   1                     621281 non-null  int64 
 10  0                     621281 non-null  int64 
 11  670                   621281 non-null  int64 
 12  23D                   621281 non-null  object
 13  Stealing from Buildi  621281 non-null  object
 14  331                   621281 non-null  object
 15  3300  E 30 ST    

None

### Intitial thoughts
The data from Seattle 01 and Kansas City do not appear to provide much in the way of usable data. Denver and Seattle 02 show some promise. I will need to try to find another comparison city for evaluation against the cities with legalized marijuana.

In [11]:
# Check for duplicated records for Denver
denver_crime_df.duplicated().sum()


0

In [12]:
# Check for duplicated records for Seattle
seattle02_crime_df.duplicated().sum()

4

In [14]:
# Drop duplicates from Seattle and rename
display(seattle02_crime_df.shape)
seattle_crime_df = seattle02_crime_df.drop_duplicates()
display(seattle_crime_df.shape)

(27125, 8)

(27121, 8)

In [15]:
# Check for missing values for Denver
display(denver_crime_df.isna().sum())
display(seattle_crime_df.isna().sum())

incident_id                    0
offense_id                     0
offense_code                   0
offense_code_extension         0
offense_type_id                0
offense_category_id            0
first_occurrence_date          0
last_occurrence_date      175556
reported_date                  0
incident_address           15503
geo_x                      15503
geo_y                      15503
geo_lon                    15769
geo_lat                    15769
district_id                   57
precinct_id                    0
neighborhood_id              689
is_crime                       0
is_traffic                     0
victim_count                   0
dtype: int64

Police Beat          0
CRIME_TYPE           0
CRIME_DESCRIPTION    0
STAT_VALUE           0
REPORT_DATE          0
Sector               0
Precinct             0
Row_Value_ID         0
dtype: int64

### Data cleaned and ready for use
The missing values are immaterial to the scope of analysis and can be safely ignored. The four duplicates from Seattle have been dropped.

## Resources and References
*What resources and references have you used for this project?*
I used Kaggle to source the datasets

In [6]:
# ⚠️ Make sure you run this cell at the end of your notebook before every submission!
!jupyter nbconvert --to python source.ipynb

[NbConvertApp] Converting notebook source.ipynb to python
[NbConvertApp] Writing 2701 bytes to source.py
