# Crime stat analysis



## Business understanding
De poltiechef van Denver wil inzicht krijgen in criminaliteit op labour day in Denver, zodat hij af kan wegen of hij zijn troepen anders in kan/moet zetten.

### Determine business objectives
Breng in kaart of er op labour day verschil is met een gemiddelde dag.
Zo ja, breng vervolgens in kaart wat er verschillend is.

### Assess situation
Labour day is een feestadg om de werkers van Amerika te eren.
Labour day vindt plaats op de eerste maandag van september.

### Determine data mining goals
Krijg uit de Denver criminaliteit dataset alle labourday data en vergelijk deze met de gemiddelde data.

### Project plan

# Data understanding for Crime Stat
Dit documentje is een basis voor de data understanding van de datasets die we hebben gekgregen voor de tweede challenge voor de Proof of Concept.


In [138]:
import platform
import os

# Get working directory
working_directory = os.getcwd()

# check if macos
isMac = "macos" in platform.platform().lower()

## Format
De data is aangeleverd in twee losse csv bestanden; crime.csv en offense_codes.csv. De bestanden zijn repectievelijk ±120 mb en ±35kb. 

NOTE: voeg de bestanden toe aan een mapje "data" onder de namen die hierboven staan.

In [139]:
# Variabele 
crime_file_path = (f"{working_directory}/data/crime.csv" if isMac else f"{working_directory}\\data\\crime.csv")
offense_file_path = f"{working_directory}/data/offense_codes.csv" if isMac else f"{working_directory}\\data\\offense_codes.csv"

## Data benaderen
De data kan worden benaderd met de paden die in bovenstaande cell staan. Dit doen we door ze in een Pandas Dataframe te stoppen, zoals hieronder. Hierna kun je in dit hele Jupyter Notebook gebruik maken van de dataframes.

In [140]:
import pandas as pd
import numpy as np
from datetime import datetime

In [141]:
# Data in een dataframe te stoppen
crime_df = pd.read_csv(crime_file_path)
offense_df = pd.read_csv(offense_file_path)

## Data preparation

In [142]:
# DataFrame met alle labourday data
labourday_crime_df = crime_df.copy()
labourday_offense_df = pd.read_csv(offense_file_path)

# Van datum kolommen het DateTime format van maken om bepaalde datums te kunnen selecteren
labourday_crime_df["FIRST_OCCURRENCE_DATE"] = pd.to_datetime(labourday_crime_df["FIRST_OCCURRENCE_DATE"])
labourday_crime_df["LAST_OCCURRENCE_DATE"] = pd.to_datetime(labourday_crime_df["LAST_OCCURRENCE_DATE"])
labourday_crime_df["REPORTED_DATE"] = pd.to_datetime(labourday_crime_df["REPORTED_DATE"])

# Adds a normalized reported date, the reported date gets converted to a date only, to be able to merge the mondays
labourday_crime_df["REPORTED_DATE_NORMALIZED"] = pd.to_datetime(labourday_crime_df["REPORTED_DATE"]).dt.normalize()

labourday_crime_df.head()

Unnamed: 0,incident_id,offense_id,OFFENSE_CODE,OFFENSE_CODE_EXTENSION,OFFENSE_TYPE_ID,OFFENSE_CATEGORY_ID,FIRST_OCCURRENCE_DATE,LAST_OCCURRENCE_DATE,REPORTED_DATE,INCIDENT_ADDRESS,GEO_X,GEO_Y,GEO_LON,GEO_LAT,DISTRICT_ID,PRECINCT_ID,NEIGHBORHOOD_ID,IS_CRIME,IS_TRAFFIC,REPORTED_DATE_NORMALIZED
0,20226000193,20226000193299900,2999,0,criminal-mischief-other,public-disorder,2022-01-04 11:30:00,2022-01-04 12:00:00,2022-01-04 20:36:00,128 S CANOSA CT,3135366.0,1685410.0,-105.018825,39.714268,4.0,411.0,valverde,1,0,2022-01-04
1,20223319,20223319299900,2999,0,criminal-mischief-other,public-disorder,2022-01-03 06:45:00,NaT,2022-01-03 11:01:00,650 15TH ST,3142454.0,1696151.0,-104.993418,39.743649,6.0,611.0,cbd,1,0,2022-01-03
2,20223093,20223093299900,2999,0,criminal-mischief-other,public-disorder,2022-01-03 01:00:00,NaT,2022-01-03 06:11:00,919 E COLFAX AVE,3147484.0,1694898.0,-104.975557,39.74013,6.0,621.0,north-capitol-hill,1,0,2022-01-03
3,20224000,20224000299900,2999,0,criminal-mischief-other,public-disorder,2022-01-03 19:47:00,NaT,2022-01-03 21:12:00,2345 W ALAMEDA AVE,3136478.0,1684414.0,-105.014892,39.711518,4.0,411.0,valverde,1,0,2022-01-03
4,20223956,20223956299900,2999,0,criminal-mischief-other,public-disorder,2022-01-03 17:06:00,NaT,2022-01-03 20:31:00,7800 E SMITH RD,3169237.0,1705800.0,-104.89795,39.769688,5.0,512.0,central-park,1,0,2022-01-03


De data gaat terug tot Januari 2017. We kunnen dus 5 jaren analyseren. Zie onderstaand:

In [143]:
start_date = labourday_crime_df["REPORTED_DATE"].min()
start_date

Timestamp('2017-01-02 00:19:00')

### Select data

Om de labourday data op te halen maken we eerst een lijst met de labourday datums van de afgelopen 5 jaar. Daarmee maken we een intercept, oftewel we halen uit de crime set alle crimes op met deze datums als REPORTED_DATE

In [144]:
# Maak een pandas DataFrame met daarin alle datum
dates = pd.date_range(start_date, datetime.now())
dates

DatetimeIndex(['2017-01-02 00:19:00', '2017-01-03 00:19:00',
               '2017-01-04 00:19:00', '2017-01-05 00:19:00',
               '2017-01-06 00:19:00', '2017-01-07 00:19:00',
               '2017-01-08 00:19:00', '2017-01-09 00:19:00',
               '2017-01-10 00:19:00', '2017-01-11 00:19:00',
               ...
               '2022-03-01 00:19:00', '2022-03-02 00:19:00',
               '2022-03-03 00:19:00', '2022-03-04 00:19:00',
               '2022-03-05 00:19:00', '2022-03-06 00:19:00',
               '2022-03-07 00:19:00', '2022-03-08 00:19:00',
               '2022-03-09 00:19:00', '2022-03-10 00:19:00'],
              dtype='datetime64[ns]', length=1894, freq='D')

In [145]:
# vars voor leesbaarheid
monday = 0
september = 9

# Krijg alle eerste maandagen van de datums
dates = dates.shift(1, freq='WOM-1MON')
dates = dates.unique()

# Krijg alle eerste maandagen uit september uit de al bewerkte datums
mask = (dates.weekday == monday) & (dates.month == september)
first_mondays = dates[mask]

# Convert mondays to a DataFrame and to dates only
first_mondays_df = pd.DataFrame()
first_mondays_df["REPORTED_DATE_NORMALIZED"] = pd.to_datetime(first_mondays)
first_mondays_df["REPORTED_DATE_NORMALIZED"] = pd.to_datetime(first_mondays_df["REPORTED_DATE_NORMALIZED"]).dt.normalize()
first_mondays_df



Unnamed: 0,REPORTED_DATE_NORMALIZED
0,2017-09-04
1,2018-09-03
2,2019-09-02
3,2020-09-07
4,2021-09-06


In [149]:
# Labour day merge
labourday_crime_df = pd.merge(labourday_crime_df, first_mondays_df, on="REPORTED_DATE_NORMALIZED")
labourday_crime_df.sort_values("REPORTED_DATE_NORMALIZED")

Unnamed: 0,incident_id,offense_id,OFFENSE_CODE,OFFENSE_CODE_EXTENSION,OFFENSE_TYPE_ID,OFFENSE_CATEGORY_ID,FIRST_OCCURRENCE_DATE,LAST_OCCURRENCE_DATE,REPORTED_DATE,INCIDENT_ADDRESS,GEO_X,GEO_Y,GEO_LON,GEO_LAT,DISTRICT_ID,PRECINCT_ID,NEIGHBORHOOD_ID,IS_CRIME,IS_TRAFFIC,REPORTED_DATE_NORMALIZED
481,2017592783,2017592783269903,2699,3,theft-of-services,larceny,2017-09-04 09:45:00,NaT,2017-09-04 10:29:00,360 S DAHLIA ST,3159945.0,1684194.0,-104.931485,39.710540,3.0,322.0,washington-virginia-vale,1,0,2017-09-04
311,2017593168,2017593168540100,5401,0,traffic-accident-hit-and-run,traffic-accident,2017-09-04 13:03:00,NaT,2017-09-04 13:55:00,W 8TH AVE / N KALAMATH ST,3140583.0,1690777.0,-105.000176,39.728924,1.0,123.0,lincoln-park,0,1,2017-09-04
310,2017593271,2017593271540100,5401,0,traffic-accident-hit-and-run,traffic-accident,2017-09-02 23:30:00,NaT,2017-09-04 14:00:00,1221 N CHEROKEE ST,3142941.0,1693251.0,-104.991744,39.735680,6.0,611.0,civic-center,0,1,2017-09-04
309,2017594161,2017594161540100,5401,0,traffic-accident-hit-and-run,traffic-accident,2017-09-04 13:34:00,NaT,2017-09-04 21:55:00,1500 BLOCK N LOGAN ST,3145602.0,1695149.0,-104.982244,39.740849,6.0,621.0,north-capitol-hill,0,1,2017-09-04
308,2017592720,2017592720539900,5399,0,public-peace-other,public-disorder,2017-09-04 08:30:00,NaT,2017-09-04 08:56:00,500 BLOCK 16TH ST,3143046.0,1696397.0,-104.991308,39.744315,6.0,611.0,cbd,1,0,2017-09-04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
630,2021513775,2021513775230300,2303,0,theft-shoplift,larceny,2021-09-06 11:00:00,2021-09-06 11:30:00,2021-09-06 12:22:00,500 S SANTA FE DR,3142345.0,1682796.0,-104.994070,39.706988,3.0,312.0,baker,1,0,2021-09-06
629,2021513998,2021513998230300,2303,0,theft-shoplift,larceny,2021-09-06 12:32:00,NaT,2021-09-06 13:39:00,7757 E 36TH AVE,3169319.0,1705282.0,-104.897670,39.768265,5.0,512.0,central-park,1,0,2021-09-06
628,2021514557,2021514557230300,2303,0,theft-shoplift,larceny,2021-09-06 19:00:00,2021-09-06 19:28:00,2021-09-06 19:28:00,1416 PLATTE ST,3137893.0,1700399.0,-105.009555,39.755379,1.0,113.0,highland,1,0,2021-09-06
640,2021514060,2021514060230400,2304,0,theft-parts-from-vehicle,theft-from-motor-vehicle,2021-08-04 12:00:00,NaT,2021-09-06 14:13:00,2330 S KEARNEY ST,3164016.0,1670986.0,-104.917317,39.674211,3.0,323.0,goldsmith,1,0,2021-09-06


## Data preparation

### Select data

### Clean data

### Construct data

### Integrate data

### Format data

### Dataset description

## Modelling

### Select modelling techniques

### Generate test design

### Build model

### Assess Model

## Evaluation

### Evaluate results

### Review process

### Determine next steps

## Deployment

### Plan deployment

### Plan monitoring and maintenance

### Produce final report

### Review project