# Analysis and Prediction of Crimes in Chicago city

### Introduction

- The data we are analysing is from the Chicago Data Portal (https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/data) which provides us information about all the crimes that took place in the city of Chicago from 2001 to Present

- The questions we would be investigating and predicting are:
    1. The type of crime that can happen
    2. The place where a crime is likely to happen
    3. If the crime ends up in an arrest or not


### Any changes?

We initially planned to work with the entire dataset which has data from 2001 - Present. That is 7662271 rows (as of Nov 2 2022)
Currently, we would be working on a subset of the dataset i.e., from 2019 - 2021 which has 680425 rows (as of Nov 2 2022)

This is being done to fit the time frame of our project. In the subsequent days, we would incorporate the entire dataset to finetune our model better

### Data initialisation

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [25]:
crimes_df = pd.read_csv('Crimes-2021_to_2022.csv')
crimes_df.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,12345411,JE205618,1/1/21 0:00,036XX S ASHLAND AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,...,11.0,59,14,1166266.0,1880505.0,2021,4/23/21 16:49,41.827682,-87.665496,"(41.827681913, -87.665496311)"
1,12449065,JE319016,1/1/21 0:00,100XX S AVENUE L,1750,OFFENSE INVOLVING CHILDREN,CHILD ABUSE,RESIDENCE,False,True,...,10.0,52,08B,1201814.0,1838991.0,2021,8/12/21 16:59,41.712933,-87.536489,"(41.712932999, -87.53648903)"
2,12349639,JE210703,1/1/21 0:00,037XX N PITTSBURGH AVE,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,APARTMENT,False,False,...,38.0,17,11,1120403.0,1923742.0,2021,4/28/21 16:51,41.947186,-87.83284,"(41.94718614, -87.832840321)"
3,12354069,JE216275,1/1/21 0:00,016XX W HOWARD ST,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,True,...,49.0,1,11,1163887.0,1950346.0,2021,5/4/21 16:47,42.01938,-87.672249,"(42.019380398, -87.672249127)"
4,12355943,JE218564,1/1/21 0:00,051XX N SHERIDAN RD,2826,OTHER OFFENSE,HARASSMENT BY ELECTRONIC MEANS,APARTMENT,False,False,...,48.0,3,26,1168704.0,1934414.0,2021,5/5/21 16:49,41.975559,-87.654988,"(41.975559264, -87.654987667)"


In [26]:
crimes_df.shape

(207351, 22)

In [27]:
crimes_df.dtypes

ID                        int64
Case Number              object
Date                     object
Block                    object
IUCR                     object
Primary Type             object
Description              object
Location Description     object
Arrest                     bool
Domestic                   bool
Beat                      int64
District                  int64
Ward                    float64
Community Area            int64
FBI Code                 object
X Coordinate            float64
Y Coordinate            float64
Year                      int64
Updated On               object
Latitude                float64
Longitude               float64
Location                 object
dtype: object

##### Meanings of columns

1. ID
2. Case Number: unique crime ID
3. Date: listed date of crime
4. Block: block where crime occured
5. IUCR: four digit Illinois Uniform Crime Reporting (IUCR) codes
6. Description: Short description of the type of crime
7. Location description: Description of where crime occured
8. Arrest: boolean value (T/F) of whether or not an arrest was made
9. Domestic: boolean value (T/V) of whether or not crime was domestic
10. Community Area: numeric value indicating area in community where crime occured
11. FBI Code: numeric code indicating FBI crime categorization
12. X & Y Coordinate: exact location where crime occured
13. Year: Year crime occured
14. Updated On: Date and time the crime was added
15. Latitude & Longitude: latitude and longitude information of crime

In [28]:
crimes_df.describe()

Unnamed: 0,ID,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude
count,207351.0,207351.0,207351.0,207340.0,207351.0,202386.0,202386.0,207351.0,202386.0,202386.0
mean,12377570.0,1151.009973,11.281084,23.14277,37.098442,1165116.0,1885762.0,2021.0,41.842116,-87.66961
std,775615.3,699.192416,6.988373,13.920121,21.646216,16554.49,32011.93,0.0,0.088039,0.060254
min,25699.0,111.0,1.0,1.0,1.0,1091242.0,1813909.0,2021.0,41.644608,-87.939733
25%,12342730.0,611.0,6.0,10.0,23.0,1153359.0,1858110.0,2021.0,41.765975,-87.712323
50%,12424980.0,1031.0,10.0,24.0,32.0,1166968.0,1891278.0,2021.0,41.857211,-87.662955
75%,12505900.0,1722.0,17.0,34.0,55.0,1176822.0,1909231.0,2021.0,41.906736,-87.626743
max,12878020.0,2535.0,31.0,50.0,77.0,1205119.0,1951499.0,2021.0,42.022548,-87.524529


### Data cleaning

Converting the column names to a standard form and handling inconsistencies

In [29]:
crimes_df.columns = crimes_df.columns.str.strip()
crimes_df.columns = crimes_df.columns.str.replace(' ', '_')
crimes_df.columns = crimes_df.columns.str.lower()

- ID, Case Number are primary key attributes and do not add value. These columns can be dropped
- We have attributes: Latitude, Longitude and Location where Location is the combination of Latitude & Longitude data (Latitude,Longitude). So Location column is redundant and can be dropped

In [30]:
crimes_df.drop(['id','case_number','location'], axis = 1, inplace = True)

Checking if there are any null values

In [31]:
crimes_df.isna().sum()

date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description     843
arrest                     0
domestic                   0
beat                       0
district                   0
ward                      11
community_area             0
fbi_code                   0
x_coordinate            4965
y_coordinate            4965
year                       0
updated_on                 0
latitude                4965
longitude               4965
dtype: int64

- Since latitude, longitude, ward are crucial information to detect the location of crime, missing information for these do not contribute much to the dataset. So, we can drop these columns
- Location description is not a mandatory column and would not affect our model. So, we don't have to delete it. We can replace missing values with 'Unavailable'

In [32]:
crimes_df.dropna(subset = ['latitude','longitude','ward'], inplace = True)
crimes_df.reset_index(drop = True, inplace = True)
crimes_df['location_description'] = crimes_df['location_description'].fillna('Unavailable')

In [33]:
crimes_df.isna().sum()

date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
district                0
ward                    0
community_area          0
fbi_code                0
x_coordinate            0
y_coordinate            0
year                    0
updated_on              0
latitude                0
longitude               0
dtype: int64

In [34]:
crimes_df.shape

(202375, 19)

We removed 3 columns and (207351 - 202375 = 4976) 4976 rows. This constitutes to 2.3% of the initial dataset. Since, the value is very low we wouldn't have missed out on important insights

### Data enhancements