# **Chicago crime analysis**

In [23]:
## Importing required libraries
import pandas as pd

Dataset source: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/about_data

## Importing the Dataset

In [24]:
## Reading the data
crimes = pd.read_csv('Crimes_-_2001_to_Present_20241029.csv')

In [27]:
crimes.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11037294,JA371270,03/18/2015 12:00:00 PM,0000X W WACKER DR,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,BANK,False,False,...,42.0,32.0,11,,,2015,08/01/2017 03:52:26 PM,,,
1,11646293,JC213749,12/20/2018 03:00:00 PM,023XX N LOCKWOOD AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,...,36.0,19.0,11,,,2018,04/06/2019 04:04:43 PM,,,
2,11645836,JC212333,05/01/2016 12:25:00 AM,055XX S ROCKWELL ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,...,15.0,63.0,11,,,2016,04/06/2019 04:04:43 PM,,,
3,11645959,JC211511,12/20/2018 04:00:00 PM,045XX N ALBANY AVE,2820,OTHER OFFENSE,TELEPHONE THREAT,RESIDENCE,False,False,...,33.0,14.0,08A,,,2018,04/06/2019 04:04:43 PM,,,
4,11645601,JC212935,06/01/2014 12:01:00 AM,087XX S SANGAMON ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,21.0,71.0,11,,,2014,04/06/2019 04:04:43 PM,,,


In [25]:
crimes.shape

(8187014, 22)

The crimes dataset contain 22 columns and 8187014 rows.

In [26]:
## checking columns and their datatypes
crimes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8187014 entries, 0 to 8187013
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

Following is the short description for each column accroding to the data source:

- **ID**: Unique identifier for the record.
- **Case Number**: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
- **Date**: Date when the incident occurred. this is sometimes a best estimate.
- **Block**: The partially redacted address where the incident occurred, placing it on the same block as the actual address.
- **IUCR**: The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.
- **Primary Type**: The primary description of the IUCR code.
- **Description**: The secondary description of the IUCR code, a subcategory of the primary description.
- **Location Description**: Description of the location where the incident occurred.
- **Arrest**: Indicates whether an arrest was made.
- **Domestic**: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
- **Beat**: Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.
- **District**: Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.
- **Ward**: The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.
- **Community Area**: Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.
- **FBI Code**: Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS).See the Chicago Police Department listing of these classifications at https://gis.chicagopolice.org/pages/crime_details.
- **X Coordinate**: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
- **Y Coordinate**: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
- **Year**: Year the incident occurred.
- **Updated On**: Date and time the record was last updated.
- **Latitude**: The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
- **Longitude**: The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
- **Location**: The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.


## Removing unnecessary columns

According to above information, we planned to remove few columns:
- ID: not required for visualizations
- Case Number: we will check for duplicates and if none, we will remove it
- Block: masked address of incident, hence we can remove it
- IUCR: This is the code for'Primary Type' column. Since we already have the information there, we can remove this column
- Beat: This also doesn't give much information in visualizations as it just gives beat number for each latitude and longitude
- X Coordinate: Since we have latitude information, this can be removed
- Y Coordinate: Since we have longitude information, this can be removed
- Year: Since we already have Year information in Date Columns, this can also be removed
- Updated On: record update time; not required for visualization or analysis
- Location: This is nothing but combination of Latitude and Longitude, hence we can remove it

In [30]:
## duplicate check for Case number column
if len(crimes['Case Number'].unique()) == len(crimes['Case Number']):
    print("No duplicates in Case number Column")
else:
    print("There are duplicates in Case Number column")    

There are duplicates in Case Number column


Since there are no duplicates in Case number column, we can all cases present in the data are unique and can remove this Case Number column.

In [None]:
## removing the above mentioned columns
crimes.drop(columns = ['ID', 'Case Number', 'Block', 'IUCR', 'Beat', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Location'], inplace = True)
crimes.head()

Unnamed: 0,Date,Primary Type,Description,Location Description,Arrest,Domestic,District,Ward,Community Area,FBI Code,Latitude,Longitude
0,03/18/2015 12:00:00 PM,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,BANK,False,False,1.0,42.0,32.0,11,,
1,12/20/2018 03:00:00 PM,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,25.0,36.0,19.0,11,,
2,05/01/2016 12:25:00 AM,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,8.0,15.0,63.0,11,,
3,12/20/2018 04:00:00 PM,OTHER OFFENSE,TELEPHONE THREAT,RESIDENCE,False,False,17.0,33.0,14.0,08A,,
4,06/01/2014 12:01:00 AM,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,22.0,21.0,71.0,11,,


In [None]:
## checking for unique values in each column
for column in crimes.columns :
    print(f"unique values in {column} are: {crimes[column].unique()}")

unique values in Date are: ['03/18/2015 12:00:00 PM' '12/20/2018 03:00:00 PM'
 '05/01/2016 12:25:00 AM' ... '05/06/2020 09:51:00 AM'
 '09/14/2020 06:13:00 PM' '07/27/2020 03:02:00 PM']
unique values in Primary Type are: ['DECEPTIVE PRACTICE' 'OTHER OFFENSE' 'THEFT' 'BATTERY' 'ASSAULT'
 'WEAPONS VIOLATION' 'INTERFERENCE WITH PUBLIC OFFICER' 'SEX OFFENSE'
 'BURGLARY' 'NARCOTICS' 'LIQUOR LAW VIOLATION' 'CRIM SEXUAL ASSAULT'
 'MOTOR VEHICLE THEFT' 'CRIMINAL DAMAGE' 'OFFENSE INVOLVING CHILDREN'
 'CRIMINAL TRESPASS' 'ROBBERY' 'PUBLIC PEACE VIOLATION'
 'CRIMINAL SEXUAL ASSAULT' 'PROSTITUTION' 'STALKING' 'HOMICIDE'
 'KIDNAPPING' 'ARSON' 'CONCEALED CARRY LICENSE VIOLATION' 'GAMBLING'
 'OBSCENITY' 'INTIMIDATION' 'OTHER NARCOTIC VIOLATION' 'PUBLIC INDECENCY'
 'NON-CRIMINAL' 'HUMAN TRAFFICKING' 'RITUALISM' 'DOMESTIC VIOLENCE'
 'NON-CRIMINAL (SUBJECT SPECIFIED)' 'NON - CRIMINAL']
unique values in Description are: ['FINANCIAL IDENTITY THEFT OVER $ 300'
 'FINANCIAL IDENTITY THEFT $300 AND UNDER' 'TEL

In [34]:
## checking for null values in each column
for column in crimes.columns:
    print(f'Sum of null values in {column} are: {sum(crimes[column].isnull())}')

Sum of null values in Date are: 0
Sum of null values in Primary Type are: 0
Sum of null values in Description are: 0
Sum of null values in Location Description are: 13674
Sum of null values in Arrest are: 0
Sum of null values in Domestic are: 0
Sum of null values in District are: 47
Sum of null values in Ward are: 614830
Sum of null values in Community Area are: 613454
Sum of null values in FBI Code are: 0
Sum of null values in Latitude are: 90049
Sum of null values in Longitude are: 90049
