### EDA on the Chicago Crime Data

 What is Exploratory Data Analysis?
 
Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.

‘Understanding the dataset’ can refer to a number of things including but not limited to…

Extracting important variables and leaving behind useless variables
Identifying outliers, missing values, or human error
Understanding the relationship(s), or lack of, between variables

Ultimately, maximizing your insights of a dataset and minimizing potential error that may occur later in the process
Exploratory Data Analysis (EDA) is like the Sherlock Holmes of data science. It involves delving deep into datasets, using visual methods as its magnifying glass, to highlight significant attributes. EDA helps us unearth hidden patterns, relationships, and trends within the data before we even start modeling.

Let’s dive into Exploratory Data Analysis (EDA) using Chicago crime data as our case study. We’ll leverage Pandas to explore and summarize the dataset. Here are the steps we’ll follow

Using these four commands, we will perform a basic analysis: 

- df.head()
- df.shape
- df.info()
- df.describe()

First, we'll import the libraries we will need, followed by the data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv(r"C:\Users\admin\Downloads\crime_data_chicago.csv").drop('Unnamed: 0',axis=1)

In [3]:
df.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,6407111,HP485721,07/26/2008 02:30:00 PM,085XX S MUSKEGON AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,10.0,46.0,14,1196638.0,1848800.0,2008,02/28/2018 03:56:25 PM,41.73998,-87.55512,"(41.739979622, -87.555120042)"
1,11398199,JB372830,07/31/2018 10:57:00 AM,092XX S ELLIS AVE,143C,WEAPONS VIOLATION,UNLAWFUL POSS AMMUNITION,POOL ROOM,True,False,...,8.0,47.0,15,1184499.0,1843935.0,2018,08/07/2018 04:02:59 PM,41.726922,-87.599747,"(41.726922145, -87.599746995)"
2,5488785,HN308568,04/27/2007 10:30:00 AM,062XX N TRIPP AVE,0610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,True,False,...,39.0,12.0,05,1146911.0,1941022.0,2007,02/28/2018 03:56:25 PM,41.994138,-87.734959,"(41.994137622, -87.734959049)"
3,11389116,JB361368,07/23/2018 08:55:00 AM,0000X N KEELER AVE,0560,ASSAULT,SIMPLE,NURSING HOME/RETIREMENT HOME,False,False,...,28.0,26.0,08A,1148388.0,1899882.0,2018,07/30/2018 03:52:24 PM,41.881217,-87.73059,"(41.881217483, -87.730589961)"
4,12420431,JE297624,07/11/2021 06:40:00 AM,016XX W HARRISON ST,051A,ASSAULT,AGGRAVATED - HANDGUN,PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,...,27.0,28.0,04A,1165430.0,1897441.0,2021,07/18/2021 04:56:02 PM,41.874174,-87.668082,"(41.874173691, -87.668082118)"


In [4]:
df.shape

(2278726, 22)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2278726 entries, 0 to 2278725
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

In [6]:
df.describe()

Unnamed: 0,ID,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude
count,2278726.0,2278726.0,2278714.0,2094031.0,2094459.0,2254741.0,2254741.0,2278726.0,2254741.0,2254741.0
mean,6882068.0,1186.442,11.29072,22.72764,37.5214,1164569.0,1885747.0,2009.638,41.84209,-87.67161
std,3419168.0,702.6836,6.946692,13.83464,21.53282,16739.55,32098.55,6.019724,0.08830434,0.06073538
min,637.0,111.0,1.0,1.0,0.0,0.0,0.0,2001.0,36.61945,-91.68657
25%,3716076.0,621.0,6.0,10.0,23.0,1152948.0,1859053.0,2004.0,41.76866,-87.71379
50%,6885990.0,1034.0,10.0,23.0,32.0,1166060.0,1890673.0,2009.0,41.85578,-87.66597
75%,9887568.0,1731.0,17.0,34.0,57.0,1176365.0,1909219.0,2014.0,41.90668,-87.62823
max,12781990.0,2535.0,31.0,50.0,77.0,1205119.0,1951622.0,2022.0,42.02291,-87.52453


##### Null values for each feature can also be checked by using the following command:

In [7]:
#checking for null values
df.isnull().sum() 

ID                           0
Case Number                  1
Date                         0
Block                        0
IUCR                         0
Primary Type                 0
Description                  0
Location Description      2877
Arrest                       0
Domestic                     0
Beat                         0
District                    12
Ward                    184695
Community Area          184267
FBI Code                     0
X Coordinate             23985
Y Coordinate             23985
Year                         0
Updated On                   0
Latitude                 23985
Longitude                23985
Location                 23985
dtype: int64

In [8]:
#Dropping of columns
df.drop(['Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude'], axis=1 ,inplace= True)

In [9]:
df.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Year,Updated On,Location
0,6407111,HP485721,07/26/2008 02:30:00 PM,085XX S MUSKEGON AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,423,4.0,2008,02/28/2018 03:56:25 PM,"(41.739979622, -87.555120042)"
1,11398199,JB372830,07/31/2018 10:57:00 AM,092XX S ELLIS AVE,143C,WEAPONS VIOLATION,UNLAWFUL POSS AMMUNITION,POOL ROOM,True,False,413,4.0,2018,08/07/2018 04:02:59 PM,"(41.726922145, -87.599746995)"
2,5488785,HN308568,04/27/2007 10:30:00 AM,062XX N TRIPP AVE,0610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,True,False,1711,17.0,2007,02/28/2018 03:56:25 PM,"(41.994137622, -87.734959049)"
3,11389116,JB361368,07/23/2018 08:55:00 AM,0000X N KEELER AVE,0560,ASSAULT,SIMPLE,NURSING HOME/RETIREMENT HOME,False,False,1115,11.0,2018,07/30/2018 03:52:24 PM,"(41.881217483, -87.730589961)"
4,12420431,JE297624,07/11/2021 06:40:00 AM,016XX W HARRISON ST,051A,ASSAULT,AGGRAVATED - HANDGUN,PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,1231,12.0,2021,07/18/2021 04:56:02 PM,"(41.874173691, -87.668082118)"


In [10]:
df.isnull().sum()

ID                          0
Case Number                 1
Date                        0
Block                       0
IUCR                        0
Primary Type                0
Description                 0
Location Description     2877
Arrest                      0
Domestic                    0
Beat                        0
District                   12
Year                        0
Updated On                  0
Location                23985
dtype: int64

In [11]:
df['Location']

0          (41.739979622, -87.555120042)
1          (41.726922145, -87.599746995)
2          (41.994137622, -87.734959049)
3          (41.881217483, -87.730589961)
4          (41.874173691, -87.668082118)
                       ...              
2278721    (41.893646656, -87.631177143)
2278722    (41.887188151, -87.757163155)
2278723     (41.82272748, -87.607863136)
2278724    (41.893983593, -87.634677382)
2278725     (41.91109424, -87.692122762)
Name: Location, Length: 2278726, dtype: object

In [12]:
#Filling the missing value
df['Location'].ffill().inplace= True

In [13]:
df['Location'].isnull()

0          False
1          False
2          False
3          False
4          False
           ...  
2278721    False
2278722    False
2278723    False
2278724    False
2278725    False
Name: Location, Length: 2278726, dtype: bool

In [14]:
#Checking the columns and data after cleaning
df.info

<bound method DataFrame.info of                ID Case Number                    Date                 Block  \
0         6407111    HP485721  07/26/2008 02:30:00 PM  085XX S MUSKEGON AVE   
1        11398199    JB372830  07/31/2018 10:57:00 AM     092XX S ELLIS AVE   
2         5488785    HN308568  04/27/2007 10:30:00 AM     062XX N TRIPP AVE   
3        11389116    JB361368  07/23/2018 08:55:00 AM    0000X N KEELER AVE   
4        12420431    JE297624  07/11/2021 06:40:00 AM   016XX W HARRISON ST   
...           ...         ...                     ...                   ...   
2278721  10716043    HZ474139  10/14/2016 02:35:00 PM      006XX N CLARK ST   
2278722   1740109     G546340  09/11/2001 10:20:00 PM       052XX W LAKE ST   
2278723   4737434    HM342705  05/10/2006 07:49:00 PM  007XX E OAKWOOD BLVD   
2278724  11122832    JA476827  10/18/2017 10:30:00 PM       002XX W ERIE ST   
2278725   3409804    HK420105  06/09/2004 08:19:28 PM   016XX N ROCKWELL ST   

         IUCR      

In [15]:
#inspecting the date column 
(df['Date'].head()) 

0    07/26/2008 02:30:00 PM
1    07/31/2018 10:57:00 AM
2    04/27/2007 10:30:00 AM
3    07/23/2018 08:55:00 AM
4    07/11/2021 06:40:00 AM
Name: Date, dtype: object

In [16]:
#check the date type
df['Date'].dtype 

dtype('O')

In [17]:
# Function to extract and map months
def extract_and_map_month(date_column):
    # Convert 'Date' column to datetime
    date_column = pd.to_datetime(date_column)
    
    # Extract month
    month = date_column.dt.month
    
    # use a dictionary to map month numbers to month names
    month_map = {
        1: 'January',
        2: 'February',
        3: 'March',
        4: 'April',
        5: 'May',
        6: 'June',
        7: 'July',
        8: 'August',
        9: 'September',
        10: 'October',
        11: 'November',
        12: 'December',
        
        
    }
    
    return month.map(month_map)

# Add a new column 'Month' with mapped month values
df['Month'] = extract_and_map_month(df['Date']) 


In [18]:
 # display the updated column with months of the year
df.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Year,Updated On,Location,Month
0,6407111,HP485721,07/26/2008 02:30:00 PM,085XX S MUSKEGON AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,423,4.0,2008,02/28/2018 03:56:25 PM,"(41.739979622, -87.555120042)",July
1,11398199,JB372830,07/31/2018 10:57:00 AM,092XX S ELLIS AVE,143C,WEAPONS VIOLATION,UNLAWFUL POSS AMMUNITION,POOL ROOM,True,False,413,4.0,2018,08/07/2018 04:02:59 PM,"(41.726922145, -87.599746995)",July
2,5488785,HN308568,04/27/2007 10:30:00 AM,062XX N TRIPP AVE,0610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,True,False,1711,17.0,2007,02/28/2018 03:56:25 PM,"(41.994137622, -87.734959049)",April
3,11389116,JB361368,07/23/2018 08:55:00 AM,0000X N KEELER AVE,0560,ASSAULT,SIMPLE,NURSING HOME/RETIREMENT HOME,False,False,1115,11.0,2018,07/30/2018 03:52:24 PM,"(41.881217483, -87.730589961)",July
4,12420431,JE297624,07/11/2021 06:40:00 AM,016XX W HARRISON ST,051A,ASSAULT,AGGRAVATED - HANDGUN,PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,1231,12.0,2021,07/18/2021 04:56:02 PM,"(41.874173691, -87.668082118)",July


In [19]:
# Function to extract and map days of the week
def extract_and_map_days(date_column):
    # Convert 'Date' column to datetime
    date_column = pd.to_datetime(date_column)
    
    # Extract day of the week (0=Monday, 6=Sunday)
    days = date_column.dt.dayofweek
    
    # Map day of the week to day names
    day_map = {
        0: 'Monday',
        1: 'Tuesday',
        2: 'Wednesday',
        3: 'Thursday',
        4: 'Friday',
        5: 'Saturday',
        6: 'Sunday'
    }
    
    return days.map(day_map)

# Add a new column 'Days' with mapped day values
df['Days'] = extract_and_map_days(df['Date'])

In [20]:
# display the updated column with weeks
df.head(5) 

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Year,Updated On,Location,Month,Days
0,6407111,HP485721,07/26/2008 02:30:00 PM,085XX S MUSKEGON AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,423,4.0,2008,02/28/2018 03:56:25 PM,"(41.739979622, -87.555120042)",July,Saturday
1,11398199,JB372830,07/31/2018 10:57:00 AM,092XX S ELLIS AVE,143C,WEAPONS VIOLATION,UNLAWFUL POSS AMMUNITION,POOL ROOM,True,False,413,4.0,2018,08/07/2018 04:02:59 PM,"(41.726922145, -87.599746995)",July,Tuesday
2,5488785,HN308568,04/27/2007 10:30:00 AM,062XX N TRIPP AVE,0610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,True,False,1711,17.0,2007,02/28/2018 03:56:25 PM,"(41.994137622, -87.734959049)",April,Friday
3,11389116,JB361368,07/23/2018 08:55:00 AM,0000X N KEELER AVE,0560,ASSAULT,SIMPLE,NURSING HOME/RETIREMENT HOME,False,False,1115,11.0,2018,07/30/2018 03:52:24 PM,"(41.881217483, -87.730589961)",July,Monday
4,12420431,JE297624,07/11/2021 06:40:00 AM,016XX W HARRISON ST,051A,ASSAULT,AGGRAVATED - HANDGUN,PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,1231,12.0,2021,07/18/2021 04:56:02 PM,"(41.874173691, -87.668082118)",July,Sunday


### Crime counts by months and weeks

In [21]:
crime_counts_by_month=df.groupby('Month')['Primary Type'].count()

crime_counts_by_month

Month
April        187474
August       206307
December     168105
February     158580
January      180174
July         213727
June         205061
March        188763
May          204682
November     176586
October      195735
September    193532
Name: Primary Type, dtype: int64

In [22]:
# Find the month with the highest crime count
highest_crime_month = crime_counts_by_month.idxmax()

print("Month with the highest crime count:", highest_crime_month)

Month with the highest crime count: July


In [23]:
lowest_crime_month = crime_counts_by_month.idxmin()

print("Month with the lowest crime count:", lowest_crime_month)

Month with the lowest crime count: February


In [24]:
highest_crime_location=df.groupby('Location Description')['Primary Type'].count()

highest_crime_location

Location Description
ABANDONED BUILDING                                 3483
AIRCRAFT                                            244
AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA     370
AIRPORT BUILDING NON-TERMINAL - SECURE AREA         229
AIRPORT EXTERIOR - NON-SECURE AREA                  292
                                                   ... 
VEHICLE-COMMERCIAL - TROLLEY BUS                      1
VESTIBULE                                             7
WAREHOUSE                                          3066
WOODED AREA                                           4
YARD                                                 79
Name: Primary Type, Length: 198, dtype: int64