## Circle 13 Chicago Dataset Exploratory Data Analysis
#### This is a circle 13 project for exploratory data analysis for the chicago dataset
#### Members
* Otim William Gerison
* Roddiyyat Nasirudeen Taiwo
* Okafor Brian

### 1. Data Preparation 

In [41]:
#Importing the necessary libraries for EDA
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [42]:
file_path = "e:\AltSchool\Circle-13\Chicago Data\crime_data_chicago.csv"

In [43]:
#Reading the csv file into a dataframe
df = pd.read_csv(file_path)

In [44]:
#Previewing the Dataset
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,0,6407111,HP485721,07/26/2008 02:30:00 PM,085XX S MUSKEGON AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,...,10.0,46.0,14,1196638.0,1848800.0,2008,02/28/2018 03:56:25 PM,41.73998,-87.55512,"(41.739979622, -87.555120042)"
1,1,11398199,JB372830,07/31/2018 10:57:00 AM,092XX S ELLIS AVE,143C,WEAPONS VIOLATION,UNLAWFUL POSS AMMUNITION,POOL ROOM,True,...,8.0,47.0,15,1184499.0,1843935.0,2018,08/07/2018 04:02:59 PM,41.726922,-87.599747,"(41.726922145, -87.599746995)"
2,2,5488785,HN308568,04/27/2007 10:30:00 AM,062XX N TRIPP AVE,0610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,True,...,39.0,12.0,05,1146911.0,1941022.0,2007,02/28/2018 03:56:25 PM,41.994138,-87.734959,"(41.994137622, -87.734959049)"
3,3,11389116,JB361368,07/23/2018 08:55:00 AM,0000X N KEELER AVE,0560,ASSAULT,SIMPLE,NURSING HOME/RETIREMENT HOME,False,...,28.0,26.0,08A,1148388.0,1899882.0,2018,07/30/2018 03:52:24 PM,41.881217,-87.73059,"(41.881217483, -87.730589961)"
4,4,12420431,JE297624,07/11/2021 06:40:00 AM,016XX W HARRISON ST,051A,ASSAULT,AGGRAVATED - HANDGUN,PARKING LOT / GARAGE (NON RESIDENTIAL),False,...,27.0,28.0,04A,1165430.0,1897441.0,2021,07/18/2021 04:56:02 PM,41.874174,-87.668082,"(41.874173691, -87.668082118)"


In [45]:
df.shape

(2278726, 23)

In [46]:
#Making a copy of the dataset
df_copy = df.copy()

In [47]:
#Checking the data types of our columns
df_copy.dtypes

Unnamed: 0                int64
ID                        int64
Case Number              object
Date                     object
Block                    object
IUCR                     object
Primary Type             object
Description              object
Location Description     object
Arrest                     bool
Domestic                   bool
Beat                      int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                 object
X Coordinate            float64
Y Coordinate            float64
Year                      int64
Updated On               object
Latitude                float64
Longitude               float64
Location                 object
dtype: object

Converting the Date from object to datetime

In [48]:
df_copy["Date"] = pd.to_datetime(df_copy["Date"])

  df_copy["Date"] = pd.to_datetime(df_copy["Date"])


Checking for missing values

In [49]:
def missing_values(df_copy):
    row, column = df_copy.shape
    return df_copy.isna().sum()

In [50]:
missing_values(df_copy)

Unnamed: 0                   0
ID                           0
Case Number                  1
Date                         0
Block                        0
IUCR                         0
Primary Type                 0
Description                  0
Location Description      2877
Arrest                       0
Domestic                     0
Beat                         0
District                    12
Ward                    184695
Community Area          184267
FBI Code                     0
X Coordinate             23985
Y Coordinate             23985
Year                         0
Updated On                   0
Latitude                 23985
Longitude                23985
Location                 23985
dtype: int64

Since Ward, Community area, Location, Location Description and District are categorical in nature, we use the mode to fill in the missing values.

In [51]:
#Deriving their modes
ward = df_copy["Ward"].mode()[0]
community_area = df_copy["Community Area"].mode()[0]
location= df_copy["Location"].mode()[0]
loc_description = df_copy["Location Description"].mode()[0]
district = df_copy["District"].mode()[0]

In [52]:
#Filling in the missing values with the mode
df_copy["Ward"] = df_copy["Ward"].fillna(ward)
df_copy["Community Area"] = df_copy["Community Area"].fillna(community_area)
df_copy["Location"] = df_copy["Location"].fillna(location)
df_copy["District"] = df_copy["District"].fillna(district)
df_copy["Location Description"] = df_copy["Location Description"].fillna(loc_description)

There is a missing value in the Case Number column but we can not use median or mode or mean to fill it and since it is only 1, we drop it.

In [53]:
df_copy.dropna(inplace = True)

In [54]:
df_copy.isna().sum()

Unnamed: 0              0
ID                      0
Case Number             0
Date                    0
Block                   0
IUCR                    0
Primary Type            0
Description             0
Location Description    0
Arrest                  0
Domestic                0
Beat                    0
District                0
Ward                    0
Community Area          0
FBI Code                0
X Coordinate            0
Y Coordinate            0
Year                    0
Updated On              0
Latitude                0
Longitude               0
Location                0
dtype: int64

Checking for duplicates

In [55]:
print((df_copy.duplicated()).sum())

0


In [56]:
#Dropping unnecessary columns
def drop_column(df_copy):
    df_copy.drop(columns = ["Unnamed: 0", "ID", "Longitude", "Latitude"], inplace=True)

In [57]:
drop_column(df_copy)

Feature engineering to create new columns for the month, day and season of crime as well as convert date column to date time

In [58]:
def feature_engineering(df_copy):
  #Creating new month and day columns
  df_copy['Month'] = df_copy['Date'].dt.month_name()
  df_copy['Day'] = df_copy['Date'].dt.day_name()

def get_season(month):
  #Mapping month to its corresponding season
  if month in ["December", "January", "February"]:
    return 'Winter'
  elif month in ["March", "April", "May"]:
    return 'Spring'
  elif month in ["June", "July", "August"]:
    return 'Summer'
  else:
    return 'Autumn'

In [59]:
feature_engineering(df_copy)

In [60]:
df_copy.head()

Unnamed: 0,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Location,Month,Day
0,HP485721,2008-07-26 14:30:00,085XX S MUSKEGON AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,423,...,10.0,46.0,14,1196638.0,1848800.0,2008,02/28/2018 03:56:25 PM,"(41.739979622, -87.555120042)",July,Saturday
1,JB372830,2018-07-31 10:57:00,092XX S ELLIS AVE,143C,WEAPONS VIOLATION,UNLAWFUL POSS AMMUNITION,POOL ROOM,True,False,413,...,8.0,47.0,15,1184499.0,1843935.0,2018,08/07/2018 04:02:59 PM,"(41.726922145, -87.599746995)",July,Tuesday
2,HN308568,2007-04-27 10:30:00,062XX N TRIPP AVE,0610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,True,False,1711,...,39.0,12.0,05,1146911.0,1941022.0,2007,02/28/2018 03:56:25 PM,"(41.994137622, -87.734959049)",April,Friday
3,JB361368,2018-07-23 08:55:00,0000X N KEELER AVE,0560,ASSAULT,SIMPLE,NURSING HOME/RETIREMENT HOME,False,False,1115,...,28.0,26.0,08A,1148388.0,1899882.0,2018,07/30/2018 03:52:24 PM,"(41.881217483, -87.730589961)",July,Monday
4,JE297624,2021-07-11 06:40:00,016XX W HARRISON ST,051A,ASSAULT,AGGRAVATED - HANDGUN,PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,1231,...,27.0,28.0,04A,1165430.0,1897441.0,2021,07/18/2021 04:56:02 PM,"(41.874173691, -87.668082118)",July,Sunday


In [61]:
df_copy.to_csv("dfchicago_clean.csv")

### 2. Statistical Exploration 

### 3. Visual Exploration 