## Exploratory Data Analysis of Crime in Chicago
### <span style="color:#95B">*Understanding the relationship between time of day, location, and type of crime commited between the years of 2012 - 2017 in Chicago, IL.*</span>

Often referred to in pop-culture as "Chiraq", a play on words referring to levels of violence experienced in the Second Persian Gulf War in Iraq; Chicago has some of the highest crime rates across the United States. Being that crimes can vary among a wide array of categories, it is likely that the occurence frequencies of certain crimes may be affected by parameters such as location or time of day. It is imperative to gain a better understanding of high risk locations and times of day so that civillians, law enforcement and goverments are able to make better decisions aimed at improving overall inhabitant safety.

In this exploratory analysis I will analyze a crime dataset obtained from Kaggle.com in an attempt to gain a better understanding about the relationship between types of crime and how their occurances vary with time and place. The conclusions drawn from this analysis will lead to better decision making regarding day-to-day life for the inhabitants of Chicago.

## Methods
### Data Colleciton

This dataset was aquired from [Kaggle](https://www.kaggle.com/code/fahd09/eda-of-crime-in-chicago-2005-2016), a subsidiary of Google LLC that serves as an online data scientist and machine learning practitioner community as well as a repository of published data sets [1]. I will only be analyzing the csv file for data from 2012 - 2017. 

In [None]:
# Importing necessary libraries and loading data
import datetime as dt
import pandas as pd
import numpy as np
import matplotlib as plt

data_filename = "./Chicago_Crimes_2012_to_2017.csv"
crime_data = pd.read_csv(data_filename)

### Data Cleaning

This dataset contains numerous colums that are irrelevant for the purposes of this exploratory analysis. I will remove unecessary columns and rename relevant columns so that they are more understandable. Additionally, rows containing missing data and duplicate crime entries are also removed. I will also be converting the 'Date' column from containing strings to values of the datetime object type so that they are easier to work with. 

In [None]:
# Dropping all rows with missing data
crime_data = crime_data.dropna(axis=0)

# Removing all duplicate row entries
crime_data.drop_duplicates(subset=['ID',"Case Number"], keep="first", inplace=True)

# Removing uneeded columns
crime_data = crime_data.drop(['Unnamed: 0', 'ID', 'Case Number', 'Block', 'IUCR', 'Description', 
                     'Ward', 'Community Area', 'FBI Code' , 'X Coordinate', 'Y Coordinate',
                     'Year', 'Updated On', 'Latitude', 'Longitude', 'Location','Beat',
                     'District', 'Arrest', 'Domestic'], axis=1)

# Editing Date column to convert datetime strings to pandas datetime objects
crime_data['Date'] = pd.to_datetime(crime_data.Date, format='%m/%d/%Y %I:%M:%S %p')

# Renaming cloumns
crime_data.rename(columns = {'Primary Type':'Crime Type', 'Location Description':'Location'}, inplace = True)

In [None]:
crime_data

## Exploratory Analysis and Visualization
### Distribution of Crimes by Type

Let us first explore the distribution of the occurence frequency of each type of crime between the years of 2012 and 2017. 

In [None]:
# Finding the count of all crime types between the years of 2012 and 2017
crime_counts = pd.Series(crime_data.groupby(['Crime Type']).size())
crime_counts = crime_counts.sort_values(ascending=False, kind='quicksort')

# Creating a histogram of the distribution
crime_counts.plot(kind='bar', figsize=(13,5), ylabel="Count (Thousands)", title="Amount of Crimes by Type")

We can see from the above distribution that the majority of crimes types are theft and battery.

Next, let us determine the crime type totals in terms of percentage. We can view the percentages in table formate to get a better understanding of the figures.

In [None]:
# Percentage of crimes committed by type
crimePercentages = pd.Series(100*(crime_data.groupby('Crime Type').size()/crime_data.groupby('Crime Type').size().sum()))
crimePercentages = crimePercentages.sort_values(ascending=False)
crimePercentages = pd.DataFrame(crimePercentages, columns=['Percentage']).reset_index()
crimePercentages

Let us consider only the top 10 types of crime by percentage.

In [None]:
crimePercentages = crimePercentages.iloc[0:10]
print('-------------------------------\nThese crime types account for the top ',
      round(crimePercentages['Percentage'].sum(), 2), '% of all crimes commited.\n-------------------------------', sep="" )

crimePercentages

We can see again that theft and battery again make up the mojarity of committed crimes at ~22.7% and ~18.26% respectively.

Here is a graphical comparison view of the types of crimes that make up the top 92.01% of all crimes committed:

In [None]:
crimePercentages.plot(kind="bar", x='Crime Type', figsize=(13,5), ylabel="Count (Percentage)", title="Percentage of Crimes by Type")


We can now further analyze the top ~92% of crime types by determining the percentage breakdown for the locations where these crimes occur.

In [None]:
#we are retrieving only those columns that correspond to the top 10 ~92% of crimes
crime_locations = crime_data.loc[crime_data['Crime Type'].isin(crimePercentages['Crime Type']), :]
crime_locations.drop('Date', axis=1)

sum = crime_locations.groupby('Location').size().sum()

locationPercentages = pd.Series(round(100*(crime_locations.groupby('Location').size()/sum), 2)).sort_values(ascending=False)
locationPercentages = pd.DataFrame(locationPercentages, columns=['Percentage']).reset_index()
print("Here we can see the percentage of occurence of the top ~92% of crime types at each given location:")
locationPercentages


It appears that the majority of these crimes tend to occur on the street, in residences, in apartments and on the sidewalk.

We will now consider the top 10 most at risk locations:

In [None]:
locationPercentages = locationPercentages.iloc[0:10]
total = round(locationPercentages['Percentage'].sum(), 2)
print(f"The following locations in the city account for {total}% out of the top ~92% of crime types committed in Chicago:")
locationPercentages

A graphical representation of the most at-risk locations for the top ~92% of crime types:

In [None]:
locationPercentages.plot(kind="bar", x = 'Location', figsize=(13,5), ylabel="Count (Percentage)", title="Percentage of Most Common Crimes Committed by Location")

We will now round the time of occurence of each of the the top ten crime types to the nearest hour and create a heatmap showcasing the most at-risk time of day for a given crime type.

In [None]:
#rounding time of crime committed to the nearest hour 
def rounder(t):
    if t.minute >= 30:
        if t.hour == 23:
            return t.replace(second=0, microsecond=0, minute=0, hour=0)
        return t.replace(second=0, microsecond=0, minute=0, hour=t.hour+1)
    else:
        return t.replace(second=0, microsecond=0, minute=0)

crime_data['Date'] = [rounder(dt.datetime.time(i)).hour for i in crime_data.Date]
crime_data

Removing all but the top ten previously determined crime types:

In [None]:
crimeTime = crime_data.loc[crime_data['Crime Type'].isin(crimePercentages['Crime Type'])]
crimeTime


Using the top 10 crime types and their newly calculated approximate times of occurence, we are going to generate a heatmap that will provide a good visual display of the most at-risk times of day for each given crime type. 

In [None]:
from matplotlib import colors

rows = { crime : [0 for _ in range(24)] for crime in crimeTime['Crime Type'] }

vals = crimeTime.groupby(['Crime Type', 'Date']).size()

for key, value in crimeTime.groupby(['Crime Type', 'Date']):

    rows[key[0]][key[1]] = vals[key[0], key[1]]
    

index   = crimeTime['Crime Type'].unique()

columns = [i for i in range(24)]

df = pd.DataFrame(rows, index=index, columns=columns)

for key, value in rows.items():
    
    for index, count in enumerate(value):
        
        df[index][key] = count


def background_gradient(s, m=None, M=None, cmap='Greys', low=0, high=0):

    if m is None:
        m = s.min().min()
    if M is None:
        M = s.max().max()
    rng = M - m
    norm = colors.Normalize(m - (rng * low),
                            M + (rng * high))
    normed = s.apply(norm)

    cm = plt.cm.get_cmap(cmap)
    c = normed.applymap(lambda x: colors.rgb2hex(cm(x)))
    
    def hex_to_rgb(hex):
        return (int(hex[0:2], 16), int(hex[2:4], 16), int(hex[4:6], 16))
    
    def rgb_to_hex(r, g, b):
        return ('#{:X}{:X}{:X}').format(r, g, b)
      
    def colorit(x):
        r, g, b = hex_to_rgb(x[1:])
        
        if (r*0.299 + g*0.587 + b*0.114) > 186: 
            fore = "#000000"
        else:
            fore = "#ffffff"
        
        return 'color : {}; background-color: {}'.format(fore, x)
    
    ret = c.applymap(colorit)
    return ret


df.style.apply(background_gradient, axis=None)

From the above heatmap we can see the darker regions representing a higher crimrate at that time of day for the given crime type. It appears as though theft is most common anywhere from 12pm to around 8pm while battery seems to be more distributed but still occurring more during pm times.

## Discussion

In this exploratory analysis, I attempted to gain a better understanding of the crimes committed in the city Chicago between the years of 2012 and 2017. I wanted to determine what the relationships were between the types of crime, time of day and location. From the analyzed data and models we can see that theft and battery make up the majority of crime types. It also seems that the most at-risk locations for the common crime types are in the open; out on the street and on sidewalks as well as in higher conjested areas such as residential appartments. Finally, the times of day for the occurence of the top 10 most common crimes is in the pm times from about 12pm to 10pm. Being that one of the more common types of crimes were things such as theft and robbery, people would be safer if they travel light and avoid being alone in public between pm times.

## References

1. Source data - https://www.kaggle.com/datasets/currie32/crime_data-in-chicago?resource=downloadselect=Chicago_crime_data_2001_to_2004.csv
2. Pandas for data manipulation
3. Seaborn for data viz
4. Matplotlib for data viz
5. datetime for formatting