#Chicago Crime Data Analysis
This project is based on Chicago crime data downloaded from [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2). This link inclues all crime in Chicago since 2001.

This data was downloaded from the site and save it as a parquet format. This analysis will focus on crimes between 2012 and 2022.

Download the dataset(ChicagoCrime2012_2022.parquet, PoliceStation.csv) from Canvas and store them in a folder /FileStore/tables/ChicagoCrime

In [0]:
%fs ls /FileStore/tables/ChicagoCrime

### Import crime data

In [0]:
df=spark.read.parquet("/FileStore/tables/ChicagoCrime/ChicagoCrimes2012_2022.parquet")


In [0]:
df.count()

In [0]:
display(df)

In [0]:
df.printSchema()

##ETL

### Column name reformat
Replace space in column names as _, change the names to all lower cases letters.

In [0]:
from pyspark.sql.functions import col

columns=df.columns

for col in columns:
  new_col=col.replace(" ", "_").lower()
  df=df.withColumnRenamed(col, new_col)
  
display(df)

Modify format of column

###Change date from string type to timestamp type

In [0]:
from pyspark.sql.functions import to_timestamp

help(to_timestamp)

In [0]:
from pyspark.sql.functions import to_timestamp, col
crime_df = df.withColumn('date',to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a'))
crime_df.printSchema()

In [0]:
display(crime_df)

# Basic Analysis

### number of crimes by year

### number of crimes by year and by month

### What are the top 10 number of reported crimes by primary type, in descending order of occurence?

### What are the top 5 reported crimes by primary type for each year?

###Find the percentage of reported crimes that results in an arrest

### find the percentage of reported crimes that results in arrest by year.

### What are the top 10 words appearing in the deacription of the crime?

#Working with joins

In [0]:
#The reported crimes dataset has only the district number. Add the district name by joining with the police station dataset.

# load the policy dataset, only keep District and District Name

station_df = spark.read.csv('/FileStore/tables/ChicagoCrime/PoliceStation.csv',header=True, inferSchema=True)
station_df=station_df.select('District', "District Name")
station_df.printSchema()

In [0]:
#Join police staion with crime data.

#new_df=crime_df.join(station_df, crime_df['district']==station_df['district'], 'inner')

# if both fields have the same name, the following join can be used:

new_df=crime_df.join(station_df, 'district', 'inner')

In [0]:
display(new_df)

In [0]:
# rename District Name to be consistent with other columns

new_df=new_df.withColumnRenamed('District Name', 'district_name')

display(new_df)

### Which district has the highest arrest rate?

## Crime Map for 2021
We want to show certain crime on a map for particular day in year 2021

In [0]:
# remove existing widgets

dbutils.widgets.removeAll()

In [0]:
new_df.createOrReplaceTempView('crimeTable')

### create a list to store all date in 2022

In [0]:
pd_date=new_df.filter(col('year')==2021).select(date_format("Date", "yyyy-MM-dd").alias('date')).distinct().orderBy('date').toPandas()
pd_date=pd_date.sort_values(by=['date'])
date=list(pd_date['date'])
date

### Create a list to store top 10 type of crime in 2021

In [0]:
from pyspark.sql.functions import *
pd_type=new_df.filter(col('year')==2021).groupBy("primary_type").count().orderBy(desc('count')).limit(10).toPandas()
pd_type=pd_type.sort_values(by=['primary_type'])
type=list(pd_type['primary_type'])
type

### create dropdown list for date and primary type

In [0]:
dbutils.widgets.dropdown("Date", "2021-01-01", [str(x) for x in date])

dbutils.widgets.dropdown("Type", "ASSAULT", [str(x) for x in type])

### display number of crimes by week day for selected date and crime type

In [0]:
display(new_df.filter(col('primary_type')==getArgument("Type")).groupBy(date_format('date','E').alias('week_day')).count().orderBy('week_day'))

## Display data on a map using folium

[see this link for more detail](https://python-visualization.github.io/folium/quickstart.html#Getting-Started)

In [0]:
display(new_df)

In [0]:
%sh pip install folium

In [0]:
# create a dataframe for crime data in 2021

df_2021=new_df.where('year=2021')

df_2021.count()

In [0]:
import folium
from pyspark.sql.functions import *

pd=df_2021.filter(col('latitude').isNotNull()).filter(date_format("Date", "yyyy-MM-dd")==getArgument("Date")).filter(col("primary_type")==getArgument("Type")).select("description", "latitude", 'longitude').toPandas()

description=pd['description']
latitude=pd['latitude']
longitude=pd['longitude']

m = folium.Map(location=[41.815, -87.669], zoom_start=12)

for i in range(0, len(description)):
     folium.Marker([latitude[i], longitude[i]], popup=description[i],
                   icon=folium.Icon(color="blue")).add_to(m)

display(m)