##Chicago Crime Data Analysis
This project is based on Chicago crime data downloaded from [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2). The data set include the crime in Chicago from 2001 to present

In [0]:
%fs ls /mnt/isa460/data/chicago_crime

In [0]:
# import csv data from S3 storage folder

df=spark.read.csv("/mnt/isa460/data/chicago_crime/Crimes2001_2021.csv", header=True, inferSchema=True)


In [0]:
df.printSchema()

In [0]:
display(df)

In [0]:
df.count()

##ETL 

### Replace space in column names as _, change the names to all lower cases letters.

In [0]:
from pyspark.sql.functions import *

columns=df.columns

for col in columns:
  new_col=col.replace(" ", "_").lower()
  df=df.withColumnRenamed(col, new_col)


###Change date from string type to timestamp type

In [0]:
from pyspark.sql.functions import to_timestamp
help(to_timestamp)

In [0]:
from pyspark.sql.functions import to_timestamp, col
crime_df = df.withColumn('Date',to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a'))
crime_df.printSchema()

In [0]:
display(crime_df)

### store transformed crime data as parquet data format

In [0]:
crime_df.write.parquet("/mnt/isa460/data/chicago_crime/parquet")

check the stored parquet file

In [0]:
%fs ls /mnt/isa460/data/chicago_crime/parquet

# Basic Analysis

In [0]:
# load crime data (in parquet format)

crime_df=spark.read.parquet("/mnt/isa460/data/chicago_crime/parquet")

## number of crimes by year

## number of crimes by year and by month

###What are the top 10 number of reported crimes by primary type, in descending order of occurence?

### What are the top 5 reported crimes by primary type for each year?

###Find the percentage of reported crimes that results in an arrest

### Find the percentage of reported crimes that results in arrest by year.

### What are the top 10 words appearing in the deacription of the crime?

#Working with joins

In [0]:
#The reported crimes dataset has only the district number. Add the district name by joining with the police station dataset.

# load the policy dataset, only keep District and District Name

station_df = spark.read.csv('/mnt/isa460/data/chicago_crime/policestation.csv',header=True, inferSchema=True)
station_df=station_df.select('District', "District Name")
station_df.printSchema()

### join police station data with crime data

### Which district has the highest arrest rate?

### Create widget based on primary type

In [0]:
dbutils.widgets.removeAll()

In [0]:
new_df.createOrReplaceTempView('crime')

In [0]:
primary_type = spark.sql("select distinct primary_type from crime").rdd.map(lambda row : row[0]).collect()
primary_type.sort()

In [0]:
dbutils.widgets.dropdown("Type", "THEFT", [str(x) for x in primary_type])

Find the day of the week with the most reported crime by certain primary type

In [0]:
display(crime_df.filter(col('primary_type')==getArgument("Type")).groupBy(date_format('date','E').alias('week day')).count().orderBy('count', ascending=False))