#### Lecture 5: Coding Exercise
• Baltimore City Crime Data: https://opendata.arcgis.com/api/v3/datasets/3eeb0a2cbae94b3e8549a8193717a9e1_0/downloads/data?format=csv&spatialRefId=2248

##### Directions
1. Specify the schema for the crime data set. 
2. Read the file using the schema definition
3. Cache the DataFrame
4. Show the count of the rows
5. Print the schema
6. Display first 5 rows
7. Answer following questions
 
##### Questions
1. What are distinct crime codes?
2. Count the number of crimes by the crime codes and order by the resulting counts in descending order
3. Which neighborhood had most crimes?
4. Which month of the year had most crimes?
5. What weapons were used? 
6. Which weapon was used the most?

#### Prework: Get the Data
Inspect location where the crime data was stored locally

In [0]:
%fs ls /FileStore/tables/Part1_Crime_data.csv

path,name,size
dbfs:/FileStore/tables/Part1_Crime_data.csv,Part1_Crime_data.csv,62763718


##### Define the File Location

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

crime_data_file = "/FileStore/tables/Part1_Crime_data.csv"

##### Inspect the data locally

In [0]:
%fs head /FileStore/tables/Part1_Crime_data.csv

#### Directions

##### 1. Specify Schema

In [0]:
# Define Schema
crime_schema = StructType([StructField('X',DoubleType() ,True ),
                          StructField('Y',DoubleType() ,True ),
                          StructField('RowID',IntegerType() ,True ),
                          StructField('CrimeDateTime', StringType() ,True ),
                          StructField('CrimeCode',StringType() ,True ),
                          StructField('Location',StringType() ,True ),
                          StructField('Description',StringType() ,True ),
                          StructField('Inside_Outside',StringType() ,True ),
                          StructField('Weapon',StringType() ,True ),
                          StructField('Post',StringType() ,True ),
                          StructField('District',StringType() ,True ),
                          StructField('Neighborhood',StringType() ,True ),
                          StructField('Latitude',FloatType() ,True ),
                          StructField('Longitude',FloatType() ,True ),
                          StructField('GeoLocation',StringType() ,True ),
                          StructField('Premise',StringType() ,True ),
                          StructField('VRIName',StringType() ,True ),
                          StructField('Total_Incidents',IntegerType() ,True )])

##### 2. Read file to DataFrame 
Using the schema definition

In [0]:
crime_df = spark.read.csv(crime_data_file, header=True, schema=crime_schema)

##### 3. Cache the DataFrame

In [0]:
crime_df.cache()

##### 4. Show the count of the rows

In [0]:
# Note the count number may be different than what you see here.
crime_df.count()

##### 5. Print the schema

In [0]:
crime_df.printSchema()

##### 6. Display first 5 rows

In [0]:
display(crime_df.limit(5))

X,Y,RowID,CrimeDateTime,CrimeCode,Location,Description,Inside_Outside,Weapon,Post,District,Neighborhood,Latitude,Longitude,GeoLocation,Premise,VRIName,Total_Incidents
1444028.69115949,614013.478635031,1,2021/04/24 17:00:00+00,7A,6400 ROSEMONT AVE,AUTO THEFT,O,,425,NORTHEAST,ROSEMONT EAST,39.3517,-76.5343,"(39.3517,-76.5343)",STREET,,1
1404650.49199384,609908.822236821,2,2021/04/24 14:22:00+00,4E,4700 REISTERSTOWN RD,COMMON ASSAULT,O,,613,NORTHWEST,LUCILLE PARK,39.3409,-76.6736,"(39.3409,-76.6736)",STREET,,1
1402390.7637532,609063.088658197,3,2021/04/24 20:20:00+00,4E,3900 PENHURST AVE,COMMON ASSAULT,I,,625,NORTHWEST,DOLFIELD,39.3386,-76.6816,"(39.3386,-76.6816)",APT/CONDO - OCCUPIED,,1
1411763.33074933,599008.054752978,4,2021/04/24 10:35:00+00,4E,1900 WALBROOK AVE,COMMON ASSAULT,I,,731,WESTERN,MONDAWMIN,39.3109,-76.6486,"(39.3109,-76.6486)",ROW/TOWNHOUSE-OCC,Western,1
1407601.80571192,599575.150884168,5,2021/04/24 18:28:00+00,4E,2200 KOKO LN,COMMON ASSAULT,I,,731,WESTERN,PANWAY/BRADDISH AVENUE,39.3125,-76.6633,"(39.3125,-76.6633)",ROW/TOWNHOUSE-VAC,,1


##### 7. Answer following questions using PySpark and DataFrames

#### Questions

##### 1. What are distinct crime codes?

In [0]:
# Your code here. Note: The actual output maybe different than what you see below.
crime_df.select("CrimeCode").where(col("CrimeCode").isNotNull()).distinct().show()

##### 2. Count the number of crimes by the crime codes and order by the resulting counts in descending order

In [0]:
(crime_df
 .select("CrimeCode").where(col("CrimeCode").isNotNull())
 .groupBy("CrimeCode")
 .count()
 .orderBy("count", ascending=False)
 .withColumnRenamed('count', 'Sum_Crimes')
 .show(truncate =False))


##### 3. Which neighborhood had most crimes?

In [0]:
# Your code here. Note: The actual output maybe different than what you see below.
(crime_df
 .select("Neighborhood").where(col("CrimeCode").isNotNull())
 .groupBy("Neighborhood")
 .count()
 .orderBy(("count"), ascending=False)
 .withColumnRenamed('count', 'Sum_Of_Crimes')
 .show(1))

##### 4. Which month of the year had most crimes?

In [0]:
crime_df2 = (crime_df.withColumn("timestamp",to_timestamp(col("CrimeDateTime"), "yyyy/MM/dd HH:mm:ss+SS")))
            
(crime_df2
 .select("timestamp").where(col("timestamp").isNotNull())
 .groupBy(month("timestamp").alias("CrimeMonth"))
 .count()
 .orderBy("count", ascending=False)
 .withColumnRenamed('count', 'Sum_Of_Crimes')
 .show(1))

##### 5. What weapons were used?

In [0]:
# Your code here. Note: The actual output maybe different than what you see below.
(crime_df.select("Weapon")
.where(col("Weapon") != "NA")
.distinct()
.show())

##### 6. Which weapon was used the most?

In [0]:
# Your code here. Note: The actual output maybe different than what you see below.
(crime_df.select("Weapon").where(col("Weapon") != "NA").groupBy("Weapon")
.count()
.orderBy("count", ascending = False)
.show(1))