### Chicago Crime Data Dictionary

| **Column Name**           | **Description**                                                                                                                                                                      
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| **ID**                    | Unique identifier for the record.                                                                                                                                                   | 
| **Case Number**           | The Chicago Police Department RD Number (Records Division Number), unique to the incident.                                                                                           | 
| **Date**                  | Date when the incident occurred. This may be a best estimate.                                                                                                                        | 
| **Block**                 | The partially redacted address where the incident occurred, placing it on the same block as the actual address.                                                                      | 
| **IUCR**                  | Illinois Uniform Crime Reporting code, linked to Primary Type and Description. [IUCR Codes](https://data.cityofchicago.org/d/c7ck-438e)                                              | 
| **Primary Type**          | The primary description of the IUCR code.                                                                                                                                            |
| **Description**           | The secondary (subcategory) description of the IUCR code.                                                                                                                            | 
| **Location Description**  | Description of where the incident occurred.                                                                                                                                          | 
| **Arrest**                | Whether an arrest was made (`true`/`false`).                                                                                                                                         | 
| **Domestic**              | Whether the incident was domestic-related as per the Illinois Domestic Violence Act.                                                                                                |
| **Beat**                  | Police beat where the incident occurred. Beats are the smallest police geographic units. [Beats Info](https://data.cityofchicago.org/d/aerh-rz74)                                   | 
| **District**              | Police district where the incident occurred. [Districts Info](https://data.cityofchicago.org/d/fthy-xz3r)                                                                            |
| **Ward**                  | City Council ward where the incident occurred. [Wards Info](https://data.cityofchicago.org/d/sp34-6z76)                                                                             |            
| **Community Area**        | The community area (1 of 77) where the incident occurred. [Community Areas](https://data.cityofchicago.org/d/cauq-8yn6)                                                              |
| **FBI Code**              | Crime classification per the FBI’s NIBRS system. [FBI Classifications](https://gis.chicagopolice.org/pages/crime_details)                                                           | 
| **X Coordinate**          | 
The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same bloc.               |
| **Y Coordinate**          |  The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.               |
| **Year**                  | Year the incident occurred.                                                     |
| **Updated On**            | Date and time the record was last updated.                                      |
| **Latitude**              | The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.                                                                |
| **Longitude**             | The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.                                                                |
| **Location**              | The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.            |
























block.
location
Location


In [1]:
import pyspark

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql import types as t

In [3]:
spark = SparkSession\
.builder\
.appName('Spark_LABs')\
.getOrCreate()

In [4]:
crimes= spark\
.read\
.format('csv')\
.option('header',True)\
.option('inferSchema', True)\
.load('Crimes_-_2001_to_Present_20250513.csv')

In [5]:
crimes.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: boolean (nullable = true)
 |-- Domestic: boolean (nullable = true)
 |-- Beat: integer (nullable = true)
 |-- District: integer (nullable = true)
 |-- Ward: integer (nullable = true)
 |-- Community Area: integer (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: integer (nullable = true)
 |-- Y Coordinate: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Updated On: string (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Location: string (nullable = true)



In [6]:
crimes.show(1,truncate=False)

+--------+-----------+----------------------+---------------+----+--------------------------+-----------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+----------------------+--------+---------+--------+
|ID      |Case Number|Date                  |Block          |IUCR|Primary Type              |Description      |Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|Updated On            |Latitude|Longitude|Location|
+--------+-----------+----------------------+---------------+----+--------------------------+-----------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+----------------------+--------+---------+--------+
|13311263|JG503434   |07/29/2022 03:39:00 AM|023XX S TROY ST|1582|OFFENSE INVOLVING CHILDREN|CHILD PORNOGRAPHY|RESIDENCE           |true  |false   |1033|10      |25  |30         

In [7]:
crimes.columns

['ID',
 'Case Number',
 'Date',
 'Block',
 'IUCR',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'Beat',
 'District',
 'Ward',
 'Community Area',
 'FBI Code',
 'X Coordinate',
 'Y Coordinate',
 'Year',
 'Updated On',
 'Latitude',
 'Longitude',
 'Location']

In [8]:
schema = t.StructType(
    [

t.StructField( 'ID',t.IntegerType(),True),
t.StructField( 'Case Number',t.StringType(),True),
 t.StructField('Date',t.StringType(),True),
 t.StructField('Block',t.StringType(),True),
 t.StructField('IUCR',t.StringType(),True),
 t.StructField('Primary Type',t.StringType(),True),
 t.StructField('Description',t.StringType(),True),
 t.StructField('Location Description',t.StringType(),True),
t.StructField( 'Arrest',t.BooleanType(),True),
 t.StructField('Domestic',t.BooleanType(),True),
 t.StructField( 'Beat',t.IntegerType(),True),
 t.StructField('District',t.IntegerType(),True),
 t.StructField('Ward',t.IntegerType(),True),
 t.StructField('Community Area',t.IntegerType(),True),
 t.StructField('FBI Code',t.StringType(),True),
 t.StructField('X Coordinate',t.IntegerType(),True),
 t.StructField('Y Coordinate',t.IntegerType(),True),
 t.StructField('Year',t.IntegerType(),True),
 t.StructField('Updated On',t.TimestampType(),True),
 t.StructField('Latitude',t.DoubleType(),True),
 t.StructField('Longitude',t.DoubleType(),True),
t.StructField( 'Location',t.StringType(),True)]

)

In [9]:
crimes= spark\
.read\
.option('header',True)\
.schema(schema)\
.csv('Crimes_-_2001_to_Present_20250513.csv')

In [10]:
crimes.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: boolean (nullable = true)
 |-- Domestic: boolean (nullable = true)
 |-- Beat: integer (nullable = true)
 |-- District: integer (nullable = true)
 |-- Ward: integer (nullable = true)
 |-- Community Area: integer (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: integer (nullable = true)
 |-- Y Coordinate: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Updated On: timestamp (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Location: string (nullable = true)



In [11]:
crimes.count()

8312355

In [12]:
crimes.show(2)

+--------+-----------+--------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+----------+--------+---------+--------+
|      ID|Case Number|                Date|               Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|Updated On|Latitude|Longitude|Location|
+--------+-----------+--------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+----------+--------+---------+--------+
|13311263|   JG503434|07/29/2022 03:39:...|     023XX S TROY ST|1582|OFFENSE INVOLVING...|   CHILD PORNOGRAPHY|           RESIDENCE|  true|   false|1033|      10|  25|            30|      17|        NULL|        NU

In [13]:
['ID',
 'Case Number',
 'Date',
 'Block',
 'IUCR',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'Beat',
 'District',
 'Ward',
 'Community Area',
 'FBI Code',
 'X Coordinate',
 'Y Coordinate',
 'Year',
 'Updated On',
 'Latitude',
 'Longitude',
 'Location']

['ID',
 'Case Number',
 'Date',
 'Block',
 'IUCR',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'Beat',
 'District',
 'Ward',
 'Community Area',
 'FBI Code',
 'X Coordinate',
 'Y Coordinate',
 'Year',
 'Updated On',
 'Latitude',
 'Longitude',
 'Location']

In [14]:
crimes.where(f.col('ID').isNull()).count()
 

0

In [15]:
crimes.where(f.col('Case Number').isNull()).count()

0

In [16]:
 'Date',
 'Block',
'Year'
 'Primary Type',

 'Arrest',
 'Domestic',
 'Beat',
 

('Beat',)

In [47]:
#Dates Transformation
crimes = crimes.withColumn(
    "date_new",
    f.to_timestamp(f.col("Date"), "MM/dd/yyyy hh:mm:ss a")
)



In [48]:
crimes=crimes.withColumn('period',f.date_format(f.col('date_new'),'a'))


In [49]:
crimes=crimes.withColumn('Months',f.date_format(f.col('date_new'),'MM'))
crimes.select (f.col('Months')).show(2)

+------+
|Months|
+------+
|    07|
|    01|
+------+
only showing top 2 rows



In [109]:
crimes=crimes.withColumn('hours',f.date_format(f.col('date_new'),'hh'))
crimes.select (f.col('hours')).show(2)

+-----+
|hours|
+-----+
|   03|
|   04|
+-----+
only showing top 2 rows



In [20]:
# address Transformation

In [21]:
crimes.select (f.col('Block')).show(2)

+--------------------+
|               Block|
+--------------------+
|     023XX S TROY ST|
|039XX W WASHINGTO...|
+--------------------+
only showing top 2 rows



In [22]:
crimes=crimes.withColumn("new_block",f.regexp_replace(f.col("Block"),  f.col('Block')[0:7]," "))
crimes.select (f.col('new_block')).show(2)

+-----------------+
|        new_block|
+-----------------+
|          TROY ST|
|  WASHINGTON BLVD|
+-----------------+
only showing top 2 rows



In [23]:
crimes.select (f.col('Location Description')).distinct().show(3)
crimes=crimes.withColumn("short_location",f.split(f.col('Location Description')," ")[0])
crimes.select (f.col('short_location')).show(3)

+--------------------+
|Location Description|
+--------------------+
|   RAILROAD PROPERTY|
|SCHOOL - PRIVATE ...|
|AIRPORT TERMINAL ...|
+--------------------+
only showing top 3 rows

+--------------+
|short_location|
+--------------+
|     RESIDENCE|
|      SIDEWALK|
|   HOTEL/MOTEL|
+--------------+
only showing top 3 rows



In [107]:
crimes.where( f.col('Community Area').isNull()).count()
mode_v = crimes.where(f.col("Community Area").isNotNull()) \
                   .groupBy("Community Area") \
                   .count() \
                   .orderBy("count", ascending=False) \
                   .take(1)
mode_value = mode_v[0][0]

crimes=crimes.fillna({'Community Area':mode_value})

In [24]:
crimes.select (f.col('Primary Type')).distinct().show()

+--------------------+
|        Primary Type|
+--------------------+
|OFFENSE INVOLVING...|
|CRIMINAL SEXUAL A...|
|            STALKING|
|PUBLIC PEACE VIOL...|
|           OBSCENITY|
|               ARSON|
|   DOMESTIC VIOLENCE|
|            GAMBLING|
|   CRIMINAL TRESPASS|
|             ASSAULT|
|LIQUOR LAW VIOLATION|
| MOTOR VEHICLE THEFT|
|               THEFT|
|             BATTERY|
|             ROBBERY|
|            HOMICIDE|
|           RITUALISM|
|    PUBLIC INDECENCY|
| CRIM SEXUAL ASSAULT|
|   HUMAN TRAFFICKING|
+--------------------+
only showing top 20 rows



In [25]:
crimes.select (f.col('Community Area')).distinct().show()

+--------------+
|Community Area|
+--------------+
|            31|
|            65|
|            53|
|            34|
|            28|
|            76|
|            26|
|            27|
|            44|
|            12|
|            22|
|            47|
|             1|
|            52|
|            13|
|            16|
|             6|
|             3|
|            40|
|            20|
+--------------+
only showing top 20 rows



In [26]:
crimes.select (f.col('Domestic')).distinct().show()

+--------+
|Domestic|
+--------+
|    true|
|   false|
+--------+



In [27]:
crimes.select (f.col('Arrest')).distinct().show()

+------+
|Arrest|
+------+
|  true|
| false|
+------+



### Are crimes increasing or decreasing over time?

In [57]:
crimes.groupBy('Months').count().orderBy('Months').show()

+------+------+
|Months| count|
+------+------+
|    01|660994|
|    02|584261|
|    03|691012|
|    04|687643|
|    05|730191|
|    06|727834|
|    07|765515|
|    08|757815|
|    09|714008|
|    10|721979|
|    11|650259|
|    12|620844|
+------+------+



In [51]:
crimes.groupBy('period').count().show()

+------+-------+
|period|  count|
+------+-------+
|    PM|5217166|
|    AM|3095189|
+------+-------+



In [127]:
crimes.groupBy('hours','period').count().orderBy("count", ascending=False).show()

+-----+------+------+
|hours|period| count|
+-----+------+------+
|   12|    AM|482880|
|   12|    PM|477115|
|   07|    PM|465451|
|   08|    PM|462955|
|   06|    PM|453411|
|   09|    PM|447889|
|   03|    PM|442906|
|   10|    PM|439901|
|   05|    PM|428148|
|   04|    PM|420998|
|   02|    PM|417454|
|   01|    PM|393128|
|   11|    AM|368315|
|   11|    PM|367810|
|   09|    AM|358534|
|   10|    AM|352438|
|   08|    AM|281595|
|   01|    AM|264365|
|   02|    AM|223727|
|   07|    AM|191205|
+-----+------+------+
only showing top 20 rows



In [74]:
crimes.groupBy('Year').count().orderBy("Year", ascending=False).show()

+----+------+
|Year| count|
+----+------+
|2025| 74805|
|2024|258140|
|2023|262960|
|2022|239791|
|2021|209492|
|2020|212586|
|2019|261617|
|2018|269107|
|2017|269267|
|2016|269961|
|2015|264877|
|2014|275880|
|2013|307600|
|2012|336364|
|2011|352031|
|2010|370547|
|2009|392853|
|2008|427208|
|2007|437102|
|2006|448195|
+----+------+
only showing top 20 rows



### Are domestic crimes more likely to result in arrests?

In [77]:
crimes.groupBy(f.col('Domestic') , f.col('Arrest') ).count().show()

+--------+------+-------+
|Domestic|Arrest|  count|
+--------+------+-------+
|    true| false|1156449|
|    true|  true| 276455|
|   false| false|5047889|
|   false|  true|1831562|
+--------+------+-------+



In [61]:

crimes.groupBy(f.col('Domestic')  ).count().show()

+--------+-------+
|Domestic|  count|
+--------+-------+
|    true|1432904|
|   false|6879451|
+--------+-------+



In [62]:
crimes.groupBy(f.col('Arrest') ).count().show()

+------+-------+
|Arrest|  count|
+------+-------+
|  true|2108017|
| false|6204338|
+------+-------+



### What are the most common types of crime?

In [75]:
crimes.groupBy(f.col('Primary Type')).count().orderBy('count',ascending=False).show()

+--------------------+-------+
|        Primary Type|  count|
+--------------------+-------+
|               THEFT|1762465|
|             BATTERY|1514461|
|     CRIMINAL DAMAGE| 945462|
|           NARCOTICS| 760647|
|             ASSAULT| 553999|
|       OTHER OFFENSE| 517900|
|            BURGLARY| 440646|
| MOTOR VEHICLE THEFT| 422754|
|  DECEPTIVE PRACTICE| 380714|
|             ROBBERY| 311730|
|   CRIMINAL TRESPASS| 224262|
|   WEAPONS VIOLATION| 122249|
|        PROSTITUTION|  70317|
|OFFENSE INVOLVING...|  59644|
|PUBLIC PEACE VIOL...|  54250|
|         SEX OFFENSE|  33559|
| CRIM SEXUAL ASSAULT|  27324|
|INTERFERENCE WITH...|  19802|
|LIQUOR LAW VIOLATION|  15300|
|            GAMBLING|  14656|
+--------------------+-------+
only showing top 20 rows



In [63]:
crimes.groupBy (f.col('short_location')).count().show()

+------------------+------+
|    short_location| count|
+------------------+------+
|              BOAT|   140|
|             MOTEL|     7|
|          SIDEWALK|756151|
|               CAR|  3576|
|            SPORTS|  5972|
|           BANQUET|     2|
|        GOVERNMENT| 17312|
|            VACANT| 28489|
|  AIRPORT/AIRCRAFT| 16296|
|          HOSPITAL| 29187|
|              JAIL|  1298|
|           TRAILER|     4|
|           ROOMING|     2|
|          CEMETARY|   434|
|             HOUSE|   701|
|               ATM|  8683|
|          DRIVEWAY| 24448|
|VEHICLE-COMMERCIAL|  5635|
|         WAREHOUSE| 10641|
|          ATHLETIC| 10152|
+------------------+------+
only showing top 20 rows



In [64]:
crimes.groupBy (f.col('new_block')).count().show()

+--------------------+-----+
|           new_block|count|
+--------------------+-----+
|       FULLERTON AVE|33211|
|          MONTANA ST| 2996|
|          HOBART AVE|  142|
|        Blackhawk St|   11|
|         Brodman Ave|    1|
|       Sunnyside Ave|   23|
|           Barry ave|    5|
|         Aberdeen st|    4|
|        Clarence Ave|    1|
|  LAWRENCE AV JFK ER|    2|
|               44 PL|  118|
|        HOLLYWOOD AV|  235|
|         FULTON AV `|    1|
|           UNION AVE|27589|
|             40TH ST| 3343|
|            AVENUE H| 2702|
|            122ND ST|  829|
|           EVANS AVE|14524|
|           MEADE AVE| 4891|
|          Fremont St|   22|
+--------------------+-----+
only showing top 20 rows



### Which community areas have the most crimes?

In [108]:
crimes.groupBy (f.col('Community Area')).count().orderBy("count", ascending=False).show()

+--------------+-------+
|Community Area|  count|
+--------------+-------+
|            25|1088077|
|             8| 275690|
|            43| 253776|
|            28| 238624|
|            23| 238470|
|            24| 226161|
|            29| 223882|
|            71| 216717|
|            67| 214612|
|            49| 202402|
|            68| 197640|
|            32| 196094|
|            69| 191469|
|            66| 185352|
|            44| 170206|
|            22| 159068|
|             6| 158398|
|            61| 152894|
|            26| 143289|
|            27| 142484|
+--------------+-------+
only showing top 20 rows



### Compare crime count by Community Area over Year

In [129]:
crimes.groupBy("Community Area", "Year") \
      .count() \
      .orderBy("count",ascending=False) \
      .show()


+--------------+----+------+
|Community Area|Year| count|
+--------------+----+------+
|            25|2001|479889|
|            25|2002|153736|
|            25|2003| 30888|
|            25|2004| 29558|
|            25|2006| 28947|
|            25|2007| 28702|
|            25|2005| 28434|
|            25|2008| 27515|
|            25|2009| 26266|
|            25|2010| 24492|
|            25|2011| 22918|
|            25|2012| 21405|
|            25|2013| 20277|
|            25|2014| 18826|
|            25|2015| 17435|
|            25|2016| 16813|
|            28|2003| 15807|
|             8|2004| 15760|
|            23|2003| 15498|
|            25|2017| 15474|
+--------------+----+------+
only showing top 20 rows



### the percentage of cases that resulted in an arrest for each type of crime

In [132]:
crimes.groupBy("Primary Type") \
      .agg(f.avg(f.col("Arrest").cast("int")).alias("arrest_rate")) \
      .orderBy("arrest_rate", ascending=False) \
      .show(truncate=False)

+---------------------------------+-------------------+
|Primary Type                     |arrest_rate        |
+---------------------------------+-------------------+
|DOMESTIC VIOLENCE                |1.0                |
|PROSTITUTION                     |0.9957336063825248 |
|NARCOTICS                        |0.9936448838948947 |
|GAMBLING                         |0.9927674672489083 |
|LIQUOR LAW VIOLATION             |0.9905882352941177 |
|PUBLIC INDECENCY                 |0.9813084112149533 |
|CONCEALED CARRY LICENSE VIOLATION|0.966818477553676  |
|INTERFERENCE WITH PUBLIC OFFICER |0.9166750833249167 |
|OBSCENITY                        |0.7454153182308522 |
|WEAPONS VIOLATION                |0.724750304705969  |
|CRIMINAL TRESPASS                |0.6829734863686224 |
|OTHER NARCOTIC VIOLATION         |0.6624203821656051 |
|PUBLIC PEACE VIOLATION           |0.6249585253456221 |
|HOMICIDE                         |0.47854545454545455|
|NON-CRIMINAL (SUBJECT SPECIFIED) |0.33333333333