# Chicago Crimes Data Model
### Data Engineering Capstone Project

#### Project Summary
the aim of this project is to build single-source-of-truth to help perform further analysis regarding Chicago crimes from the period Jan, 2001 till Jun, 2022.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [28]:
# Do all imports and installs here
import pandas as pd
import json
import pyspark
from pyspark import SparkContext 
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
from pyspark.sql.functions import col, max 

In [2]:
from pyspark import SparkConf
from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

### Step 1: Scope the Project and Gather Data

#### Scope 
in this project, we'll establish a data model to prepare data for further analysis.
this will be done by extracting, joining and forming the corresponding tables (Time, Districts, Primary Type Dimensions and Crimes Fact table).

tools used in this project:
- Jupyter Notebook.
- Python.
- PySpark library.

#### Describe and Gather Data 
##### Chicago cimes data split into 3 tables:
- CrimeDate:
crimes data grouped by date and primary_type with crime, arrest and false counts.
- CrimeDesc:
crimes' description 
- CrimeLocation:

##### Chicago districts data:
json file from 
has districts' names.
Describe the data sets you're using. Where did it come from? What type of information is included? 

In [3]:
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# sc = pyspark.SparkContext(appName='SparkByExamples.com')

In [4]:
# Read in the data here
# df_columns = ["date", "primary_type", "crime_count", "arrest_count", "false_count"]
dates_df = spark.read.csv("Datasets/CrimeDate.csv",
                          header='true', 
                          inferSchema='true')
dates_df.printSchema()

root
 |-- date: timestamp (nullable = true)
 |-- primary_type: string (nullable = true)
 |-- crime_count: integer (nullable = true)
 |-- arrest_count: integer (nullable = true)
 |-- false_count: integer (nullable = true)



In [5]:
dates_df.count()

123789

In [6]:
crimes_df = spark.read.csv("Datasets/CrimeDesc.csv",
                          header='true', 
                          inferSchema='true')
crimes_df.printSchema()

root
 |-- date: timestamp (nullable = true)
 |-- primary_type: string (nullable = true)
 |-- description: string (nullable = true)



In [7]:
crimes_df.count()

7479177

In [8]:
locations_df = spark.read.csv("Datasets/CrimeLocation.csv",
                              header='true',
                              inferSchema='true')
locations_df.printSchema()

root
 |-- year: integer (nullable = true)
 |-- district: integer (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- primary_type: string (nullable = true)
 |-- crime_count: integer (nullable = true)



In [9]:
locations_df.count()

5417807

In [10]:
# Chicago districts data from:
# https://data.cityofchicago.org/api/views/zidz-sdfj/rows.json?accessType=DOWNLOAD

# sqlContext = SQLContext(spark.sparkContext)
# districts_df = sqlContext.read.json('Datasets/chicago_districts.json')
with open('Datasets/chicago_districts.json', 'r') as f:
    data = json.load(f)

df = pd.DataFrame(data['data'],
                            columns = ['sid', 'id', 'position', 'created_at', 'created_meta', 'updated_at',
                                       'updated_meta', 'meta', 'district_name', 'designation_date'])
df.head(5)

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,district_name,designation_date
0,row-kyf5_yb7r.vrwc,00000000-0000-0000-E6B2-7767EA2D2877,0,1566178027,,1566178027,,{ },Old Town Triangle,244278000
1,row-g4ak.p5ja.8pv8,00000000-0000-0000-2D36-D11E47447804,0,1566178027,,1566178027,,{ },Milwaukee Avenue,1207724400
2,row-shrt~gdnn~3kcf,00000000-0000-0000-9B11-357FA5616303,0,1566178027,,1566178027,,{ },Astor Street,188208000
3,row-vapw~nh6d_sywy,00000000-0000-0000-7C6C-ACED956B631A,0,1566178027,,1566178027,,{ },Beverly/Morgan Park Railroad Stations,797929200
4,row-hmi5~yp2s.6ttw,00000000-0000-0000-EC91-F9519EBE1B46,0,1566178027,,1566178027,,{ },Black Metropolis-Bronzeville,905324400


In [11]:
schema = StructType([ \
    StructField("sid",StringType(),True), \
    StructField("id",StringType(),True), \
    StructField("position",StringType(),True), \
    StructField("created_at", StringType(), True), \
    StructField("created_meta", StringType(), True), \
    StructField("updated_at", StringType(), True), \
    StructField("updated_meta", StringType(), True), \
    StructField("meta", StringType(), True), \
    StructField("district_name", StringType(), True), \
    StructField("designation_date", IntegerType(), True) \
  ])
districts_df = spark.createDataFrame(df, schema) 
districts_df.printSchema()

root
 |-- sid: string (nullable = true)
 |-- id: string (nullable = true)
 |-- position: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- created_meta: string (nullable = true)
 |-- updated_at: string (nullable = true)
 |-- updated_meta: string (nullable = true)
 |-- meta: string (nullable = true)
 |-- district_name: string (nullable = true)
 |-- designation_date: integer (nullable = true)



In [12]:
districts_df.count()

59

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

In [13]:
dates_df.show(10)

+-------------------+--------------------+-----------+------------+-----------+
|               date|        primary_type|crime_count|arrest_count|false_count|
+-------------------+--------------------+-----------+------------+-----------+
|2001-01-01 00:00:00| MOTOR VEHICLE THEFT|         59|           9|         50|
|2001-01-01 00:00:00|   WEAPONS VIOLATION|         32|          26|          6|
|2001-01-01 00:00:00|  DECEPTIVE PRACTICE|         78|          16|         62|
|2001-01-01 00:00:00|   CRIMINAL TRESPASS|         29|          17|         12|
|2001-01-01 00:00:00|            GAMBLING|          2|           2|          0|
|2001-01-01 00:00:00|             ROBBERY|         40|           5|         35|
|2001-01-01 00:00:00|            BURGLARY|         65|           5|         60|
|2001-01-01 00:00:00|INTERFERENCE WITH...|          1|           1|          0|
|2001-01-01 00:00:00|PUBLIC PEACE VIOL...|          5|           2|          3|
|2001-01-01 00:00:00|LIQUOR LAW VIOLATIO

In [14]:
dates_df.describe().show()

+-------+-----------------+-----------------+------------------+-----------------+
|summary|     primary_type|      crime_count|      arrest_count|      false_count|
+-------+-----------------+-----------------+------------------+-----------------+
|  count|           123789|           123788|            123788|           123788|
|   mean|             null|53.59252108443468|15.081179112676512|38.51134197175817|
| stddev|             null|66.06540926173342|27.042152198323155|56.28782463503632|
|    min|            ARSON|                1|                 1|                0|
|    max|WEAPONS VIOLATION|              559|               559|              437|
+-------+-----------------+-----------------+------------------+-----------------+



In [15]:
crimes_df.show(10)

+-------------------+--------------------+--------------------+
|               date|        primary_type|         description|
+-------------------+--------------------+--------------------+
|2001-01-01 00:00:00|OFFENSE INVOLVING...|AGG CRIM SEX ABUS...|
|2001-01-01 00:00:00|     CRIMINAL DAMAGE|          TO VEHICLE|
|2001-01-01 00:00:00|     CRIMINAL DAMAGE|          TO VEHICLE|
|2001-01-01 00:00:00|           NARCOTICS|POSS: CANNABIS 30...|
|2001-01-01 00:00:00|               THEFT|           OVER $500|
|2001-01-01 00:00:00|       OTHER OFFENSE|    TELEPHONE THREAT|
|2001-01-01 00:00:00|          KIDNAPPING|UNLAWFUL INTERFER...|
|2001-01-01 00:00:00|           NARCOTICS|FOUND SUSPECT NAR...|
|2001-01-01 00:00:00|               THEFT|           OVER $500|
|2001-01-01 00:00:00|               THEFT|      $500 AND UNDER|
+-------------------+--------------------+--------------------+
only showing top 10 rows



In [16]:
crimes_df.describe().show()

+-------+-----------------+---------------+
|summary|     primary_type|    description|
+-------+-----------------+---------------+
|  count|          7479177|        7479177|
|   mean|             null|           null|
| stddev|             null|           null|
|    min|            ARSON| $300 AND UNDER|
|    max|WEAPONS VIOLATION|WIREROOM/SPORTS|
+-------+-----------------+---------------+



In [17]:
locations_df.show(10)

+----+--------+------------+-------------+-------------+-----------+
|year|district|    latitude|    longitude| primary_type|crime_count|
+----+--------+------------+-------------+-------------+-----------+
|2001|       6|41.726587107|-87.628650815|OTHER OFFENSE|          1|
|2001|       6|41.743292263|-87.601517412|      ROBBERY|          1|
|2001|       7|41.784650088|-87.665708402|      ASSAULT|          2|
|2001|      24|42.016181819|-87.666751351|      ROBBERY|          1|
|2001|      10|41.861351884|-87.694550926|      BATTERY|          5|
|2001|      14|41.917669402|-87.679950307|      BATTERY|          1|
|2001|      10|41.856919058|-87.726270093|      BATTERY|          3|
|2001|      20| 41.98537745|-87.657686188|      BATTERY|          1|
|2001|      11|41.879549718|-87.705125189|      BATTERY|          2|
|2001|       3|41.759980814| -87.56624795|      BATTERY|          2|
+----+--------+------------+-------------+-------------+-----------+
only showing top 10 rows



In [18]:
locations_df.describe().show()

+-------+-----------------+------------------+-------------------+-------------------+-----------------+------------------+
|summary|             year|          district|           latitude|          longitude|     primary_type|       crime_count|
+-------+-----------------+------------------+-------------------+-------------------+-----------------+------------------+
|  count|          5417807|           5417760|            5417807|            5417807|          5417807|           5417807|
|   mean| 2009.78657859167|11.543600491716134|  41.84087468860387| -87.67337706996014|             null|1.3804805154557924|
| stddev|5.989131793791821| 6.918199651874521|0.08832914173955167|0.05830104426599391|             null| 2.726444299721679|
|    min|             2001|                 1|       41.644585429|      -87.939732936|            ARSON|                 1|
|    max|             2022|                31|       42.022910333|      -87.524529378|WEAPONS VIOLATION|               746|
+-------

In [19]:
districts_df.show(10)

+------------------+--------------------+--------+----------+------------+----------+------------+----+--------------------+----------------+
|               sid|                  id|position|created_at|created_meta|updated_at|updated_meta|meta|       district_name|designation_date|
+------------------+--------------------+--------+----------+------------+----------+------------+----+--------------------+----------------+
|row-kyf5_yb7r.vrwc|00000000-0000-000...|       0|1566178027|        null|1566178027|        null| { }|   Old Town Triangle|       244278000|
|row-g4ak.p5ja.8pv8|00000000-0000-000...|       0|1566178027|        null|1566178027|        null| { }|    Milwaukee Avenue|      1207724400|
|row-shrt~gdnn~3kcf|00000000-0000-000...|       0|1566178027|        null|1566178027|        null| { }|        Astor Street|       188208000|
|row-vapw~nh6d_sywy|00000000-0000-000...|       0|1566178027|        null|1566178027|        null| { }|Beverly/Morgan Pa...|       797929200|
|row-h

In [38]:
# used this solution to find the max:
# https://stackoverflow.com/questions/38377894/how-to-get-maxdate-from-given-set-of-data-grouped-by-some-fields-using-pyspark
# show latest date in CrimeDate table dataset
(dates_df.withColumn("date", col("date").cast("timestamp"))
    .groupBy("primary_type")
    .agg(max("date"))).show()

+--------------------+-------------------+
|        primary_type|          max(date)|
+--------------------+-------------------+
|OFFENSE INVOLVING...|2019-01-09 00:00:00|
|CRIMINAL SEXUAL A...|2019-01-12 00:00:00|
|            STALKING|2019-01-12 00:00:00|
|PUBLIC PEACE VIOL...|2019-01-12 00:00:00|
|           OBSCENITY|2019-01-07 00:00:00|
|NON-CRIMINAL (SUB...|2018-04-16 00:00:00|
|               ARSON|2018-12-26 00:00:00|
|   DOMESTIC VIOLENCE|2001-01-11 00:00:00|
|            GAMBLING|2019-01-07 00:00:00|
|   CRIMINAL TRESPASS|2019-01-13 00:00:00|
|             ASSAULT|2019-01-13 00:00:00|
|LIQUOR LAW VIOLATION|2019-01-12 00:00:00|
|                   O|2019-01-13 00:00:00|
| MOTOR VEHICLE THEFT|2019-01-12 00:00:00|
|               THEFT|2019-01-13 00:00:00|
|             BATTERY|2019-01-13 00:00:00|
|             ROBBERY|2019-01-13 00:00:00|
|            HOMICIDE|2019-01-12 00:00:00|
|           RITUALISM|2006-02-28 00:00:00|
|    PUBLIC INDECENCY|2018-12-19 00:00:00|
+----------

In [32]:
# show latest date in CrimeDesc table dataset
(crimes_df.withColumn("date", col("date").cast("timestamp"))
    .groupBy("primary_type")
    .agg(max("date"))).show()

+--------------------+-------------------+
|        primary_type|          max(date)|
+--------------------+-------------------+
|OFFENSE INVOLVING...|2022-06-14 00:00:00|
|CRIMINAL SEXUAL A...|2022-06-14 00:00:00|
|            STALKING|2022-06-09 00:00:00|
|PUBLIC PEACE VIOL...|2022-06-14 00:00:00|
|           OBSCENITY|2022-06-05 00:00:00|
|NON-CRIMINAL (SUB...|2018-08-23 00:00:00|
|               ARSON|2022-06-14 00:00:00|
|   DOMESTIC VIOLENCE|2001-01-11 00:00:00|
|            GAMBLING|2022-06-10 00:00:00|
|   CRIMINAL TRESPASS|2022-06-14 00:00:00|
|             ASSAULT|2022-06-14 00:00:00|
|LIQUOR LAW VIOLATION|2022-06-14 00:00:00|
| MOTOR VEHICLE THEFT|2022-06-14 00:00:00|
|               THEFT|2022-06-14 00:00:00|
|             BATTERY|2022-06-14 00:00:00|
|             ROBBERY|2022-06-14 00:00:00|
|            HOMICIDE|2022-06-12 00:00:00|
|           RITUALISM|2020-11-17 00:00:00|
|    PUBLIC INDECENCY|2022-05-31 00:00:00|
|   HUMAN TRAFFICKING|2022-06-08 00:00:00|
+----------

In [36]:
# show latest date in CrimeDate table dataset
(locations_df.groupBy("primary_type")
    .agg(max("year"))).show()

+--------------------+---------+
|        primary_type|max(year)|
+--------------------+---------+
|OFFENSE INVOLVING...|     2022|
|CRIMINAL SEXUAL A...|     2022|
|            STALKING|     2022|
|PUBLIC PEACE VIOL...|     2022|
|           OBSCENITY|     2022|
|NON-CRIMINAL (SUB...|     2018|
|               ARSON|     2022|
|   DOMESTIC VIOLENCE|     2001|
|            GAMBLING|     2022|
|   CRIMINAL TRESPASS|     2022|
|             ASSAULT|     2022|
|LIQUOR LAW VIOLATION|     2022|
| MOTOR VEHICLE THEFT|     2022|
|               THEFT|     2022|
|             BATTERY|     2022|
|             ROBBERY|     2022|
|            HOMICIDE|     2022|
|           RITUALISM|     2020|
|    PUBLIC INDECENCY|     2022|
|   HUMAN TRAFFICKING|     2022|
+--------------------+---------+
only showing top 20 rows



#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.