# Generate the Chicago Crime Dataset

This notebook generates the Chicago Crime Dataset for the year 2024 and 2025. The dataset is generated by backfilling the data from the previous year to the current year.
The dataset is saved as a CSV file in the "data" folder.

We will use the following steps:
1. Import the necessary libraries.
2. Define the endpoint for the Chicago Crime dataset.
3. Set the start and end dates for the data to be generated.
4. Call the `backfill_chicago_data` function with the appropriate parameters.
5. Save the resulting dataframe as a CSV file in the "data" folder.

**Note:** To avoid committing the generated data to GitHub, we will update the `.gitignore` file to exclude the "csv" files from being tracked by Git.


## Data Source
We will use the Chicago Data Portal to download the Chicago Crime dataset. The dataset is available at https://data.cityofchicago.org/.

<figure>
 <img src="images/chicago_data_portal.png" width="100%" align="center"/></a>
<figcaption> The Chicago Data Portal</figcaption>
</figure>

<br>
<br />


The Chicago Crime dataset reflects reported incidents of crime that occurred in the City of Chicago from 2001 to present. The dataset is updated on a daily basis and includes information about the type of crime, location, date, time, and other details. The dataset has more than 8 million records and 22 columns. Due to the size of the dataset we will pull data from 2023 to May 2025. We will use a custom functions from the `chicago_data` module (path: `functions/data/chicago_data.py`) to pull the data from the API.


More information about the dataset can be found at https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/about_data



## Loading the Libraries

In [13]:
import pandas as pd
import pointblank as pb
import functions.data.chicago_data as cd
import datetime


## Settings

In [5]:
endpoint = "https://data.cityofchicago.org/resource/ijzp-q8t2"

start = datetime.datetime(2023,1,1,0,0,0)
end = datetime.datetime(2025,5,1,0,0,0)

## Pull the Data from the API

In [6]:
df = cd.backfill_chicago_data(endpoint = endpoint, 
                                start = start, 
                                end = end, 
                                offset = 24 * 30,
                                limit = 100000)

In [16]:
df["updated_on"] = pd.to_datetime(df["updated_on"])

In [17]:
df.head()

Unnamed: 0,id,case_number,datetime,block,iucr,primary_type,description,location_description,arrest,domestic,...,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude
0,13140855,JG341458,2023-01-01,082XX S JEFFERY BLVD,1754,OFFENSE INVOLVING CHILDREN,AGGRAVATED SEXUAL ASSAULT OF CHILD BY FAMILY M...,APARTMENT,False,True,...,4,8,46,02,1190953,1850848,2023,2023-09-24 15:41:26,41.745738706,-87.57588269
1,13180096,JG387858,2023-01-01,075XX S WOLCOTT AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,6,17,71,11,1164996,1854651,2023,2023-08-20 15:40:56,41.756762329,-87.670886691
2,13168471,JG374193,2023-01-01,013XX W HARRISON ST,460,BATTERY,SIMPLE,SCHOOL - PUBLIC GROUNDS,False,False,...,12,34,28,08B,1167465,1897475,2023,2023-08-19 15:40:26,41.874223466,-87.660609583
3,13078152,JG267031,2023-01-01,101XX S BEVERLY AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,22,21,73,11,1168800,1837525,2023,2023-08-19 15:40:26,41.709684879,-87.657438571
4,13120699,JG314178,2023-01-01,063XX N FAIRFIELD AVE,1544,SEX OFFENSE,SEXUAL EXPLOITATION OF A CHILD,OTHER (SPECIFY),False,False,...,24,50,2,17,1156847,1941985,2023,2023-08-19 15:40:26,41.9965838,-87.698383738


## Data Validation

In [18]:
df.dtypes

id                              object
case_number                     object
datetime                datetime64[ns]
block                           object
iucr                            object
primary_type                    object
description                     object
location_description            object
arrest                            bool
domestic                          bool
beat                            object
district                        object
ward                            object
community_area                  object
fbi_code                        object
x_coordinate                    object
y_coordinate                    object
year                            object
updated_on              datetime64[ns]
latitude                        object
longitude                       object
dtype: object

In [None]:
schema = pb.Schema(
    columns=[
        ("id", "object"),
        ("case_number", "object"),
        ("datetime", "datetime64[ns]"),   
        ("block", "object"),
        ("iucr", "object"),
        ("primary_type", "object"),
        ("description", "object"),
        ("location_description", "object"),
        ("arrest", "bool"),
        ("domestic", "bool"),
        ("beat", "object"),
        ("district", "object"),
        ("ward", "object"),
        ("community_area", "object"),
        ("fbi_code", "object"),
        ("x_coordinate", "float64"),
        ("y_coordinate", "float64"),
        ("year", "int64"),
        ("updated_on", "datetime64[ns]")
    ]
)

In [None]:
 validation = (
            pb.Validate(data = df,
            tbl_name= "Chicago Crime Data",
            label = "Chicago Crime Data",
            thresholds=pb.Thresholds(warning= 0, error= 0, critical= 0))
            .col_schema_match(schema=schema)
            .col_vals_gt(columns="value", value=0)
            .col_count_match(count=len(self.schema.columns)) 
            .col_vals_in_set(columns="respondent", set = [self.parameters["facets"]["respondent"]])
            .col_vals_in_set(columns="type", set = [self.parameters["facets"]["type"]])
            .col_vals_not_null(columns= ["index","value"])
            .rows_distinct() 
            .interrogate()
        )

In [8]:
df.describe()

Unnamed: 0,datetime
count,593448
mean,2024-02-22 12:39:29.957792768
min,2023-01-01 00:00:00
25%,2023-07-29 16:49:15
50%,2024-02-21 22:09:00
75%,2024-09-12 11:43:30
max,2025-04-30 23:00:00


In [10]:
df.head()

Unnamed: 0,id,case_number,datetime,block,iucr,primary_type,description,location_description,arrest,domestic,...,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude
0,13140855,JG341458,2023-01-01,082XX S JEFFERY BLVD,1754,OFFENSE INVOLVING CHILDREN,AGGRAVATED SEXUAL ASSAULT OF CHILD BY FAMILY M...,APARTMENT,False,True,...,4,8,46,02,1190953,1850848,2023,2023-09-24T15:41:26.000,41.745738706,-87.57588269
1,13180096,JG387858,2023-01-01,075XX S WOLCOTT AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,6,17,71,11,1164996,1854651,2023,2023-08-20T15:40:56.000,41.756762329,-87.670886691
2,13168471,JG374193,2023-01-01,013XX W HARRISON ST,460,BATTERY,SIMPLE,SCHOOL - PUBLIC GROUNDS,False,False,...,12,34,28,08B,1167465,1897475,2023,2023-08-19T15:40:26.000,41.874223466,-87.660609583
3,13078152,JG267031,2023-01-01,101XX S BEVERLY AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,22,21,73,11,1168800,1837525,2023,2023-08-19T15:40:26.000,41.709684879,-87.657438571
4,13120699,JG314178,2023-01-01,063XX N FAIRFIELD AVE,1544,SEX OFFENSE,SEXUAL EXPLOITATION OF A CHILD,OTHER (SPECIFY),False,False,...,24,50,2,17,1156847,1941985,2023,2023-08-19T15:40:26.000,41.9965838,-87.698383738


## Save the Data

In [11]:
df.to_csv("data/chicago_crime_2023-2025.csv", index = False)