# Analyzing Chicago Crime - Extracting data

link: https://dev.socrata.com/foundry/data.cityofchicago.org/x2n5-8w5q

## To do

- [x] sign up for app token [here](https://data.cityofchicago.org/login)
- [x] create env file in main dir + Update jupyter notebook to use env file
- [x] Push to github
- [x] Recheck deduplicate for-loop (confirm we can dedup along 'case_' column)
- [ ] Create logic for extraction, using the while loop and pagination in tandem with **order by `:id`**
- [ ] Transformations
- [ ] AWS
- [ ] Database
- [ ] Logic for UPSERT
- [ ] Docker

## Questions

- [ ] How will we manage the env file in the docker container?
- [ ] Who's App token are we going to use? Or are we to assume that the person using our container has to get their own app token?

## Imports & configurations

In [2]:
# imports
import pandas as pd
import requests
import os
from dotenv import load_dotenv 
import json

# pandas configurations
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# .env
dotenv_path = os.path.expanduser("~/Documents/DEC/dec-proj1-chicago-crime/.env")  #Enter path to env file here
load_dotenv(dotenv_path)

# variables
APP_TOKEN = os.environ.get("APP_TOKEN")

## Extracting Logic (WIP)

- retrieve MAX date_of_occurence from database
- Round up to next day (2024-01-01T23:50:00.000 --> 2024-01-02T00:00:00.000), this becomes `start_time` variable
- Add 23 hours, 59 mins, 59 seconds, 999 milliseconds to the time above using python datetime module. This become `end_time` variable
- Query api using the 'BETWEEN .. AND ..' function using the `start_time` and `end_time` variables
- Do transformations in python
- upsert into database

### Problems

- we really won't be upserting, since we're only going to be uploading a day's worth of data at a time.

### Thoughts from a balcony

- If we upload only a day's worth of data at a time, every hour, that is going to be:
  - 365 - 7 (1 year minus most recent 7 days) = **358** days in dataset
  - 358 uploads / 24 uploads per day = Approx **15 days** to get all the data.
- If we upload 7 days worth of data at a time, every hour...
  - 358 / 7 = 52 uploads
  - 52 / 24 = **2.5 days** to get all the data
  - 7 days worth of data is ~5000 records per upload
- So... we could upload 7 days worth of data at a time, and once we start getting no records, meaning we have reached the end of the dataset, we can fetch the min date from the dataset, and then start fetching 7 days worth of data again. This will be upserting then.

## Extracting Data (WIP)

Will update this once done completing logic using `requests` library

## Transformations (WIP)

### Drop 'location' column 

drop 'location' column since it displays data we already have under the longitude and latitude columns

In [None]:
df.drop('location', axis=1, inplace=True)

### Deduplicate

In [None]:
df.drop_duplicates(subset='case_', inplace=True)

## Checking for Duplicates

In [5]:
file_path = "~/Documents/DEC/dec-proj1-chicago-crime/data_2024_01_12.csv"

df = pd.read_csv(file_path)

df = df.drop(columns=['location','Unnamed: 0'])

In [9]:
df

Unnamed: 0,case_,date_of_occurrence,block,_iucr,_primary_decsription,_secondary_description,_location_description,arrest,domestic,beat,ward,fbi_cd,x_coordinate,y_coordinate,latitude,longitude,:@computed_region_awaf_s7ux,:@computed_region_6mkv_f3dw,:@computed_region_vrxf_vc4k,:@computed_region_bdys_3d7i,:@computed_region_43wa_7qmu,:@computed_region_rpca_8um6
0,JG105027,2023-01-05T05:10:00.000,052XX N OAKVIEW AVE,0910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,N,N,1614,41.0,07,1117315.0,1933582.0,41.974237,-87.843985,34.0,4448.0,75.0,8.0,29.0,31.0
1,JG104971,2023-01-05T05:14:00.000,017XX W 87TH ST,0610,BURGLARY,FORCIBLE ENTRY,RESTAURANT,Y,N,2221,21.0,05,1166065.0,1846996.0,41.735733,-87.667186,18.0,21554.0,70.0,488.0,13.0,59.0
2,JG104986,2023-01-05T05:20:00.000,023XX W CERMAK RD,031A,ROBBERY,ARMED - HANDGUN,PARKING LOT / GARAGE (NON RESIDENTIAL),N,N,1234,25.0,03,1161101.0,1889344.0,41.852046,-87.684201,8.0,14920.0,33.0,4.0,26.0,43.0
3,JG110497,2023-01-05T05:20:00.000,003XX N LARAMIE AVE,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,OTHER (SPECIFY),N,N,1523,37.0,11,1141651.0,1901688.0,41.886301,-87.755283,11.0,22216.0,26.0,696.0,23.0,32.0
4,JG104975,2023-01-05T05:25:00.000,093XX S LUELLA AVE,0910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,N,Y,413,7.0,07,1192807.0,1843427.0,41.725330,-87.569331,43.0,21202.0,44.0,492.0,37.0,25.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
258718,JH108580,2024-01-04T00:00:00.000,025XX E 83RD ST,0910,MOTOR VEHICLE THEFT,AUTOMOBILE,PARKING LOT / GARAGE (NON RESIDENTIAL),N,N,423,7.0,07,1194757.0,1850436.0,41.744515,-87.561958,43.0,21202.0,42.0,478.0,37.0,25.0
258719,JH104216,2024-01-04T00:00:00.000,019XX N DAYTON ST,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,N,N,1813,43.0,14,1170314.0,1913138.0,41.917142,-87.649691,51.0,21190.0,68.0,168.0,34.0,16.0
258720,JH104042,2024-01-04T00:00:00.000,045XX N MALDEN ST,0820,THEFT,$500 AND UNDER,STREET,N,N,1913,46.0,06,1166728.0,1930315.0,41.964354,-87.662372,37.0,22616.0,31.0,611.0,39.0,15.0
258721,JH105047,2024-01-04T00:00:00.000,054XX S LARAMIE AVE,0610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,N,N,814,23.0,05,1142626.0,1868066.0,41.794019,-87.752538,35.0,22268.0,53.0,607.0,6.0,7.0


In [10]:
duplicates_all_cols = df[df.duplicated()]

len(duplicates_all_cols)

# 17 columns that are duplicated along all cols

17

In [11]:
# drop 17 duplicates

df.drop_duplicates(inplace=True)

In [12]:
# confirming drop

len(df)

258706

In [16]:
duplicates = df[df.duplicated(['case_'])]

len(duplicates)

# 20 duplicates along case_ column

20

There are still duplicated records when filtering by case number, indicating that these cases have multiple records where one of the other columns have unique values.

**So Questions:**

1. What columns have multiple unique values for cases with multiple records?
2. And can we drop duplicates based on just the case number?

In [17]:
# view cases that have duplicate records

duplicates

Unnamed: 0,case_,date_of_occurrence,block,_iucr,_primary_decsription,_secondary_description,_location_description,arrest,domestic,beat,ward,fbi_cd,x_coordinate,y_coordinate,latitude,longitude,:@computed_region_awaf_s7ux,:@computed_region_6mkv_f3dw,:@computed_region_vrxf_vc4k,:@computed_region_bdys_3d7i,:@computed_region_43wa_7qmu,:@computed_region_rpca_8um6
61500,JG171194,2023-04-07T15:45:00.000,042XX W VAN BUREN ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,N,N,1132,28.0,01A,1148292.0,1897688.0,41.875199,-87.730999,36.0,21572.0,27.0,717.0,23.0,30.0
48674,JG191739,2023-03-20T00:44:00.000,004XX E 71ST ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,Y,N,322,6.0,01A,1180240.0,1858037.0,41.765718,-87.614917,31.0,22260.0,67.0,480.0,32.0,61.0
61564,JG214225,2023-04-07T17:23:00.000,044XX W WEST END AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,N,N,1113,28.0,01A,1146602.0,1900602.0,41.883227,-87.73713,11.0,21572.0,27.0,732.0,23.0,30.0
67798,JG225210,2023-04-16T06:19:00.000,083XX S LUELLA AVE,110,HOMICIDE,FIRST DEGREE MURDER,HOUSE,Y,N,412,7.0,01A,1192630.0,1850099.0,41.743643,-87.569762,9.0,21202.0,42.0,507.0,37.0,25.0
85029,JG256987,2023-05-11T18:06:00.000,006XX W 61ST PL,110,HOMICIDE,FIRST DEGREE MURDER,STREET,N,N,711,16.0,01A,1172987.0,1864107.0,41.782538,-87.641322,19.0,21559.0,66.0,22.0,2.0,11.0
99557,JG282670,2023-05-31T19:05:00.000,007XX N HOMAN AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,N,N,1121,27.0,01A,1153579.0,1904680.0,41.894282,-87.711401,41.0,21572.0,24.0,584.0,46.0,30.0
114047,JG304885,2023-06-20T12:10:00.000,001XX S HOMAN AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,N,N,1123,28.0,01A,1153747.0,1898983.0,41.878646,-87.710936,11.0,21572.0,28.0,783.0,23.0,30.0
112828,JG306277,2023-06-18T20:31:00.000,099XX S PRINCETON AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,N,N,511,9.0,01A,1176038.0,1839095.0,41.713834,-87.630885,30.0,21861.0,45.0,569.0,43.0,19.0
119906,JG318738,2023-06-27T20:20:00.000,015XX W 59TH ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,N,N,713,16.0,01A,1166867.0,1865596.0,41.786757,-87.663717,44.0,22257.0,65.0,384.0,2.0,23.0
123871,JG325333,2023-07-03T02:03:00.000,026XX S MILLARD AVE,110,HOMICIDE,FIRST DEGREE MURDER,APARTMENT,Y,Y,1032,22.0,01A,1152462.0,1886281.0,41.843815,-87.715989,14.0,21569.0,32.0,227.0,28.0,57.0


Below, checking which columns have multiple unique values for each case number with duplicate records:

In [30]:
# create list of cols
col_list = list()
for col in duplicates.columns:
    col_list.append(col)


# initialize empty list to capture case numbers and what columns are different
duplicates_breakdown = list()


# Check what is duplicated
for case in duplicates['case_']:  # cycle through cases that have duplicated records
    duplicates_cols = list()  # initialize empty list to capture columns with multiple unique values
    for col in col_list:  # cycle through columns
        if len((df[df['case_'] == case][col]).unique()) > 1:
            duplicates_cols.append(col)
    duplicates_breakdown.append(f'{case}, {duplicates_cols}')

duplicates_breakdown

["JG171194, ['date_of_occurrence']",
 "JG191739, ['date_of_occurrence']",
 "JG214225, ['date_of_occurrence']",
 "JG225210, ['date_of_occurrence']",
 "JG256987, ['date_of_occurrence']",
 "JG282670, ['date_of_occurrence']",
 "JG304885, ['date_of_occurrence']",
 "JG306277, ['date_of_occurrence']",
 "JG318738, ['date_of_occurrence']",
 "JG325333, ['date_of_occurrence']",
 "JG446325, ['date_of_occurrence']",
 "JG453003, ['date_of_occurrence']",
 "JG456963, ['date_of_occurrence']",
 "JG486600, ['date_of_occurrence']",
 "JG490649, ['date_of_occurrence']",
 "JG490649, ['date_of_occurrence']",
 "JG499426, ['date_of_occurrence']",
 "JG499426, ['date_of_occurrence']",
 "JG545814, ['date_of_occurrence']",
 "JH100028, ['date_of_occurrence']"]

In [31]:
len(duplicates_breakdown)

20

**Answering questions from above:**

1. Looks like all duplicated records have different values under the date_of_occurence column
2. Based on above, we can drop duplicates based on 'case_' column

## Using Requests to extract

Notes:

- For pagination, **order by `:id`** (since there are multiple crime cases with the same date_of_occurrence), and use the `offset` and `limit` param in tandem. Increase offset by the value of the limit to grab the next page. Ex:

  ```md
  limit = 10
  
  pg 1
  offset = 0

  pg 2:
  offset = 0 + 10 = 10

  pg 3
  offset = 10 + 10 = 20
  ```

- We can use a while loop, where we query the api as long as the len(data) we're receiving is equal to the limit value, and with each pass through increase offset by the limit value to get the next pg of data, until we reach the end of the dataset

Testing Pagination below:

In [None]:
# df1 (pg1)

import requests

start_date = '2023-11-06T00:00:00.000'

end_date = '2023-11-19T23:59:59.999'

soql_date = f"$where=date_of_occurrence between '{start_date}' and '{end_date}'"

limit = 1000

offset = 0

response = requests.get(f"https://data.cityofchicago.org/resource/x2n5-8w5q.json?"
                        f"$$app_token={APP_TOKEN}&"
                        f"$order=:id"  #a date of occurrence can have multiple cases, so better to use :id since that will lock the sequence of records.
                        f"&{soql_date}"
                        f"&$limit={limit}"
                        f"&$offset={offset}")

# print the message
data1 = response.json()
print(data1)
assert response.status_code == 200

In [None]:
df1 = pd.json_normalize(data1)

print(len(df1))
df1

In [None]:
print(df1['case_'])

In [None]:
# df2 (pg2)

offset = 0+1000  #pg2

response = requests.get(f"https://data.cityofchicago.org/resource/x2n5-8w5q.json?"
                        f"$$app_token={APP_TOKEN}&"
                        f"$order=:id"
                        f"&{soql_date}"
                        f"&$limit={limit}"
                        f"&$offset={offset}")

# print the message
data2 = response.json()
print(data2)
assert response.status_code == 200

In [None]:
df2 = pd.json_normalize(data2)

print(len(df2))
df2

In [None]:
print(df2['case_'])

In [None]:
# df3 (pg1 + pg2)

full_limit = 1000 + 1000

response = requests.get(f"https://data.cityofchicago.org/resource/x2n5-8w5q.json?"
                        f"$$app_token={APP_TOKEN}&"
                        f"$order=:id"
                        f"&{soql_date}"
                        f"&$limit={full_limit}"
                        )

# print the message
data3 = response.json()
print(data3)
assert response.status_code == 200

In [None]:
df3 = pd.json_normalize(data3)

print(len(df3))
df3

In [None]:
print(df3['case_'])

Testing querying api for MINIMUM date_of_occurrence from dataset - **Success**

In [None]:
# min date_of_occurence in dataset

response = requests.get(f"https://data.cityofchicago.org/resource/x2n5-8w5q.json?"
                        f"$$app_token={APP_TOKEN}&"
                        f"&$select=min(date_of_occurrence)"
                        )

# print the message
data3 = response.json()
print(data3)
assert response.status_code == 200