*This notebook will create a file with the following transformations/filters:*

- Filtered to include only the following columns: ('SR_NUMBER', 'SR_TYPE', 'OWNER_DEPARTMENT',
  'STATUS', 'CREATED_DATE', 'CLOSED_DATE', 'DUPLICATE', 'PARENT_SR_NUMBER', 'COMMUNITY_AREA',
  'WARD', 'CREATED_HOUR', 'CREATED_DAY_OF_WEEK', 'CREATED_MONTH')
- 311 Information-Only calls removed
- Legacy records removed
- Removed rows with no Ward or Community Area
- Added 'time_to_close_sec' column including time difference in seconds between record creation
  and closure

**Transformations/filters remaining:**

- Add 'number of children' column denoting how many duplicates a request has
- Filter out duplicates once first bullet point completed (this code is already in the notebook, we'll just need to move it and uncomment it)
- Create dummy columns for request type, department, community area, ward, and hour/day/month of request creation
- Think about what our 'time-to-close' threshold should be and apply that filter -- should we filter out all requests closed in under X minutes? Or should we filter out a request type entirely if more than X% of its requests are filled in under X minutes?

**Notes/resources**:

- pandas to_pickle function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html
- pandas read_pickle function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html#pandas.read_pickle
- I downloaded the Chicago 311 CSV file locally from the data portal: https://data.cityofchicago.org/Service-Requests/311-Service-Requests/v6vf-nfxy; it's ~1.4 GBs. I have it in a folder called 'raw_data' in my local repo (I didn't push it because the file is too big). You'll need to do the same on your end (download the file and put it in a 'raw_data' folder in your local repo). Let me know if you have any trouble with this (e.g., you don't have space to save a 1.4 GB file or it takes too long to download)
- I don't think we should push another pickle file to GitHub until the final version, but open to what you think. I think we can add the final filters/transformations to this notebook.

In [1]:
%load_ext autoreload

In [2]:
import pandas as pd
import datetime as dt

Read in downloaded CSV file:

In [84]:
chi_311 = pd.read_csv('../raw_data/chicago_311_requests.csv', nrows=5000)

In [85]:
chi_311.shape

(5000, 37)

In [86]:
chi_311.columns

Index(['SR_NUMBER', 'SR_TYPE', 'SR_SHORT_CODE', 'OWNER_DEPARTMENT', 'STATUS',
       'CREATED_DATE', 'LAST_MODIFIED_DATE', 'CLOSED_DATE', 'STREET_ADDRESS',
       'CITY', 'STATE', 'ZIP_CODE', 'STREET_NUMBER', 'STREET_DIRECTION',
       'STREET_NAME', 'STREET_TYPE', 'DUPLICATE', 'LEGACY_RECORD',
       'LEGACY_SR_NUMBER', 'PARENT_SR_NUMBER', 'COMMUNITY_AREA', 'WARD',
       'ELECTRICAL_DISTRICT', 'ELECTRICITY_GRID', 'POLICE_SECTOR',
       'POLICE_DISTRICT', 'POLICE_BEAT', 'PRECINCT',
       'SANITATION_DIVISION_DAYS', 'CREATED_HOUR', 'CREATED_DAY_OF_WEEK',
       'CREATED_MONTH', 'X_COORDINATE', 'Y_COORDINATE', 'LATITUDE',
       'LONGITUDE', 'LOCATION'],
      dtype='object')

Filter dataframe to exclude info-only calls:

In [87]:
chi_311_filtered = chi_311[chi_311['SR_TYPE'] != '311 INFORMATION ONLY CALL']

In [71]:
chi_311_filtered.shape

(4870, 37)

Filter for columns needed:

In [72]:
chi_311_filtered.columns

Index(['SR_NUMBER', 'SR_TYPE', 'SR_SHORT_CODE', 'OWNER_DEPARTMENT', 'STATUS',
       'CREATED_DATE', 'LAST_MODIFIED_DATE', 'CLOSED_DATE', 'STREET_ADDRESS',
       'CITY', 'STATE', 'ZIP_CODE', 'STREET_NUMBER', 'STREET_DIRECTION',
       'STREET_NAME', 'STREET_TYPE', 'DUPLICATE', 'LEGACY_RECORD',
       'LEGACY_SR_NUMBER', 'PARENT_SR_NUMBER', 'COMMUNITY_AREA', 'WARD',
       'ELECTRICAL_DISTRICT', 'ELECTRICITY_GRID', 'POLICE_SECTOR',
       'POLICE_DISTRICT', 'POLICE_BEAT', 'PRECINCT',
       'SANITATION_DIVISION_DAYS', 'CREATED_HOUR', 'CREATED_DAY_OF_WEEK',
       'CREATED_MONTH', 'X_COORDINATE', 'Y_COORDINATE', 'LATITUDE',
       'LONGITUDE', 'LOCATION'],
      dtype='object')

In [73]:
chi_311_filtered = chi_311_filtered[['SR_NUMBER', 'SR_TYPE', 'OWNER_DEPARTMENT',
                                     'STATUS', 'CREATED_DATE', 'CLOSED_DATE', 'DUPLICATE',
                                     'LEGACY_RECORD', 'LEGACY_SR_NUMBER', 'PARENT_SR_NUMBER',
                                     'COMMUNITY_AREA', 'WARD', 'CREATED_HOUR', 'CREATED_DAY_OF_WEEK',
                                     'CREATED_MONTH']]

In [74]:
chi_311_filtered.shape

(4870, 15)

Filter out legacy records:

In [75]:
legacy = chi_311_filtered[chi_311_filtered['LEGACY_RECORD'] == True]

In [76]:
max(legacy['CREATED_DATE'])

'12/12/2018 03:25:54 PM'

In [77]:
legacy.shape

(4124, 15)

In [17]:
chi_311_filtered = chi_311_filtered[chi_311_filtered['LEGACY_RECORD'] == False]
chi_311_filtered = chi_311_filtered.drop(columns=['LEGACY_RECORD', 'LEGACY_SR_NUMBER'])

In [18]:
chi_311_filtered.shape

(746, 13)

Filter out rows with no community area or ward

In [19]:
chi_311_filtered[chi_311_filtered['COMMUNITY_AREA'].isna()].shape

(510, 13)

In [20]:
chi_311_filtered[chi_311_filtered['WARD'].isna()].shape

(510, 13)

In [21]:
chi_311_filtered = chi_311_filtered[chi_311_filtered['COMMUNITY_AREA'].notna() &
                                    chi_311_filtered['WARD'].notna()]

In [22]:
chi_311_filtered.shape

(236, 13)

(COMMENTED OUT THIS CODE) Check and remove duplicates:

**Do we want to add a column with the total number of children that a request has?**

In [28]:
parent_groups = pd.DataFrame(chi_311_filtered['PARENT_SR_NUMBER'].value_counts())

In [33]:
chi_311_filtered['NUM_CHILDREN'] = 0

In [54]:
for parent_sr in parent_groups.index:
    chi_311_filtered.loc[chi_311_filtered['SR_NUMBER']==parent_sr, 
                         'NUM_CHILDREN'] = parent_groups.loc[parent_sr, 'PARENT_SR_NUMBER']

In [55]:
chi_311_filtered['NUM_CHILDREN'].value_counts()

0    233
1      3
Name: NUM_CHILDREN, dtype: int64

In [65]:
len(chi_311_filtered[chi_311_filtered['NUM_CHILDREN'] != 0])

3

In [23]:
dupes = chi_311_filtered[chi_311_filtered['DUPLICATE'] == True]

In [24]:
dupes[dupes['PARENT_SR_NUMBER'] == 'SR19-02206459']

Unnamed: 0,SR_NUMBER,SR_TYPE,OWNER_DEPARTMENT,STATUS,CREATED_DATE,CLOSED_DATE,DUPLICATE,PARENT_SR_NUMBER,COMMUNITY_AREA,WARD,CREATED_HOUR,CREATED_DAY_OF_WEEK,CREATED_MONTH
412,SR19-02557557,Pothole in Street Complaint,CDOT - Department of Transportation,Completed,09/23/2019 06:13:16 PM,12/20/2019 03:19:51 PM,True,SR19-02206459,25.0,29.0,18,2,9


In [None]:
# want intermediate dataframe with all duplicates and parents
# column with all parent sr numbers, return unique list of all parents


In [66]:
chi_311_filtered = chi_311_filtered[chi_311_filtered['DUPLICATE'] == False]
chi_311_filtered = chi_311_filtered.drop(columns=['DUPLICATE', 'PARENT_SR_NUMBER'])

In [67]:
chi_311_filtered.shape

(212, 12)

Make dummy columns.

In [88]:
chi_311_filtered.columns

Index(['SR_NUMBER', 'SR_TYPE', 'SR_SHORT_CODE', 'OWNER_DEPARTMENT', 'STATUS',
       'CREATED_DATE', 'LAST_MODIFIED_DATE', 'CLOSED_DATE', 'STREET_ADDRESS',
       'CITY', 'STATE', 'ZIP_CODE', 'STREET_NUMBER', 'STREET_DIRECTION',
       'STREET_NAME', 'STREET_TYPE', 'DUPLICATE', 'LEGACY_RECORD',
       'LEGACY_SR_NUMBER', 'PARENT_SR_NUMBER', 'COMMUNITY_AREA', 'WARD',
       'ELECTRICAL_DISTRICT', 'ELECTRICITY_GRID', 'POLICE_SECTOR',
       'POLICE_DISTRICT', 'POLICE_BEAT', 'PRECINCT',
       'SANITATION_DIVISION_DAYS', 'CREATED_HOUR', 'CREATED_DAY_OF_WEEK',
       'CREATED_MONTH', 'X_COORDINATE', 'Y_COORDINATE', 'LATITUDE',
       'LONGITUDE', 'LOCATION'],
      dtype='object')

In [90]:
chi_311_filtered = pd.get_dummies(chi_311_filtered, columns=['COMMUNITY_AREA', 'WARD', 'CREATED_HOUR',
                                          'CREATED_DAY_OF_WEEK', 'CREATED_MONTH', 'SR_TYPE',
                                         'OWNER_DEPARTMENT'])

Filtered out complaints resolved in very short period of time -
NEED TO DECIDE WHICH COMPLAINTS TO FILTER OUT!!; do we want to filter out request types with more than a certain % fulfilled in less than X minutes?

In [27]:
chi_311_filtered['CREATED_DATE'] = pd.to_datetime(chi_311_filtered['CREATED_DATE'],
                                                  format='%m/%d/%Y %I:%M:%S %p')
chi_311_filtered['CLOSED_DATE'] = pd.to_datetime(chi_311_filtered['CLOSED_DATE'],
                                                format='%m/%d/%Y %I:%M:%S %p')

In [28]:
chi_311_filtered['time_to_close'] = chi_311_filtered['CLOSED_DATE'] - \
                                        chi_311_filtered['CREATED_DATE']

In [29]:
chi_311_filtered['time_to_close_sec'] = chi_311_filtered['time_to_close'].dt.total_seconds()

In [46]:
chi_311_filtered = chi_311_filtered.drop(columns=['time_to_close'])

In [30]:
# Requests resolved in 0 seconds
chi_311_filtered[chi_311_filtered['time_to_close_sec'] == 0].shape

(402913, 15)

In [31]:
# Requests resolved in less than 1 minute
chi_311_filtered[chi_311_filtered['time_to_close_sec'] < 60].shape

(770925, 15)

In [32]:
# Requests resolved in less than 10 minute
chi_311_filtered[chi_311_filtered['time_to_close_sec'] < 600].shape

(846269, 15)

In [33]:
# Requests resolved in less than 1 day
chi_311_filtered[chi_311_filtered['time_to_close_sec'] < 86400].shape

(1097323, 15)

In [34]:
# Types of requests resolved in less than 10 minutes
chi_311_filtered[chi_311_filtered['time_to_close_sec'] < 600]['SR_TYPE'].value_counts()

Aircraft Noise Complaint                     584578
Weed Removal Request                          96013
Graffiti Removal Request                      81066
Sign Repair Request - All Other Signs         40096
Pothole in Street Complaint                    8264
                                              ...  
Home Buyer Program Info Request                   1
Clean and Green Program Request                   1
Protected Bike Lane - Debris Removal              1
Pavement Cave-In Inspection Request               1
Bungalow/Vintage Home Information Request         1
Name: SR_TYPE, Length: 78, dtype: int64

In [35]:
# Filter out requests resolved in less than a minute
# chi_311_filtered = chi_311_filtered[chi_311_filtered['time_to_close_sec'] >= 60]

In [36]:
chi_311_filtered[chi_311_filtered['time_to_close_sec'] < 600]['SR_TYPE'].value_counts()

Aircraft Noise Complaint                     584578
Weed Removal Request                          96013
Graffiti Removal Request                      81066
Sign Repair Request - All Other Signs         40096
Pothole in Street Complaint                    8264
                                              ...  
Home Buyer Program Info Request                   1
Clean and Green Program Request                   1
Protected Bike Lane - Debris Removal              1
Pavement Cave-In Inspection Request               1
Bungalow/Vintage Home Information Request         1
Name: SR_TYPE, Length: 78, dtype: int64

Pickle filtered file:

In [47]:
# chi_311_filtered.to_pickle("../pickle_files/chi_311.pkl")