# Project 3 – NYC 311 calls

NYC 311 is a service that provides access to non-emergency City services and info about City government programs to the residents of New York.  Each year, the service receives millions of requests reporting various kinds of problems with city services and other issues.

The data on the type of calls received, and their ultimate resolution is made available through the NYC Open Data portal at https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9. The data is updated daily.  The link also provides the data dictionary for the data.

To ensure that we are all using the same data and arrive at the same results, the data has been downloaded and includes information up to 2023-08-04 12:00:00.  Several columns not required for this project have been removed from the original data.  (As an additional exercise to showcase your skills, you should feel free to download the entire dataset from the URL above for investigation.)

Please use the Project-3_NYC_311_Calls.pkl for this analysis.  The data file is available as a pickle file on Jupyterhub, and has information on 32 million calls to 311.  You should be able to read the file into a dataframe using pd.read_pickle('Project-3_NYC_311_Calls.pkl')

You may choose to move the ‘Created Date’ to the dataframe’s index, for example, using the below code.

'''
We make the index as a proper DatetimeIndex, and then delete the Created Date column
'''
 

df = df.set_index(pd.DatetimeIndex(df['Created Date']))

del df['Created Date']

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_pickle('shared/Project-3_NYC_311_Calls.pkl')


In [3]:
# We make the index as a proper DatetimeIndex, and then delete the Created Date column
df = df.set_index(pd.DatetimeIndex(df['Created Date']))
del df['Created Date']
df

Unnamed: 0_level_0,Unique Key,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,City,Resolution Description,Borough,Open Data Channel Type
Created Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-04-06 00:00:00,20184537,HPD,Department of Housing Preservation and Develop...,HEATING,HEAT,RESIDENTIAL BUILDING,10002.0,NEW YORK,More than one complaint was received for this ...,MANHATTAN,UNKNOWN
2011-04-06 00:00:00,20184538,HPD,Department of Housing Preservation and Develop...,GENERAL CONSTRUCTION,WINDOWS,RESIDENTIAL BUILDING,11236.0,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,UNKNOWN
2011-04-06 00:00:00,20184539,HPD,Department of Housing Preservation and Develop...,PAINT - PLASTER,WALLS,RESIDENTIAL BUILDING,10460.0,BRONX,The Department of Housing Preservation and Dev...,BRONX,UNKNOWN
2022-07-08 11:14:43,54732265,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,10467.0,BRONX,The Department of Sanitation investigated this...,BRONX,PHONE
2011-04-06 00:00:00,20184540,HPD,Department of Housing Preservation and Develop...,NONCONST,VERMIN,RESIDENTIAL BUILDING,10460.0,BRONX,The Department of Housing Preservation and Dev...,BRONX,UNKNOWN
...,...,...,...,...,...,...,...,...,...,...,...
2011-04-06 00:00:00,20184532,HPD,Department of Housing Preservation and Develop...,HEATING,HEAT,RESIDENTIAL BUILDING,10468,BRONX,The Department of Housing Preservation and Dev...,BRONX,UNKNOWN
2011-04-06 00:00:00,20184533,HPD,Department of Housing Preservation and Develop...,HEATING,HEAT,RESIDENTIAL BUILDING,10018,NEW YORK,More than one complaint was received for this ...,MANHATTAN,UNKNOWN
2011-04-06 00:00:00,20184534,HPD,Department of Housing Preservation and Develop...,GENERAL CONSTRUCTION,STAIRS,RESIDENTIAL BUILDING,10460,BRONX,The Department of Housing Preservation and Dev...,BRONX,UNKNOWN
2011-04-06 00:00:00,20184535,HPD,Department of Housing Preservation and Develop...,GENERAL CONSTRUCTION,GAS,RESIDENTIAL BUILDING,11236,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,UNKNOWN


## Data Exploration

Spend some time querying the data, and making yourself familiar with it using the EDA techniques we learned in class.  Show your EDA in the Jupyter notebook you will use for uploading to Github.  Look at the types of complaints, read some descriptions, see the earliest date, the latest date etc.  Read the data dictionary to understand what the columns mean.  Be aware that many columns have NaN values, and they may not get counted for your analysis.

In [4]:
# Basic Information about the Dataset
print("Dataset Info:")
print(df.info())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 33780977 entries, 2011-04-06 00:00:00 to 2011-04-06 00:00:00
Data columns (total 11 columns):
 #   Column                  Dtype 
---  ------                  ----- 
 0   Unique Key              int64 
 1   Agency                  object
 2   Agency Name             object
 3   Complaint Type          object
 4   Descriptor              object
 5   Location Type           object
 6   Incident Zip            object
 7   City                    object
 8   Resolution Description  object
 9   Borough                 object
 10  Open Data Channel Type  object
dtypes: int64(1), object(10)
memory usage: 3.0+ GB
None


In [5]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())

Missing Values:
Unique Key                      0
Agency                          0
Agency Name                     0
Complaint Type                  0
Descriptor                 586677
Location Type             7140574
Incident Zip              1507958
City                      1981664
Resolution Description    1254890
Borough                     47074
Open Data Channel Type          0
dtype: int64


In [6]:
df_unique_key = df['Unique Key'].resample(rule='D').count()
df_unique_key

Created Date
2010-01-01    2942
2010-01-02    3958
2010-01-03    5676
2010-01-04    9763
2010-01-05    8735
              ... 
2023-07-31    9921
2023-08-01    9813
2023-08-02    9245
2023-08-03    9128
2023-08-04     384
Freq: D, Name: Unique Key, Length: 4964, dtype: int64

In [7]:
# On which single date were the maximum number of calls received?
print('single date were the maximum number of calls received:',df_unique_key.idxmax())
print('maximum number:',df_unique_key.max())

single date were the maximum number of calls received: 2020-08-04 00:00:00
maximum number: 24415


In [8]:
# On the date the maximum number of calls were received, what was the most important complaint type?
max_calls_date_data = df.loc['2020-08-04']
max_calls_date_data

Unnamed: 0_level_0,Unique Key,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,City,Resolution Description,Borough,Open Data Channel Type
Created Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2020-08-04 20:26:03,47103795,DPR,Department of Parks and Recreation,Damaged Tree,Entire Tree Has Fallen Down,Street,,,The Department of Parks and Recreation visited...,Unspecified,UNKNOWN
2020-08-04 20:02:44,47107936,DPR,Department of Parks and Recreation,Damaged Tree,Entire Tree Has Fallen Down,Street,,,The Department of Parks and Recreation has ins...,Unspecified,UNKNOWN
2020-08-04 15:35:00,47098354,DOT,Department of Transportation,Traffic Signal Condition,Vehicle Signal,,,,Service Request status for this request is ava...,,UNKNOWN
2020-08-04 11:52:00,47105987,DOT,Department of Transportation,Street Light Condition,Street Light Out,,,,Service Request status for this request is ava...,QUEENS,UNKNOWN
2020-08-04 12:00:00,47118929,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,,,The Department of Sanitation Investigated and ...,,PHONE
...,...,...,...,...,...,...,...,...,...,...,...
2020-08-04 18:22:59,47107565,DPR,Department of Parks and Recreation,Damaged Tree,Branch or Limb Has Fallen Down,Street,,,The Department of Parks and Recreation perform...,Unspecified,UNKNOWN
2020-08-04 13:05:56,47145515,DOT,Department of Transportation,Street Condition,Pothole,,10467,BRONX,The Department of Transportation inspected thi...,BRONX,UNKNOWN
2020-08-04 12:06:58,47089692,DPR,Department of Parks and Recreation,Damaged Tree,Branch or Limb Has Fallen Down,Street,,,The Department of Parks and Recreation perform...,Unspecified,UNKNOWN
2020-08-04 15:10:35,47096612,DPR,Department of Parks and Recreation,Damaged Tree,Entire Tree Has Fallen Down,Street,,,The Department of Parks and Recreation visited...,Unspecified,UNKNOWN


In [9]:
# Count the occurrences of each complaint type on that date
complaint_type_counts = max_calls_date_data['Complaint Type'].value_counts()
complaint_type_counts

Complaint Type
Damaged Tree                           14863
Noise - Residential                      982
Request Large Bulky Item Collection      909
Street Light Condition                   617
Overgrown Tree/Branches                  609
                                       ...  
Bus Stop Shelter Placement                 1
Unsanitary Pigeon Condition                1
Public Payphone Complaint                  1
For Hire Vehicle Report                    1
Bridge Condition                           1
Name: count, Length: 125, dtype: int64

In [10]:
# Quietest month: Group the data by months, and identify the month that historically has the fewest number of calls.
df_quite_month = df.groupby(df.index.month).size()
df_quite_month

Created Date
1     2994507
2     2621845
3     2868044
4     2672760
5     2956598
6     3093442
7     3111026
8     2779862
9     2684271
10    2766887
11    2634749
12    2596986
dtype: int64

In [11]:
print('Quiet month:',df_quite_month.idxmin())

Quiet month: 12


In [12]:
df_2022 = df[(df.index > pd.to_datetime('1/1/2022')) & (df.index < pd.to_datetime('1/1/2023'))]
df_2022

Unnamed: 0_level_0,Unique Key,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,City,Resolution Description,Borough,Open Data Channel Type
Created Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2022-07-08 11:14:43,54732265,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,10467.0,BRONX,The Department of Sanitation investigated this...,BRONX,PHONE
2022-07-08 12:07:45,54732266,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,10009.0,NEW YORK,The Department of Sanitation investigated this...,MANHATTAN,MOBILE
2022-07-08 06:06:47,54732267,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,11204.0,BROOKLYN,The Department of Sanitation investigated this...,BROOKLYN,MOBILE
2022-07-08 14:12:01,54732268,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,10455.0,BRONX,The Department of Sanitation investigated this...,BRONX,MOBILE
2022-07-08 11:25:58,54732269,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,10025.0,NEW YORK,The Department of Sanitation investigated this...,MANHATTAN,PHONE
...,...,...,...,...,...,...,...,...,...,...,...
2022-08-28 14:10:00,55237319,DEP,Department of Environmental Protection,Sewer,Defective/Missing Curb Piece (SC4),,11691,FAR ROCKAWAY,The Department of Environmental Protection inv...,QUEENS,PHONE
2022-11-21 16:06:35,56051500,HPD,Department of Housing Preservation and Develop...,APPLIANCE,REFRIGERATOR,RESIDENTIAL BUILDING,11238,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,PHONE
2022-07-08 18:18:10,54732262,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,10468,BRONX,The Department of Sanitation investigated this...,BRONX,MOBILE
2022-07-08 07:51:59,54732263,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,10027,NEW YORK,The Department of Sanitation investigated this...,MANHATTAN,MOBILE


In [13]:
# What is the average number of daily complaints received in 2022?
df_2022_daily = df_2022.resample(rule='D')['Unique Key'].count()
df_2022_daily

Created Date
2022-01-01     5865
2022-01-02     6710
2022-01-03     9163
2022-01-04     9415
2022-01-05     8385
              ...  
2022-12-27    10250
2022-12-28     9036
2022-12-29     8279
2022-12-30     7582
2022-12-31     5742
Freq: D, Name: Unique Key, Length: 365, dtype: int64

In [14]:
df_2022_daily.mean()

8684.317808219179

In [4]:
# Now we decompose our time series

import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

df2 = df.iloc[:,:1]
df2

Unnamed: 0_level_0,Unique Key
Created Date,Unnamed: 1_level_1
2011-04-06 00:00:00,20184537
2011-04-06 00:00:00,20184538
2011-04-06 00:00:00,20184539
2022-07-08 11:14:43,54732265
2011-04-06 00:00:00,20184540
...,...
2011-04-06 00:00:00,20184532
2011-04-06 00:00:00,20184533
2011-04-06 00:00:00,20184534
2011-04-06 00:00:00,20184535


In [11]:
daily_call = df2.resample(rule='D')['Unique Key'].count()
daily_call = daily_call.reset_index(name='Total Calls')
daily_call = daily_call.set_index('Created Date')
daily_call

Unnamed: 0_level_0,Total Calls
Created Date,Unnamed: 1_level_1
2010-01-01,2942
2010-01-02,3958
2010-01-03,5676
2010-01-04,9763
2010-01-05,8735
...,...
2023-07-31,9921
2023-08-01,9813
2023-08-02,9245
2023-08-03,9128


In [12]:
# We use the ETS additive model
result = seasonal_decompose(daily_call['Total Calls'], model = 'additive')

In [13]:
ets = pd.DataFrame({'Total Calls':daily_call['Total Calls'],
                    'trend': result.trend,
                    'seasonality': result.seasonal,
                    'error': result.resid})
ets.loc['2020-12-25','seasonality']

182.69763790386236

In [14]:
# Calculate the autocorrelation of the number of daily calls with the number of calls the day prior, ie lag of 1.  (Use the daily series).
daily_call['Total Calls'].autocorr(lag = 1)

0.7517059728398577

In [16]:
# Forecast the daily series with a test set of 90 days using the Prophet library.  What is your RMSE on your test set?
from prophet import Prophet
daily_call = pd.DataFrame(daily_call)
daily_call = daily_call.reset_index()
daily_call

Unnamed: 0,Created Date,Total Calls
0,2010-01-01,2942
1,2010-01-02,3958
2,2010-01-03,5676
3,2010-01-04,9763
4,2010-01-05,8735
...,...,...
4959,2023-07-31,9921
4960,2023-08-01,9813
4961,2023-08-02,9245
4962,2023-08-03,9128


In [17]:
# Train-test split

train_set = daily_call.iloc[:-90]
test_set = daily_call.iloc[-90:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])

Training set:  4874
Test set:  90


In [18]:
# Create the ds and y columns for Prophet
train_set_prophet = train_set.reset_index()
train_set_prophet = train_set_prophet[['Created Date', 'Total Calls']]
train_set_prophet.columns = ['ds', 'y']
train_set_prophet.head()

Unnamed: 0,ds,y
0,2010-01-01,2942
1,2010-01-02,3958
2,2010-01-03,5676
3,2010-01-04,9763
4,2010-01-05,8735


In [19]:
model = Prophet()
model.fit(train_set_prophet)

21:28:44 - cmdstanpy - INFO - Chain [1] start processing
21:28:46 - cmdstanpy - INFO - Chain [1] done processing


<prophet.forecaster.Prophet at 0x7f0a5e3c0fa0>

In [21]:
future = model.make_future_dataframe(periods=90,freq = 'd')
future.tail()

Unnamed: 0,ds
4959,2023-07-31
4960,2023-08-01
4961,2023-08-02
4962,2023-08-03
4963,2023-08-04


In [22]:
future.shape

(4964, 1)

In [23]:
# Python
forecast = model.predict(future)
forecast.columns

Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'weekly', 'weekly_lower', 'weekly_upper', 'yearly', 'yearly_lower',
       'yearly_upper', 'multiplicative_terms', 'multiplicative_terms_lower',
       'multiplicative_terms_upper', 'yhat'],
      dtype='object')

In [24]:
preds = pd.DataFrame({'Prediction': forecast.yhat[-90:]})
preds.index = pd.to_datetime(forecast.ds[-90:])
preds.index.names = ['Date']
preds

Unnamed: 0_level_0,Prediction
Date,Unnamed: 1_level_1
2023-05-07,7242.556928
2023-05-08,9229.604309
2023-05-09,9291.220112
2023-05-10,9131.293129
2023-05-11,8981.995534
...,...
2023-07-31,9588.156716
2023-08-01,9636.185749
2023-08-02,9462.665714
2023-08-03,9299.547111


In [26]:
# Calculate Evaluation Metrics

y_test = test_set['Total Calls'] 
y_pred = preds['Prediction']
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})

  pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})


Unnamed: 0,y_test,y_pred,diff
4874,9102.0,,
4875,9709.0,,
4876,9309.0,,
4877,9110.0,,
4878,9155.0,,
...,...,...,...
2023-07-31 00:00:00,,9588.156716,
2023-08-01 00:00:00,,9636.185749,
2023-08-02 00:00:00,,9462.665714,
2023-08-03 00:00:00,,9299.547111,


In [27]:
# Model evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))

MSE =  1516626.1429373787
RMSE =  1231.513760758433
MAE =  692.2150947529236
