### NYC 311 calls

#### NYC 311 is a service that provides access to non-emergency City services and info about City government programs to the residents of New York.  Each year, the service receives millions of requests reporting various kinds of problems with city services and other issues.
The data on the type of calls received, and their ultimate resolution is made available through the NYC Open Data portal at https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9. The data is updated daily.  The link also provides the data dictionary for the data.
To ensure that we are all using the same data and arrive at the same results, the data has been downloaded and includes information up to 2023-08-04 12:00:00.  Several columns not required for this project have been removed from the original data.  (As an additional exercise to showcase your skills, you should feel free to download the entire dataset from the URL above for investigation.)


In [1]:
import pandas as pd

# Load the data from the pickle file. Replace 'your_file_path' with the actual file path.
df = pd.read_pickle('shared/Project-3_NYC_311_Calls.pkl')

# Optimize numerical columns
for col in df.select_dtypes(include=['int', "float"]).columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')

# Convert object columns to category if they have a limited set of values
for col in df.select_dtypes(include=['object']).columns:
    if df[col].nunique() / len(df[col]) < 0.5:
        df[col] = df[col].astype('category')

In [2]:
# Set the 'Created Date' as the index and convert it to a proper DatetimeIndex
df = df.set_index(pd.DatetimeIndex(df['Created Date']))
del df['Created Date']

# Filtering the data for the year 2022
df_2022 = df[df.index.year == 2022]

# Calculating the average number of daily complaints in 2022
# Assuming that each row represents a unique complaint
average_daily_complaints_2022 = df_2022.resample('D')['Unique Key'].count().mean()

# Print the result
print(average_daily_complaints_2022)

8684.320547945206


In [3]:
# Finding the date with the maximum number of calls
max_calls_date = df.resample('D')['Unique Key'].count().idxmax()

# Print the date
print(max_calls_date)

2020-08-04 00:00:00


In [4]:
# First, find the date with the maximum number of calls
max_calls_date = df.resample('D')['Unique Key'].count().idxmax()

# Filter the dataframe for only the entries on that date
df_max_calls_date = df[df.index.date == max_calls_date.date()]

# Find the most common complaint type on that date
most_common_complaint = df_max_calls_date['Complaint Type'].value_counts().idxmax()

# Print the most common complaint type and the date
print(f"Most common complaint on {max_calls_date.date()}: {most_common_complaint}")

Most common complaint on 2020-08-04: Damaged Tree


In [5]:
# Group the data by month and count the number of calls in each month
monthly_calls = df.resample('M')['Unique Key'].count()

# Identify the month with the fewest number of calls
quietest_month = monthly_calls.idxmin().month_name()

# Print the quietest month
print(f"The quietest month historically: {quietest_month}")

The quietest month historically: August


In [6]:
import statsmodels.api as sm

# Resample the time series to a daily frequency
daily_calls = df['Unique Key'].resample('D').count()

# Perform ETS decomposition based on an additive model
decomposition = sm.tsa.seasonal_decompose(daily_calls, model='additive')

# Extract the seasonal component
seasonal_component = decomposition.seasonal

# Find the value of the seasonal component on 2020-12-25, rounded to the nearest integer
seasonal_value = round(seasonal_component['2020-12-25'])

# Print the seasonal value
print(f"Seasonal component value on 2020-12-25: {seasonal_value}")

Seasonal component value on 2020-12-25: 183


In [7]:
# Calculate the autocorrelation with lag 1
autocorrelation_lag_1 = daily_calls.autocorr(lag=1)

# Print the autocorrelation
print(f"Autocorrelation with a lag of 1 day: {autocorrelation_lag_1}")

Autocorrelation with a lag of 1 day: 0.7517059728398577


In [15]:
from prophet import Prophet
from sklearn.metrics import mean_squared_error
from math import sqrt
import pandas as pd

# Prepare the data for Prophet
# Ensure the DataFrame is structured correctly
prophet_df = pd.DataFrame({'ds': daily_calls.index, 'y': daily_calls.values})

# Verify the structure of prophet_df
print(prophet_df.head())  # Check the first few rows to ensure the structure is correct

# Split into train and test sets
train_df = prophet_df.iloc[:-90]
test_df = prophet_df.iloc[-90:]

# Initialize and fit the Prophet model
model = Prophet()
model.fit(train_df)

# Make predictions
future = model.make_future_dataframe(periods=90)
forecast = model.predict(future)

# Extract the predicted values for the test set
predictions = forecast['yhat'].iloc[-90:]

# Calculate RMSE
rmse = sqrt(mean_squared_error(test_df['y'], predictions))

print(rmse)

          ds     y
0 2010-01-01  2942
1 2010-01-02  3958
2 2010-01-03  5676
3 2010-01-04  9763
4 2010-01-05  8735


19:07:39 - cmdstanpy - INFO - Chain [1] start processing
19:07:40 - cmdstanpy - INFO - Chain [1] done processing


1231.513760758433
