# **Predicting Seattle Resident's Customer Requests**

The City of Seattle has collected extensive data on resident service requests through its customer service portals over many years. These requests cover a wide range of issues, from pothole repairs to unauthorized encampments, providing a valuable opportunity to understand public service demand patterns over time.

In this project, we focus on forecasting the number of service requests expected over the next three months (May, June, and July 2025). Our goal is to build predictive models that can accurately estimate future service volumes for each Service Request Type, as well as extend to predictions at the Department level and for specific ZIP Code and Service Request Type combinations.

To achieve this, we preprocess the data by aggregating historical service requests at a monthly level. We engineer time series features such as Lag (previous month's request counts) and Rolling Statistics (rolling mean and standard deviation) to give our models "memory" of past trends and fluctuations. These features help capture seasonality, stability, and short-term changes in service request volumes.

We evaluate four models: Linear Regression, Random Forest, LightGBM, and XGBoost. Each model is globally tuned before applying 5-fold time-based cross-validation to fairly compare their performance across service types. Final model selection for each service is based on minimizing the Mean Absolute Percentage Error (MAPE) to ensure robust forecasting even across diverse and unpredictable service categories.

Our final predictions aim to equip city agencies with actionable insights, allowing them to allocate resources more effectively, anticipate resident needs, and plan operational efforts during the upcoming months.

### Install All Relevant Dependencies

In [1]:
!pip install scikit-learn
!pip install lightgbm xgboost

Collecting lightgbm
  Using cached lightgbm-4.6.0-py3-none-win_amd64.whl.metadata (17 kB)
Collecting xgboost
  Using cached xgboost-3.0.0-py3-none-win_amd64.whl.metadata (2.1 kB)
Using cached lightgbm-4.6.0-py3-none-win_amd64.whl (1.5 MB)
Using cached xgboost-3.0.0-py3-none-win_amd64.whl (150.0 MB)
Installing collected packages: xgboost, lightgbm
Successfully installed lightgbm-4.6.0 xgboost-3.0.0


### Import All Relevant Libraries

In [2]:
# General
import numpy as np
import pandas as pd

# Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

# Cross-Validation
from sklearn.model_selection import KFold
import time

# Model Tuning
from sklearn.model_selection import RandomizedSearchCV, KFold
from scipy.stats import randint, uniform


Data is sourced from https://drive.google.com/drive/folders/1NHK30kuyNR7wvtXLiDieKK4IQSEaNfYD 

Due to how large the dataset we are working with, we are asking users to download manually from the above link.

Make sure the csv file is stored in the 'Data' directory of this repository!

In [3]:
df = pd.read_csv("Data/Customer_Service_Requests_20250426.csv", low_memory=False)
df.sample(5)

Unnamed: 0,Service Request Number,Service Request Type,City Department,Created Date,Method Received,Status,Location,X_Value,Y_Value,Latitude,Longitude,Latitude/Longitude,ZIP Code,Council District,Police Precinct,Neighborhood
396866,22-00295099,Clogged Storm Drain,SPU-Seattle Public Utilities,11/27/2022 11:45:25 AM,Citizen Web Intake App,Closed,"702 N 66TH ST, SEATTLE, WA 98103",1266875.0,250660.375685,47.677086,-122.349733,POINT (-122.34973266 47.67708581),98103,6.0,NORTH,PHINNEY RIDGE
843626,24-00213625,Unauthorized Encampment,SEA-City of Seattle,07/29/2024 11:04:23 AM,Voice Mail,Closed,"1590 NW 90TH ST, SEATTLE, WA 98117",1259405.0,257314.976659,47.694918,-122.380595,POINT (-122.38059468 47.69491803),98117,5.0,NORTH,NORTH BEACH/BLUE RIDGE
168243,22-00007218,Pothole,SDOT-Seattle Department of Transportation,01/09/2022 09:05:37 AM,Find It Fix It Apps,Reported,"5700 1ST AVE S, SEATTLE, WA 98108",1269887.0,204862.588367,47.551718,-122.333885,POINT (-122.33388459 47.55171764),98108,2.0,SOUTH,GEORGETOWN
578314,23-00199294,Overgrown Vegetation,FAS-Finance and Administrative Services,08/11/2023 05:01:53 PM,Find It Fix It Apps,Closed,"100 BROADWAY E, SEATTLE, WA 98102",1273682.0,229368.720776,47.61909,-122.320438,POINT (-122.32043832 47.61909025),98102,3.0,EAST,BROADWAY
748437,24-00085148,Unauthorized Encampment,SEA-City of Seattle,04/06/2024 11:39:36 AM,Find It Fix It Apps,Closed,"1028 15TH AVE E, SEATTLE, WA 98112",1275765.0,232778.753358,47.628547,-122.312257,POINT (-122.31225722 47.62854715),98112,3.0,EAST,STEVENS


### Pinpoint Null Values and Remove Them

In [4]:
max_na = df.isna().sum().max()
total_count = df.shape[0]
percent_missing = (max_na / total_count) * 100

print(f"There are a total of {total_count} observations with {max_na} observations that has at least one feature with missing data.", end="\n")
print(f"{percent_missing:.2f}% of the data would be removed if we were to account for all features with missing data.")

There are a total of 1077316 observations with 43129 observations that has at least one feature with missing data.
4.00% of the data would be removed if we were to account for all features with missing data.


In [5]:
df.isna().sum()

Service Request Number        0
Service Request Type          0
City Department               0
Created Date                  0
Method Received               0
Status                        0
Location                  17216
X_Value                       0
Y_Value                       0
Latitude                  24161
Longitude                 24161
Latitude/Longitude        24161
ZIP Code                  43129
Council District          34316
Police Precinct           32785
Neighborhood              34373
dtype: int64

In [6]:
df = df.dropna()

### Source and Create Relevant Variables

In [7]:
df['Created Date'] = pd.to_datetime(df['Created Date'])

df['Year'] = df['Created Date'].dt.year
df['Month'] = df['Created Date'].dt.month

df.sample(5)

Unnamed: 0,Service Request Number,Service Request Type,City Department,Created Date,Method Received,Status,Location,X_Value,Y_Value,Latitude,Longitude,Latitude/Longitude,ZIP Code,Council District,Police Precinct,Neighborhood,Year,Month
493177,23-00096060,Graffiti,SPU-Seattle Public Utilities,2023-04-24 12:14:59,Find It Fix It Apps,Reported,"5501 SEAVIEW AVE NW, SEATTLE, WA 98107",1253494.0,247821.735689,47.66857,-122.403806,POINT (-122.40380563 47.66856957),98107,6.0,NORTH,SUNSET HILL,2023,4
373575,22-00266553,Parking Enforcement,SPD-Seattle Police Department,2022-10-20 08:19:19,Find It Fix It Apps,Closed,"3658 PHINNEY AVE N, SEATTLE, WA 98103",1265620.0,242265.720517,47.654009,-122.354153,POINT (-122.35415294 47.65400857),98103,6.0,NORTH,FREMONT,2022,10
1018978,25-00050096,Illegal Dumping / Needles,SPU-Seattle Public Utilities,2025-02-22 12:38:42,Find It Fix It Apps,Closed,"2306 S HILL ST, SEATTLE, WA 98144",1277866.0,216732.979402,47.584676,-122.302499,POINT (-122.30249946 47.58467616),98144,2.0,SOUTH,ATLANTIC,2025,2
1051378,25-00088381,Parking Enforcement,SPD-Seattle Police Department,2025-03-28 15:27:51,Citizen Web Intake App,Closed,"125 E LYNN ST, SEATTLE, WA 98102",1272379.0,236884.751992,47.639622,-122.32631,POINT (-122.32630965 47.63962236),98102,3.0,WEST,EASTLAKE,2025,3
698829,24-00023317,Parking Enforcement,SPD-Seattle Police Department,2024-01-29 14:02:52,Citizen Web Intake App,Closed,"7732 MARY AVE NW, SEATTLE, WA 98117",1260675.0,254034.940256,47.685998,-122.37517,POINT (-122.37516988 47.68599758),98117,6.0,NORTH,WHITTIER HEIGHTS,2024,1


## Predicting Total Service Requests in the next 3 Months: Service Request Type

### Preprocessing

In [8]:
Service_Type = df.groupby(['Service Request Type', 'Year', 'Month'])['Service Request Number'].count().reset_index()
Service_Type.rename(columns={'Service Request Number': 'Request Count'}, inplace=True)

Service_Type.sample(10)

Unnamed: 0,Service Request Type,Year,Month,Request Count
1327,Public Litter and Recycling Cans,2025,1,270
1150,Pollution Report Form,2023,2,6
1218,Pothole,2024,6,680
984,Overgrown Vegetation,2022,2,116
291,Damaged Sidewalk,2022,10,137
717,General Inquiry - Transportation,2021,8,635
1406,Scooter or Bike Share Issue,2024,4,291
1254,Public Garage or Parking Lot Complaint,2023,4,2
12,ADA Request (Transportation),2022,1,2
434,Found a Pet,2024,1,25


### Lag and Rolling

In time series forecasting, what happens in the past will often impact what happens in the future.

**LAG**: This serves as a value from a previous time step that helps the model remember recent trends.

**ROLLING**: This serves to help smooth out the statistics by suppressing sudden ruptures of noises in order reveal the general trend.

In [9]:
Service_Type = Service_Type.sort_values(['Year', 'Month', 'Service Request Type']).reset_index(drop=True)

Service_Type['lag'] = Service_Type['Request Count'].shift(1)

Service_Type['Rolling_Mean'] = Service_Type['Request Count'].rolling(window=3, min_periods=1).mean().reset_index(0, drop=True)
Service_Type['Rolling_Std'] = Service_Type['Request Count'].rolling(window=3, min_periods=1).std().reset_index(0, drop=True)

Service_Type = Service_Type.fillna(0)
Service_Type['lag'] = Service_Type['lag'].astype(int)

Service_Type.sample(10)

Unnamed: 0,Service Request Type,Year,Month,Request Count,lag,Rolling_Mean,Rolling_Std
1350,Damaged Sidewalk,2024,6,89,146,82.666667,66.725807
359,Towing Impound Complaint,2021,12,3,3,119.0,200.917894
903,Street Sign and Traffic Signal Maintenance,2023,5,868,47,454.333333,410.536641
1574,Feedback about the Customer Service Requests P...,2024,12,2,153,72.666667,75.96271
212,Animal Noise,2021,8,48,1,17.0,26.851443
546,Clogged Storm Drain,2022,7,68,20,47.333333,24.684678
1563,Towing Impound Complaint,2024,11,3,449,352.0,312.020833
18,Parking Enforcement,2021,1,1263,125,486.666667,672.846441
478,Towing Complaint - Public Impound,2022,4,8,243,369.666667,438.92862
650,Graffiti,2022,10,1984,618,913.333333,957.781464


### Global Tuning

We will tune hyperparameters on the following candidate models:
- Random Forest
- LightGBM
- XGBoost

Tuning will prepare us to run these models at the best setting during cross-validation.



In [10]:
def tune_model(model, param_dist, X_sample, y_sample, n_iter=10):
    """Tunes a model using RandomizedSearchCV and returns the best estimator."""
    random_search = RandomizedSearchCV(
        model,
        param_distributions=param_dist,
        n_iter=n_iter,
        cv=3,
        scoring='neg_mean_absolute_error',
        random_state=42,
        n_jobs=-1
    )
    random_search.fit(X_sample, y_sample)
    return random_search.best_params_

In [11]:
rf_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}

lgbm_params = {
    'num_leaves': [20, 31, 40],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 300]
}

xgb_params = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300]
}

In [12]:
service_data = Service_Type.drop(columns=['Service Request Type'])
X = service_data.drop(columns=['Request Count'])
y = service_data['Request Count']
X_sample = X.sample(frac=0.1, random_state=42)
y_sample = y.loc[X_sample.index]

In [13]:
best_rf_params = tune_model(RandomForestRegressor(), rf_params, X_sample, y_sample)

best_lgbm_params = tune_model(LGBMRegressor(verbosity=-1), lgbm_params, X_sample, y_sample)

best_xgb_params = tune_model(XGBRegressor(verbosity=0), xgb_params, X_sample, y_sample)

In [14]:
best_params = {
    'Linear Regression': {},
    'Random Forest': best_rf_params,
    'LightGBM': best_lgbm_params,
    'XGBoost': best_xgb_params
}

print("Best Random Forest Parameters:", best_rf_params)
print("Best LightGBM Parameters:", best_lgbm_params)
print("Best XGBoost Parameters:", best_xgb_params)

Best Random Forest Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'max_depth': 20}
Best LightGBM Parameters: {'num_leaves': 20, 'n_estimators': 300, 'learning_rate': 0.1}
Best XGBoost Parameters: {'n_estimators': 100, 'max_depth': 7, 'learning_rate': 0.2}


### Model Candidate Competition through Cross-Validation
The four candidate models:
- Linear Regression
- Random Forest
- LightGBM
- XGBoost

We will be evaluating their performances for each service request types based on the overall
average of their magnitude of error (Mean Absolute Error), the average of their magnitude of error
in percentage relative to their proportion of errors from the total (Mean Absolute Percentage Error),
and the total requests reported from that service type (Total Requests). Latency will help us determine
the speed to which the model is training and predicting (not as important but still a useful insight).
We will be assigning the best model with the least MAPE (Mean Absolute Percentage Error) to each service request
type.

In [15]:
service_types = Service_Type['Service Request Type'].unique()

model_df = pd.DataFrame(service_types, columns=['Service Request Type'])
model_df['Model'] = None
model_df['Mean Absolute Error'] = None
model_df['Mean Absolute Percentage Error'] = None
model_df['Total Requests'] = None
model_df['Latency'] = None

model_df.sample(10)

Unnamed: 0,Service Request Type,Model,Mean Absolute Error,Mean Absolute Percentage Error,Total Requests,Latency
21,Pothole,,,,,
27,Streetlight Maintenance,,,,,
19,Parks and Recreation Maintenance,,,,,
12,General Inquiry - Public Utilities,,,,,
36,Feedback about the Customer Service Requests P...,,,,,
33,Nightlife Noise Complaint,,,,,
7,Dead Animal,,,,,
26,Street Sign and Traffic Signal Maintenance,,,,,
10,General Inquiry - Customer Service Bureau,,,,,
17,Overgrown Vegetation,,,,,


In [16]:
cv_split = KFold(n_splits=5, shuffle=False)

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(**best_params['Random Forest']),
    'LightGBM': LGBMRegressor(**best_params['LightGBM']),
    'XGBoost': XGBRegressor(**best_params['XGBoost'])
}

for service in service_types:
    eval = {model: {'mae': None, 'latency': None} for model in models.keys()}

    service_data = Service_Type[Service_Type['Service Request Type'] == service].drop(columns=['Service Request Type'])
    X = service_data.drop(columns=['Request Count'])
    y = service_data['Request Count']

    if len(X) < 5:
        print(f"Skipping {service}: only {len(X)} samples.")
        continue

    for model_name, model in models.items():

        mean_absolute_errors, mean_absolute_percentage_errors, latencies = [], [], []

        for idx_train, idx_val in cv_split.split(X, y):
            X_train = X.iloc[idx_train]
            X_val = X.iloc[idx_val]
            y_train = y.iloc[idx_train]
            y_val = y.iloc[idx_val]

            start_time = time.time()
            model.fit(X_train, y_train)

            y_pred = model.predict(X_val)
            end_time = time.time()

            latency = end_time - start_time
            mae = np.mean(np.abs(y_val - y_pred))
            mape = np.mean(np.abs(100.0 * (y_val - y_pred) / y_val))
            mean_absolute_errors.append(mae)
            mean_absolute_percentage_errors.append(mape)
            latencies.append(latency)

        eval[model_name]['mae'] = np.mean(mean_absolute_errors)
        eval[model_name]['mape'] = np.mean(mean_absolute_percentage_errors)
        eval[model_name]['latency'] = np.mean(latencies)

    best_model = min(eval, key=lambda x: eval[x]['mape'])

    model_df.loc[model_df['Service Request Type'] == service, 'Model'] = best_model
    model_df.loc[model_df['Service Request Type'] == service, 'Mean Absolute Error'] = eval[best_model]['mae']
    model_df.loc[model_df['Service Request Type'] == service, 'Mean Absolute Percentage Error'] = eval[best_model]['mape']
    model_df.loc[model_df['Service Request Type'] == service, 'Total Requests'] = service_data['Request Count'].sum()
    model_df.loc[model_df['Service Request Type'] == service, 'Latency'] = eval[best_model]['latency']

Skipping Taxi, TNC, or Limousine Complaint or Compliment: only 4 samples.


Notice that some service types like "Snow and Ice" has a huge MAPE! That is usually due to the tiny amount of total requests used for training and validating this model.
You will notice the trend that popular service types tend to have smaller MAPE. But there are also some anomalies or outliers that does not follow this trend. That is because
of the amount of noise in the data which often times, in time series, are unexplainable and has no causality. But generally, it is safe to say that even if certain causes do not contribute to the presence of a phenomenon, larger data always makes our models perform better! 

In [17]:
model_df = model_df.loc[model_df['Model'].notna(), :]
model_df.sample(10)

Unnamed: 0,Service Request Type,Model,Mean Absolute Error,Mean Absolute Percentage Error,Total Requests,Latency
14,Graffiti,Linear Regression,6.033708,0.299039,103381,0.001902
1,Abandoned Vehicle,Random Forest,1.344067,70.48123,21,0.077905
5,Clogged Storm Drain,Linear Regression,4.743802,4.485917,10412,0.001402
3,Business Related Complaint,Linear Regression,5.252364,34.664436,861,0.00143
28,Towing Impound Complaint,Linear Regression,2.516171,55.755072,379,0.001305
20,Pollution Report Form,LightGBM,1.985,57.307031,281,0.019374
8,General Inquiry - Animal Shelter,XGBoost,17.87028,28.832266,3754,0.140531
30,Wireless or Small Cell Issue,LightGBM,1.001133,65.302042,55,0.012272
42,Snow and Ice,XGBoost,28.738208,153.742786,203,0.094919
7,Dead Animal,Linear Regression,18.159706,17.068158,5798,0.001602


### Predicting the Future: 3 Month from Today (4/27/2025)

In [18]:
test_rows = []

for service in model_df['Service Request Type'].unique():
    for month in [5, 6, 7]: 
        row = {
            'Service Request Type': service,
            'Year': 2025,
            'Month': month,
            'Predicted Request Count': None
        }
        test_rows.append(row)

test = pd.DataFrame(test_rows)
test.sample(10)

Unnamed: 0,Service Request Type,Year,Month,Predicted Request Count
89,Traffic Calming,2025,7,
90,Wireless or Small Cell Issue,2025,5,
27,General Inquiry - City Light,2025,5,
108,"Feedback about the Find It, Fix It mobile app",2025,5,
84,Towing Impound Complaint,2025,5,
13,Business Violation of Public Health Requirements,2025,6,
62,Pollution Report Form,2025,7,
86,Towing Impound Complaint,2025,7,
28,General Inquiry - City Light,2025,6,
104,Unauthorized Encampment,2025,7,


In [19]:
test_set = test.merge(model_df[['Service Request Type', 'Model']], on='Service Request Type', how='left')

trained_models = {}
trained_features = {}

for service in test_set['Service Request Type'].unique():
    model_name = test_set.loc[test_set['Service Request Type'] == service, 'Model'].values[0]

    service_data = Service_Type[Service_Type['Service Request Type'] == service].drop(columns=['Service Request Type', 'lag', 'Rolling_Mean', 'Rolling_Std'])
    X = service_data.drop(columns=['Request Count'])
    y = service_data['Request Count']

    model = models[model_name]
    model.fit(X, y)

    trained_models[service] = model
    trained_features[service] = X.columns.tolist()

predicted_counts = []

for service in test_set['Service Request Type'].unique():
    model = trained_models[service]
    features = trained_features[service]

    service_rows = test_set[test_set['Service Request Type'] == service]
    X_test = service_rows[features].astype(float)
    
    preds = model.predict(X_test)
    test_set.loc[test_set['Service Request Type'] == service, 'Predicted Request Count'] = preds

test_set.sample(10)


Unnamed: 0,Service Request Type,Year,Month,Predicted Request Count,Model
55,Parking Enforcement,2025,6,127.568182,Linear Regression
121,Traffic Signal Maintenance,2025,6,127.568182,Linear Regression
110,"Feedback about the Find It, Fix It mobile app",2025,7,140.659091,Linear Regression
86,Towing Impound Complaint,2025,7,140.659091,Linear Regression
109,"Feedback about the Find It, Fix It mobile app",2025,6,127.568182,Linear Regression
71,Public Litter and Recycling Cans,2025,7,140.659091,Linear Regression
103,Unauthorized Encampment,2025,6,127.568182,Linear Regression
76,Scooter or Bike Share Issue,2025,6,683.010496,Random Forest
43,Graffiti,2025,6,127.568182,Linear Regression
129,Bicycle Facility Maintenance,2025,5,114.477273,Linear Regression


In [20]:
final = test_set.groupby(['Service Request Type'])['Predicted Request Count'].sum().reset_index()
final.sample(10)

Unnamed: 0,Service Request Type,Predicted Request Count
2,Abandoned Vehicle/72hr Parking Ordinance,129.239525
0,ADA Request (Transportation),382.704545
10,Feedback about the Customer Service Requests P...,382.704545
9,Dead Animal,382.704545
22,Lost a Pet,382.704545
23,Nightlife Noise Complaint,129.239525
24,Nuisance dogs in a park,382.704545
13,General Inquiry - Animal Shelter,129.239525
17,General Inquiry - Public Utilities,382.704545
26,Parking Enforcement,382.704545


### Exports for Data Analysis & Visualization

In [None]:
# df.to_csv('Cleaned_Service.csv', index=False)
# Service_Type.to_csv('Service_Type.csv', index=False)
# model_df.to_csv('model_summary.csv', index=False)
# test_set.to_csv('predicted_monthly.csv', index=False)
# final.to_csv('predicted_quarterly.csv', index=False)

### Conclusion

Through time series modeling and predictive analysis of Seattle’s historical service request data, we successfully forecasted the expected number of service requests for each Service Request Type over the next three months. By engineering critical features such as lag values and rolling statistics, and by tuning and evaluating multiple machine learning models — including Linear Regression, Random Forest, LightGBM, and XGBoost — we identified the best-performing model for each service category.

Our results showed strong predictive performance on high-volume and stable service types, while naturally encountering greater variability in smaller, more irregular service types. This outcome reflects common patterns in real-world forecasting, where rare or seasonal services present higher prediction challenges.

Overall, the models and forecasts developed in this project provide a data-driven foundation for Seattle’s departments to anticipate resident needs, allocate resources more efficiently, and plan operational strategies for the coming months. Future work could extend this approach by incorporating external factors, such as weather or event calendars, to further refine forecasts for highly seasonal or irregular service types.