**Objective:**  
The goal of this project is to predict daily ambulance call demand in New York City, with a focus on borough-specific patterns.  
This analysis will also incorporate weather and holiday data to explore external factors influencing call volume.

**Dataset Source:**  
NYC Open Data API – [Link to API](https://data.cityofnewyork.us/resource/76xm-jjuj.json)  

**Notebook Outline:**  
1. Introduction  
2. Data Wrangling  
3. Feature Engineering (adding weather & holiday data)   
4.Exploratory Data Analysis (EDA)   
5. Model Selection & Training  
6. Evaluation & Conclusion


**Introduction**

Emergency response systems are a critical part of city infrastructure, and understanding patterns in emergency call demand can help allocate resources more efficiently. In this project, I analyze 911 call data for New York City to explore patterns by time, location, and external factors such as weather and holidays.
The primary goals of this project are:

Data Wrangling – Collect and clean raw data from the NYC Open Data API.

Feature Engineering – Create meaningful variables such as hour, day_of_week, is_holiday, and integrate external data sources like weather conditions. 

Exploratory Data Analysis (EDA) – Visualize trends in call volume across boroughs and identify patterns related to time, weather, and holidays. 

Modeling – Build and evaluate predictive models (Linear Regression and Random Forest) to forecast daily call volume. 

Insights – Summarize findings and highlight factors most strongly associated with variations in call volume. 

In [1]:
# Loading the libraries

import requests 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Data Collection

 The following code was used to collect the NYC EMS call data from the NYC Open Data API. It's commented out in Markdown to avoid accidental reruns, since the full request can take hours to complete.
 
<pre> import requests import pandas as pd # API endpoint url = "https://data.cityofnewyork.us/resource/76xm-jjuj.json" # Initialize variables all_data = [] limit = 1000 total_needed = 500000 # Loop to fetch data in batches for offset in range(0, total_needed, limit): params = { "$where": "incident_datetime between '2025-01-01T00:00:00' and '2025-06-30T23:59:59'", "$order": "incident_datetime ASC", "$limit": limit, "$offset": offset } response = requests.get(url, params=params) batch = response.json() if not batch: break all_data.extend(batch) # Convert to DataFrame df = pd.DataFrame(all_data) print(f"Fetched {len(df)} rows") # Save to CSV to avoid re-fetching df.to_csv('nyc_ambulance_call.csv', index=False) # Load the saved CSV df = pd.read_csv('nyc_ambulance_call.csv') </pre>

In [2]:
# Load the csv file into dataframe
df = pd.read_csv('nyc_ambulance_call.csv')

In [3]:
print(f'df - number of rows : {df.shape[0]}')
print((f'df - number of columns : {df.shape[1]}'))
df.head()      

df - number of rows : 500000
df - number of columns : 31


Unnamed: 0,cad_incident_id,incident_datetime,initial_call_type,initial_severity_level_code,final_call_type,final_severity_level_code,first_assignment_datetime,valid_dispatch_rspns_time_indc,dispatch_response_seconds_qy,first_activation_datetime,...,communitydistrict,communityschooldistrict,congressionaldistrict,reopen_indicator,special_event_indicator,standby_indicator,transfer_indicator,incident_response_seconds_qy,first_to_hosp_datetime,first_hosp_arrival_datetime
0,250010001,2025-01-01T00:00:12.000,STNDBY,8,STNDBY,8,2025-01-01T09:53:41.000,N,0,2025-01-01T09:57:09.000,...,313.0,21.0,8.0,Y,N,Y,N,,,
1,250010003,2025-01-01T00:01:53.000,UNC,2,UNC,2,2025-01-01T00:02:06.000,Y,13,2025-01-01T00:02:43.000,...,104.0,2.0,10.0,N,N,N,N,454.0,2025-01-01T00:21:01.000,2025-01-01T00:39:03.000
2,250010004,2025-01-01T00:01:58.000,CARD,3,CARD,3,2025-01-01T00:01:58.000,Y,0,2025-01-01T00:01:58.000,...,104.0,2.0,10.0,N,N,N,N,0.0,,
3,250010007,2025-01-01T00:03:42.000,ABDPN,5,ABDPN,5,2025-01-01T00:03:59.000,Y,17,2025-01-01T00:04:12.000,...,112.0,6.0,13.0,N,N,N,N,654.0,2025-01-01T00:28:30.000,2025-01-01T00:37:31.000
4,250010008,2025-01-01T00:04:36.000,STATEP,2,STATEP,2,2025-01-01T00:05:06.000,Y,30,2025-01-01T00:05:18.000,...,212.0,11.0,16.0,N,N,N,N,284.0,2025-01-01T00:40:27.000,2025-01-01T00:50:50.000


In [4]:
# Dataframes columns
df.columns

Index(['cad_incident_id', 'incident_datetime', 'initial_call_type',
       'initial_severity_level_code', 'final_call_type',
       'final_severity_level_code', 'first_assignment_datetime',
       'valid_dispatch_rspns_time_indc', 'dispatch_response_seconds_qy',
       'first_activation_datetime', 'first_on_scene_datetime',
       'valid_incident_rspns_time_indc', 'incident_travel_tm_seconds_qy',
       'incident_close_datetime', 'held_indicator',
       'incident_disposition_code', 'borough', 'incident_dispatch_area',
       'zipcode', 'policeprecinct', 'citycouncildistrict', 'communitydistrict',
       'communityschooldistrict', 'congressionaldistrict', 'reopen_indicator',
       'special_event_indicator', 'standby_indicator', 'transfer_indicator',
       'incident_response_seconds_qy', 'first_to_hosp_datetime',
       'first_hosp_arrival_datetime'],
      dtype='object')

In [5]:
# df data types 
df.dtypes

cad_incident_id                     int64
incident_datetime                  object
initial_call_type                  object
initial_severity_level_code         int64
final_call_type                    object
final_severity_level_code           int64
first_assignment_datetime          object
valid_dispatch_rspns_time_indc     object
dispatch_response_seconds_qy        int64
first_activation_datetime          object
first_on_scene_datetime            object
valid_incident_rspns_time_indc     object
incident_travel_tm_seconds_qy     float64
incident_close_datetime            object
held_indicator                     object
incident_disposition_code          object
borough                            object
incident_dispatch_area             object
zipcode                           float64
policeprecinct                    float64
citycouncildistrict               float64
communitydistrict                 float64
communityschooldistrict           float64
congressionaldistrict             

In [6]:
# Datframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 31 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   cad_incident_id                 500000 non-null  int64  
 1   incident_datetime               500000 non-null  object 
 2   initial_call_type               500000 non-null  object 
 3   initial_severity_level_code     500000 non-null  int64  
 4   final_call_type                 500000 non-null  object 
 5   final_severity_level_code       500000 non-null  int64  
 6   first_assignment_datetime       490831 non-null  object 
 7   valid_dispatch_rspns_time_indc  500000 non-null  object 
 8   dispatch_response_seconds_qy    500000 non-null  int64  
 9   first_activation_datetime       489951 non-null  object 
 10  first_on_scene_datetime         472388 non-null  object 
 11  valid_incident_rspns_time_indc  500000 non-null  object 
 12  incident_travel_

In [7]:
# max and min date

df.incident_datetime.min(), df.incident_datetime.max()

('2025-01-01T00:00:12.000', '2025-04-25T21:03:38.000')

**Select only the columns relevant to predicting ambulance call demand by hour and location. The incident_datetime column allow time-based analysis and forecasting. Borough and zipcode provide geographic context to understand spatial demand patterns. Other columns like initial_call_type and severity_level help capture the nature and urgency of calls, which may affect demand. Removing unnecessary columns reduces data size, simplifies the analysis, and focuses the model on meaningful features.**

In [8]:
# Selecting the required columns 
cols=['incident_datetime','borough','zipcode','initial_call_type','initial_severity_level_code']

# Subsetting the dataframe 
df = df[cols]
df.iloc[0:100:10]

Unnamed: 0,incident_datetime,borough,zipcode,initial_call_type,initial_severity_level_code
0,2025-01-01T00:00:12.000,BROOKLYN,11224.0,STNDBY,8
10,2025-01-01T00:07:12.000,BRONX,10459.0,UNKNOW,4
20,2025-01-01T00:09:57.000,BRONX,10475.0,CARD,3
30,2025-01-01T00:14:16.000,QUEENS,11432.0,ABDPN,5
40,2025-01-01T00:16:49.000,MANHATTAN,10036.0,DRUG,5
50,2025-01-01T00:20:13.000,MANHATTAN,10038.0,SICK,6
60,2025-01-01T00:24:07.000,QUEENS,11433.0,SICK,6
70,2025-01-01T00:27:36.000,BROOKLYN,11234.0,STATEP,2
80,2025-01-01T00:31:50.000,QUEENS,11415.0,INJMAJ,3
90,2025-01-01T00:35:46.000,MANHATTAN,10036.0,DRUG,5


In [9]:
# Converting incident_datetime to pandas date time format
df['incident_datetime'] = pd.to_datetime(df['incident_datetime'], errors='coerce')

In [10]:
# Check datatypes 
df.dtypes

incident_datetime              datetime64[ns]
borough                                object
zipcode                               float64
initial_call_type                      object
initial_severity_level_code             int64
dtype: object

__**Feature Engineering:**  
Create new columns from the incident_datetime because the original timestamp is hard to use directly.  
'date' helps us group calls by day.  
'hour' shows what time of day calls happen.  
'day_of_week' is an arbitrary number (0 for Monday, 1 for Tuesday, etc.) that helps identify the day of the week.  
'month' lets us see if calls change by season.  
'is_weekend' marks weekends, since weekends can have different call patterns than weekdays.  
These features make it easier to find patterns and improve our predictions.__

In [11]:
# Feature engineering - creating new columns from incident_datetime 
df['date'] = df['incident_datetime'].dt.date
df['hour'] = df['incident_datetime'].dt.hour
df['day_of_week'] = df['incident_datetime'].dt.dayofweek
df['month'] = df['incident_datetime'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5,6]).astype(int)

In [12]:
df.head()

Unnamed: 0,incident_datetime,borough,zipcode,initial_call_type,initial_severity_level_code,date,hour,day_of_week,month,is_weekend
0,2025-01-01 00:00:12,BROOKLYN,11224.0,STNDBY,8,2025-01-01,0,2,1,0
1,2025-01-01 00:01:53,MANHATTAN,10001.0,UNC,2,2025-01-01,0,2,1,0
2,2025-01-01 00:01:58,MANHATTAN,10036.0,CARD,3,2025-01-01,0,2,1,0
3,2025-01-01 00:03:42,MANHATTAN,10040.0,ABDPN,5,2025-01-01,0,2,1,0
4,2025-01-01 00:04:36,BRONX,10466.0,STATEP,2,2025-01-01,0,2,1,0


In [13]:
# Rearanging the DataFrame columns by subsetting and exclude the original incident_datetime
cols = ['date', 'hour', 'day_of_week', 'is_weekend', 'month', 'borough', 'zipcode', 'initial_call_type', 'initial_severity_level_code']
df = df[cols]

In [14]:
df.head()

Unnamed: 0,date,hour,day_of_week,is_weekend,month,borough,zipcode,initial_call_type,initial_severity_level_code
0,2025-01-01,0,2,0,1,BROOKLYN,11224.0,STNDBY,8
1,2025-01-01,0,2,0,1,MANHATTAN,10001.0,UNC,2
2,2025-01-01,0,2,0,1,MANHATTAN,10036.0,CARD,3
3,2025-01-01,0,2,0,1,MANHATTAN,10040.0,ABDPN,5
4,2025-01-01,0,2,0,1,BRONX,10466.0,STATEP,2


__Checking for Missing Values__

Before adding features like holidays or weather, it's important to handle missing values in key columns:

- **`date`**: Needed to match holidays and weather data by time. Missing dates make time-based features unreliable.
- **`zipcode`** and **`borough`**: Needed to match the location for weather data. Missing values mean we can’t add weather accurately.


In [15]:
# Check for missing date
df.date.isna().sum()

np.int64(0)

In [16]:
#Check for missing  in boruogh 
df.zipcode.isna().sum()

np.int64(12400)

__Analyzing missing zipcode rows__

In [17]:
missing_zip = df[df.zipcode.isna()]
missing_zip.tail()

Unnamed: 0,date,hour,day_of_week,is_weekend,month,borough,zipcode,initial_call_type,initial_severity_level_code
499789,2025-04-25,20,4,0,4,QUEENS,,SICMIN,7
499810,2025-04-25,20,4,0,4,BROOKLYN,,UNKNOW,4
499844,2025-04-25,20,4,0,4,BRONX,,MVAINJ,5
499923,2025-04-25,20,4,0,4,RICHMOND / STATEN ISLAND,,INJURY,5
499937,2025-04-25,20,4,0,4,QUEENS,,STNDBY,8


In [18]:
# check missing zipcode by borough
missing_zip.borough.value_counts()

borough
QUEENS                      4450
BROOKLYN                    2931
BRONX                       2260
MANHATTAN                   1383
RICHMOND / STATEN ISLAND    1375
UNKNOWN                        1
Name: count, dtype: int64

In [19]:
# Percentage of missing zipcodes by borough
total_counts = df.groupby('borough').size()

# Missing zipcode per borough
missing_zip = df[df.zipcode.isna()].groupby('borough').size()

# Mising percentage 
missing_percentage = (missing_zip/total_counts)*100

print(missing_percentage)

borough
BRONX                         1.889680
BROOKLYN                      2.109847
MANHATTAN                     1.173775
QUEENS                        4.363986
RICHMOND / STATEN ISLAND      6.340496
UNKNOWN                     100.000000
dtype: float64


__Drop Rows with Missing `zipcode`__

Dataset has about **500000 rows**, and around **1–11% of zipcodes are missing**, depending on the borough.

Since `zipcode` is important for adding **location-based features** like weather, missing values in this column reduce the accuracy of those features.

Given the small dataset size, keeping clean and reliable data is essential. 
So, we choose to **drop rows with missing `zipcode`** to avoid adding incorrect or incomplete location-based data.

This results in only a small data loss (~2.5%), which is acceptable for improving data quality.

In [20]:
# Drop missing zipcode values
df = df.dropna(subset=['zipcode'])

df.zipcode.isna().sum()

np.int64(0)

In [21]:
# Check the df for any other missing values
df.isna().sum()

date                           0
hour                           0
day_of_week                    0
is_weekend                     0
month                          0
borough                        0
zipcode                        0
initial_call_type              0
initial_severity_level_code    0
dtype: int64

__Zipcodes are numeric identifiers — storing them as float (e.g., 10001.0) is incorrect and can cause confusion.__

In [22]:
# check zipcode type
df.zipcode.dtype

dtype('float64')

In [23]:
# Converting zipcode to int
df['zipcode'] = df['zipcode'].astype(int)


In [24]:
df.dtypes

date                           object
hour                            int32
day_of_week                     int32
is_weekend                      int64
month                           int32
borough                        object
zipcode                         int64
initial_call_type              object
initial_severity_level_code     int64
dtype: object

In [25]:
# Converting date back to datetime 
df['date']=pd.to_datetime(df['date'])

In [26]:
df.date.dtype

dtype('<M8[ns]')

In [27]:
df.shape

(487600, 9)

__Add Holidays to the Dataset__

Holidays can affect ambulance call patterns because people behave differently on those days. There may be more accidents, health emergencies, or delayed care. 

We are adding a holiday column using the `pandas` `USFederalHolidayCalendar` library to help the model understand and predict these changes more accurately.

In [28]:
# importing USFederalHoilidayCalender
from pandas.tseries.holiday import USFederalHolidayCalendar   

#Generate calender
calendar = USFederalHolidayCalendar()

# Create US hoildays using df date range
us_holidays = calendar.holidays(start='2025-01-01', end='2025-04-25')

In [29]:
# Create new cloumn is_holiday. True=1, False=0
df['is_holiday']=df['date'].isin(us_holidays).astype(int)
df.head()

Unnamed: 0,date,hour,day_of_week,is_weekend,month,borough,zipcode,initial_call_type,initial_severity_level_code,is_holiday
0,2025-01-01,0,2,0,1,BROOKLYN,11224,STNDBY,8,1
1,2025-01-01,0,2,0,1,MANHATTAN,10001,UNC,2,1
2,2025-01-01,0,2,0,1,MANHATTAN,10036,CARD,3,1
3,2025-01-01,0,2,0,1,MANHATTAN,10040,ABDPN,5,1
4,2025-01-01,0,2,0,1,BRONX,10466,STATEP,2,1


In [30]:
df.is_holiday.sum()

np.int64(12224)

In [31]:
df.date.min(), df.date.max()

(Timestamp('2025-01-01 00:00:00'), Timestamp('2025-04-25 00:00:00'))

**Feature Engineering: Rationale** 
  
**1. Daily Aggregation of Ambulance Calls**.   
Why: The goal is to predict daily call volumes, not hourly.  
Reason: Aggregating to daily level simplifies modeling, reduces noise from short-term fluctuations, and aligns with how trends like holidays, weather, or weekdays typically influence behavior.  
How: Count total calls per day per borough or ZIP.

**2. Mean, Min, and Max Severity Levels**   
Why: Severity level ranges from **1 (most severe) to 8 (least severe)**.      
Reason: These statistical summaries help capture the nature of the calls on a given day:   
mean_severity: shows general trend of urgency (lower = more severe).  
min_severity: identifies if very serious cases occurred.  
max_severity: captures the full range of severity types on that day.  
Insight: A day with mostly low severity levels might reflect routine calls, while a lower mean might indicate a stressful or crisis-heavy day.  

**3. Non-Urgent Call Volume**.  
Why: To distinguish between emergency and non-emergency workload.  
Reason: High volumes of non-urgent calls (e.g., severity 7–8) can still strain resources and impact emergency response.  
How: Count number (or percentage) of calls with severity level ≥ 7 per day.  
Insight: Helps understand the composition of daily workload, not just the total volume.  

**4. Lag Features (Previous Day’s Call Volume)**.     
Why: To capture temporal trends and patterns.  
Reason: Ambulance call volume often shows autocorrelation — a high call volume one day might be followed by a similar trend the next.  
How: Create lag features like:  
calls_lag_1: number of calls on the previous day  
calls_lag_7: calls from the same weekday last week  
Insight: Helps the model learn short-term patterns and anomalies (e.g., ongoing incidents, seasonal waves, or weekends).  



In [32]:
# Confirming date is in datetime format
df.date.dtype

dtype('<M8[ns]')

In [33]:
# Aggregate daily ambulance call volume per borough

daily_calls = df.groupby(['date','borough']).agg(
            daily_call_vol=('hour','count'),
            mean_severity_level=('initial_severity_level_code','mean'),
            max_severity_level=('initial_severity_level_code', 'min'),
            min_severity=('initial_severity_level_code','max'),
            non_urgent_call=('initial_severity_level_code', lambda x: (x>=6).sum()),
            day_of_week=('day_of_week','first'),
            is_weekend=('is_weekend','first'),
            is_holiday=('is_holiday','first')
).reset_index()

In [34]:
daily_calls.head()

Unnamed: 0,date,borough,daily_call_vol,mean_severity_level,max_severity_level,min_severity,non_urgent_call,day_of_week,is_weekend,is_holiday
0,2025-01-01,BRONX,954,4.091195,1,7,205,2,0,1
1,2025-01-01,BROOKLYN,1140,4.160526,1,8,233,2,0,1
2,2025-01-01,MANHATTAN,1049,4.173499,1,8,205,2,0,1
3,2025-01-01,QUEENS,890,4.196629,1,8,187,2,0,1
4,2025-01-01,RICHMOND / STATEN ISLAND,157,4.401274,1,8,40,2,0,1


In [35]:
# Adding lag features by sorting borough and date
 
daily_calls = daily_calls.sort_values(['borough','date'])
daily_calls['lag_1'] = daily_calls.groupby('borough')['daily_call_vol'].shift(1)
daily_calls['lag_7'] = daily_calls.groupby('borough')['daily_call_vol'].shift(7)

In [36]:
daily_calls.iloc[0:56:7]

Unnamed: 0,date,borough,daily_call_vol,mean_severity_level,max_severity_level,min_severity,non_urgent_call,day_of_week,is_weekend,is_holiday,lag_1,lag_7
0,2025-01-01,BRONX,954,4.091195,1,7,205,2,0,1,,
35,2025-01-08,BRONX,998,4.113226,1,8,256,2,0,0,1092.0,954.0
70,2025-01-15,BRONX,1114,4.192101,1,8,274,2,0,0,1095.0,998.0
105,2025-01-22,BRONX,1042,4.216891,1,8,283,2,0,0,1127.0,1114.0
140,2025-01-29,BRONX,1051,4.139867,1,8,273,2,0,0,1092.0,1042.0
175,2025-02-05,BRONX,1044,4.249042,1,8,273,2,0,0,1083.0,1051.0
210,2025-02-12,BRONX,1014,4.21499,1,8,267,2,0,0,1072.0,1044.0
245,2025-02-19,BRONX,995,4.18191,1,8,253,2,0,0,1018.0,1014.0


In [37]:
daily_calls.shape

(575, 12)

In [38]:
daily_calls.dtypes

date                   datetime64[ns]
borough                        object
daily_call_vol                  int64
mean_severity_level           float64
max_severity_level              int64
min_severity                    int64
non_urgent_call                 int64
day_of_week                     int32
is_weekend                      int64
is_holiday                      int64
lag_1                         float64
lag_7                         float64
dtype: object

In [39]:
# Checking for missing values
daily_calls.isna().sum()

date                    0
borough                 0
daily_call_vol          0
mean_severity_level     0
max_severity_level      0
min_severity            0
non_urgent_call         0
day_of_week             0
is_weekend              0
is_holiday              0
lag_1                   5
lag_7                  35
dtype: int64

**Standardizing Borough Names**

Before merging the daily calls and weather datasets, we need to ensure that the borough names are consistent.  

- The `daily_call` DataFrame has borough names in **all caps**, while `weather_daily` uses **title case**.  
- To avoid merge mismatches, we **normalize the strings** by converting them to title case and trimming any whitespace.

In [40]:
# Checking for string consistency: 
daily_calls.borough.unique()

array(['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS',
       'RICHMOND / STATEN ISLAND'], dtype=object)

In [41]:
# Standardize borough names in daily_calls
daily_calls['borough'] =daily_calls['borough'].str.strip().str.title()

In [42]:
daily_calls.borough.unique()

array(['Bronx', 'Brooklyn', 'Manhattan', 'Queens',
       'Richmond / Staten Island'], dtype=object)

**In the `daily_call` dataset, Staten Island is labeled as `'Richmond / Staten Island'`.  
To ensure consistency with the `weather_daily` dataset, we replace it with the standard name `'Staten Island'`.**

In [43]:
daily_calls['borough'] = daily_calls['borough'].replace({'Richmond / Staten Island':'Staten Island'})

In [44]:
daily_calls.borough.unique()

array(['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'],
      dtype=object)

In [45]:
# Saving it as csv file before joining weather data
daily_calls.to_csv('daily_ambulance_calls.csv')

## Adding weather Data in Ambulance Call Prediction
Weather conditions can have a direct and indirect impact on ambulance call volumes. Adding weather data enriches the feature space with external, real-world influences that the model can learn from. While it’s not always the dominant factor, weather adds important context, especially during extreme or seasonal changes.

## Weather Data Summary  
To add environmental context to the ambulance call data, we collected daily weather data in a different notebook,from January 1 to April 30, 2025 using We use the Open-Meteo **archive API** to get hourly weather data for all five NYC boroughs from **2025-01-01 to 2025-04-30**.  
The variables we collect are:

- `temperature_2m` — hourly temperature (°C)  
- `precipitation` — hourly precipitation (mm)  
- `snowfall` — new snow per hour (cm)  
- `snow_depth` — total snow on ground (cm)  

We store each borough’s data in a DataFrame and combine them into a single CSV(**'nyc_boroughs_weather.csv'**) for reproducibility.


In [46]:
# Load weather data 
weather = pd.read_csv('nyc_boroughs_weather.csv')

In [47]:
# Randomly selecting 10 rows
weather.sample(n=10, random_state=42)

Unnamed: 0,time,temperature_2m,precipitation,snowfall,snow_depth,borough
3025,2025-01-07 01:00:00,-6.7,0.0,0.0,0.02,Brooklyn
7182,2025-03-01 06:00:00,5.8,0.0,0.0,0.0,Queens
3492,2025-01-26 12:00:00,2.4,0.0,0.0,0.04,Brooklyn
6685,2025-02-08 13:00:00,0.3,0.0,0.0,0.01,Queens
9099,2025-01-20 03:00:00,-4.9,0.0,0.0,0.05,Bronx
11672,2025-01-07 08:00:00,-6.4,0.0,0.0,0.02,Staten Island
10297,2025-03-11 01:00:00,7.9,0.0,0.0,0.0,Bronx
3860,2025-02-10 20:00:00,-3.6,0.0,0.0,0.06,Brooklyn
6178,2025-01-18 10:00:00,5.5,0.0,0.0,0.01,Queens
2861,2025-04-30 05:00:00,19.8,0.0,0.0,0.0,Manhattan


In [48]:
# Convert temperature columns to Farenheit from Celsius (F = C * 9/5 + 32)

weather['temperature_2m'] = weather['temperature_2m']* 9/5 + 32

In [49]:
weather.head()

Unnamed: 0,time,temperature_2m,precipitation,snowfall,snow_depth,borough
0,2025-01-01 00:00:00,45.32,3.7,0.0,0.0,Manhattan
1,2025-01-01 01:00:00,46.58,0.9,0.0,0.0,Manhattan
2,2025-01-01 02:00:00,47.3,0.0,0.0,0.0,Manhattan
3,2025-01-01 03:00:00,47.84,0.0,0.0,0.0,Manhattan
4,2025-01-01 04:00:00,47.84,0.0,0.0,0.0,Manhattan


In [50]:
weather.shape

(14400, 6)

In [51]:
# Weather data type
weather.dtypes

time               object
temperature_2m    float64
precipitation     float64
snowfall          float64
snow_depth        float64
borough            object
dtype: object

**Change weather 'time' to to_datetime and and add 'date' column**

In [52]:
# Convert 'time' to datetime
weather['time'] = pd.to_datetime(weather['time'])

# Rename column to 'date'
weather.rename(columns={'time':'date'}, inplace=True)

# Reset index
weather.set_index('date', inplace = True)

**Resample Hourly Weather Data to Daily**

The ambulance call data is **daily**, so we aggregate hourly weather to daily values (**mean**) to match the target frequency.  
Daily aggregation reduces noise from hourly fluctuations and makes features like average temperature, total precipitation, or snow depth easier for the model to use.

In [53]:
# Resampling data to daily
weather_daily = (weather.groupby('borough')
                .resample('D')
                .agg(
                temp =('temperature_2m','mean',),
                precp = ('precipitation', 'mean'),
                snowfall = ('snowfall','mean'),
                snow_depth =('snow_depth', 'mean')
                )
                 .reset_index()
                )
weather_daily.head()    

  .agg(


Unnamed: 0,borough,date,temp,precp,snowfall,snow_depth
0,Bronx,2025-01-01,45.32,0.254167,0.0,0.0
1,Bronx,2025-01-02,37.13,0.0,0.0,0.0
2,Bronx,2025-01-03,32.9825,0.0,0.0,0.0
3,Bronx,2025-01-04,29.6975,0.0,0.0,0.0
4,Bronx,2025-01-05,28.565,0.0,0.0,0.0


In [54]:
weather_daily.shape

(600, 6)

In [55]:
# Check weather_daily for missing values 
weather_daily.isna().sum()

borough       0
date          0
temp          0
precp         0
snowfall      0
snow_depth    0
dtype: int64

In [56]:
# checking for string consistancy
weather_daily.borough.unique()

array(['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'],
      dtype=object)

**Merge Weather with Daily Ambulance Calls**

The `weather_daily` dataset has **no missing values** and is now ready to be merged with `daily_calls` on `date` and `borough` using a **left join** to preserve all ambulance records.

In [57]:
# Merge weather_daily with daily_calls to 'daily-amb-calls'

daily_amb_calls = daily_calls.merge(weather_daily, on=['date','borough'], how='left')

In [58]:
daily_amb_calls.head()

Unnamed: 0,date,borough,daily_call_vol,mean_severity_level,max_severity_level,min_severity,non_urgent_call,day_of_week,is_weekend,is_holiday,lag_1,lag_7,temp,precp,snowfall,snow_depth
0,2025-01-01,Bronx,954,4.091195,1,7,205,2,0,1,,,45.32,0.254167,0.0,0.0
1,2025-01-02,Bronx,1116,4.144265,1,8,278,3,0,0,954.0,,37.13,0.0,0.0,0.0
2,2025-01-03,Bronx,1116,4.229391,1,8,276,4,0,0,1116.0,,32.9825,0.0,0.0,0.0
3,2025-01-04,Bronx,1035,4.173913,1,8,231,5,1,0,1116.0,,29.6975,0.0,0.0,0.0
4,2025-01-05,Bronx,964,4.11722,1,8,237,6,1,0,1035.0,,28.565,0.0,0.0,0.0


In [59]:
daily_amb_calls.isna().sum()

date                    0
borough                 0
daily_call_vol          0
mean_severity_level     0
max_severity_level      0
min_severity            0
non_urgent_call         0
day_of_week             0
is_weekend              0
is_holiday              0
lag_1                   5
lag_7                  35
temp                    0
precp                   0
snowfall                0
snow_depth              0
dtype: int64

**Handling Missing Lag Features**

The `lag_1` and `lag_7` features represent the number of calls from 1 day and 7 days prior.  
- By definition, the first day of each borough has no `lag_1`, and the first 7 days have no `lag_7`.  
- These missing values are **not errors** but a natural consequence of computing lags.

Filling lag_1 with the previous day’s value (within the same borough) and lag_7 with the borough mean ensures that each row has reasonable historical information without introducing large distortions

In [64]:
# Fill lag_1 missing values with the boroughs mean within the same borough
daily_amb_calls['lag_1'] = daily_amb_calls.groupby('borough')['lag_1'].transform(lambda x: x.fillna(x.mean()))

# Fill lag_7 missing values with the boroughs 'lag_7' mean value within the same borough
daily_amb_calls['lag_7'] = daily_amb_calls.groupby('borough')['lag_7'] \
                                          .transform(lambda x: x.fillna(x.mean()))

In [65]:
# Check for missing values 
daily_amb_calls.isna().sum()

date                   0
borough                0
daily_call_vol         0
mean_severity_level    0
max_severity_level     0
min_severity           0
non_urgent_call        0
day_of_week            0
is_weekend             0
is_holiday             0
lag_1                  0
lag_7                  0
temp                   0
precp                  0
snowfall               0
snow_depth             0
dtype: int64

**Save `daily_amb_call` as CSV**

In [66]:
daily_amb_calls.to_csv('daily_ambulance_calls.csv')

## The merged dataset is now `clean`, `consistent`, and `ready` for exploratory data analysis (EDA).

In [67]:
daily_amb_calls.head()

Unnamed: 0,date,borough,daily_call_vol,mean_severity_level,max_severity_level,min_severity,non_urgent_call,day_of_week,is_weekend,is_holiday,lag_1,lag_7,temp,precp,snowfall,snow_depth
0,2025-01-01,Bronx,954,4.091195,1,7,205,2,0,1,1020.859649,1018.12963,45.32,0.254167,0.0,0.0
1,2025-01-02,Bronx,1116,4.144265,1,8,278,3,0,0,954.0,1018.12963,37.13,0.0,0.0,0.0
2,2025-01-03,Bronx,1116,4.229391,1,8,276,4,0,0,1116.0,1018.12963,32.9825,0.0,0.0,0.0
3,2025-01-04,Bronx,1035,4.173913,1,8,231,5,1,0,1116.0,1018.12963,29.6975,0.0,0.0,0.0
4,2025-01-05,Bronx,964,4.11722,1,8,237,6,1,0,1035.0,1018.12963,28.565,0.0,0.0,0.0
