**Objective:**  
The goal of this project is to predict daily ambulance call demand in New York City, with a focus on borough-specific patterns.  
This analysis will also incorporate weather and holiday data to explore external factors influencing call volume.

**Dataset Source:**  
NYC Open Data API – [Link to API](https://data.cityofnewyork.us/resource/76xm-jjuj.json)  

**Notebook Outline:**  
1. Introduction  
2. Data Wrangling  
3. Feature Engineering (adding weather & holiday data)   
4.Exploratory Data Analysis (EDA)   
5. Model Selection & Training  
6. Evaluation & Conclusion


**Introduction**

Emergency response systems are a critical part of city infrastructure, and understanding patterns in emergency call demand can help allocate resources more efficiently. In this project, I analyze 911 call data for New York City to explore patterns by time, location, and external factors such as weather and holidays.
The primary goals of this project are:

Data Wrangling – Collect and clean raw data from the NYC Open Data API.

Feature Engineering – Create meaningful variables such as hour, day_of_week, is_holiday, and integrate external data sources like weather conditions. 

Exploratory Data Analysis (EDA) – Visualize trends in call volume across boroughs and identify patterns related to time, weather, and holidays. 

Modeling – Build and evaluate predictive models (Linear Regression and Random Forest) to forecast daily call volume. 

Insights – Summarize findings and highlight factors most strongly associated with variations in call volume. 

In [1]:
# Loading the libraries

import requests 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## NYC EMS Call Data Collection


We collected NYC EMS call data from the **NYC Open Data API**.  

- **Date Range:** July 1, 2024 → April 25, 2025  
- **Method:** Multiple smaller requests over different date ranges, then merged into a single dataset.  
- **Storage:** Saved as `nyc_ambulance_call.csv` to avoid re-fetching.  

> **Note:** The dataset is large; running the full code may take hours. This is provided for reference.  

```python
import requests
import pandas as pd

url = "https://data.cityofnewyork.us/resource/76xm-jjuj.json"
all_data = []
limit = 1000

for offset in range(0, 500000, limit):
    params = {
        "$where": "incident_datetime between '2024-07-01T00:00:00' and '2025-04-25T23:59:59'",
        "$order": "incident_datetime ASC",
        "$limit": limit,
        "$offset": offset
    }
    response = requests.get(url, params=params)
    batch = response.json()
    if not batch:
        break
    all_data.extend(batch)

df = pd.DataFrame(all_data)
df.to_csv('nyc_ambulance_call.csv', index=False)
df = pd.read_csv('nyc_ambulance_call_data.csv')


In [2]:
# Load the csv file into dataframe
df = pd.read_csv('nyc_ambulance_call_data.csv')

In [3]:
print(f'df - number of rows : {df.shape[0]}')
print((f'df - number of columns : {df.shape[1]}'))
df.head()      

df - number of rows : 1308614
df - number of columns : 32


Unnamed: 0.1,Unnamed: 0,cad_incident_id,incident_datetime,initial_call_type,initial_severity_level_code,final_call_type,final_severity_level_code,first_assignment_datetime,valid_dispatch_rspns_time_indc,dispatch_response_seconds_qy,...,citycouncildistrict,communitydistrict,communityschooldistrict,congressionaldistrict,reopen_indicator,special_event_indicator,standby_indicator,transfer_indicator,first_to_hosp_datetime,first_hosp_arrival_datetime
0,0,241830001,2024-07-01T00:00:02.000,STNDBY,8,STNDBY,8,2024-07-01T06:19:55.000,Y,22793,...,26.0,402.0,30.0,12.0,Y,N,Y,N,,
1,1,241830002,2024-07-01T00:00:04.000,STNDBY,8,STNDBY,8,2024-07-01T14:38:23.000,N,0,...,8.0,111.0,4.0,12.0,Y,N,Y,N,,
2,2,241830003,2024-07-01T00:00:06.000,EDP,7,EDP,7,,N,0,...,5.0,108.0,2.0,12.0,N,N,N,N,,
3,3,241830004,2024-07-01T00:00:12.000,STNDBY,8,STNDBY,8,2024-07-01T06:28:45.000,Y,23313,...,8.0,111.0,4.0,12.0,Y,N,Y,N,,
4,4,241830007,2024-07-01T00:00:22.000,SEIZR,3,SEIZR,3,2024-07-01T00:00:41.000,Y,19,...,17.0,201.0,7.0,15.0,N,N,N,N,2024-07-01T00:23:56.000,2024-07-01T00:34:20.000


In [4]:
# Dataframes columns
df.columns

Index(['Unnamed: 0', 'cad_incident_id', 'incident_datetime',
       'initial_call_type', 'initial_severity_level_code', 'final_call_type',
       'final_severity_level_code', 'first_assignment_datetime',
       'valid_dispatch_rspns_time_indc', 'dispatch_response_seconds_qy',
       'first_activation_datetime', 'first_on_scene_datetime',
       'valid_incident_rspns_time_indc', 'incident_response_seconds_qy',
       'incident_travel_tm_seconds_qy', 'incident_close_datetime',
       'held_indicator', 'incident_disposition_code', 'borough',
       'incident_dispatch_area', 'zipcode', 'policeprecinct',
       'citycouncildistrict', 'communitydistrict', 'communityschooldistrict',
       'congressionaldistrict', 'reopen_indicator', 'special_event_indicator',
       'standby_indicator', 'transfer_indicator', 'first_to_hosp_datetime',
       'first_hosp_arrival_datetime'],
      dtype='object')

In [5]:
# df data types 
df.dtypes

Unnamed: 0                          int64
cad_incident_id                     int64
incident_datetime                  object
initial_call_type                  object
initial_severity_level_code         int64
final_call_type                    object
final_severity_level_code           int64
first_assignment_datetime          object
valid_dispatch_rspns_time_indc     object
dispatch_response_seconds_qy        int64
first_activation_datetime          object
first_on_scene_datetime            object
valid_incident_rspns_time_indc     object
incident_response_seconds_qy      float64
incident_travel_tm_seconds_qy     float64
incident_close_datetime            object
held_indicator                     object
incident_disposition_code          object
borough                            object
incident_dispatch_area             object
zipcode                           float64
policeprecinct                    float64
citycouncildistrict               float64
communitydistrict                 

In [6]:
# Datframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308614 entries, 0 to 1308613
Data columns (total 32 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   Unnamed: 0                      1308614 non-null  int64  
 1   cad_incident_id                 1308614 non-null  int64  
 2   incident_datetime               1308614 non-null  object 
 3   initial_call_type               1308614 non-null  object 
 4   initial_severity_level_code     1308614 non-null  int64  
 5   final_call_type                 1308614 non-null  object 
 6   final_severity_level_code       1308614 non-null  int64  
 7   first_assignment_datetime       1283939 non-null  object 
 8   valid_dispatch_rspns_time_indc  1308614 non-null  object 
 9   dispatch_response_seconds_qy    1308614 non-null  int64  
 10  first_activation_datetime       1281483 non-null  object 
 11  first_on_scene_datetime         1233989 non-null  object 
 12  

In [7]:
# max and min date

df.incident_datetime.min(), df.incident_datetime.max()

('2024-07-01T00:00:02.000', '2025-04-25T21:03:38.000')

**Select only the columns relevant to predicting ambulance call demand by hour and location. The incident_datetime column allow time-based analysis and forecasting. Borough and zipcode provide geographic context to understand spatial demand patterns. Other columns like initial_call_type and severity_level help capture the nature and urgency of calls, which may affect demand. Removing unnecessary columns reduces data size, simplifies the analysis, and focuses the model on meaningful features.**

In [8]:
# Selecting the required columns 
cols=['incident_datetime','borough','zipcode','initial_call_type','initial_severity_level_code']

# Subsetting the dataframe 
df = df[cols]
df.iloc[0:100:10]

Unnamed: 0,incident_datetime,borough,zipcode,initial_call_type,initial_severity_level_code
0,2024-07-01T00:00:02.000,QUEENS,11101.0,STNDBY,8
10,2024-07-01T00:00:59.000,BROOKLYN,11214.0,ARREST,1
20,2024-07-01T00:03:31.000,BROOKLYN,11220.0,SICK,6
30,2024-07-01T00:06:33.000,BROOKLYN,11212.0,SICK,6
40,2024-07-01T00:09:01.000,BROOKLYN,11204.0,INJURY,5
50,2024-07-01T00:13:16.000,BROOKLYN,11224.0,CARD,3
60,2024-07-01T00:18:35.000,BRONX,10453.0,INJURY,5
70,2024-07-01T00:21:59.000,MANHATTAN,10014.0,INJURY,5
80,2024-07-01T00:25:11.000,MANHATTAN,10035.0,RESPIR,4
90,2024-07-01T00:29:02.000,MANHATTAN,10031.0,UNKNOW,4


In [9]:
# Converting incident_datetime to pandas date time format
df['incident_datetime'] = pd.to_datetime(df['incident_datetime'], errors='coerce')

In [10]:
# Check datatypes 
df.dtypes

incident_datetime              datetime64[ns]
borough                                object
zipcode                               float64
initial_call_type                      object
initial_severity_level_code             int64
dtype: object

__**Feature Engineering:**  
Create new columns from the incident_datetime because the original timestamp is hard to use directly.  
'date' helps us group calls by day.  
'hour' shows what time of day calls happen.  
'day_of_week' is an arbitrary number (0 for Monday, 1 for Tuesday, etc.) that helps identify the day of the week.  
'month' lets us see if calls change by season.  
'is_weekend' marks weekends, since weekends can have different call patterns than weekdays.  
These features make it easier to find patterns and improve our predictions.__

In [11]:
# Feature engineering - creating new columns from incident_datetime 
df['date'] = df['incident_datetime'].dt.date
df['hour'] = df['incident_datetime'].dt.hour
df['day_of_week'] = df['incident_datetime'].dt.dayofweek
df['month'] = df['incident_datetime'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5,6]).astype(int)

In [12]:
df.head()

Unnamed: 0,incident_datetime,borough,zipcode,initial_call_type,initial_severity_level_code,date,hour,day_of_week,month,is_weekend
0,2024-07-01 00:00:02,QUEENS,11101.0,STNDBY,8,2024-07-01,0,0,7,0
1,2024-07-01 00:00:04,MANHATTAN,10035.0,STNDBY,8,2024-07-01,0,0,7,0
2,2024-07-01 00:00:06,MANHATTAN,10065.0,EDP,7,2024-07-01,0,0,7,0
3,2024-07-01 00:00:12,MANHATTAN,10035.0,STNDBY,8,2024-07-01,0,0,7,0
4,2024-07-01 00:00:22,BRONX,10451.0,SEIZR,3,2024-07-01,0,0,7,0


In [13]:
# Rearanging the DataFrame columns by subsetting and exclude the original incident_datetime
cols = ['date', 'hour', 'day_of_week', 'is_weekend', 'month', 'borough', 'zipcode', 'initial_call_type', 'initial_severity_level_code']
df = df[cols]

In [14]:
df.head()

Unnamed: 0,date,hour,day_of_week,is_weekend,month,borough,zipcode,initial_call_type,initial_severity_level_code
0,2024-07-01,0,0,0,7,QUEENS,11101.0,STNDBY,8
1,2024-07-01,0,0,0,7,MANHATTAN,10035.0,STNDBY,8
2,2024-07-01,0,0,0,7,MANHATTAN,10065.0,EDP,7
3,2024-07-01,0,0,0,7,MANHATTAN,10035.0,STNDBY,8
4,2024-07-01,0,0,0,7,BRONX,10451.0,SEIZR,3


__Checking for Missing Values__

Before adding features like holidays or weather, it's important to handle missing values in key columns:

- **`date`**: Needed to match holidays and weather data by time. Missing dates make time-based features unreliable.
- **`zipcode`** and **`borough`**: Needed to match the location for weather data. Missing values mean we can’t add weather accurately.


In [15]:
# Check for missing date
df.date.isna().sum()

np.int64(0)

In [16]:
#Check for missing  in boruogh 
df.zipcode.isna().sum()

np.int64(21123)

__Analyzing missing zipcode rows__

In [17]:
missing_zip = df[df.zipcode.isna()]
missing_zip.tail()

Unnamed: 0,date,hour,day_of_week,is_weekend,month,borough,zipcode,initial_call_type,initial_severity_level_code
1308403,2025-04-25,20,4,0,4,QUEENS,,SICMIN,7
1308424,2025-04-25,20,4,0,4,BROOKLYN,,UNKNOW,4
1308458,2025-04-25,20,4,0,4,BRONX,,MVAINJ,5
1308537,2025-04-25,20,4,0,4,RICHMOND / STATEN ISLAND,,INJURY,5
1308551,2025-04-25,20,4,0,4,QUEENS,,STNDBY,8


In [18]:
# check missing zipcode by borough
missing_zip.borough.value_counts()

borough
QUEENS                      8103
BROOKLYN                    4416
BRONX                       4036
MANHATTAN                   2846
RICHMOND / STATEN ISLAND    1721
UNKNOWN                        1
Name: count, dtype: int64

In [19]:
# Total zipcodes by borough
total_counts = df.groupby('borough').size()

# Missing zipcode per borough
missing_zip = df[df.zipcode.isna()].groupby('borough').size()

# Mising percentage 
missing_percentage = (missing_zip/total_counts)*100

print(missing_percentage)

borough
BRONX                         1.304380
BROOKLYN                      1.223814
MANHATTAN                     0.901726
QUEENS                        3.032560
RICHMOND / STATEN ISLAND      3.098779
UNKNOWN                     100.000000
dtype: float64


__Drop Rows with Missing `zipcode`__

Dataset has about **1308614 rows**, and around **1–3% of zipcodes are missing**, depending on the borough.

Since `zipcode` is important for adding **location-based features** like weather, missing values in this column reduce the accuracy of those features.

Given the small dataset size, keeping clean and reliable data is essential. 
So, we choose to **drop rows with missing `zipcode`** to avoid adding incorrect or incomplete location-based data.

This results in only a small data loss (~2.5%), which is acceptable for improving data quality.

In [20]:
# Drop missing zipcode values
df = df.dropna(subset=['zipcode'])

df.zipcode.isna().sum()

np.int64(0)

In [21]:
# Check the df for any other missing values
df.isna().sum()

date                           0
hour                           0
day_of_week                    0
is_weekend                     0
month                          0
borough                        0
zipcode                        0
initial_call_type              0
initial_severity_level_code    0
dtype: int64

__Zipcodes are numeric identifiers — storing them as float (e.g., 10001.0) is incorrect and can cause confusion.__

In [22]:
# check zipcode type
df.zipcode.dtype

dtype('float64')

In [23]:
# Converting zipcode to int
df['zipcode'] = df['zipcode'].astype(int)


In [24]:
df.dtypes

date                           object
hour                            int32
day_of_week                     int32
is_weekend                      int64
month                           int32
borough                        object
zipcode                         int64
initial_call_type              object
initial_severity_level_code     int64
dtype: object

In [25]:
# Converting date back to datetime 
df['date']=pd.to_datetime(df['date'])

In [26]:
df.date.dtype

dtype('<M8[ns]')

In [27]:
df.shape

(1287491, 9)

__Add Holidays to the Dataset__

Holidays can affect ambulance call patterns because people behave differently on those days. There may be more accidents, health emergencies, or delayed care. 

We are adding a holiday column using the `pandas` `USFederalHolidayCalendar` library to help the model understand and predict these changes more accurately.

In [28]:
# importing USFederalHoilidayCalender
from pandas.tseries.holiday import USFederalHolidayCalendar   

#Generate calender
calendar = USFederalHolidayCalendar()

# Create US hoildays using df date range
us_holidays = calendar.holidays(start='2024-07-01', end='2025-04-25')

In [29]:
# Create new cloumn is_holiday. True=1, False=0
df['is_holiday']=df['date'].isin(us_holidays).astype(int)
df.head()

Unnamed: 0,date,hour,day_of_week,is_weekend,month,borough,zipcode,initial_call_type,initial_severity_level_code,is_holiday
0,2024-07-01,0,0,0,7,QUEENS,11101,STNDBY,8,0
1,2024-07-01,0,0,0,7,MANHATTAN,10035,STNDBY,8,0
2,2024-07-01,0,0,0,7,MANHATTAN,10065,EDP,7,0
3,2024-07-01,0,0,0,7,MANHATTAN,10035,STNDBY,8,0
4,2024-07-01,0,0,0,7,BRONX,10451,SEIZR,3,0


In [30]:
df.is_holiday.sum()

np.int64(36182)

In [31]:
df.date.min(), df.date.max()

(Timestamp('2024-07-01 00:00:00'), Timestamp('2025-04-25 00:00:00'))

**Feature Engineering: Rationale** 
  
**1. Daily Aggregation of Ambulance Calls**.   
Why: The goal is to predict daily call volumes, not hourly.  
Reason: Aggregating to daily level simplifies modeling, reduces noise from short-term fluctuations, and aligns with how trends like holidays, weather, or weekdays typically influence behavior.  
How: Count total calls per day per borough or ZIP.

**2. Mean, Min, and Max Severity Levels**   
Why: Severity level ranges from **1 (most severe) to 8 (least severe)**.      
Reason: These statistical summaries help capture the nature of the calls on a given day:   
mean_severity: shows general trend of urgency (lower = more severe).  
min_severity: identifies if very serious cases occurred.  
max_severity: captures the full range of severity types on that day.  
Insight: A day with mostly low severity levels might reflect routine calls, while a lower mean might indicate a stressful or crisis-heavy day.  

**3. Lag Features (Previous Day’s Call Volume)**.     
Why: To capture temporal trends and patterns.  
Reason: Ambulance call volume often shows autocorrelation — a high call volume one day might be followed by a similar trend the next.  
How: Create lag features like:  
calls_lag_1: number of calls on the previous day  
calls_lag_7: calls from the same weekday last week  
Insight: Helps the model learn short-term patterns and anomalies (e.g., ongoing incidents, seasonal waves, or weekends).  



In [32]:
# Confirming date is in datetime format
df.date.dtype

dtype('<M8[ns]')

In [33]:
# Aggregate daily ambulance call volume per borough

daily_calls = df.groupby(['date','borough']).agg(
            daily_call_vol=('hour','count'),
            mean_severity_level=('initial_severity_level_code','mean'),
            max_severity_level=('initial_severity_level_code', 'min'),
            min_severity=('initial_severity_level_code','max'),
            day_of_week=('day_of_week','first'),
            is_weekend=('is_weekend','first'),
            is_holiday=('is_holiday','first')
).reset_index()

In [34]:
daily_calls.head()

Unnamed: 0,date,borough,daily_call_vol,mean_severity_level,max_severity_level,min_severity,day_of_week,is_weekend,is_holiday
0,2024-07-01,BRONX,1039,4.216554,1,8,0,0,0
1,2024-07-01,BROOKLYN,1266,4.274882,1,8,0,0,0
2,2024-07-01,MANHATTAN,1200,4.276667,1,8,0,0,0
3,2024-07-01,QUEENS,914,4.250547,1,8,0,0,0
4,2024-07-01,RICHMOND / STATEN ISLAND,167,4.233533,1,7,0,0,0


In [35]:
# Adding lag features by sorting borough and date
 
daily_calls = daily_calls.sort_values(['borough','date'])
daily_calls['lag_1'] = daily_calls.groupby('borough')['daily_call_vol'].shift(1)
daily_calls['lag_7'] = daily_calls.groupby('borough')['daily_call_vol'].shift(7)

In [36]:
daily_calls.iloc[0:56:7]

Unnamed: 0,date,borough,daily_call_vol,mean_severity_level,max_severity_level,min_severity,day_of_week,is_weekend,is_holiday,lag_1,lag_7
0,2024-07-01,BRONX,1039,4.216554,1,8,0,0,0,,
35,2024-07-08,BRONX,1134,4.194885,1,7,0,0,0,1009.0,1039.0
70,2024-07-15,BRONX,1230,4.162602,1,8,0,0,0,1070.0,1134.0
105,2024-07-22,BRONX,1137,4.186456,1,8,0,0,0,1091.0,1230.0
140,2024-07-29,BRONX,1077,4.319406,1,7,0,0,0,1002.0,1137.0
175,2024-08-05,BRONX,1089,4.23416,1,7,0,0,0,941.0,1077.0
210,2024-08-12,BRONX,1075,4.301395,1,8,0,0,0,1020.0,1089.0
245,2024-08-19,BRONX,1011,4.281899,1,8,0,0,0,937.0,1075.0


In [37]:
daily_calls.shape

(1495, 11)

In [38]:
daily_calls.dtypes

date                   datetime64[ns]
borough                        object
daily_call_vol                  int64
mean_severity_level           float64
max_severity_level              int64
min_severity                    int64
day_of_week                     int32
is_weekend                      int64
is_holiday                      int64
lag_1                         float64
lag_7                         float64
dtype: object

In [39]:
# Checking for missing values
daily_calls.isna().sum()

date                    0
borough                 0
daily_call_vol          0
mean_severity_level     0
max_severity_level      0
min_severity            0
day_of_week             0
is_weekend              0
is_holiday              0
lag_1                   5
lag_7                  35
dtype: int64

**Standardizing Borough Names**

Before merging the daily calls and weather datasets, we need to ensure that the borough names are consistent.  

- The `daily_call` DataFrame has borough names in **all caps**, while `weather_daily` uses **title case**.  
- To avoid merge mismatches, we **normalize the strings** by converting them to title case and trimming any whitespace.

In [40]:
# Checking for string consistency: 
daily_calls.borough.unique()

array(['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS',
       'RICHMOND / STATEN ISLAND'], dtype=object)

In [41]:
# Standardize borough names in daily_calls
daily_calls['borough'] =daily_calls['borough'].str.strip().str.title()

In [42]:
daily_calls.borough.unique()

array(['Bronx', 'Brooklyn', 'Manhattan', 'Queens',
       'Richmond / Staten Island'], dtype=object)

**In the `daily_call` dataset, Staten Island is labeled as `'Richmond / Staten Island'`.  
To ensure consistency with the `weather_daily` dataset, we replace it with the standard name `'Staten Island'`.**

In [43]:
daily_calls['borough'] = daily_calls['borough'].replace({'Richmond / Staten Island':'Staten Island'})

In [44]:
daily_calls.borough.unique()

array(['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'],
      dtype=object)

In [45]:
# Saving it as csv file before joining weather data
daily_calls.to_csv('daily_ambulance_calls.csv')

## Adding weather Data in Ambulance Call Prediction
Weather conditions can have a direct and indirect impact on ambulance call volumes. Adding weather data enriches the feature space with external, real-world influences that the model can learn from. While it’s not always the dominant factor, weather adds important context, especially during extreme or seasonal changes.

## Weather Data Summary  
To add environmental context to the ambulance call data, we collected daily weather data in a different notebook,from July 1, 2024 to April 25, 2025 using We use the Open-Meteo **archive API** to get daily weather data for all five NYC boroughs from **2025-01-01 to 2025-04-30**.  
The variables we collect are:

- `temperature_2m` — hourly temperature (°C)  
- `precipitation` — hourly precipitation (mm)  
- `snowfall` — new snow per hour (cm)    

We store each borough’s data in a DataFrame and combine them into a single CSV(**'nyc_weather_daily.csv''**) for reproducibility.


In [46]:
# Load weather data 
weather = pd.read_csv('nyc_weather_daily.csv')

In [47]:
# Randomly selecting 10 rows
weather.sample(n=10, random_state=42)

Unnamed: 0,temperature_2m_max,temperature_2m_min,precipitation_sum,snowfall_sum,borough,date
999,20.4,7.3,0.0,0.0,The Bronx,2024-10-11
1444,11.9,2.1,3.0,0.0,Staten Island,2025-03-06
1187,15.1,3.7,0.0,0.0,The Bronx,2025-04-17
886,19.2,9.1,1.6,0.0,Queens,2025-04-15
1166,9.5,-0.3,0.0,0.0,The Bronx,2025-03-27
561,8.8,5.6,9.0,0.0,Brooklyn,2025-03-20
481,12.8,8.3,8.3,0.0,Brooklyn,2024-12-30
303,29.2,21.3,0.7,0.0,Brooklyn,2024-07-05
342,27.4,16.3,0.0,0.0,Brooklyn,2024-08-13
244,0.1,-7.6,0.0,0.0,Manhattan,2025-03-02


In [48]:
# Create Daily mean temperature column temperature_2m
weather['temperature_2m'] = (weather.temperature_2m_max+weather.temperature_2m_min)/2

# Convert temperature columns to Farenheit from Celsius (F = C * 9/5 + 32)

weather['temperature_2m'] = weather['temperature_2m']* 9/5 + 32

# Drop temperature min and max column
weather = weather.drop(columns=['temperature_2m_max', 'temperature_2m_min'])

In [49]:
weather.head()

Unnamed: 0,precipitation_sum,snowfall_sum,borough,date,temperature_2m
0,0.2,0.0,Manhattan,2024-07-01,70.43
1,0.0,0.0,Manhattan,2024-07-02,71.96
2,0.0,0.0,Manhattan,2024-07-03,72.32
3,2.2,0.0,Manhattan,2024-07-04,76.46
4,5.7,0.0,Manhattan,2024-07-05,80.33


In [50]:
weather.shape

(1495, 5)

In [51]:
# Weather data type
weather.dtypes

precipitation_sum    float64
snowfall_sum         float64
borough               object
date                  object
temperature_2m       float64
dtype: object

**Change weather 'time' to to_datetime and and add 'date' column**

In [52]:
# Convert 'time' to datetime
weather['date'] = pd.to_datetime(weather['date'])

# Rename column to 'date'
weather.rename(columns={'precipitation_sum':'precp', 'snowfall_sum':'snowfall', 'temperature_2m': 'temp'}, inplace=True)

# Reset index
weather.set_index('date', inplace = True)

In [53]:
weather.head()

Unnamed: 0_level_0,precp,snowfall,borough,temp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-07-01,0.2,0.0,Manhattan,70.43
2024-07-02,0.0,0.0,Manhattan,71.96
2024-07-03,0.0,0.0,Manhattan,72.32
2024-07-04,2.2,0.0,Manhattan,76.46
2024-07-05,5.7,0.0,Manhattan,80.33


In [54]:
weather.shape

(1495, 4)

In [55]:
# Check weather_daily for missing values 
weather.isna().sum()

precp       0
snowfall    0
borough     0
temp        0
dtype: int64

In [56]:
# checking for string consistancy
weather.borough.unique()

array(['Manhattan', 'Brooklyn', 'Queens', 'The Bronx', 'Staten Island'],
      dtype=object)

In [61]:
# Changing borough The Bronx to Bronx 
weather['borough']=weather['borough'].replace({'The Bronx' : 'Bronx'})

**Merge Weather with Daily Ambulance Calls**

The `weather_daily` dataset has **no missing values** and is now ready to be merged with `daily_calls` on `date` and `borough` using a **left join** to preserve all ambulance records.

In [62]:
# Merge weather_daily with daily_calls to 'daily-amb-calls'

daily_amb_calls = daily_calls.merge(weather, on=['date','borough'], how='left')

In [63]:
daily_amb_calls.head()

Unnamed: 0,date,borough,daily_call_vol,mean_severity_level,max_severity_level,min_severity,day_of_week,is_weekend,is_holiday,lag_1,lag_7,precp,snowfall,temp
0,2024-07-01,Bronx,1039,4.216554,1,8,0,0,0,,,0.4,0.0,70.34
1,2024-07-02,Bronx,950,4.195789,1,8,1,0,0,1039.0,,0.0,0.0,71.87
2,2024-07-03,Bronx,1002,4.229541,1,8,2,0,0,950.0,,0.0,0.0,71.87
3,2024-07-04,Bronx,907,4.060639,1,8,3,0,1,1002.0,,1.3,0.0,75.47
4,2024-07-05,Bronx,1099,4.270246,1,8,4,0,0,907.0,,15.6,0.0,78.71


In [64]:
daily_amb_calls.isna().sum()

date                    0
borough                 0
daily_call_vol          0
mean_severity_level     0
max_severity_level      0
min_severity            0
day_of_week             0
is_weekend              0
is_holiday              0
lag_1                   5
lag_7                  35
precp                   0
snowfall                0
temp                    0
dtype: int64

**Handling Missing Lag Features**

The `lag_1` and `lag_7` features represent the number of calls from 1 day and 7 days prior.  
- By definition, the first day of each borough has no `lag_1`, and the first 7 days have no `lag_7`.  
- These missing values are **not errors** but a natural consequence of computing lags.

Filling lag_1 with the previous day’s value (within the same borough) and lag_7 with the borough mean ensures that each row has reasonable historical information without introducing large distortions

In [65]:
# Fill lag_1 missing values with the boroughs mean within the same borough
daily_amb_calls['lag_1'] = daily_amb_calls.groupby('borough')['lag_1'].transform(lambda x: x.fillna(x.mean()))

# Fill lag_7 missing values with the boroughs 'lag_7' mean value within the same borough
daily_amb_calls['lag_7'] = daily_amb_calls.groupby('borough')['lag_7'] \
                                          .transform(lambda x: x.fillna(x.mean()))

In [66]:
# Check for missing values 
daily_amb_calls.isna().sum()

date                   0
borough                0
daily_call_vol         0
mean_severity_level    0
max_severity_level     0
min_severity           0
day_of_week            0
is_weekend             0
is_holiday             0
lag_1                  0
lag_7                  0
precp                  0
snowfall               0
temp                   0
dtype: int64

**Save `daily_amb_call` as CSV**

In [69]:
daily_amb_calls.to_csv('daily_ambulance_calls.csv')

## The merged dataset is now `clean`, `consistent`, and `ready` for exploratory data analysis (EDA).

In [68]:
daily_amb_calls.head()

Unnamed: 0,date,borough,daily_call_vol,mean_severity_level,max_severity_level,min_severity,day_of_week,is_weekend,is_holiday,lag_1,lag_7,precp,snowfall,temp
0,2024-07-01,Bronx,1039,4.216554,1,8,0,0,0,1021.557047,1020.561644,0.4,0.0,70.34
1,2024-07-02,Bronx,950,4.195789,1,8,1,0,0,1039.0,1020.561644,0.0,0.0,71.87
2,2024-07-03,Bronx,1002,4.229541,1,8,2,0,0,950.0,1020.561644,0.0,0.0,71.87
3,2024-07-04,Bronx,907,4.060639,1,8,3,0,1,1002.0,1020.561644,1.3,0.0,75.47
4,2024-07-05,Bronx,1099,4.270246,1,8,4,0,0,907.0,1020.561644,15.6,0.0,78.71
