In [1]:
import pandas as pd 
import geopandas as gpd
import numpy as np
import cemo_module as cemo

In [2]:
heat_gdf = gpd.read_file('data/Final_Data/District_Heat.gpkg', layer='heat')

In [3]:
heat_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 186354 entries, 0 to 186353
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype   
---  ------           --------------   -----   
 0   District         186354 non-null  object  
 1   ObjectID         186354 non-null  int64   
 2   TOOLTIP          186354 non-null  object  
 3   Battalion        186354 non-null  int64   
 4   Division         186354 non-null  object  
 5   Battalion_       186354 non-null  object  
 6   District_N       186354 non-null  object  
 7   heat_index_high  186354 non-null  float64 
 8   date_time        186354 non-null  object  
 9   heat_index_low   186354 non-null  float64 
 10  geometry         186354 non-null  geometry
dtypes: float64(2), geometry(1), int64(2), object(6)
memory usage: 15.6+ MB


In [4]:
colums_to_keep = [
    'District',
    'date_time',
    'heat_index_high',
    'heat_index_low',
    'geometry'
]
heat_gdf = heat_gdf[colums_to_keep] # Select only the columns we need

#Process date and integer datatypes needed
heat_gdf['date_time'] = pd.to_datetime(heat_gdf['date_time']) 
heat_gdf['District'] = heat_gdf['District'].astype(int)

#Sort on district and date, needed to calculate streaks acurately later on.
heat_gdf.sort_values(
    by=['District', 'date_time'],
    ascending=[True,True], 
    inplace=True
)

In [5]:
heat_gdf.head()

Unnamed: 0,District,date_time,heat_index_high,heat_index_low,geometry
0,1,2018-01-01,67.96405,44.975379,"POLYGON ((-118.20065 34.09533, -118.20060 34.0..."
102,1,2018-01-02,75.183769,51.330571,"POLYGON ((-118.20065 34.09533, -118.20060 34.0..."
204,1,2018-01-03,72.576,51.70338,"POLYGON ((-118.20065 34.09533, -118.20060 34.0..."
306,1,2018-01-04,72.90138,51.761742,"POLYGON ((-118.20065 34.09533, -118.20060 34.0..."
408,1,2018-01-05,71.472715,53.012019,"POLYGON ((-118.20065 34.09533, -118.20060 34.0..."


### Heat Day Definitions
Heat Day definitions are tricky, at their simplist, definitions are above 90°F, but some definitions are 95°F, even 100°F in the Valley. Other definitions use over 90° during the day and over 70° at night. We can explore the heat day definitions later but for now we use the simplist definition: two or more days over 90°F

In [6]:
heat_gdf['heat_day'] = np.vectorize(cemo.heat_threshold)(daily_high=heat_gdf['heat_index_high'], high_thresh=90)

In [7]:
heat_gdf.head()


Unnamed: 0,District,date_time,heat_index_high,heat_index_low,geometry,heat_day
0,1,2018-01-01,67.96405,44.975379,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False
102,1,2018-01-02,75.183769,51.330571,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False
204,1,2018-01-03,72.576,51.70338,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False
306,1,2018-01-04,72.90138,51.761742,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False
408,1,2018-01-05,71.472715,53.012019,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False


In [8]:
ems_volumes = pd.read_csv('data/FireStatLA/ems_call_counts.csv', 
                          names=['District', 'date_time', 'calls'],
                          parse_dates=['date_time'], header=0)
ems_volumes.head()


Unnamed: 0,District,date_time,calls
0,1,2018-01-01,9
1,1,2018-01-02,7
2,1,2018-01-03,5
3,1,2018-01-04,1
4,1,2018-01-05,6


In [9]:
ems_volumes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178052 entries, 0 to 178051
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   District   178052 non-null  int64         
 1   date_time  178052 non-null  datetime64[ns]
 2   calls      178052 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 4.1 MB


## Mismatched District List Length Sleuthing

When we pulled the heat data in to the fire district shape file in the heat modeling notebook we identified a discrepency between the number of unique district numbers (102) and the number of records in the district shape file (107), which led to some issues in the eventual joins and the expected length of files. This was solved. 

In exploring the call volumes provided by FireStats LA, however we found that that the EMS data had 104 unique fire districts. In trying to identify which districts these where and what geography they corespond to, we also looked at the [First In Districts](https://geohub.lacity.org/datasets/e9bfde4b3cc04ef48a0fc4d8ec0d16bd_0/explore?location=33.772674%2C-118.252538%2C12.56) and found that this dataset, while not immediately different from the Fire Station District, had 106 rows, but did not have any of the duplicates present in the fire station districts file.

Which districts are present in the ems file that are not present in the fires station district shape file and what do they correspond to?

In [11]:
# Find the districts that are missing from the fire district shp file
districts = heat_gdf['District'].unique()
ems_dist = ems_volumes['District'].unique()

extra = [x for x in ems_dist if x not in districts]
extra

[110, 111]

Districts 110 and 111 are the extra districts. How many calls do they have?

In [12]:
ems_volumes[ems_volumes['District'].isin(extra)].describe()

Unnamed: 0,District,calls
count,319.0,319.0
mean,110.758621,1.097179
std,0.428592,0.345626
min,110.0,1.0
25%,111.0,1.0
50%,111.0,1.0
75%,111.0,1.0
max,111.0,3.0


Hmmmm, not that many. Of the 5 year period between both districts they only didn't recieve that many days of calls, and the days they did recive calls they almost only recieved one call.

Further inspection of the first in vs vire districs show that these districts are exclusively in the Los Angeles Harbor
<center><img src="data/FirstInDistrict/Screenshot%202023-04-22%20at%2011.38.25%20PM.png" alt="Screenshot of missing districts" width="600"/></center>

Because of this, for now we will just drop these rows of calls by using a left join on the data




In [13]:
#Joins heat data with call volumes
joined_gdf = heat_gdf.merge(ems_volumes, how="left", on=['District', 'date_time'])
joined_gdf.head()

Unnamed: 0,District,date_time,heat_index_high,heat_index_low,geometry,heat_day,calls
0,1,2018-01-01,67.96405,44.975379,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,9.0
1,1,2018-01-02,75.183769,51.330571,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,7.0
2,1,2018-01-03,72.576,51.70338,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,5.0
3,1,2018-01-04,72.90138,51.761742,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,1.0
4,1,2018-01-05,71.472715,53.012019,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,6.0


Call volumes are representative of observations, so some days in some districts have no observation, and are therefore missing. Because temperature data is complete for every district and every day we do a left join to ensure there is a record for each district and each day. This will leave some rows of the "calls" column as `` NaN ``. We can fill these with zero to represent no calls in that district on that day.

In [14]:
joined_gdf.fillna(value=0, inplace=True) #Fills rows with no observation with zero
joined_gdf['calls'] = joined_gdf['calls'].astype(int)

In [15]:
s = pd.Series([True, True, False, False, True, False, True, True, True, False,])

In [30]:
def streak(s):
    return np.multiply(s, s.cumsum()).diff().where(lambda x:x<0).ffill().add(s.cumsum(), fill_value=0)

In [26]:
test_df = joined_gdf.loc[:100]
test_df

Unnamed: 0,District,date_time,heat_index_high,heat_index_low,geometry,heat_day,calls
0,1,2018-01-01,67.964050,44.975379,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,9
1,1,2018-01-02,75.183769,51.330571,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,7
2,1,2018-01-03,72.576000,51.703380,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,5
3,1,2018-01-04,72.901380,51.761742,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,1
4,1,2018-01-05,71.472715,53.012019,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,6
...,...,...,...,...,...,...,...
96,1,2018-04-07,78.758207,58.121902,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,10
97,1,2018-04-08,73.649877,55.027232,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,7
98,1,2018-04-09,87.817333,55.030820,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,9
99,1,2018-04-10,84.504581,60.628256,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,7


0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
96     0.0
97     0.0
98     0.0
99     0.0
100    0.0
Name: heat_day, Length: 101, dtype: float64

In [33]:
test_df = joined_gdf[joined_gdf['District']==1]

In [35]:
test_df['streak'] = streak(test_df['heat_day'])
test_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)


Unnamed: 0,District,date_time,heat_index_high,heat_index_low,geometry,heat_day,calls,streak
0,1,2018-01-01,67.96405,44.975379,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,9,0.0
1,1,2018-01-02,75.183769,51.330571,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,7,0.0
2,1,2018-01-03,72.576,51.70338,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,5,0.0
3,1,2018-01-04,72.90138,51.761742,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,1,0.0
4,1,2018-01-05,71.472715,53.012019,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",False,6,0.0


In [39]:
test_df[test_df['streak']>=2]

Unnamed: 0,District,date_time,heat_index_high,heat_index_low,geometry,heat_day,calls,streak
187,1,2018-07-07,102.418791,79.189977,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,7,2.0
188,1,2018-07-08,95.028806,79.314201,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,3,3.0
189,1,2018-07-09,93.606799,75.5146,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,4,4.0
204,1,2018-07-24,97.699099,72.414137,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,5,2.0
205,1,2018-07-25,93.160701,73.246259,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,10,3.0
218,1,2018-08-07,95.850763,72.012365,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,5,2.0
219,1,2018-08-08,94.52946,72.882148,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,11,3.0
220,1,2018-08-09,94.693334,71.867557,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,4,4.0
221,1,2018-08-10,91.524448,73.960443,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,3,5.0
222,1,2018-08-11,90.385058,73.16779,"POLYGON ((-118.20065 34.09533, -118.20060 34.0...",True,5,6.0


In [42]:
test_df.groupby('heat_day')['calls'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
heat_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,1757.0,7.24132,2.744603,0.0,5.0,7.0,9.0,17.0
True,70.0,7.914286,3.151916,1.0,6.0,8.0,9.75,15.0


'pm25'