In [60]:
import geopandas as gpd
import pandas as pd
from datetime import datetime
from shapely.wkt import loads
from warnings import filterwarnings
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial import cKDTree

filterwarnings('ignore')

In this notebook. We'll be combining the fire data and the weather data into 1 data frame. The goals of this notebook is as follows:

- Find the closest 5 stations to a fire and create a dataframe based on that. We decided on 5 stations to create an average when modelling. 
- Connect the weather data to their respective stations based on Station ID and Date. 

Let's first grab the fire data below and check to make sure everything is prepped for combining

In [51]:
fire_data = gpd.read_file('Data/Fire_Data/fire_date_geo.shp',crs='esri:102009')


Let's also get the weather data:

In [10]:
monthly_weather= pd.read_csv('Data/Monthly_Weather_Data/monthly_weather.csv',index_col=0)

Since the weather data is in csv format, let's convert it to a geodataframe so that we can accurately determine distance between fire and weather station:

In [226]:
weather_shp=gpd.GeoDataFrame(monthly_weather,geometry=gpd.points_from_xy(monthly_weather['Longitude (x)'],monthly_weather['Latitude (y)']),crs=4326)

Let's now check the CRS for both the fire and Weather data. Remeber, a Coordinate reference system (CRS) defines, with the help of coordinates, how the two-dimensional, projected map is related to real locations on the earth:

In [52]:
fire_data['geometry'].crs

<Projected CRS: PROJCS["NAD_1983_Lambert_Conformal_Conic",GEOGCS[" ...>
Name: NAD_1983_Lambert_Conformal_Conic
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: unnamed
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

Despite the `shp` file's CRS being set as 102009, it appears that the bounds are undefined. The bounds help us map the points based on a geographic space. If the bounds are not set, the distance between the points may be inaccurate. Let's reset the crs and ensure the bounds are defined:

In [53]:
fire_data=fire_data.set_crs('esri:102009',allow_override=True)

In [54]:
fire_data.crs

<Projected CRS: ESRI:102009>
Name: North_America_Lambert_Conformal_Conic
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. United States (USA) - Alabama; Alaska (mainland); Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming.
- bounds: (-172.54, 23.81, -47.74, 86.46)
Coordin

Now that the bounds are set, let's look at the weather data:

In [227]:
weather_shp.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

We notice that the CRS codes are different. We'll have to match them before we merge the tables. For now let's look at a sample of the data we have: 

In [7]:
fire_data.head()

Unnamed: 0,YEAR,MONTH,SRC_AGY2,geometry
0,2004,6,BC,"POLYGON Z ((-1886926.467 898021.006 0.000, -18..."
1,2004,6,BC,"POLYGON Z ((-1880308.251 892344.865 0.000, -18..."
2,2004,6,BC,"POLYGON Z ((-1965048.293 820512.199 0.000, -19..."
3,2004,6,BC,"POLYGON Z ((-1995073.527 854615.146 0.000, -19..."
4,2004,6,BC,"POLYGON Z ((-1988211.829 940418.674 0.000, -19..."


Let's now make a list of stations that we'll use to merge the stations and fire data by distance:

In [230]:
list_stations= weather_shp[['Climate ID','geometry']]

We have a list of all the stations in our dataframe, since it's listing weather information on a monthly bases, we will have multiple duplicates of our station numbers. Let's drop duplicates so we only have a list of unique `Climate ID`s.

In [23]:
list_stations.drop_duplicates(inplace=True,ignore_index=True)

In [24]:
weather_shp.drop_duplicates(inplace=True)

In [25]:
list_stations=gpd.GeoDataFrame(list_stations)

Let's now confirm that the crs is still the same for our new list:

In [231]:
list_stations.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [32]:
list_stations.shape

(430, 2)

In [232]:
list_stations

Unnamed: 0,Climate ID,geometry
0,1011500,POINT (-123.74000 48.94000)
1,1011500,POINT (-123.74000 48.94000)
2,1011500,POINT (-123.74000 48.94000)
3,1011500,POINT (-123.74000 48.94000)
4,1011500,POINT (-123.74000 48.94000)
...,...,...
161498,709CEE9,POINT (-78.28000 48.80000)
161499,709CEE9,POINT (-78.28000 48.80000)
161500,709CEE9,POINT (-78.28000 48.80000)
161501,709CEE9,POINT (-78.28000 48.80000)


Let's change the `fire_data` to match the coordinate system of our `list_stations` data before we start merging. 

In [125]:
fire_data=fire_data.to_crs(crs=4326)

### Using cKDTree to find the nearest fires to the respective stations.  

The cKDTree class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point. this is the most effecient way of merging data based on distance.

Let's start with first extracting the coordinates.

In [143]:

# Extract coordinates
fire_coords = fire_data.geometry.apply(lambda geom: (geom.centroid.x, geom.centroid.y)).tolist()
station_coords = list_stations.geometry.apply(lambda geom: (geom.x, geom.y)).tolist()



Now we'll build the tree. We're going to use the station coordinates as a reference point, so that when we are looking at the fire coordinates, we are measuring the distance between them

In [144]:
# Build KDTree
tree = cKDTree(station_coords)


`.query()` is the function that will get us the distance between coordinates. We're collecting the distances and indicies to be able to create a new Dataframe with the connection. `k=5` means that we'll be looking at the top 5 nearest neighbours while `p=2` is referencing the use of the `Euclidean distance`.

In [145]:

# Query for nearest stations
distances, indices = tree.query(fire_coords, k=5,p=2)


In [233]:
distances

array([[2.80887921, 2.87810565, 2.88043013, 3.02735922, 3.04142161],
       [2.88839054, 2.96130313, 2.9684149 , 3.10751746, 3.12095583],
       [3.51319008, 3.5134173 , 3.55066469, 3.70396395, 3.71699966],
       ...,
       [1.34498919, 1.6041456 , 2.02475476, 2.27689879, 2.39921112],
       [1.08442195, 1.88584185, 2.23497577, 2.95078277, 3.01156395],
       [1.14448434, 1.9500969 , 2.30103576, 3.0323819 , 3.05382225]])

In [235]:
indices

array([[ 12,   2,   4, 400,   8],
       [ 12,   2,   4, 400,   8],
       [  4,  12,   2,   3, 400],
       ...,
       [145, 141, 144, 101, 108],
       [109,  64,  62,  60, 144],
       [109,  64,  62,  60, 144]])

Let's create the dataframe with the stations and fires merged by distance:

In [218]:

# Construct the result dataframe
results = []
for i, fire in fire_data.iterrows():
    nearest_stations = list_stations.iloc[indices[i]].copy()
    nearest_stations['YEAR']=fire['YEAR']
    nearest_stations['MONTH']=fire['MONTH']
    nearest_stations['fire_index'] = i
    nearest_stations['distance'] = distances[i]
    results.append(nearest_stations)

result_df = pd.concat(results).reset_index(drop=True)

#merge with forest fires data
result_df = result_df.merge(fire_data[['geometry']], left_on='fire_index', right_index=True)

Let's confirm that the distances have been measured correctly. We can do that by confirming that a majority of our stations are assigned to atleast 1 fire occurance:

In [198]:
result_df['Climate ID'].value_counts()

Climate ID
5022125    4916
5021220    4815
5010140    4194
5022575    3755
6020559    3575
           ... 
7028200       1
7012071       1
7014332       1
7016960       1
704FEG0       1
Name: count, Length: 384, dtype: int64

Let's take a sample look at the full table

In [219]:
result_df

Unnamed: 0,Climate ID,geometry_x,YEAR,MONTH,fire_index,distance,geometry_y
0,1018611,POINT (-123.320 48.410),2004,6,0,2.808879,"POLYGON Z ((-122.16984 45.86643 0.00000, -122...."
1,1012710,POINT (-123.440 48.430),2004,6,0,2.878106,"POLYGON Z ((-122.16984 45.86643 0.00000, -122...."
2,1015105,POINT (-123.560 48.370),2004,6,0,2.880430,"POLYGON Z ((-122.16984 45.86643 0.00000, -122...."
3,1016RM0,POINT (-123.430 48.600),2004,6,0,3.027359,"POLYGON Z ((-122.16984 45.86643 0.00000, -122...."
4,1016940,POINT (-123.420 48.620),2004,6,0,3.041422,"POLYGON Z ((-122.16984 45.86643 0.00000, -122...."
...,...,...,...,...,...,...,...
134370,1181508,POINT (-121.630 55.690),1992,7,26874,1.144484,"MULTIPOLYGON Z (((-122.77133 55.84941 0.00000,..."
134371,1096468,POINT (-122.770 53.880),1992,7,26874,1.950097,"MULTIPOLYGON Z (((-122.77133 55.84941 0.00000,..."
134372,1093474,POINT (-122.700 53.530),1992,7,26874,2.301036,"MULTIPOLYGON Z (((-122.77133 55.84941 0.00000,..."
134373,1090660,POINT (-121.510 53.070),1992,7,26874,3.032382,"MULTIPOLYGON Z (((-122.77133 55.84941 0.00000,..."


In [201]:
#confirming there's no missing data
weather_shp.Date.isna().sum()

0

Let's change the `Date` column to datetime so that we can extract the month and year. 

In [202]:
weather_shp['Date']=pd.to_datetime(weather_shp['Date'])

In [183]:
#extract month and year by creating new columns

weather_shp['MONTH'] = weather_shp['Date'].dt.month
weather_shp['YEAR'] = weather_shp['Date'].dt.year

Now finally, we will merge the weather data based on `Month`, `YEAR`, and `Climate ID`. We are using an outer merge to capture all the stations that don't have any fires associated to them(i.e stations with data from winter months).

In [220]:
merged_df = pd.merge(result_df, weather_shp, how='outer', on=['MONTH','YEAR','Climate ID'])

In [221]:
merged_df.sample(20)

Unnamed: 0,Climate ID,geometry_x,YEAR,MONTH,fire_index,distance,geometry_y,Longitude (x),Latitude (y),Date,Mean Temp (°C),Total Snow (cm),Total Precip (mm),geometry
123427,8100468,,2012,6,,,,-64.87,47.8,2012-06-30,,0.0,3.7,POINT (-64.870 47.800)
27634,2402684,,1991,3,,,,-75.14,68.9,1991-03-31,-24.157692,0.232258,0.232258,POINT (-75.140 68.900)
45423,1135126,POINT (-118.770 49.000),1999,4,2876.0,0.640317,"POLYGON Z ((-119.03770 48.41615 0.00000, -119....",-118.77,49.0,1999-04-30,7.5,0.0,1.62,POINT (-118.770 49.000)
23345,5013117,,2013,2,,,,-99.94,50.66,2013-02-28,-13.757143,,0.407143,POINT (-99.940 50.660)
93608,7024250,,2020,5,,,,-71.67,46.33,2020-05-31,,0.016667,1.411111,POINT (-71.670 46.330)
122930,5022791,POINT (-97.170 50.120),2012,6,14085.0,2.498818,"POLYGON Z ((-96.97558 47.62888 0.00000, -96.97...",-97.17,50.12,2012-06-30,18.056667,0.0,2.986667,POINT (-97.170 50.120)
244214,6124127,,2010,10,,,,-81.62,44.17,2010-10-31,10.72381,0.0,2.2,POINT (-81.620 44.170)
36165,1140876,POINT (-118.230 49.020),2011,3,1928.0,7.421061,"POLYGON Z ((-118.54536 41.60680 0.00000, -118....",-118.23,49.02,2011-03-31,3.603448,0.37931,1.868966,POINT (-118.230 49.020)
60921,3053600,POINT (-115.030 51.030),2018,4,6206.0,0.696673,"POLYGON Z ((-115.72213 51.08556 0.00000, -115....",-115.03,51.03,2018-04-30,1.123333,0.806667,1.913333,POINT (-115.030 51.030)
222081,4014040,POINT (-102.730 50.200),1998,9,10062.0,3.967948,"POLYGON Z ((-103.14863 46.25383 0.00000, -103....",-102.73,50.2,1998-09-30,13.723333,0.0,1.06,POINT (-102.730 50.200)


In [208]:
merged_df.shape

(277003, 15)

In [207]:
merged_df.duplicated().sum()

0

In [189]:
merged_df.drop_duplicates(inplace=True)

Let's take a look at the missing data in our new completely merged dataframe:

In [222]:
for i in range(len(merged_df.columns)):
    na_index=merged_df.isna().sum().index[i]
    na_ratio=merged_df.isna().sum().iloc[i]
    print(f'{na_index} has {round((na_ratio/merged_df.shape[0])*100,2)}% missing data')


Climate ID has 0.0% missing data
geometry_x has 51.48% missing data
YEAR has 0.0% missing data
MONTH has 0.0% missing data
fire_index has 51.48% missing data
distance has 51.48% missing data
geometry_y has 51.48% missing data
Longitude (x) has 1.61% missing data
Latitude (y) has 1.61% missing data
Date has 1.61% missing data
Mean Temp (°C) has 15.27% missing data
Total Snow (cm) has 21.93% missing data
Total Precip (mm) has 8.92% missing data
geometry has 1.61% missing data


We can see that all the fire related datapoints have 51% missing data, which is expected as we have weather information on dates that didn't have any fires associated to them. These columns will eventually be dropped, as we only need to know whether or not a fire occured at that point.

In [191]:
merged_df.dtypes

Climate ID                   object
geometry_x                 geometry
YEAR                          int64
MONTH                         int64
fire_index                  float64
distance                    float64
geometry_y                 geometry
Longitude (x)               float64
Latitude (y)                float64
Date                 datetime64[ns]
Mean Temp (°C)              float64
Total Snow (cm)             float64
Total Precip (mm)           float64
geometry                   geometry
dtype: object

Below, I'll be creating a new column with a `1` label for cases where fire did occur and a `0` for when no fire occured. I'll be using `geometry` as the column to look at; if there's missing data, that means that no fire occured, otherwise fire did occur. This can be done with any column that came from the original fire dataframe. 

In [223]:
merged_df['Fire'] = np.where(merged_df['geometry_y']!=None,1,0)

In [224]:
merged_df.sample(10)

Unnamed: 0,Climate ID,geometry_x,YEAR,MONTH,fire_index,distance,geometry_y,Longitude (x),Latitude (y),Date,Mean Temp (°C),Total Snow (cm),Total Precip (mm),geometry,Fire
59136,2400660,,2016,4,,,,-66.8,68.47,2016-04-30,,,,POINT (-66.800 68.470),0
192102,3033890,POINT (-112.770 49.700),1998,8,8890.0,0.249442,"POLYGON Z ((-112.84004 49.47974 0.00000, -112....",-112.77,49.7,1998-08-31,20.13871,0.0,0.877419,POINT (-112.770 49.700),1
121629,5031320,POINT (-95.200 49.620),2011,6,13806.0,3.525343,"POLYGON Z ((-96.22184 46.24514 0.00000, -96.22...",-95.2,49.62,2011-06-30,15.996667,0.0,1.823333,POINT (-95.200 49.620),1
183170,8101792,POINT (-66.430 45.830),2018,7,18974.0,1.65227,"POLYGON Z ((-67.28631 44.41708 0.00000, -67.28...",-66.43,45.83,2018-07-31,,,,POINT (-66.430 45.830),1
179481,5022575,POINT (-96.040 49.350),2017,7,16027.0,1.965696,"POLYGON Z ((-96.97515 47.62085 0.00000, -96.97...",-96.04,49.35,2017-07-31,,0.0,1.309677,POINT (-96.040 49.350),1
197421,7025745,,2005,8,,,,-74.05,45.12,2005-08-31,21.187097,0.0,4.029032,POINT (-74.050 45.120),0
237208,7038975,,1994,10,,,,-76.05,46.07,1994-10-31,7.490323,0.0,1.067742,POINT (-76.050 46.070),0
44298,1063298,,1997,4,,,,-130.71,54.57,1997-04-30,7.97,0.033333,6.766667,POINT (-130.710 54.570),0
232711,1140876,POINT (-118.230 49.020),2017,9,4581.0,7.38705,"POLYGON Z ((-118.44802 41.63640 0.00000, -118....",-118.23,49.02,2017-09-30,16.284,0.0,0.552,POINT (-118.230 49.020),1
114309,6103367,POINT (-76.690 44.430),2007,6,19456.0,2.136325,"MULTIPOLYGON Z (((-75.06659 42.97201 0.00000, ...",-76.69,44.43,2007-06-30,19.343333,0.0,2.26,POINT (-76.690 44.430),1


Now we have a table showing whether fire had occured or not! 

Let's export the table as it is above, and we'll clean it up for modelling in the `Modelling.ipynb` notebook!

In [225]:
#exporting for modelling and analysis
merged_df.to_csv('Data/modelling_df.csv')
