# **BDM-3035 Big Data Capstone Project**
Instructor: Meysam Effati

Members:


*   Ann Margaret Silva (C0903604)
*   Antonio Carlos De Mello Mendes (C0866063)

*   Maria Jessa Cruz (C0910329)
*   Prescila Mora (C0896891)


*   Rewant Sharma (C0894265)



Datasets:

*https://cwfis.cfs.nrcan.gc.ca/background/summary/fwi*

*https://cwfis.cfs.nrcan.gc.ca/background/summary/fbp*


# **Wildfire Prediction Data**

each column name in the GeoDataFrame based on the information provided:

**_id:** Unique identifier for each record in the dataset.

**lat:** Latitude of the fire location.

**lon:** Longitude of the fire location.

**rep_date:** The reported date of the fire.

**source:** Source of the data for the fire event.

**sensor:** The type of sensor used to detect the fire.

**satellite:** The name of the satellite that detected the fire.

**agency:** The agency reporting the fire, such as provincial, territorial, or Parks Canada.

**temp:** Temperature at the fire location.

**rh:** Relative humidity at the fire location.

**ws:** Wind speed at the fire location.

**wd:** Wind direction at the fire location.

**pcp:** Precipitation at the fire location.

**ffmc:** Fine Fuel Moisture Code, part of the Canadian Forest Fire Weather Index (FWI) System, indicating the moisture content of surface litter and other cured fine fuels.

**dmc:** Duff Moisture Code, part of the FWI System, representing the average moisture content of loosely compacted organic layers of moderate depth.

**dc:** Drought Code, part of the FWI System, indicating the moisture content of deep, compact organic layers.

**isi:** Initial Spread Index, part of the FWI System, indicating the rate of spread based on the FFMC and wind speed.

**bui:** Buildup Index, part of the FWI System, combining the DMC and DC to indicate the total amount of fuel available for combustion.

**fwi**: Fire Weather Index, a comprehensive rating of fire intensity.
**fuel:** Type of fuel present at the fire location.

**ros:** Rate of spread of the fire.

**sfc:** Surface fuel consumption, representing the amount of fuel consumed at the surface level.

**tfc**: Total fuel consumption, representing the total amount of fuel consumed during the fire.

**bfc**: Below-ground fuel consumption, representing the amount of fuel consumed below ground level.

**hfi**: Head fire intensity, indicating the intensity of the leading edge of the fire.

**cfb**: Crown fraction burned, indicating the proportion of the crown area burned.

**pcuring**: Percent curing, indicating the proportion of dead material in grass fuels.

**greenup**: The state of vegetation green-up, indicating how much the vegetation has recovered or greened up.

**elev**: Elevation of the fire location.

**sfl**: Surface fireline intensity, indicating the intensity of the fire along the surface.

**cfl**: Crown fireline intensity, indicating the intensity of the fire in the crown of trees.

**tfc0**: Initial total fuel consumption, representing the amount of fuel initially consumed during the fire.

**ecozone**: Ecozone of the fire location, providing information about the ecological zone.

**sfc0**: Initial surface fuel consumption, representing the initial amount of surface fuel consumed.

**cbh**: Canopy base height, indicating the height above ground level where the canopy begins.

These columns represent various attributes related to fire incidents, fire weather, fuel conditions, and fire behavior as collected by Canadian fire management agencies and other sources.

In [None]:
!pip install pymongo
# Import Libraries
import os
from pymongo import MongoClient
import pandas as pd
import geopandas as gpd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Define environment variables
db_name = 'wildfire_db_2020_2023'
mongo_uri = 'mongodb+srv://wildfire:F1reCanada@wildfirecluster.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000'

In [None]:
# List of years to read GeoJSON files from 2020 to 2022
years = range(2020, 2023)

In [None]:
# Connect to MongoDB
client = MongoClient(mongo_uri)
db = client[db_name]

In [None]:
# Create a function to load data from MongoDB into a GeoDataFrame
def load_data_from_mongodb(collection_name):
    collection = db[collection_name]
    data = list(collection.find())
    df = pd.DataFrame(data)
    return gpd.GeoDataFrame(df)

In [None]:
# Create an empty list to store GeoDataFrames
geojson_final_data = []

In [None]:
# Load data for each year, clean and append to the list
for year in years:
    collection_name = f"wildfire_collection_{year}"
    gdf = load_data_from_mongodb(collection_name)
    geojson_final_data.append(gdf)

In [None]:
# Concatenate the list of GeoDataFrames into one GeoDataFrame
geo_wfp = gpd.GeoDataFrame(pd.concat(geojson_final_data, ignore_index=True))

In [None]:
geo_wfp

# **Data Cleaning**







In [None]:
# Data Inspection
print("\nDescribe the GeoDataFrame:")
geo_wfp.describe()

In [None]:
# Check for missing values
print("\nMissing values in the GeoDataFrame:")
geo_wfp.isnull().sum()

In [None]:
# Check for columns with more than 50% null values and drop them
threshold = 0.5 * len(geo_wfp)
columns_to_drop = geo_wfp.columns[geo_wfp.isnull().sum() > threshold]
print(f"\nColumns with more than 50% null values and will be dropped: {list(columns_to_drop)}")

In [None]:
geo_wfp.drop(columns=columns_to_drop, inplace=True)

In [None]:
geo_wfp

In [None]:
# Check again the columns with null values
geo_wfp.isnull().sum()

In [None]:
# Data Description
print("\nData Types and Missing Data:")
geo_wfp.info()

In [None]:
# Modify AGENCY and FUEL columns type to string to treat them as categorical values
geo_wfp[["agency", "fuel", "ecozone"]] = geo_wfp[["agency", "fuel", "ecozone"]].astype("str")

In [None]:
geo_wfp.info()

In [None]:
# Check for duplicate entries
geo_wfp.duplicated().sum()

In [None]:
# Drop the duplicates
geo_wfp.drop_duplicates(inplace=True)

In [None]:
# Display cleaned GeoDataFrame info
print("\nCleaned GeoDataFrame Info:")
geo_wfp.info()

In [None]:
geo_wfp

# **Exploratory Data Analysis (EDA)**

In [None]:
# Select columns for visualization
columns_to_visualize = ['ws', 'pcp', 'dmc', 'dc', 'ros', 'hfi', 'cfl', 'tfc0']

# Calculate the number of rows and columns for the subplot grid
num_columns = len(columns_to_visualize)
num_rows = int(np.ceil(num_columns / 2))

# Create a subplot grid
fig, axes = plt.subplots(num_rows, 2, figsize=(12, 10))

# Flatten the 2D array of axes into a 1D array
axes = axes.flatten()

# Iterate over columns and create boxplots
for i, column in enumerate(columns_to_visualize):
    sns.boxplot(x=geo_wfp[column], ax=axes[i], color='skyblue')
    axes[i].set_title(f'Boxplot of {column}')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

plt.tight_layout()
plt.show()

In [None]:
!pip install plotly
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import numpy as np

# Select columns for visualization
columns_to_visualize = ['ws', 'pcp', 'dmc', 'dc', 'ros', 'hfi', 'cfl', 'tfc0']

# Calculate the number of rows and columns for the subplot grid
num_columns = 2
num_rows = int(np.ceil(len(columns_to_visualize) / num_columns))

# Create a subplot grid
fig = make_subplots(rows=num_rows, cols=num_columns, subplot_titles=columns_to_visualize)

# Iterate over columns and create boxplots
for i, column in enumerate(columns_to_visualize):
    row = i // num_columns + 1
    col = i % num_columns + 1
    fig.add_trace(
        go.Box(y=geo_wfp[column], name=column, marker_color=px.colors.qualitative.Plotly[i % len(px.colors.qualitative.Plotly)]),
        row=row, col=col
    )

# Update layout
fig.update_layout(
    title_text='Boxplots of Selected Variables',
    height=1200,
    showlegend=False,
    title_x=0.5
)

fig.show()

In [None]:
# Analyze the 'source' column
geo_wfp['source']

In [None]:
print("\nSource column analysis:")
source_counts = geo_wfp['source'].value_counts()
source_counts

In [None]:
sns.countplot(y=geo_wfp['source'])
plt.title('Source Column Distribution')
plt.show()

**USFS (United States Forest Service):** The USFS is the most common source in the dataset with over 120,000 records. The data is collected from a variety of sensors, including IBAND, MODIS, and VIIRS-I, and from multiple satellites such as JPSS1, Terra, Aqua, S-NPP, and NOAA-20. This indicates a broad and diverse data collection effort by the USFS.


**NASA3:** Another significant contributor with 52,940 records. The data comes from the NOAA-20 satellite using IBAND and VIIRS-I sensors. This indicates another focused data source from a different NASA satellite mission.


**NASA2:** This source contributes a substantial number of records (40,285). The data is primarily collected using IBAND and VIIRS-I sensors from the S-NPP satellite. This suggests a focused source of data from a specific NASA satellite mission.



**NASA6:** This source contributes 33,535 records, collected using IBAND and VIIRS-I sensors from the S-NPP satellite. Like NASA2, this suggests a focused data source from a specific NASA satellite mission.


**NASA7:** This source has 29,541 records. The data is collected from the NOAA-20 satellite using IBAND and VIIRS-I sensors, indicating a similar data collection effort to NASA3.


**NOAA (National Oceanic and Atmospheric Administration):** NOAA contributes 19,932 records, with data collected using various sensors including AVHRR, VIIRS, VIIRS-M, and MODIS from multiple satellites such as METOP-A, S-NPP, NOAA-19, NOAA-15, METOP-B, and NOAA-18. This indicates a diverse data collection effort from multiple satellites.

In [None]:
import plotly.express as px
import pandas as pd

# Count the occurrences of each unique value in the 'source' column
source_counts = geo_wfp['source'].value_counts().reset_index()
source_counts.columns = ['source', 'count']

# Create a bar plot using Plotly
fig = px.bar(
    source_counts,
    x='count',
    y='source',
    color='source',
    orientation='h',
    labels={'count': 'Count', 'source': 'Source'},
    title='Source Column Distribution',
    color_discrete_sequence=px.colors.qualitative.Plotly
)

# Update layout to sort by original order
fig.update_layout(
    yaxis={'categoryorder':'total ascending'},
    showlegend=False
)

fig.show()

In [None]:
# Define the sources to investigate
sources_to_investigate = ['USFS', 'NASA2', 'NASA3', 'NASA6', 'NASA7', 'NOAA']

# Function to investigate sensor and satellite information for a specific source
def investigate_source(source, df):
    source_data = df[df['source'] == source]
    unique_sensors = source_data['sensor'].unique()
    unique_satellites = source_data['satellite'].unique()

    print(f"\nSource: {source}")
    print(f"Number of records: {len(source_data)}")
    print("Unique sensors:")
    print(unique_sensors)
    print("Unique satellites:")
    print(unique_satellites)
    print("First 5 entries:")
    print(source_data.head())
    print("-" * 50)

# Iterate over each source and investigate
for source in sources_to_investigate:
    investigate_source(source, geo_wfp)

In [None]:
'''# Save the cleaned data to cleaned_wildfire.csv
cleaned_csv_path = 'wildfire_cleandata.csv'
geo_wfp.to_csv(cleaned_csv_path, index=False)
print(f"Cleaned data saved to {cleaned_csv_path}")'''

In [None]:
#______________________________

In [None]:
import plotly.express as px
import geopandas as gpd
import pandas as pd
# Convert to a GeoDataFrame
gdf = gpd.GeoDataFrame(geo_wfp, geometry=gpd.points_from_xy(geo_wfp.lon, geo_wfp.lat))

# Set CRS to WGS84 (EPSG:4326) if not already set
gdf.set_crs(epsg=4326, inplace=True)

# Filter data for Canadian coordinates (approximate bounds)
gdf_canada = gdf.cx[-141:-52, 41:84]  # Longitude range for Canada, Latitude range for Canada

# Plot fire locations colored by temperature in Canada
fig_temp = px.scatter_mapbox(
    gdf_canada,
    lat=gdf_canada.geometry.y,
    lon=gdf_canada.geometry.x,
    color='temp',
    color_continuous_scale='OrRd',
    size_max=5,
    zoom=3,
    mapbox_style="carto-positron",
    title='Fire Locations Colored by Temperature in Canada'
)
fig_temp.update_layout(
    xaxis_title='Longitude',
    yaxis_title='Latitude'
)
fig_temp.show()

# Plot fire locations colored by Fire Weather Index (FWI) in Canada
fig_fwi = px.scatter_mapbox(
    gdf_canada,
    lat=gdf_canada.geometry.y,
    lon=gdf_canada.geometry.x,
    color='fwi',
    color_continuous_scale='YlGnBu',
    size_max=5,
    zoom=3,
    mapbox_style="carto-positron",
    title='Fire Locations Colored by Fire Weather Index (FWI) in Canada'
)
fig_fwi.update_layout(
    xaxis_title='Longitude',
    yaxis_title='Latitude'
)
fig_fwi.show()



In [None]:
# Plot histograms for key numerical variables
variables = ['temp', 'rh', 'ws', 'pcp', 'ffmc', 'dmc', 'dc', 'isi', 'bui', 'fwi']
gdf[variables].hist(bins=30, figsize=(15, 10))
plt.suptitle('Histograms of Key Variables')
plt.show()

In [None]:
# Scatter plot of temperature vs. Fire Weather Index (FWI)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=gdf, x='temp', y='fwi')
plt.title('Temperature vs. Fire Weather Index (FWI)')
plt.xlabel('Temperature (°C)')
plt.ylabel('Fire Weather Index (FWI)')
plt.show()


In [None]:
# Scatter plot of wind speed vs. Initial Spread Index (ISI)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=gdf, x='ws', y='isi')
plt.title('Wind Speed vs. Initial Spread Index (ISI)')
plt.xlabel('Wind Speed (km/h)')
plt.ylabel('Initial Spread Index (ISI)')
plt.show()


In [None]:
# Compute the correlation matrix
corr_matrix = gdf[variables].corr()

# Generate a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Key Variables')
plt.show()

In [None]:
#Boxplot of FWI by Fuel Type
plt.figure(figsize=(12, 6))
geo_wfp.boxplot(column='fwi', by='fuel', grid=False, patch_artist=True, boxprops=dict(facecolor='skyblue'))
plt.title('Boxplot of FWI by Fuel Type')
plt.suptitle('')
plt.xlabel('Fuel Type')
plt.ylabel('Fire Weather Index (FWI)')
plt.xticks(rotation=90)
plt.show()

In [None]:
#scatter plot FWI over time
plt.figure(figsize=(14, 7))
geo_wfp_sorted = geo_wfp.sort_values('rep_date')
plt.plot(geo_wfp_sorted['rep_date'], geo_wfp_sorted['fwi'], color='skyblue')
plt.title('Line Plot of FWI over Time')
plt.xlabel('Date')
plt.ylabel('Fire Weather Index (FWI)')
plt.xticks(rotation=45)
plt.show()

In [None]:
#violin plot for fwi and fuel type
plt.figure(figsize=(12, 6))
sns.violinplot(x='fuel', y='fwi', data=geo_wfp, palette='coolwarm')
plt.title('Violin Plot of FWI by Fuel Type')
plt.xlabel('Fuel Type')
plt.ylabel('Fire Weather Index (FWI)')
plt.xticks(rotation=90)
plt.show()

In [None]:
#Density Plot of Fire Weather Index (FWI)
plt.figure(figsize=(10, 6))
sns.kdeplot(geo_wfp['fwi'], shade=True, color='skyblue')
plt.title('Density Plot of Fire Weather Index (FWI)')
plt.xlabel('Fire Weather Index (FWI)')
plt.ylabel('Density')
plt.show()

In [None]:
!pip install cartopy
import cartopy.crs as ccrs
import cartopy.feature as cfeature

# Define the geographical boundaries of Canada
canada_bounds = {
    'min_lat': 41.0,
    'max_lat': 83.0,
    'min_lon': -141.0,
    'max_lon': -52.0
}

# Filter the data for Canada
geo_wfp_canada = geo_wfp[
    (geo_wfp['lat'] >= canada_bounds['min_lat']) &
    (geo_wfp['lat'] <= canada_bounds['max_lat']) &
    (geo_wfp['lon'] >= canada_bounds['min_lon']) &
    (geo_wfp['lon'] <= canada_bounds['max_lon'])
]

# Extract the latitude, longitude, and FWI values for Canada
lat = geo_wfp_canada['lat'].values
lon = geo_wfp_canada['lon'].values
fwi = geo_wfp_canada['fwi'].values

# Define the grid size
grid_size = 100

# Create grid coordinates
grid_lon, grid_lat = np.meshgrid(
    np.linspace(lon.min(), lon.max(), grid_size),
    np.linspace(lat.min(), lat.max(), grid_size)
)

# Interpolate FWI values on the grid
grid_fwi = griddata((lon, lat), fwi, (grid_lon, grid_lat), method='cubic')

# Plot the contour map with world outline
fig = plt.figure(figsize=(12, 8))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_extent([canada_bounds['min_lon'], canada_bounds['max_lon'], canada_bounds['min_lat'], canada_bounds['max_lat']], crs=ccrs.PlateCarree())

# Add contour plot
contour = ax.contourf(grid_lon, grid_lat, grid_fwi, cmap='YlGnBu', levels=15, transform=ccrs.PlateCarree())
plt.colorbar(contour, label='Fire Weather Index (FWI)')

# Add scatter plot of data points
ax.scatter(lon, lat, c=fwi, cmap='YlGnBu', edgecolor='k', s=20, transform=ccrs.PlateCarree())

# Add coastlines and borders
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS, linestyle=':')

# Add title and labels
ax.set_title('Contour Map of Fire Weather Index (FWI) in Canada')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')

plt.show()


In [None]:
from shapely.geometry import Point

# Plot fire locations colored by temperature
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
geo_wfp_canada.plot(column='temp', ax=ax, legend=True, cmap='OrRd', markersize=5)
ax.set_title('Fire Locations Colored by Temperature')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

# Plot fire locations colored by Fire Weather Index (FWI)
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
geo_wfp_canada.plot(column='fwi', ax=ax, legend=True, cmap='YlGnBu', markersize=5)
ax.set_title('Fire Locations Colored by Fire Weather Index (FWI)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

In [None]:
!pip install basemap
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

# Plot heatmap
fig, ax = plt.subplots(figsize=(12, 8))
m = Basemap(projection='merc', llcrnrlat=41.68, urcrnrlat=83.11, llcrnrlon=-141.00, urcrnrlon=-52.62, resolution='i')
m.drawcoastlines()
m.drawcountries()
m.drawstates()

# Define a grid and heatmap
heatmap, xedges, yedges = np.histogram2d(geo_wfp_canada['lon'], geo_wfp_canada['lat'], bins=50)

extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.imshow(heatmap.T, extent=extent, origin='lower', cmap='hot', alpha=0.6)
plt.colorbar(label='Number of Fires')

plt.title('Heatmap of Fire Locations in Canada')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()


In [None]:
import folium

# Create a map centered at the mean latitude and longitude of Canada
m = folium.Map(location=[geo_wfp_canada['lat'].mean(), geo_wfp_canada['lon'].mean()], zoom_start=4)

# Add fire locations
for _, row in geo_wfp_canada.iterrows():
    folium.CircleMarker(location=[row['lat'], row['lon']], radius=3, color='red', fill=True).add_to(m)

# Display the map
m.save('canada_fire_locations.html')

In [None]:
# Density plot
plt.figure(figsize=(12, 8))
sns.kdeplot(data=geo_wfp_canada, x='lon', y='lat', fill=True, cmap='magma', levels=50, thresh=0.1)
plt.title('Density Plot of Fire Locations in Canada')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()