# Data Science Final Project 


**College/University Name**: _CICCC - Cornerstone International Community College of Canada_  
**Course**: _Final Project_  
**Instructor**: _Derrick Park_  
**Student Name**: _Amir Lima Oliveira_  
**Submission Date**: _2025-09-26_  

---

### Project Title
    _Wildfire Restoration Priority Classification in Canada_
---

#### Objective
    Find, structure and analyse the NASA's datasets with satelite data points about wildfires detection, connect this with satelite images and engineer areas parameters for the detection of which wildfire area needs priority restoration.
### Problem Statement or Research Question
    This project aims to help manage and direct resources with efficiency in the right areas based on the data-driven structure of the machine learning model to the most critical areas. 
---

#### Dataset Overview
- **Source:** [Dataset URL or name]
- **Description:** Short explanation of the dataset (e.g., features, size, context)
- **Credits:** Cite source or dataset author if required

---

## Table of Contents


1. [Import Libraries](#import-libraries)  


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import rasterio as rio
import fiona
from rasterio.plot import show
import shapely.geometry as geom
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import urllib.request # to download the watershed gdb file

---

2. [Load & Inspect Data](#load--inspect-data)  


In [None]:
# Load KML into GeoDataFrame
watershed = gpd.read_file('../data_raw/watershed/watersheds_bc.gpkg')

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x000001320A641C10>>
Traceback (most recent call last):
  File "c:\Users\Dell\anaconda3\envs\fire_env\Lib\site-packages\ipykernel\ipkernel.py", line 796, in _clean_thread_parent_frames
    active_threads = {thread.ident for thread in threading.enumerate()}
                                                 ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Dell\anaconda3\envs\fire_env\Lib\threading.py", line 1501, in enumerate
    def enumerate():
    
KeyboardInterrupt: 


Needed to make that code below to convert the gdb file into gpkg in order to be possible to make the geographical data into geopandas

In [5]:
# import geopandas as gpd
# import os

# def load_and_merge_watersheds(gdb_path, save_path=None):
#     """
#     Load and merge all watershed layers from a GDB into a single GeoDataFrame.
    
#     Parameters:
#         gdb_path (str): Path to the .gdb file.
#         save_path (str, optional): Path to save the merged GeoDataFrame as GeoPackage.
        
#     Returns:
#         gpd.GeoDataFrame: Merged watersheds
#     """
#     # List all layers
#     layers = fiona.listlayers(gdb_path)
#     print("Layers found:", layers)

#     # Filter out backup and helper layers
#     layers = [l for l in layers if not l.startswith("_")]
#     print("Using layers:", layers)

#     # Read and concatenate
#     gdfs = []
#     for layer in layers:
#         print(f"Reading {layer}...")
#         gdf = gpd.read_file(gdb_path, layer=layer)
#         gdfs.append(gdf)

# watersheds = gpd.pd.concat(gdfs, ignore_index=True)
# print(f"Merged {len(layers)} layers into {len(watersheds)} features.")

#     # Save if requested
# if save_path:
#     watersheds.to_file(save_path, driver="GPKG")
#     print(f"Saved merged watersheds to {save_path}")
    
# return watersheds


# # Example usage:
# gdb_file = r"../data_raw/watershed/FWA_WATERSHEDS_POLY.gdb"
# save_file = r"../data_raw/watershed/watersheds_bc.gpkg"

# watersheds_bc = load_and_merge_watersheds(gdb_file, save_file)
# print(watersheds_bc.crs)


In [6]:
print(watershed.head())


   WATERSHED_FEATURE_ID  WATERSHED_GROUP_ID WATERSHED_TYPE  GNIS_ID_1  \
0             7959831.0               170.0           None        NaN   
1             8891591.0               170.0           None        NaN   
2            10662283.0               170.0           None        NaN   
3             7945915.0               170.0           None        NaN   
4            10143024.0               170.0           None        NaN   

  GNIS_NAME_1  GNIS_ID_2 GNIS_NAME_2  GNIS_ID_3 GNIS_NAME_3  WATERBODY_ID  \
0        None        NaN        None        NaN        None           NaN   
1        None        NaN        None        NaN        None           NaN   
2        None        NaN        None        NaN        None           NaN   
3        None        NaN        None        NaN        None           NaN   
4        None        NaN        None        NaN        None           NaN   

   ...  ASPECT_NORTH  ASPECT_SOUTH ASPECT_WEST ASPECT_EAST ASPECT_FLAT  \
0  ...           NaN    

   - [Shape](#shape)  

In [7]:
watershed.shape


(3243400, 38)

   - [Missing Values](#missing-values)  


In [8]:
watershed.isnull().sum()

WATERSHED_FEATURE_ID               0
WATERSHED_GROUP_ID                 0
WATERSHED_TYPE               3243400
GNIS_ID_1                    3243400
GNIS_NAME_1                  3243400
GNIS_ID_2                    3243400
GNIS_NAME_2                  3243400
GNIS_ID_3                    3243400
GNIS_NAME_3                  3243400
WATERBODY_ID                 3243400
WATERBODY_KEY                      1
WATERSHED_KEY                      0
FWA_WATERSHED_CODE                 0
LOCAL_WATERSHED_CODE               0
WATERSHED_GROUP_CODE               0
LEFT_RIGHT_TRIBUTARY         3243356
WATERSHED_ORDER                    0
WATERSHED_MAGNITUDE                0
LOCAL_WATERSHED_ORDER              1
LOCAL_WATERSHED_MAGNITUDE          1
AREA_HA                            0
RIVER_AREA                   3243400
LAKE_AREA                    3243400
WETLAND_AREA                 3243400
MANMADE_AREA                 3243400
GLACIER_AREA                 3243400
AVERAGE_ELEVATION            3243400
A

   - [Data Types](#data-types)  


In [9]:
watershed.describe()

Unnamed: 0,WATERSHED_FEATURE_ID,WATERSHED_GROUP_ID,GNIS_ID_1,GNIS_ID_2,GNIS_ID_3,WATERBODY_ID,WATERBODY_KEY,WATERSHED_KEY,WATERSHED_ORDER,WATERSHED_MAGNITUDE,...,GLACIER_AREA,AVERAGE_ELEVATION,AVERAGE_SLOPE,ASPECT_NORTH,ASPECT_SOUTH,ASPECT_WEST,ASPECT_EAST,ASPECT_FLAT,GEOMETRY_Length,GEOMETRY_Area
count,3243400.0,3243400.0,0.0,0.0,0.0,0.0,3243399.0,3243400.0,3243400.0,3243400.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3243400.0,3243400.0
mean,9136079.0,126.8362,,,,,21354860.0,359115800.0,2.44836,1406.644,...,,,,,,,,,2281.145,292307.7
std,936410.9,69.44889,,,,,81107230.0,3740147.0,1.687149,15761.8,...,,,,,,,,,2172.088,1088910.0
min,7513908.0,1.0,,,,,0.0,-1.0,0.0,0.0,...,,,,,,,,,0.06553548,0.000155
25%,8325166.0,69.0,,,,,0.0,356423300.0,1.0,1.0,...,,,,,,,,,1007.841,40309.12
50%,9136086.0,129.0,,,,,0.0,359497900.0,2.0,3.0,...,,,,,,,,,1842.984,127333.6
75%,9947026.0,187.0,,,,,0.0,360642500.0,3.0,22.0,...,,,,,,,,,2973.524,312288.4
max,10757980.0,246.0,,,,,708021500.0,380961800.0,10.0,296885.0,...,,,,,,,,,552756.8,811000500.0


In [10]:
important_cols = [
    "WATERSHED_FEATURE_ID",
    "FWA_WATERSHED_CODE",
    "WATERSHED_ORDER",
    "AREA_HA",
    "WATERSHED_GROUP_CODE",
    "geometry"
]

watersheds_clean = watershed[important_cols].copy()

In [11]:
watersheds_clean.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 3243400 entries, 0 to 3243399
Data columns (total 6 columns):
 #   Column                Dtype   
---  ------                -----   
 0   WATERSHED_FEATURE_ID  float64 
 1   FWA_WATERSHED_CODE    object  
 2   WATERSHED_ORDER       int32   
 3   AREA_HA               float64 
 4   WATERSHED_GROUP_CODE  object  
 5   geometry              geometry
dtypes: float64(2), geometry(1), int32(1), object(2)
memory usage: 136.1+ MB


In [12]:
watersheds_clean.head()

Unnamed: 0,WATERSHED_FEATURE_ID,FWA_WATERSHED_CODE,WATERSHED_ORDER,AREA_HA,WATERSHED_GROUP_CODE,geometry
0,7959831.0,915-740516-000000-000000-000000-000000-000000-...,0,0.11303,PORI,"MULTIPOLYGON (((695621.045 1010144.888, 695604..."
1,8891591.0,915-740604-000000-000000-000000-000000-000000-...,0,5.862183,PORI,"MULTIPOLYGON (((694759.358 1012303.554, 694758..."
2,10662283.0,915-724877-642115-000000-000000-000000-000000-...,1,47.778471,PORI,"MULTIPOLYGON (((698726.465 1008130.326, 698705..."
3,7945915.0,915-726325-000000-000000-000000-000000-000000-...,0,0.355021,PORI,"MULTIPOLYGON (((706831.113 990786.749, 706823...."
4,10143024.0,915-724877-437883-000000-000000-000000-000000-...,3,1.186657,PORI,"MULTIPOLYGON (((691835.013 1000057.645, 691854..."


In [13]:
watersheds_clean.shape

(3243400, 6)

In [14]:
# watersheds_clean.to_file('../data_raw/watershed/watersheds_clean.gpkg', driver="GPKG")

In [3]:
watershed = gpd.read_file('../data_raw/watershed/watersheds_clean.gpkg')

In [4]:
watershed.isnull().sum()

WATERSHED_FEATURE_ID    0
FWA_WATERSHED_CODE      0
WATERSHED_ORDER         0
AREA_HA                 0
WATERSHED_GROUP_CODE    0
geometry                0
dtype: int64

In [5]:
watershed.duplicated().sum()

0

In [6]:
watersheds_EPSG = watershed.to_crs(3005)

In [8]:
watersheds_EPSG.to_file('../data_raw/watershed/watersheds_EPSG.gpkg', driver="GPKG")

In [7]:
watersheds_EPSG.head()

Unnamed: 0,WATERSHED_FEATURE_ID,FWA_WATERSHED_CODE,WATERSHED_ORDER,AREA_HA,WATERSHED_GROUP_CODE,geometry
0,7959831.0,915-740516-000000-000000-000000-000000-000000-...,0,0.11303,PORI,"MULTIPOLYGON (((695621.045 1010144.888, 695604..."
1,8891591.0,915-740604-000000-000000-000000-000000-000000-...,0,5.862183,PORI,"MULTIPOLYGON (((694759.358 1012303.554, 694758..."
2,10662283.0,915-724877-642115-000000-000000-000000-000000-...,1,47.778471,PORI,"MULTIPOLYGON (((698726.465 1008130.326, 698705..."
3,7945915.0,915-726325-000000-000000-000000-000000-000000-...,0,0.355021,PORI,"MULTIPOLYGON (((706831.113 990786.749, 706823...."
4,10143024.0,915-724877-437883-000000-000000-000000-000000-...,3,1.186657,PORI,"MULTIPOLYGON (((691835.013 1000057.645, 691854..."


   - [Preview Data](#preview-data)


---

3. [Data Cleaning](#data-cleaning)  

   - [Drop Duplicates](#drop-duplicates)  

   - [Standardize Text and Formats](#standardize-text-and-formats)  

- [Convert Data Types](#convert-data-types)  
   

- [Filter Irrelevant Records](#filter-irrelevant-records)  

   - [Handle Inconsistent Values](#handle-inconsistent-values)  

---

4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)  


- [Univariate Analysis](#univariate-analysis)  

- [Bivariate & Multivariate Analysis](#bivariate--multivariate-analysis)  

- [Distribution of Variables](#distribution-of-variables)  


- [Correlation Analysis](#correlation-analysis)  

- [Outlier Detection](#outlier-detection)  
   

- [Initial Insights](#initial-insights)  


---

5. [Feature Engineering](#feature-engineering)


- [Feature Selection](#feature-selection)  

  
   - [Handling Missing Data](#handling-missing-data)  

- [Encoding Categorical Variables](#encoding-categorical-variables)  

   - [Creating New Features](#creating-new-features)  


- [Feature Transformation (Scaling, Normalization)](#feature-transformation-scaling-normalization)  

---