# Data Science Final Project 


**College/University Name**: _CICCC - Cornerstone International Community College of Canada_  
**Course**: _Final Project_  
**Instructor**: _Derrick Park_  
**Student Name**: _Amir Lima Oliveira_  
**Submission Date**: _2025-09-26_  

---

### Project Title
    _Wildfire Restoration Priority Classification in Canada_
---

#### Objective
    Find, structure and analyse the NASA's datasets with satelite data points about wildfires detection, connect this with satelite images and engineer areas parameters for the detection of which wildfire area needs priority restoration.
### Problem Statement or Research Question
    This project aims to help manage and direct resources with efficiency in the right areas based on the data-driven structure of the machine learning model to the most critical areas. 
---

#### Dataset Overview
- **Source:** [Dataset URL or name]
- **Description:** Short explanation of the dataset (e.g., features, size, context)
- **Credits:** Cite source or dataset author if required

---

## Table of Contents


1. [Import Libraries](#import-libraries)  


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import rasterio as rio
import fiona
from rasterio.plot import show
import shapely.geometry as geom
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import urllib.request # to download the watershed gdb file

---

2. [Load & Inspect Data](#load--inspect-data)  


In [5]:
elevation = rio.open('../data_raw/elevation/mrdem-30-source.vrt')


In [14]:
import rasterio
import rioxarray
import matplotlib.pyplot as plt

# --- Open with rasterio ---
print("CRS:", elevation.crs)
print("Bounds:", elevation.bounds)
print("Width x Height:", elevation.width, "x", elevation.height)
print("Number of bands:", elevation.count)
print("Data type:", elevation.dtypes)
print("Transform (resolution & origin):", elevation.transform)

CRS: EPSG:3979
Bounds: BoundingBox(left=-2453970.0, bottom=-902220.0, right=3056580.0, top=3887370.0)
Width x Height: 183685 x 159653
Number of bands: 1
Data type: ('uint8',)
Transform (resolution & origin): | 30.00, 0.00,-2453970.00|
| 0.00,-30.00, 3887370.00|
| 0.00, 0.00, 1.00|


In [16]:
import numpy as np

with rasterio.open('../data_raw/elevation/mrdem-30-source.vrt') as src:
    window = rasterio.windows.Window(0, 0, 500, 500)  # top-left 500x500 pixels
    sample = src.read(1, window=window, masked=True)
    print("Sample shape:", sample.shape)
    print("Sample stats:", np.nanmin(sample), np.nanmax(sample), np.nanmean(sample))


Sample shape: (500, 500)
Sample stats: -- -- --


   - [Shape](#shape)  

In [8]:
watershed.shape


(3243400, 38)

   - [Missing Values](#missing-values)  


In [9]:
watershed.isnull().sum()

WATERSHED_FEATURE_ID               0
WATERSHED_GROUP_ID                 0
WATERSHED_TYPE               3243400
GNIS_ID_1                    3243400
GNIS_NAME_1                  3243400
GNIS_ID_2                    3243400
GNIS_NAME_2                  3243400
GNIS_ID_3                    3243400
GNIS_NAME_3                  3243400
WATERBODY_ID                 3243400
WATERBODY_KEY                      1
WATERSHED_KEY                      0
FWA_WATERSHED_CODE                 0
LOCAL_WATERSHED_CODE               0
WATERSHED_GROUP_CODE               0
LEFT_RIGHT_TRIBUTARY         3243356
WATERSHED_ORDER                    0
WATERSHED_MAGNITUDE                0
LOCAL_WATERSHED_ORDER              1
LOCAL_WATERSHED_MAGNITUDE          1
AREA_HA                            0
RIVER_AREA                   3243400
LAKE_AREA                    3243400
WETLAND_AREA                 3243400
MANMADE_AREA                 3243400
GLACIER_AREA                 3243400
AVERAGE_ELEVATION            3243400
A

   - [Data Types](#data-types)  


In [10]:
watershed.describe()

Unnamed: 0,WATERSHED_FEATURE_ID,WATERSHED_GROUP_ID,GNIS_ID_1,GNIS_ID_2,GNIS_ID_3,WATERBODY_ID,WATERBODY_KEY,WATERSHED_KEY,WATERSHED_ORDER,WATERSHED_MAGNITUDE,...,GLACIER_AREA,AVERAGE_ELEVATION,AVERAGE_SLOPE,ASPECT_NORTH,ASPECT_SOUTH,ASPECT_WEST,ASPECT_EAST,ASPECT_FLAT,GEOMETRY_Length,GEOMETRY_Area
count,3243400.0,3243400.0,0.0,0.0,0.0,0.0,3243399.0,3243400.0,3243400.0,3243400.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3243400.0,3243400.0
mean,9136079.0,126.8362,,,,,21354860.0,359115800.0,2.44836,1406.644,...,,,,,,,,,2281.145,292307.7
std,936410.9,69.44889,,,,,81107230.0,3740147.0,1.687149,15761.8,...,,,,,,,,,2172.088,1088910.0
min,7513908.0,1.0,,,,,0.0,-1.0,0.0,0.0,...,,,,,,,,,0.06553548,0.000155
25%,8325166.0,69.0,,,,,0.0,356423300.0,1.0,1.0,...,,,,,,,,,1007.841,40309.12
50%,9136086.0,129.0,,,,,0.0,359497900.0,2.0,3.0,...,,,,,,,,,1842.984,127333.6
75%,9947026.0,187.0,,,,,0.0,360642500.0,3.0,22.0,...,,,,,,,,,2973.524,312288.4
max,10757980.0,246.0,,,,,708021500.0,380961800.0,10.0,296885.0,...,,,,,,,,,552756.8,811000500.0


In [None]:
important_cols = [
    "WATERSHED_FEATURE_ID",
    "FWA_WATERSHED_CODE",
    "WATERSHED_ORDER",
    "AREA_HA",
    "WATERSHED_GROUP_CODE",
    "geometry"
]

watersheds_clean = watersheds_bc[important_cols].copy()

   - [Preview Data](#preview-data)


---

3. [Data Cleaning](#data-cleaning)  

   - [Drop Duplicates](#drop-duplicates)  

   - [Standardize Text and Formats](#standardize-text-and-formats)  

- [Convert Data Types](#convert-data-types)  
   

- [Filter Irrelevant Records](#filter-irrelevant-records)  

   - [Handle Inconsistent Values](#handle-inconsistent-values)  

---

4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)  


- [Univariate Analysis](#univariate-analysis)  

- [Bivariate & Multivariate Analysis](#bivariate--multivariate-analysis)  

- [Distribution of Variables](#distribution-of-variables)  


- [Correlation Analysis](#correlation-analysis)  

- [Outlier Detection](#outlier-detection)  
   

- [Initial Insights](#initial-insights)  


---

5. [Feature Engineering](#feature-engineering)


- [Feature Selection](#feature-selection)  

  
   - [Handling Missing Data](#handling-missing-data)  

- [Encoding Categorical Variables](#encoding-categorical-variables)  

   - [Creating New Features](#creating-new-features)  


- [Feature Transformation (Scaling, Normalization)](#feature-transformation-scaling-normalization)  

---

For Elevation data

Creating the raster file (TIF) only with BC elevation data

In [None]:
# import rasterio
# from rasterio.mask import mask
# import geopandas as gpd
# import numpy as np

# # Paths
# vrt_path = "../data_raw/elevation/mrdem-30-source.vrt"
# fire_shapefile = "../data_raw/fire_perimeters/fire_perimeters.gpkg"
# clipped_fp = "../data_raw/elevation/dem_bc_clipped.tif"

# # Load fire perimeters
# fires = gpd.read_file(fire_shapefile)

# with rasterio.open(vrt_path) as src:
#     # Ensure CRS matches
#     fires = fires.to_crs(src.crs)
    
#     # Clip DEM to fire geometries
#     dem_clipped, dem_transform = mask(src, fires.geometry, crop=True)
    
#     # Convert to standard numpy array
#     dem_clipped = np.array(dem_clipped, dtype=src.dtypes[0])
    
#     # Update metadata
#     out_meta = src.meta.copy()
#     out_meta.update({
#         "driver": "GTiff",
#         "height": dem_clipped.shape[1],
#         "width": dem_clipped.shape[2],
#         "transform": dem_transform,
#         "count": src.count
#     })

# # Save clipped DEM
# with rasterio.open(clipped_fp, "w", **out_meta) as dest:
#     dest.write(dem_clipped)


In [40]:
elevation_path = "../data_raw/elevation/dem_bc_clipped.tif"

with rasterio.open(elevation_path) as src:
    print("CRS:", src.crs)
    print("Bounds:", src.bounds)
    print("Width, Height:", src.width, src.height)
    print("Count (bands):", src.count)

CRS: EPSG:3979
Bounds: BoundingBox(left=-2327010.0, bottom=209400.0, right=-1325460.0, top=1911630.0)
Width, Height: 33385 56741
Count (bands): 1


converting clipped DEM to EPSG:3005

In [None]:
# import rasterio
# from rasterio.warp import calculate_default_transform, reproject, Resampling
# from rasterio.mask import mask
# import geopandas as gpd
# import os

# # Paths
# vrt_path = "../data_raw/elevation/mrdem-30-source.vrt"
# fire_shapefile = "../data_raw/fire_perimeters/fire_perimeters.gpkg"
# clipped_fp = "../data_raw/elevation/dem_bc_clipped.tif"
# output_fp = "../data_raw/elevation/dem_bc_clipped_epsg3005.tif"

# # Load fire perimeters
# fires = gpd.read_file(fire_shapefile)

# # Open VRT (source DEM)
# with rasterio.open(vrt_path) as src:
#     # Reproject fire geometries to match DEM
#     fires = fires.to_crs(src.crs)
    
#     # Clip DEM to fire perimeters using rasterio.mask
#     # crop=True ensures we get minimal bounds
#     clipped_dem, clipped_transform = mask(src, fires.geometry, crop=True)
    
#     # Convert to numpy array
#     clipped_dem = clipped_dem.astype(src.dtypes[0])
    
#     # Metadata for clipped DEM
#     clipped_meta = src.meta.copy()
#     clipped_meta.update({
#         "driver": "GTiff",
#         "height": clipped_dem.shape[1],
#         "width": clipped_dem.shape[2],
#         "transform": clipped_transform,
#         "count": src.count
#     })

# # Now reproject while writing in blocks (memory-safe)
# dst_crs = "EPSG:3005"
# with rasterio.open(output_fp, "w", **clipped_meta) as dst:
#     # Calculate new transform, width, height for the target CRS
#     transform, width, height = calculate_default_transform(
#         src.crs, dst_crs, clipped_meta['width'], clipped_meta['height'], *src.bounds
#     )
#     dst_meta = clipped_meta.copy()
#     dst_meta.update({
#         "crs": dst_crs,
#         "transform": transform,
#         "width": width,
#         "height": height
#     })

#     # Open destination file
#     with rasterio.open(output_fp, "w", **dst_meta) as dst_file:
#         # Reproject in chunks (block-by-block)
#         reproject(
#             source=clipped_dem,
#             destination=rasterio.band(dst_file, 1),
#             src_transform=clipped_transform,
#             src_crs=src.crs,
#             dst_transform=transform,
#             dst_crs=dst_crs,
#             resampling=Resampling.bilinear
#         )

# print("Clipped and reprojected DEM saved at:", output_fp)


Clipped and reprojected DEM saved at: ../data_raw/elevation/dem_bc_clipped_epsg3005.tif


In [1]:
import rasterio
from rasterio.warp import calculate_default_transform, reproject, Resampling
import os

# Paths
input_fp = "../data_raw/elevation/dem_bc_clipped.tif"   # your current DEM
output_fp = "../data_raw/elevation/dem_bc_clipped_epsg3005.tif"  # reprojected DEM

# Make sure the output folder exists
os.makedirs(os.path.dirname(output_fp), exist_ok=True)

# Target CRS
dst_crs = "EPSG:3005"

with rasterio.open(input_fp) as src:
    # Calculate the transform and dimensions for the new CRS
    transform, width, height = calculate_default_transform(
        src.crs, dst_crs, src.width, src.height, *src.bounds
    )
    
    # Update metadata
    kwargs = src.meta.copy()
    kwargs.update({
        'crs': dst_crs,
        'transform': transform,
        'width': width,
        'height': height
    })
    
    # Reproject and write directly to a new file
    with rasterio.open(output_fp, 'w', **kwargs) as dst:
        for i in range(1, src.count + 1):
            reproject(
                source=rasterio.band(src, i),
                destination=rasterio.band(dst, i),
                src_transform=src.transform,
                src_crs=src.crs,
                dst_transform=transform,
                dst_crs=dst_crs,
                resampling=Resampling.bilinear
            )

print("Reprojection complete! Saved as:", output_fp)


Reprojection complete! Saved as: ../data_raw/elevation/dem_bc_clipped_epsg3005.tif


In [2]:
import rasterio as rio
with rio.open("../data_raw/elevation/dem_bc_clipped_epsg3005.tif") as src:
    print("CRS:", src.crs)
    print("Bounds:", src.bounds)
    print("Width, Height:", src.width, src.height)
    print("Count (bands):", src.count)


CRS: EPSG:3005
Bounds: BoundingBox(left=206198.3144124857, bottom=83611.93372010277, right=1913020.3413456595, top=2083393.448122778)
Width, Height: 55995 65606
Count (bands): 1


10. [References](#references)  


https://open.canada.ca/data/en/dataset/055919c2-101e-4329-bfd7-1d0c333c0e62/resource/de8a365d-6326-4013-a661-7647e5996c55