# Data Science Final Project 


**College/University Name**: _CICCC - Cornerstone International Community College of Canada_  
**Course**: _Final Project_  
**Instructor**: _Derrick Park_  
**Student Name**: _Amir Lima Oliveira_  
**Submission Date**: _2025-09-26_  

---

### Project Title
    _Wildfire Restoration Priority Classification in Canada_
---

#### Objective
    Find, structure and analyse the NASA's datasets with satelite data points about wildfires detection, connect this with satelite images and engineer areas parameters for the detection of which wildfire area needs priority restoration.
### Problem Statement or Research Question
    This project aims to help manage and direct resources with efficiency in the right areas based on the data-driven structure of the machine learning model to the most critical areas. 
---

#### Dataset Overview
- **Source:** [Dataset URL or name]
- **Description:** Short explanation of the dataset (e.g., features, size, context)
- **Credits:** Cite source or dataset author if required

---

## Table of Contents


1. [Import Libraries](#import-libraries)  


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import rasterio as rio
import fiona
from rasterio.plot import show
import shapely.geometry as geom
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import urllib.request # to download the watershed gdb file

---

2. [Load & Inspect Data](#load--inspect-data)  


In [3]:
land_cover = gpd.read_file('../data_raw/land_cover/NRCTHRLNDC_polygon.shp')

   - [Shape](#shape)  

In [13]:
land_cover.shape

(259810, 6)

   - [Missing Values](#missing-values)  


In [4]:
land_cover.isnull().sum()

THRLNDCVR2    0
LNDCVRCLSS    0
OBJECTID      0
AREA_SQM      0
FEAT_LEN      0
geometry      0
dtype: int64

   - [Data Types](#data-types)  


In [5]:
land_cover.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 259810 entries, 0 to 259809
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype   
---  ------      --------------   -----   
 0   THRLNDCVR2  259810 non-null  int64   
 1   LNDCVRCLSS  259810 non-null  object  
 2   OBJECTID    259810 non-null  float64 
 3   AREA_SQM    259810 non-null  float64 
 4   FEAT_LEN    259810 non-null  float64 
 5   geometry    259810 non-null  geometry
dtypes: float64(3), geometry(1), int64(1), object(1)
memory usage: 11.9+ MB


   - [Preview Data](#preview-data)


In [6]:
land_cover.head()

Unnamed: 0,THRLNDCVR2,LNDCVRCLSS,OBJECTID,AREA_SQM,FEAT_LEN,geometry
0,133756,Rock/Rubble,133756.0,8748.4316,384.9799,"POLYGON ((817520.193 1104684.096, 817509.984 1..."
1,20889,Developed,20889.0,5036.0589,555.3003,"POLYGON ((816296.673 1105430.754, 816261.271 1..."
2,376876,Snow/Ice,376876.0,9302.7318,382.0501,"POLYGON ((807831.295 1105262.792, 807795.598 1..."
3,376924,Snow/Ice,376924.0,98121.9638,2032.8296,"POLYGON ((807408.934 1105503.389, 807422.693 1..."
4,376888,Snow/Ice,376888.0,7348.9996,347.0931,"POLYGON ((808561.307 1105416.467, 808493.391 1..."


In [10]:
# Reproject to EPSG:3005
land_cover = land_cover.to_crs(epsg=3005)

# Save processed dataset
land_cover.to_file("../data_raw/land_cover/land_cover_BC.gpkg", driver="GPKG")

print(land_cover.head())
print(land_cover.crs)

   THRLNDCVR2   LNDCVRCLSS  OBJECTID    AREA_SQM   FEAT_LEN  \
0      133756  Rock/Rubble  133756.0   8748.4316   384.9799   
1       20889    Developed   20889.0   5036.0589   555.3003   
2      376876     Snow/Ice  376876.0   9302.7318   382.0501   
3      376924     Snow/Ice  376924.0  98121.9638  2032.8296   
4      376888     Snow/Ice  376888.0   7348.9996   347.0931   

                                            geometry  
0  POLYGON ((817520.193 1104684.096, 817509.984 1...  
1  POLYGON ((816296.673 1105430.754, 816261.271 1...  
2  POLYGON ((807831.295 1105262.792, 807795.598 1...  
3  POLYGON ((807408.934 1105503.389, 807422.693 1...  
4  POLYGON ((808561.307 1105416.467, 808493.391 1...  
EPSG:3005


In [11]:
land_cover_BC = gpd.read_file("../data_raw/land_cover/land_cover_BC.gpkg")
print(land_cover_BC.crs)
print(land_cover_BC.shape)


EPSG:3005
(259810, 6)


In [12]:
land_cover_BC.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 259810 entries, 0 to 259809
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype   
---  ------      --------------   -----   
 0   THRLNDCVR2  259810 non-null  int64   
 1   LNDCVRCLSS  259810 non-null  object  
 2   OBJECTID    259810 non-null  float64 
 3   AREA_SQM    259810 non-null  float64 
 4   FEAT_LEN    259810 non-null  float64 
 5   geometry    259810 non-null  geometry
dtypes: float64(3), geometry(1), int64(1), object(1)
memory usage: 11.9+ MB


---

3. [Data Cleaning](#data-cleaning)  

   - [Standardize Text and Formats](#standardize-text-and-formats)  

- [Convert Data Types](#convert-data-types)  
   

- [Filter Irrelevant Records](#filter-irrelevant-records)  

---

4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)  


- [Univariate Analysis](#univariate-analysis)  

- [Bivariate & Multivariate Analysis](#bivariate--multivariate-analysis)  

- [Distribution of Variables](#distribution-of-variables)  


- [Correlation Analysis](#correlation-analysis)  

- [Initial Insights](#initial-insights)  


---

5. [Feature Engineering](#feature-engineering)


- [Feature Selection](#feature-selection)  

   - [Creating New Features](#creating-new-features)  


---

10. [References](#references)  


https://catalogue.data.gov.bc.ca/dataset/other-land-cover-1-250-000-geobase-land-cover/resource/4e1ccbf3-63bb-4bbe-b91c-d5ac0d3ab36c