
# Weather Classification Analysis for Wildfire Avoidance

As an Environmental Engineer deeply invested in leveraging data science for environmental recovery, especially in high-risk areas like those prone to wildfires, this project resonates with my core mission. My analytical approach drives me to break down complex environmental challenges into manageable data problems. Here, I apply machine learning to predict dry weather conditions, a critical step in proactive wildfire prevention and resource management. This work is a step-by-step exploration, connecting technical skills with the vital goal of protecting our ecosystems and communities.

This notebook performs data preparation, including basic inspection, cleaning, exploratory data analysis (EDA), and feature engineering, for a weather classification dataset. The goal is to predict dryness trends for wildfire avoidance, with 'Dry' conditions defined based on parameters relevant to the Canadian Forest Fire Weather Index (FWI) system. It also includes statistical tests to understand the relationships between features and dryness.


In [53]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats # For statistical tests
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the dataset
df = pd.read_csv(r"../datasets/weather_classification_data.csv")



## 1. Basic Inspection

My first step in any data project is to get a foundational understanding of the dataset. This involves a quick look at the first few rows, checking the overall size, and examining the data types to ensure everything is as expected. It's like surveying the landscape before starting an environmental assessment.


In [54]:

print("---Checking Head Rows---")
df.head()


---Checking Head Rows---


Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Cloud Cover,Atmospheric Pressure,UV Index,Season,Visibility (km),Location,Weather Type
0,14.0,73,9.5,82.0,partly cloudy,1010.82,2,Winter,3.5,inland,Rainy
1,39.0,96,8.5,71.0,partly cloudy,1011.43,7,Spring,10.0,inland,Cloudy
2,30.0,64,7.0,16.0,clear,1018.72,5,Spring,5.5,mountain,Sunny
3,38.0,83,1.5,82.0,clear,1026.25,7,Spring,1.0,coastal,Sunny
4,27.0,74,17.0,66.0,overcast,990.67,1,Winter,2.5,mountain,Rainy


In [55]:
print("\n---Checking Rows and Columns---")
print(df.shape)
print("\n---Checking Data Types---")
print(df.dtypes)


---Checking Rows and Columns---
(13200, 11)

---Checking Data Types---
Temperature             float64
Humidity                  int64
Wind Speed              float64
Precipitation (%)       float64
Cloud Cover              object
Atmospheric Pressure    float64
UV Index                  int64
Season                   object
Visibility (km)         float64
Location                 object
Weather Type             object
dtype: object


In [56]:
print("\n---Checking Dataset Info---")
df.info()


---Checking Dataset Info---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Temperature           13200 non-null  float64
 1   Humidity              13200 non-null  int64  
 2   Wind Speed            13200 non-null  float64
 3   Precipitation (%)     13200 non-null  float64
 4   Cloud Cover           13200 non-null  object 
 5   Atmospheric Pressure  13200 non-null  float64
 6   UV Index              13200 non-null  int64  
 7   Season                13200 non-null  object 
 8   Visibility (km)       13200 non-null  float64
 9   Location              13200 non-null  object 
 10  Weather Type          13200 non-null  object 
dtypes: float64(5), int64(2), object(4)
memory usage: 1.1+ MB



## 2. Data Cleaning

Just as environmental remediation starts with removing contaminants, data cleaning is about ensuring the integrity of our dataset. Here, I focus on identifying and addressing duplicate entries and missing values, which can skew our analysis and model performance.


In [57]:

print("---Checking for duplicates---")
print(f"{df.duplicated().sum()} duplicates found.")
df.drop_duplicates(inplace=True)
print(f"Shape after dropping duplicates: {df.shape}")

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())


---Checking for duplicates---
0 duplicates found.
Shape after dropping duplicates: (13200, 11)

Missing values per column:
Temperature             0
Humidity                0
Wind Speed              0
Precipitation (%)       0
Cloud Cover             0
Atmospheric Pressure    0
UV Index                0
Season                  0
Visibility (km)         0
Location                0
Weather Type            0
dtype: int64



## 3. Feature Engineering: Define Dryness Target Variable for Wildfire Avoidance

This is where I connect the raw weather data to a real-world environmental problem: wildfire risk. By defining 'Dry' conditions based on principles from the Canadian Forest Fire Weather Index (FWI) system, I'm creating a target variable that directly supports our goal of wildfire avoidance. It's about translating environmental indicators into actionable data points.



#### Inspired on FWI system inputs: temperature, relative humidity, wind speed, and 24-hour precipitation.

*   A simplified definition of 'Dry' for wildfire avoidance:
    -   Low Precipitation
    -   Low Humidity
    -   High Temperature
    -   Potentially High Wind Speed (though not directly used in this simple binary definition)

*   Define thresholds for 'Dry' conditions based on FWI inputs (these are illustrative and can be refined)
*   Let's consider 'Dry' if:
    -   Precipitation (%) < bottom 10% (very low precipitation)
    -   Humidity < 40 (low humidity)
    -   Temperature > 25 (high temperature)



    __More details for FWI system on the PDF report__


In [58]:

def define_dryness_for_wildfire(row):
    if row["Precipitation (%)"] < df["Precipitation (%)"].quantile(0.10) and row["Humidity"] < 40 and row["Temperature"] > 25:
        return "Dry"
    else:
        return "Not Dry"

df["Dryness"] = df.apply(define_dryness_for_wildfire, axis=1)
print("\nValue counts for Dryness_Label_Wildfire:")
print(df["Dryness"].value_counts())

# Target variable for classification (1 for Dry, 0 for Not Dry)
y = (df["Dryness"] == "Dry").astype(int)

# Features (excluding original Weather Type and the new Dryness_Label_Wildfire)
x = df.drop(["Weather Type", "Dryness"], axis=1)



Value counts for Dryness_Label_Wildfire:
Dryness
Not Dry    12822
Dry          378
Name: count, dtype: int64


In [59]:
df.head()

Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Cloud Cover,Atmospheric Pressure,UV Index,Season,Visibility (km),Location,Weather Type,Dryness
0,14.0,73,9.5,82.0,partly cloudy,1010.82,2,Winter,3.5,inland,Rainy,Not Dry
1,39.0,96,8.5,71.0,partly cloudy,1011.43,7,Spring,10.0,inland,Cloudy,Not Dry
2,30.0,64,7.0,16.0,clear,1018.72,5,Spring,5.5,mountain,Sunny,Not Dry
3,38.0,83,1.5,82.0,clear,1026.25,7,Spring,1.0,coastal,Sunny,Not Dry
4,27.0,74,17.0,66.0,overcast,990.67,1,Winter,2.5,mountain,Rainy,Not Dry


In [60]:
print(df["Dryness"].unique())

print(y.unique())

['Not Dry' 'Dry']
[0 1]


In [61]:
df.describe()

Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Atmospheric Pressure,UV Index,Visibility (km)
count,13200.0,13200.0,13200.0,13200.0,13200.0,13200.0,13200.0
mean,19.127576,68.710833,9.832197,53.644394,1005.827896,4.005758,5.462917
std,17.386327,20.194248,6.908704,31.946541,37.199589,3.8566,3.371499
min,-25.0,20.0,0.0,0.0,800.12,0.0,0.0
25%,4.0,57.0,5.0,19.0,994.8,1.0,3.0
50%,21.0,70.0,9.0,58.0,1007.65,3.0,5.0
75%,31.0,84.0,13.5,82.0,1016.7725,7.0,7.5
max,109.0,109.0,48.5,109.0,1199.21,14.0,20.0



## 4. Exploratory Data Analysis (EDA)

This section is my deep dive into the data, much like a field investigation. I use visualizations and statistical tests to uncover patterns and relationships, especially those that shed light on wildfire risk factors. Understanding these connections is crucial for building a robust predictive model and ensuring our insights are grounded in evidence.


In [62]:
x.head()

Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Cloud Cover,Atmospheric Pressure,UV Index,Season,Visibility (km),Location
0,14.0,73,9.5,82.0,partly cloudy,1010.82,2,Winter,3.5,inland
1,39.0,96,8.5,71.0,partly cloudy,1011.43,7,Spring,10.0,inland
2,30.0,64,7.0,16.0,clear,1018.72,5,Spring,5.5,mountain
3,38.0,83,1.5,82.0,clear,1026.25,7,Spring,1.0,coastal
4,27.0,74,17.0,66.0,overcast,990.67,1,Winter,2.5,mountain


In [63]:

print("\n--- Exploratory Data Analysis (Visualizations and Statistical Tests) ---")

# Identify numerical and categorical features
numerical_features = x.select_dtypes(include=np.number).columns
categorical_features = x.select_dtypes(include="object").columns

# Outlier Handling (Capping using IQR method)
# This approach helps to mitigate the impact of extreme values without removing data points,
# which is often preferred in environmental data where outliers might represent real, albeit rare, events.
for col in numerical_features:
    Q1 = df[col].quantile(0.20)
    Q3 = df[col].quantile(0.80)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
    df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])
    print(f"Outliers in {col} handled by capping.")

# Histograms for numerical features (after outlier handling)
# for col in numerical_features:
#     plt.figure(figsize=(8, 5))
#     sns.histplot(df[col], kde=True)
#     plt.title(f"Distribution of {col} (After Outlier Capping)")
#     #plt.show()

# # Box plots for numerical features comparing with Dryness_Label_Wildfire feature
# for col in numerical_features:
#     plt.figure(figsize=(8, 5))
#     sns.boxplot(x=df["Dryness_Label_Wildfire"], y=df[col])
#     plt.title(f"{col} by Dryness Label for Wildfire (After Outlier Capping)")
#     #plt.show()

# # Plots for categorical features
# for col in categorical_features:
#     plt.figure(figsize=(10, 6))
#     sns.countplot(data=df, y=col, hue="Dryness_Label_Wildfire", order = df[col].value_counts().index)
#     plt.title(f"Count of {col} by Dryness Label for Wildfire")
#     plt.tight_layout()
#     #plt.show()

# # Correlation heatmap
# plt.figure(figsize=(12, 10))
# correlation_matrix = df[numerical_features].corr()
# sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
# plt.title("Correlation Heatmap of Numerical Features")
# #plt.show()

# --- Statistical Tests for Dryness --- 
print("\n--- Statistical Tests for Dryness ---")

dry_group = df[df["Dryness"] == "Dry"]
not_dry_group = df[df["Dryness"] == "Not Dry"]

# T-tests for key numerical features (Temperature, Humidity, Precipitation (%), Wind Speed)
# to see if their means are significantly different between Dry and Not Dry groups.
# Assuming data is not strictly normal, but t-test is robust for large samples.
# For non-normal data, Mann-Whitney U test could be used.
key_fwi_features = ["Temperature", "Humidity", "Precipitation (%)", "Wind Speed"]

for feature in key_fwi_features:
    stat, p = stats.ttest_ind(dry_group[feature], not_dry_group[feature], equal_var=False) # t-test
    print(f"\n{feature} - T-test (Dry vs. Not Dry):")
    print(f"  Statistic: {stat:.3f}")
    print(f"  P-value: {p:.3f}")
    if p < 0.05:
        print(f"  Conclusion: Significant difference in {feature} between Dry and Not Dry groups (p < 0.05)")
    else:
        print(f"  Conclusion: No significant difference in {feature} between Dry and Not Dry groups (p >= 0.05)")

# Chi-squared test for categorical features (Season, Location, Cloud Cover) vs. Dryness_Label_Wildfire
print("\n--- Chi-squared Tests for Categorical Features vs. Dryness ---")
for feature in categorical_features:
    contingency_table = pd.crosstab(df[feature], df["Dryness"])
    chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
    print(f"\n{feature} - Chi-squared test (vs. Dryness):")
    print(f"  Chi2 Statistic: {chi2:.3f}")
    print(f"  P-value: {p:.3f}")
    if p < 0.05:
        print(f"  Conclusion: Significant association between {feature} and Dryness (p < 0.05)")
    else:
        print(f"  Conclusion: No significant association between {feature} and Dryness (p >= 0.05)")



--- Exploratory Data Analysis (Visualizations and Statistical Tests) ---
Outliers in Temperature handled by capping.
Outliers in Humidity handled by capping.
Outliers in Wind Speed handled by capping.
Outliers in Precipitation (%) handled by capping.
Outliers in Atmospheric Pressure handled by capping.
Outliers in UV Index handled by capping.
Outliers in Visibility (km) handled by capping.

--- Statistical Tests for Dryness ---

Temperature - T-test (Dry vs. Not Dry):
  Statistic: 52.882
  P-value: 0.000
  Conclusion: Significant difference in Temperature between Dry and Not Dry groups (p < 0.05)

Humidity - T-test (Dry vs. Not Dry):
  Statistic: -115.072
  P-value: 0.000
  Conclusion: Significant difference in Humidity between Dry and Not Dry groups (p < 0.05)

Precipitation (%) - T-test (Dry vs. Not Dry):
  Statistic: -162.945
  P-value: 0.000
  Conclusion: Significant difference in Precipitation (%) between Dry and Not Dry groups (p < 0.05)

Wind Speed - T-test (Dry vs. Not Dry):
 

In [64]:
df.describe()

Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Atmospheric Pressure,UV Index,Visibility (km)
count,13200.0,13200.0,13200.0,13200.0,13200.0,13200.0,13200.0
mean,19.086705,68.710833,9.721136,53.644394,1006.017468,4.005758,5.429867
std,17.225411,20.194248,6.49295,31.946541,18.74581,3.8566,3.251377
min,-25.0,20.0,0.0,0.0,952.735,0.0,0.0
25%,4.0,57.0,5.0,19.0,994.8,1.0,3.0
50%,21.0,70.0,9.0,58.0,1007.65,3.0,5.0
75%,31.0,84.0,13.5,82.0,1016.7725,7.0,7.5
max,78.5,109.0,30.25,109.0,1057.863,14.0,16.25


---


## 5. Preprocessing

This final data preparation step transforms our raw and engineered features into a format suitable for machine learning models. It involves two key processes: scaling numerical data and encoding categorical data. Think of it as preparing different types of environmental samples for laboratory analysis—each needs specific handling to yield comparable and accurate results.


In [65]:

print("\n--- Preprocessing ---")

# Scaling Numerical Features: StandardScaler
# Numerical features like Temperature or Humidity have different ranges. 
numerical_transformer = StandardScaler()

# Encoding Categorical Features: OneHotEncoder
categorical_transformer = OneHotEncoder(handle_unknown="ignore", drop="first")
        
# Combining Transformations with ColumnTransformer
# Apply different transformations for different columns of the dataset simultaneously.
# It ensures that numerical features are scaled and categorical features are encoded.
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features)
    ],
    remainder="passthrough" # Keep any other columns that weren't specified
)

# Apply preprocessing to features X
x_processed = preprocessor.fit_transform(x)

# Convert the processed array back to a DataFrame for easier viewing and further steps
# This also helps in understanding the new feature names created by OneHotEncoder.
feature_names_out = preprocessor.get_feature_names_out()
x_processed_df = pd.DataFrame(x_processed, columns=feature_names_out)

print("\nShape of processed X:")
print(x_processed_df.shape)



--- Preprocessing ---

Shape of processed X:
(13200, 15)


##### Encoded and Scaled Dataset 

In [66]:
x_processed_df.head()

Unnamed: 0,num__Temperature,num__Humidity,num__Wind Speed,num__Precipitation (%),num__Atmospheric Pressure,num__UV Index,num__Visibility (km),cat__Cloud Cover_cloudy,cat__Cloud Cover_overcast,cat__Cloud Cover_partly cloudy,cat__Season_Spring,cat__Season_Summer,cat__Season_Winter,cat__Location_inland,cat__Location_mountain
0,-0.294931,0.212404,-0.048086,0.887629,0.134203,-0.520104,-0.582231,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
1,1.143035,1.351385,-0.192836,0.543291,0.150602,0.776424,1.345768,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
2,0.625367,-0.233285,-0.409962,-1.178401,0.346579,0.257813,0.010999,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,1.085516,0.707613,-1.206089,0.887629,0.549008,0.776424,-1.323769,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.452811,0.261924,1.037543,0.386773,-0.40749,-0.77941,-0.878846,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0


---

## __Conclusion:__ 

This notebook provided a complete data preparation pipeline for weather-based dryness classification, with the end goal of supporting proactive wildfire avoidance strategies. 

Key accomplishments include:

- ✅ **Feature Engineering**: Created a meaningful binary target (`Dryness`) rooted in environmental science, __inspired__ by the Canadian FWI system.

    * The FWI System does not use hard “base values” of precipitation, humidity or temperature. These are inputs to a series of moisture codes and indices. The thresholds that matter are applied to the resulting FWI, and must be calibrated locally using historical data (e.g., using the 95th percentile or regional benchmarks like FWI > 30). Fire weather danger is thus dynamically derived, not triggered by preset weather values.
    
    * __This explanation above shows that I even though the FWI System doesn't works with hard base values, it was the inspiration for me to search which values should I use the reach my goal of creating a dryness feature for predictions.__ 

- ✅ **Data Cleaning & Integrity**: Removed duplicates and assessed missing data, ensuring clean input for modeling.
- ✅ **Exploratory Analysis & Statistical Validation**:
    - Performed visualizations and statistical tests (T-tests and Chi-squared tests) to confirm the relevance of features like temperature, humidity, and precipitation.
    - Identified significant relationships between environmental conditions and dryness patterns.
- ✅ **Preprocessing for ML**: Applied feature scaling and encoding using a pipeline, preparing the dataset for model ingestion.

This work sets the stage for the next phase: **model training and evaluation**, where I will test multiple machine learning classifiers to predict dry conditions. These models can later be calibrated and deployed in early warning systems or decision-support tools for environmental management.



### 📚 References

1. **Forest Fire Weather Index (FWI)**: Defines high‑risk days as temperature > 25 °C, Relative Humidity < 40–45%, and 24‑h precipitation < 1 mm (Environment Canada / Météo France) — see Forest fire weather index documentation :contentReference[oaicite:28]{index=28}.

2. **Wildfire modeling factors**: Temperature and relative humidity are principal weather drivers in fire spread modeling :contentReference[oaicite:29]{index=29}.

3. **Drylands classification (UNEP)**: Drylands have aridity index P/PET < 0.65, indicating insufficient moisture availability :contentReference[oaicite:30]{index=30}.

4. **Aridity index methodology**: Definitions by Köppen and Thornthwaite further support precipitation‑based climate classification :contentReference[oaicite:31]{index=31}.

5. **Recent trends in fire weather**: A 2025 study shows rising noon temperatures are the dominant factor in increased fire danger :contentReference[oaicite:32]{index=32}.
