# üß™ Work Overview

In this work, we will:

üßº Clean and preprocess multiple datasets (elevation, soil, climate, etc.)

üîó Merge them into a single unified dataset

üîç Run tests to check whether feature reduction is possible

In [4]:
from scripts.dataMerging.combineDatasets import extract_features_elevation , extract_features_landcover , extract_features_monthly_clim , extract_features_soil , organize_monthly_climat_files
from scripts.dataMerging.mergeDataSources import progressive_merge
from scripts.dataMerging.generateGrid import generate_grid_in_shape
from scripts.dataPreprocessing.dataCleaning import process_fire_data , treat_sensor_errors_soil , impute_with_geo_zones 
from scripts.dataPreprocessing.scalingEncoding import one_hot_encode , target_encode , scale_dataset
from scripts.statistics.firePerSeason import calculate_seasonal_fire_percentage
from scripts.dataPreprocessing.featureReduction import analyze_correlation_variance , reduce_features
from scripts.statistics.firePourcentage import fired_pourcentage

### üó∫Ô∏è Reference Grid

üìê Create a reference grid with consistent latitude and longitude

üîó Ensures all datasets align and can be merged correctly

In [21]:

# Step 1: Generate grid (only once)
grid_df = generate_grid_in_shape(
    "../data/shapefiles/combined/north/alg_tun_north.shp",
    resolution=0.01, # 1 KM resolution
    output_csv="../data/features/grid_points.csv",
    min_latitude = 34,
    max_latitude = 37.5
)



üìÇ Loading shapefile and reprojecting to EPSG:4326...
üó∫Ô∏è Bounding box (lon/lat): [-8.67386818 18.96023083 11.98736715 37.55986   ]
üìè Grid candidate size: 2067 √ó 1860 = 3,844,620 points
‚¨ÜÔ∏è Applied min_latitude=34: 3,844,620 -> 735,852
‚¨áÔ∏è Applied max_latitude=37.5: 735,852 -> 723,450
üîç Filtering points inside region using spatial join...
‚úÖ 330,281 points inside shapefile after spatial join
üíæ Saved grid to ../data/features/grid_points.csv


## üî• Extract Nearest Points (cKDTree)

üå≥ Use cKDTree to find the nearest grid point for each fire record

üìç Matches fire locations to the reference grid efficiently

‚ö° Fast nearest-neighbor search for large datasets

In [4]:

# Define the paths and parameters
GRID_FILE = "../data/features/grid_points.csv"
FIRE_FILE = "../data/fire_dataset/viirs-jpss1_alg_Tun.csv"
TARGET_FIRE_TYPE = 0 

process_fire_data(
    grid_path=GRID_FILE,
    fire_path=FIRE_FILE,
    target_type=TARGET_FIRE_TYPE,
    output_file="../data/features_cleaned/grid_fire_clean.csv"
)


‚úÖ Saved (330281, 3) grid points with binary fire data (1/0) to ../data/features_cleaned/grid_fire_clean.csv


## ‚òÅÔ∏è Climat Dataset

‚ùÑÔ∏è Extract seasonal data (winter, spring, summer, autumn)

üõ†Ô∏è Preprocess by fixing missing values using the median Apply regional resolution 

üìè Scale features using a Robust Scaler

In [None]:

# Organize the files
monthly_tmax_data = organize_monthly_climat_files(
    "../data/climate_dataset/5min/max/*.tif"
)
monthly_tmin_data = organize_monthly_climat_files(
    "../data/climate_dataset/5min/min/*.tif"
)
monthly_tprec_data = organize_monthly_climat_files(
    "../data/climate_dataset/5min/prec/*.tif"
)


extract_features_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    raster_dict=monthly_tmax_data,
    output_path="../data/features/grid_tmax.csv",
    col_name="tmax",
)


extract_features_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    raster_dict=monthly_tmin_data,
    output_path="../data/features/grid_tmin.csv",
    col_name="tmin",
)


extract_features_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    raster_dict=monthly_tprec_data,
    output_path="../data/features/grid_tprec.csv",
    col_name="prec",
)

Month 01: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:46<00:00, 7045.44it/s]
Month 02: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:45<00:00, 7255.54it/s]
Month 03: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:44<00:00, 7386.56it/s]
Month 04: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:45<00:00, 7217.29it/s]
Month 05: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:47<00:00, 6922.87it/s]
Month 06: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:46<00:00, 7064.88it/s]
Month 07: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:46<00:00, 7041.51it/s]
Month 08: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:45<00:00, 7203.25it/s]
Month 09: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:47<00:00, 6980.90it/s]
Month 10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:45<00:00, 7268.81it/s]
Month 11: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [00:46<00:00, 7121.76it/s]
Month 12: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

‚úÖ Finished sampling all monthly rasters.
üíæ Saved seasonal climatology to ../data/features/grid_tmax.csv


Month 01: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:15<00:00, 4368.84it/s]
Month 02: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:14<00:00, 4438.66it/s]
Month 03: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:00<00:00, 5487.07it/s]
Month 04: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:13<00:00, 4472.39it/s]
Month 05: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:29<00:00, 3689.96it/s]
Month 06: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:24<00:00, 3892.00it/s]
Month 07: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:31<00:00, 3610.69it/s]
Month 08: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:26<00:00, 3829.47it/s]
Month 09: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:29<00:00, 3702.28it/s]
Month 10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:18<00:00, 4221.63it/s]
Month 11: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:20<00:00, 4120.02it/s]
Month 12: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

‚úÖ Finished sampling all monthly rasters.
üíæ Saved seasonal climatology to ../data/features/grid_tmin.csv


Month 01: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:43<00:00, 3193.66it/s]
Month 02: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:22<00:00, 3988.86it/s]
Month 03: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:33<00:00, 3536.19it/s]
Month 04: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:37<00:00, 3386.08it/s]
Month 05: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:41<00:00, 3240.33it/s]
Month 06: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:29<00:00, 3686.93it/s]
Month 07: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:22<00:00, 4016.68it/s]
Month 08: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:38<00:00, 3340.82it/s]
Month 09: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:49<00:00, 3017.00it/s]
Month 10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:36<00:00, 3426.03it/s]
Month 11: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:13<00:00, 4498.50it/s] 
Month 12: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚

‚úÖ Finished sampling all monthly rasters.
üíæ Saved seasonal climatology to ../data/features/grid_tprec.csv


In [24]:
calculate_seasonal_fire_percentage('../data/fire_dataset/viirs-jpss1_2024_alg_Tun.csv')


Unnamed: 0,Season,Count,Percentage
3,Winter,18609,20.62%
2,Spring,23093,25.59%
0,Summer,24667,27.33%
1,Autumn/Fall,23881,26.46%


### üìä Seasonal Fire Distribution
üî• As we can see, fires occur almost equally across all seasons

‚ö†Ô∏è Therefore, dropping any season‚Äôs climat data is not advisable

In [None]:

impute_with_geo_zones("../data/features/grid_tmax.csv", base_res=0.05 , min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_tmax_clean.csv")

impute_with_geo_zones("../data/features/grid_tmin.csv",  base_res=0.05, min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_tmin_clean.csv")

impute_with_geo_zones("../data/features/grid_tprec.csv", base_res=0.05 , min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_prec_clean.csv")


Missing values (percent) per column :
winter_tmax    0.687596
spring_tmax    0.687596
summer_tmax    0.687596
autumn_tmax    0.687596
dtype: float64

=== Imputing column: winter_tmax ===
winter_tmax: imputation done using geo-zones.

=== Imputing column: spring_tmax ===
spring_tmax: imputation done using geo-zones.

=== Imputing column: summer_tmax ===
summer_tmax: imputation done using geo-zones.

=== Imputing column: autumn_tmax ===
autumn_tmax: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_tmax_clean.csv
Saved preprocessed dataset ‚Üí ../data/preprocessed/tmax_preprocessed.csv
Missing values (percent) per column :
winter_tmin    0.687596
spring_tmin    0.687596
summer_tmin    0.687596
autumn_tmin    0.687596
dtype: float64

=== Imputing column: winter_tmin ===
winter_tmin: imputation done using geo-zones.

=== Imputing column: spring_tmin ===
spring_tmin: imputation done using geo-zones.

=== Imputing column: summer_tmin ===
summer_tmin: imp

## üå≥ Landcover Dataset

üå± Extract landcover values from the reference grid

üõ†Ô∏è Preprocess by handling missing values using the median Applying regional resolution

üìè Scale features using a Robust Scaler

‚úÖ We only kept the gridcode feature

In [26]:
extract_features_landcover(
    csv_path="../data/features/grid_points.csv",
    shapefile_path="../data/land_dataset/combined/alg_tun_landcvr.shp",
    lat_col="latitude",
    lon_col="longitude",
    keep_cols=["GRIDCODE"],  # can be ["GRIDCODE", "CLASS", "AREA", ...]
    output_path="../data/features/grid_landcover.csv",
)


In [3]:
impute_with_geo_zones("../data/features/grid_landcover.csv", cat_cols=["GRIDCODE"], base_res=0.05, min_points=10 ,max_res=0.2, output_path="../data/features_cleaned/grid_landcover_clean.csv")

Missing values (percent) per column :
GRIDCODE    0.057224
dtype: float64

=== Imputing column: GRIDCODE ===
GRIDCODE: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_landcover_clean.csv


## üå± Soil Dataset

üß± Extract soil features from the reference grid

üõ†Ô∏è Preprocess missing and invalid data

Rows with negative values (likely sensor errors) are treated as missing Apply regional resolution

üé® Feature selection & encoding

TEXTURE_SOTER and TEXTURE_USDA have the same meaning

Keep only TEXTURE_USDA (more detailed)

Apply One-Hot Encoding to TEXTURE_USDA

üìè Scale features using a Robust Scaler

In [28]:
extract_features_soil(
    csv_path="../data/features/grid_points.csv",
    raster_path="../data/soil_dataset/original/HWSD2_RASTER/HWSD2.bil",
    soil_attributes_csv="../data/soil_dataset/simplified/D1_soil_features_alg_tun.csv",
    output_soil_ids="../data/features/fire_soil_ids.csv",
    output_soil_feature="../data/features/grid_soil.csv",
)

In [29]:
import pandas as pd
df = pd.read_csv("../data/features/grid_soil.csv")
if "TEXTURE_SOTER" in df.columns:
        df.drop(columns=["TEXTURE_SOTER"], inplace=True)

In [None]:
treat_sensor_errors_soil("../data/features/grid_soil.csv",output_path="../data/features/grid_soil_treated.csv")

‚úî Cleaning complete!
  Deleted rows : 0
  Fixed rows   : 2936


In [31]:


CATEGORICAL_COLS_SOIL = ["TEXTURE_USDA"]  # categorical columns
NUMERIC_COLS_SOIL = [
    "COARSE", "SAND", "SILT", "CLAY", "BULK", "REF_BULK", "ORG_CARBON", "PH_WATER",
    "TOTAL_N", "CN_RATIO", "CEC_SOIL", "CEC_CLAY", "CEC_EFF", "TEB", "BSAT",
    "ALUM_SAT", "ESP", "TCARBON_EQ", "GYPSUM", "ELEC_COND"
]  # numeric columns

# Usage
soil_cleaned = impute_with_geo_zones("../data/features/grid_soil_treated.csv",num_cols=NUMERIC_COLS_SOIL , cat_cols=CATEGORICAL_COLS_SOIL,  base_res=0.05, min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_soil_clean.csv")


Missing values (percent) per column :
COARSE          0.349436
SAND            0.349436
SILT            0.349436
CLAY            0.349436
TEXTURE_USDA    0.349436
BULK            0.349436
REF_BULK        0.349436
ORG_CARBON      0.349436
PH_WATER        0.349436
TOTAL_N         0.349436
CN_RATIO        0.349436
CEC_SOIL        0.349436
CEC_CLAY        0.349436
CEC_EFF         0.349436
TEB             0.349436
BSAT            0.349436
ALUM_SAT        0.349436
ESP             0.349436
TCARBON_EQ      0.349436
GYPSUM          0.349436
ELEC_COND       0.349436
dtype: float64

=== Imputing column: COARSE ===
COARSE: imputation done using geo-zones.

=== Imputing column: SAND ===
SAND: imputation done using geo-zones.

=== Imputing column: SILT ===
SILT: imputation done using geo-zones.

=== Imputing column: CLAY ===
CLAY: imputation done using geo-zones.

=== Imputing column: TEXTURE_USDA ===
TEXTURE_USDA: imputation done using geo-zones.

=== Imputing column: BULK ===
BULK: imputation done

## üèîÔ∏è Elevation Dataset

üóª Extract elevation values from the reference grid

üõ†Ô∏è Preprocess by handling missing values using the median

üåç Apply regional resolution if needed

üìè Scale features using a Robust Scaler

In [33]:
fires_with_elevation = extract_features_elevation(
    raster_path="../data/elevation_dataset/simplified/elevation_clipped.tif",
    fire_csv_path="../data/features/grid_points.csv",
    output_csv="../data/features/grid_elevation.csv",
    value_name="elevation",
)


Loaded 330281 points from ../data/features/grid_points.csv


Extracting elevation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:51<00:00, 2964.18it/s]


‚úÖ Saved extracted elevation to ../data/features/grid_elevation.csv


In [None]:
impute_with_geo_zones("../data/features/grid_elevation.csv", base_res=0.05, min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_elevation_clean.csv")


Missing values (percent) per column :
Series([], dtype: float64)
üíæ Saved imputation to ../data/features_cleaned/grid_elevation_clean.csv
Saved preprocessed dataset ‚Üí ../data/preprocessed/elevation_preprocessed.csv


## üìä Merging Preprocessed Datasets
Merging all preprocessed datasets on the common key fields of longitude and latitude to obtain one final, unified dataset for analysis.

In [5]:
csv_list= ["../data/features_cleaned/grid_tmax_clean.csv", "../data/features_cleaned/grid_tmin_clean.csv","../data/features_cleaned/grid_prec_clean.csv",  "../data/features_cleaned/grid_landcover_clean.csv" , "../data/features_cleaned/grid_elevation_clean.csv" , "../data/features_cleaned/grid_soil_clean.csv","../data/features_cleaned/grid_fire_clean.csv"]
temp_df = progressive_merge(
    csv_list,
    on=["latitude", "longitude"],
    how="inner",
    output_path="../data/Merged/merged.csv"
)


Loading first CSV: ../data/features_cleaned/grid_tmax_clean.csv
üîÅ Merging file 2/7: ../data/features_cleaned/grid_tmin_clean.csv
‚úÖ Intermediate merged size: (330281, 10)
üîÅ Merging file 3/7: ../data/features_cleaned/grid_prec_clean.csv
‚úÖ Intermediate merged size: (330281, 14)
üîÅ Merging file 4/7: ../data/features_cleaned/grid_landcover_clean.csv
‚úÖ Intermediate merged size: (330281, 15)
üîÅ Merging file 5/7: ../data/features_cleaned/grid_elevation_clean.csv
‚úÖ Intermediate merged size: (330281, 16)
üîÅ Merging file 6/7: ../data/features_cleaned/grid_soil_clean.csv
‚úÖ Intermediate merged size: (845075, 37)
üîÅ Merging file 7/7: ../data/features_cleaned/grid_fire_clean.csv
‚úÖ Intermediate merged size: (845075, 38)
‚úÖ All files merged successfully.


# Encoding and scaling 


In [6]:
target_encode(csv_path="../data/Merged/merged.csv", categorical_cols=["GRIDCODE"], target_col="fire", output_path="../data/Merged/merged_scaled_land.csv")
one_hot_encode(csv_path ="../data/Merged/merged_scaled_land.csv", categorical_cols=["TEXTURE_USDA"],label_col ="fire",output_path ="../data/Merged/merged_scaled_land_soil.csv")
scale_dataset(csv_path="../data/Merged/merged_scaled_land_soil.csv", output_path="../data/preprocessed/preprocessed_data.csv", exclude_cols=["latitude","longitude","GRIDCODE","TEXTURE_USDA_3.0","TEXTURE_USDA_5.0","TEXTURE_USDA_7.0","TEXTURE_USDA_9.0","TEXTURE_USDA_10.0","TEXTURE_USDA_11.0","TEXTURE_USDA_12.0","fire"])

Unnamed: 0,latitude,longitude,winter_tmax,spring_tmax,summer_tmax,autumn_tmax,winter_tmin,spring_tmin,summer_tmin,autumn_tmin,...,GYPSUM,ELEC_COND,TEXTURE_USDA_3.0,TEXTURE_USDA_5.0,TEXTURE_USDA_7.0,TEXTURE_USDA_9.0,TEXTURE_USDA_10.0,TEXTURE_USDA_11.0,TEXTURE_USDA_12.0,fire
0,34.000231,-1.663868,-0.272457,-0.045082,-0.320568,-0.708392,-1.404679,-1.074900,-1.117193,-1.353832,...,-0.301656,0.053444,False,False,False,True,False,False,False,1
1,34.000231,-1.663868,-0.272457,-0.045082,-0.320568,-0.708392,-1.404679,-1.074900,-1.117193,-1.353832,...,-0.337078,-0.252867,False,False,False,False,False,True,False,1
2,34.000231,-1.663868,-0.272457,-0.045082,-0.320568,-0.708392,-1.404679,-1.074900,-1.117193,-1.353832,...,0.099796,-0.252867,False,False,False,False,False,True,False,1
3,34.000231,-1.663868,-0.272457,-0.045082,-0.320568,-0.708392,-1.404679,-1.074900,-1.117193,-1.353832,...,0.430404,3.729170,False,False,False,True,False,False,False,1
4,34.000231,-1.653868,-0.272457,-0.045082,-0.320568,-0.708392,-1.404679,-1.074900,-1.117193,-1.353832,...,-0.301656,0.053444,False,False,False,True,False,False,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
845070,37.340231,9.736132,0.714187,-0.379340,-1.194352,0.380527,1.470538,0.407299,-0.063879,0.888018,...,-0.337078,-0.559177,False,False,False,True,False,False,False,0
845071,37.340231,9.746132,0.714187,-0.379340,-1.194352,0.380527,1.470538,0.407299,-0.063879,0.888018,...,-0.360693,-0.559177,False,False,False,False,False,False,True,0
845072,37.340231,9.746132,0.714187,-0.379340,-1.194352,0.380527,1.470538,0.407299,-0.063879,0.888018,...,-0.337078,-0.559177,False,False,False,True,False,False,False,0
845073,37.340231,9.756132,0.731199,-0.379340,-1.113447,0.416825,1.489706,0.407299,-0.063879,0.906098,...,-0.360693,-0.559177,False,False,False,False,False,False,True,0


## üìâ Feature Reduction Analysis

üìä Visualize feature variance

Sort features by descending variance to identify those that contribute most

üîé Very low-variance features carry little information and can be removed

üîó Check highly correlated features

Detect pairs with high correlation (above a chosen threshold)

üîÅ Highly correlated features bring redundant information, so one of them can be safely dropped

üå≤ Random Forest Feature Selection

Apply a Random Forest model to rank feature importance

Keep only the top 30 features for a cleaner and more efficient dataset

In [7]:
res = analyze_correlation_variance("../data/preprocessed/preprocessed_data.csv" ,target_col="fire", corr_threshold=0.95,)

print("Correlated pairs:")
for p in res["correlated_pairs"]:
    print(p)

print("\nFeature variances:")
print(res["variances"])


Correlated pairs:
('autumn_tmin', 'spring_tmin', np.float64(0.9652335310265426))
('CLAY', 'REF_BULK', np.float64(0.9674397679430973))
('CEC_EFF', 'TEB', np.float64(0.9754403797976567))

Feature variances:
TEXTURE_USDA_7.0     0.001167
TEXTURE_USDA_12.0    0.001483
GRIDCODE             0.005641
TEXTURE_USDA_10.0    0.025143
TEXTURE_USDA_3.0     0.032848
TEXTURE_USDA_11.0    0.182933
TEXTURE_USDA_5.0     0.193670
TEXTURE_USDA_9.0     0.245624
TEB                  1.000001
GYPSUM               1.000001
summer_tmin          1.000001
winter_tmax          1.000001
CEC_EFF              1.000001
winter_tmin          1.000001
spring_tmin          1.000001
summer_tmax          1.000001
CN_RATIO             1.000001
SILT                 1.000001
ESP                  1.000001
PH_WATER             1.000001
CLAY                 1.000001
ORG_CARBON           1.000001
autumn_prec          1.000001
BSAT                 1.000001
spring_tmax          1.000001
ALUM_SAT             1.000001
TCARBON_EQ     

In [1]:
reduced = reduce_features(
    "../data/preprocessed/preprocessed_data.csv",
    output_path="../data/preprocessed/preprocessed_reduced_data.csv",
    target_col="fire",
    var_threshold=0.01,
    corr_threshold=0.95,
    importance_method="RF",
    top_k=35
)

print("Selected features:", reduced["selected_features"])


NameError: name 'reduce_features' is not defined

In [5]:
fired_pourcentage('../data/preprocessed/preprocessed_reduced_data.csv', label_column_name='fire')


## üìä Dataset Statistics
---------------------------------
**Total Lines (Rows):** **845075**
**Total Features:** **35**
**Label Column Name:** `fire`
---------------------------------
### Label Distribution (Classes 0 and 1)
**Label 0 Count:** **782695** (Non-Fire)
**Label 0 Percentage:** **92.62%**
---
**Label 1 Count:** **62380** (Fire)
**Label 1 Percentage:** **7.38%**
---------------------------------
