# üß™ Work Overview

In this work, we will:

üßº Clean and preprocess multiple datasets (elevation, soil, climate, etc.)

üîó Merge them into a single unified dataset

üîç Run tests to check whether feature reduction is possible

In [2]:
from scripts.dataMerging.combineDatasets import extract_features_elevation , extract_features_landcover , extract_features_yearly_monthly_clim , extract_features_soil , organize_climat_files
from scripts.dataMerging.mergeDataSources import progressive_merge
from scripts.dataMerging.generateGrid import generate_grid_in_shape
from scripts.dataPreprocessing.dataCleaning import process_fire_data , treat_sensor_errors_soil , impute_with_geo_zones , duplicate_analysis
from scripts.dataPreprocessing.scalingEncoding import one_hot_encode , target_encode , scale_dataset
from scripts.statistics.firePerSeason import calculate_seasonal_fire_percentage
from scripts.dataPreprocessing.featureReduction import analyze_correlation_variance , supervised_feature_reduction , unsupervised_feature_reduction
from scripts.statistics.firePourcentage import fired_pourcentage

### üó∫Ô∏è Reference Grid

üìê Create a reference grid with consistent latitude and longitude

üîó Ensures all datasets align and can be merged correctly

In [21]:

# Step 1: Generate grid (only once)
grid_df = generate_grid_in_shape(
    "../data/shapefiles/combined/north/alg_tun_north.shp",
    resolution=0.01, # 1 KM resolution
    output_csv="../data/features/grid_points.csv",
    min_latitude = 34,
    max_latitude = 37.5
)



üìÇ Loading shapefile and reprojecting to EPSG:4326...
üó∫Ô∏è Bounding box (lon/lat): [-8.67386818 18.96023083 11.98736715 37.55986   ]
üìè Grid candidate size: 2067 √ó 1860 = 3,844,620 points
‚¨ÜÔ∏è Applied min_latitude=34: 3,844,620 -> 735,852
‚¨áÔ∏è Applied max_latitude=37.5: 735,852 -> 723,450
üîç Filtering points inside region using spatial join...
‚úÖ 330,281 points inside shapefile after spatial join
üíæ Saved grid to ../data/features/grid_points.csv


## üî• Extract Nearest Points (cKDTree)

üå≥ Use cKDTree to find the nearest grid point for each fire record

üìç Matches fire locations to the reference grid efficiently

‚ö° Fast nearest-neighbor search for large datasets

In [9]:

# Define the paths and parameters
GRID_FILE = "../data/features/grid_points.csv"
FIRE_FILE = "../data/fire_dataset/viirs-jpss1_alg_Tun.csv"
TARGET_FIRE_TYPE = 0 

process_fire_data(
    grid_path=GRID_FILE,
    fire_path=FIRE_FILE,
    target_type=TARGET_FIRE_TYPE,
    output_file="../data/features_cleaned/grid_fire_clean.csv"
)


‚úÖ Saved 330281 grid points with fire + year info to ../data/features_cleaned/grid_fire_clean.csv


## ‚òÅÔ∏è Climat Dataset

‚ùÑÔ∏è Extract seasonal data (winter, spring, summer, autumn)

üõ†Ô∏è Preprocess by fixing missing values using the median Apply regional resolution 

üìè Scale features using a Robust Scaler

In [10]:

# Organize the files
tmax_data = organize_climat_files(
    "../data/climate_dataset/5min/max/*.tif"
)
tmin_data = organize_climat_files(
    "../data/climate_dataset/5min/min/*.tif"
)
prec_data = organize_climat_files(
    "../data/climate_dataset/5min/prec/*.tif"
)
print(tmax_data)

extract_features_yearly_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    fire_csv = "../data/features_cleaned/grid_fire_clean.csv",
    raster_dict=tmax_data,
    output_path="../data/features/grid_tmax.csv",
    col_name="tmax",
)


extract_features_yearly_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    fire_csv = "../data/features_cleaned/grid_fire_clean.csv",
    raster_dict=tmin_data,
    output_path="../data/features/grid_tmin.csv",
    col_name="tmin",
)


extract_features_yearly_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    fire_csv = "../data/features_cleaned/grid_fire_clean.csv",
    raster_dict=prec_data,
    output_path="../data/features/grid_tprec.csv",
    col_name="prec",
)

{'2018-01': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-01.tif', '2018-02': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-02.tif', '2018-03': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-03.tif', '2018-04': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-04.tif', '2018-05': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-05.tif', '2018-06': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-06.tif', '2018-07': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-07.tif', '2018-08': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-08.tif', '2018-09': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-09.tif', '2018-10': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-10.tif', '2018-11': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-11.tif', '2018-12': '../data/climate_dataset/5min/max\\wc2.1_cruts4.09_5m_tmax_2018-

Unnamed: 0,latitude,longitude,year,winter_prec,spring_prec,summer_prec,autumn_prec
0,34.000231,-1.663868,2018,36.250000,61.550003,7.525,45.674999
1,34.000231,-1.653868,2024,21.650002,13.600000,10.100,16.500000
2,34.000231,-1.643868,2024,21.650002,13.600000,10.100,16.500000
3,34.000231,-1.633868,2024,21.650002,13.600000,10.100,16.500000
4,34.000231,-1.623868,2024,21.650002,13.600000,10.100,16.500000
...,...,...,...,...,...,...,...
330276,37.330231,9.846132,2024,,,,
330277,37.330231,9.856132,2024,,,,
330278,37.340231,9.736132,2024,,,,
330279,37.340231,9.746132,2024,,,,


In [12]:
calculate_seasonal_fire_percentage('../data/fire_dataset/viirs-jpss1_alg_Tun.csv')


Unnamed: 0,Season,Count,Percentage
3,Winter,120557,18.57%
2,Spring,151495,23.33%
0,Summer,203227,31.3%
1,Autumn/Fall,174029,26.8%


### üìä Seasonal Fire Distribution
üî• As we can see, fires occur almost equally across all seasons

‚ö†Ô∏è Therefore, dropping any season‚Äôs climat data is not advisable

In [13]:

impute_with_geo_zones("../data/features/grid_tmax.csv", base_res=0.05 , min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_tmax_clean.csv")

impute_with_geo_zones("../data/features/grid_tmin.csv",  base_res=0.05, min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_tmin_clean.csv")

impute_with_geo_zones("../data/features/grid_tprec.csv", base_res=0.05 , min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_prec_clean.csv")


Missing values (percent) per column :
winter_tmax    0.687596
spring_tmax    0.687596
summer_tmax    0.687596
autumn_tmax    0.687596
dtype: float64

=== Imputing column: winter_tmax ===
winter_tmax: imputation done using geo-zones.

=== Imputing column: spring_tmax ===
spring_tmax: imputation done using geo-zones.

=== Imputing column: summer_tmax ===
summer_tmax: imputation done using geo-zones.

=== Imputing column: autumn_tmax ===
autumn_tmax: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_tmax_clean.csv
Missing values (percent) per column :
winter_tmin    0.687596
spring_tmin    0.687596
summer_tmin    0.687596
autumn_tmin    0.687596
dtype: float64

=== Imputing column: winter_tmin ===
winter_tmin: imputation done using geo-zones.

=== Imputing column: spring_tmin ===
spring_tmin: imputation done using geo-zones.

=== Imputing column: summer_tmin ===
summer_tmin: imputation done using geo-zones.

=== Imputing column: autumn_tmin ===
autumn

## üå≥ Landcover Dataset

üå± Extract landcover values from the reference grid

üõ†Ô∏è Preprocess by handling missing values using the median Applying regional resolution

üìè Scale features using a Robust Scaler

‚úÖ We only kept the gridcode feature

In [26]:
extract_features_landcover(
    csv_path="../data/features/grid_points.csv",
    shapefile_path="../data/land_dataset/combined/alg_tun_landcvr.shp",
    lat_col="latitude",
    lon_col="longitude",
    keep_cols=["GRIDCODE"],  # can be ["GRIDCODE", "CLASS", "AREA", ...]
    output_path="../data/features/grid_landcover.csv",
)


In [3]:
impute_with_geo_zones("../data/features/grid_landcover.csv", cat_cols=["GRIDCODE"], base_res=0.05, min_points=10 ,max_res=0.2, output_path="../data/features_cleaned/grid_landcover_clean.csv")

Missing values (percent) per column :
GRIDCODE    0.057224
dtype: float64

=== Imputing column: GRIDCODE ===
GRIDCODE: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_landcover_clean.csv


## üå± Soil Dataset

üß± Extract soil features from the reference grid

üõ†Ô∏è Preprocess missing and invalid data

Rows with negative values (likely sensor errors) are treated as missing Apply regional resolution

üé® Feature selection & encoding

TEXTURE_SOTER and TEXTURE_USDA have the same meaning

Keep only TEXTURE_USDA (more detailed)

Apply One-Hot Encoding to TEXTURE_USDA

üìè Scale features using a Robust Scaler

In [28]:
extract_features_soil(
    csv_path="../data/features/grid_points.csv",
    raster_path="../data/soil_dataset/original/HWSD2_RASTER/HWSD2.bil",
    soil_attributes_csv="../data/soil_dataset/simplified/D1_soil_features_alg_tun.csv",
    output_soil_ids="../data/features/fire_soil_ids.csv",
    output_soil_feature="../data/features/grid_soil.csv",
)

In [29]:
import pandas as pd
df = pd.read_csv("../data/features/grid_soil.csv")
if "TEXTURE_SOTER" in df.columns:
        df.drop(columns=["TEXTURE_SOTER"], inplace=True)

In [None]:
treat_sensor_errors_soil("../data/features/grid_soil.csv",output_path="../data/features/grid_soil_treated.csv")

‚úî Cleaning complete!
  Deleted rows : 0
  Fixed rows   : 2936


In [31]:


CATEGORICAL_COLS_SOIL = ["TEXTURE_USDA"]  # categorical columns
NUMERIC_COLS_SOIL = [
    "COARSE", "SAND", "SILT", "CLAY", "BULK", "REF_BULK", "ORG_CARBON", "PH_WATER",
    "TOTAL_N", "CN_RATIO", "CEC_SOIL", "CEC_CLAY", "CEC_EFF", "TEB", "BSAT",
    "ALUM_SAT", "ESP", "TCARBON_EQ", "GYPSUM", "ELEC_COND"
]  # numeric columns

# Usage
soil_cleaned = impute_with_geo_zones("../data/features/grid_soil_treated.csv",num_cols=NUMERIC_COLS_SOIL , cat_cols=CATEGORICAL_COLS_SOIL,  base_res=0.05, min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_soil_clean.csv")


Missing values (percent) per column :
COARSE          0.349436
SAND            0.349436
SILT            0.349436
CLAY            0.349436
TEXTURE_USDA    0.349436
BULK            0.349436
REF_BULK        0.349436
ORG_CARBON      0.349436
PH_WATER        0.349436
TOTAL_N         0.349436
CN_RATIO        0.349436
CEC_SOIL        0.349436
CEC_CLAY        0.349436
CEC_EFF         0.349436
TEB             0.349436
BSAT            0.349436
ALUM_SAT        0.349436
ESP             0.349436
TCARBON_EQ      0.349436
GYPSUM          0.349436
ELEC_COND       0.349436
dtype: float64

=== Imputing column: COARSE ===
COARSE: imputation done using geo-zones.

=== Imputing column: SAND ===
SAND: imputation done using geo-zones.

=== Imputing column: SILT ===
SILT: imputation done using geo-zones.

=== Imputing column: CLAY ===
CLAY: imputation done using geo-zones.

=== Imputing column: TEXTURE_USDA ===
TEXTURE_USDA: imputation done using geo-zones.

=== Imputing column: BULK ===
BULK: imputation done

## üèîÔ∏è Elevation Dataset

üóª Extract elevation values from the reference grid

üõ†Ô∏è Preprocess by handling missing values using the median

üåç Apply regional resolution if needed

üìè Scale features using a Robust Scaler

In [33]:
fires_with_elevation = extract_features_elevation(
    raster_path="../data/elevation_dataset/simplified/elevation_clipped.tif",
    fire_csv_path="../data/features/grid_points.csv",
    output_csv="../data/features/grid_elevation.csv",
    value_name="elevation",
)


Loaded 330281 points from ../data/features/grid_points.csv


Extracting elevation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 330281/330281 [01:51<00:00, 2964.18it/s]


‚úÖ Saved extracted elevation to ../data/features/grid_elevation.csv


In [None]:
impute_with_geo_zones("../data/features/grid_elevation.csv", base_res=0.05, min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_elevation_clean.csv")


Missing values (percent) per column :
Series([], dtype: float64)
üíæ Saved imputation to ../data/features_cleaned/grid_elevation_clean.csv
Saved preprocessed dataset ‚Üí ../data/preprocessed/elevation_preprocessed.csv


## üìä Merging Preprocessed Datasets
Merging all preprocessed datasets on the common key fields of longitude and latitude to obtain one final, unified dataset for analysis.

In [2]:
csv_list= ["../data/features_cleaned/grid_tmax_clean.csv", "../data/features_cleaned/grid_tmin_clean.csv","../data/features_cleaned/grid_prec_clean.csv",  "../data/features_cleaned/grid_landcover_clean.csv" , "../data/features_cleaned/grid_elevation_clean.csv" , "../data/features_cleaned/grid_soil_clean.csv","../data/features_cleaned/grid_fire_clean.csv"]
temp_df = progressive_merge(
    csv_list,
    on=["latitude", "longitude"],
    how="inner",
    output_path="../data/Merged/merged.csv"
)


Loading first CSV: ../data/features_cleaned/grid_tmax_clean.csv
üóëÔ∏è Dropped column 'year' from first CSV
üîÅ Merging file 2/7: ../data/features_cleaned/grid_tmin_clean.csv
üóëÔ∏è Dropped column 'year' from ../data/features_cleaned/grid_tmin_clean.csv
‚úÖ Intermediate merged size: (330281, 10)
üîÅ Merging file 3/7: ../data/features_cleaned/grid_prec_clean.csv
üóëÔ∏è Dropped column 'year' from ../data/features_cleaned/grid_prec_clean.csv
‚úÖ Intermediate merged size: (330281, 14)
üîÅ Merging file 4/7: ../data/features_cleaned/grid_landcover_clean.csv
‚úÖ Intermediate merged size: (330281, 15)
üîÅ Merging file 5/7: ../data/features_cleaned/grid_elevation_clean.csv
‚úÖ Intermediate merged size: (330281, 16)
üîÅ Merging file 6/7: ../data/features_cleaned/grid_soil_clean.csv
‚úÖ Intermediate merged size: (845075, 37)
üîÅ Merging file 7/7: ../data/features_cleaned/grid_fire_clean.csv
üóëÔ∏è Dropped column 'year' from ../data/features_cleaned/grid_fire_clean.csv
‚úÖ Intermediate m

## üóëÔ∏è Delete Duplicated Rows


In [2]:
# --- Analysis 1: Initial Check and Cleaning ---
print("--- üî¨ Initial Duplicate Analysis (merged.csv) ---")
stats = duplicate_analysis(
    "../data/Merged/merged.csv",
    ignore_columns=["latitude", "longitude"],
    delete_duplicates=True,
    output_clean_path="../data/Merged/merged_unique.csv"
)

print(f"‚ùå Duplicate Rows Found: {stats['duplicate_rows']}")
print(f"‚ûó Duplicate Percentage: {stats['duplicate_percentage']:.2f}%")
print("üìã Sample of Duplicated Rows:")
print(stats["duplicated_sample"])

print("\n" + "="*50 + "\n")

# --- Analysis 2: Verification on Cleaned Data ---
print("--- ‚úÖ Verification Analysis (merged_unique.csv) ---")
stats = duplicate_analysis(
    "../data/Merged/merged_unique.csv",
    ignore_columns=["latitude", "longitude"]
)

print(f"‚ùå Duplicate Rows Found: {stats['duplicate_rows']}")
print(f"‚ûó Duplicate Percentage: {stats['duplicate_percentage']:.2f}%")
print("üìã Sample of Duplicated Rows:")
print(stats["duplicated_sample"])


--- üî¨ Initial Duplicate Analysis (merged.csv) ---
‚ùå Duplicate Rows Found: 181681
‚ûó Duplicate Percentage: 21.50%
üìã Sample of Duplicated Rows:
     latitude  longitude  winter_tmax  spring_tmax  summer_tmax  autumn_tmax  \
28  34.000231  -1.593868         14.0         23.0         35.5        23.50   
29  34.000231  -1.593868         14.0         23.0         35.5        23.50   
30  34.000231  -1.593868         14.0         23.0         35.5        23.50   
31  34.000231  -1.593868         14.0         23.0         35.5        23.50   
44  34.000231  -1.553868         14.0         23.0         35.5        23.25   
45  34.000231  -1.553868         14.0         23.0         35.5        23.25   
46  34.000231  -1.553868         14.0         23.0         35.5        23.25   
47  34.000231  -1.553868         14.0         23.0         35.5        23.25   
52  34.000231  -1.533868         14.0         23.0         35.5        23.25   
53  34.000231  -1.533868         14.0         23.

# Encoding and scaling 


In [5]:
one_hot_encode(csv_path="../data/Merged/merged_unique.csv", categorical_cols=["GRIDCODE"],label_col ="fire", output_path="../data/Merged/merged_scaled_land.csv")
one_hot_encode(csv_path ="../data/Merged/merged_scaled_land.csv", categorical_cols=["TEXTURE_USDA"],label_col ="fire",output_path ="../data/Merged/merged_scaled_land_soil.csv")
scale_dataset(csv_path="../data/Merged/merged_scaled_land_soil.csv", output_path="../data/preprocessed/preprocessed_data.csv", exclude_cols=["latitude","longitude","GRIDCODE","TEXTURE_USDA_3.0","TEXTURE_USDA_5.0","TEXTURE_USDA_7.0","TEXTURE_USDA_9.0","TEXTURE_USDA_10.0","TEXTURE_USDA_11.0","TEXTURE_USDA_12.0","fire"])

Unnamed: 0,latitude,longitude,winter_tmax,spring_tmax,summer_tmax,autumn_tmax,winter_tmin,spring_tmin,summer_tmin,autumn_tmin,...,GRIDCODE_203.0,GRIDCODE_210.0,TEXTURE_USDA_3.0,TEXTURE_USDA_5.0,TEXTURE_USDA_7.0,TEXTURE_USDA_9.0,TEXTURE_USDA_10.0,TEXTURE_USDA_11.0,TEXTURE_USDA_12.0,fire
0,34.000231,-1.663868,-1.224973,-0.810936,-0.599399,-1.487380,-2.377942,-1.346584,-1.913587,-2.258266,...,False,False,False,False,False,True,False,False,False,1
1,34.000231,-1.663868,-1.224973,-0.810936,-0.599399,-1.487380,-2.377942,-1.346584,-1.913587,-2.258266,...,False,False,False,False,False,False,False,True,False,1
2,34.000231,-1.663868,-1.224973,-0.810936,-0.599399,-1.487380,-2.377942,-1.346584,-1.913587,-2.258266,...,False,False,False,False,False,False,False,True,False,1
3,34.000231,-1.663868,-1.224973,-0.810936,-0.599399,-1.487380,-2.377942,-1.346584,-1.913587,-2.258266,...,False,False,False,False,False,True,False,False,False,1
4,34.000231,-1.653868,-0.361393,0.450169,0.050300,-0.856067,-1.419762,-0.880163,-1.061501,-1.422711,...,False,False,False,False,False,True,False,False,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735478,37.340231,9.736132,0.826029,-0.390567,-1.249098,0.406561,1.574548,0.519102,0.216628,1.083953,...,False,False,False,False,False,True,False,False,False,0
735479,37.340231,9.746132,0.826029,-0.390567,-1.249098,0.406561,1.574548,0.519102,0.216628,1.083953,...,False,False,False,False,False,False,False,False,True,0
735480,37.340231,9.746132,0.826029,-0.390567,-1.249098,0.406561,1.574548,0.519102,0.216628,1.083953,...,False,False,False,False,False,True,False,False,False,0
735481,37.340231,9.756132,0.880002,-0.390567,-0.924249,0.511780,1.574548,0.519102,0.216628,1.083953,...,False,True,False,False,False,False,False,False,True,0


## üìâ Feature Reduction Analysis

üìä Visualize feature variance

Sort features by descending variance to identify those that contribute most

üîé Very low-variance features carry little information and can be removed

üîó Check highly correlated features

Detect pairs with high correlation (above a chosen threshold)

üîÅ Highly correlated features bring redundant information, so one of them can be safely dropped

üå≤ Random Forest Feature Selection

Apply a Random Forest model to rank feature importance

Keep only the top 30 features for a cleaner and more efficient dataset

In [7]:
res = analyze_correlation_variance("../data/preprocessed/preprocessed_data.csv" ,target_col="fire", corr_threshold=0.95,)

print("Correlated pairs:")
for p in res["correlated_pairs"]:
    print(p)

print("\nFeature variances:")
print(res["variances"])


Correlated pairs:
('CLAY', 'REF_BULK', np.float64(0.9676779157918374))
('CEC_EFF', 'TEB', np.float64(0.9739908779021232))

Feature variances:
GRIDCODE_41.0     0.000007
GRIDCODE_170.0    0.000010
GRIDCODE_203.0    0.000019
GRIDCODE_100.0    0.000065
GRIDCODE_16.0     0.000451
                    ...   
ELEC_COND         1.000001
TOTAL_N           1.000001
SAND              1.000001
PH_WATER          1.000001
spring_tmin       1.000001
Length: 61, dtype: float64


In [None]:
reduced =  supervised_feature_reduction(
    "../data/preprocessed/preprocessed_data.csv",
    output_path="../data/preprocessed/preprocessed_reduced_data.csv",
    target_col="fire",
    var_threshold=0.01,
    corr_threshold=0.95,
    importance_method="RF",
    top_k=40
)

print("Selected features:", reduced["selected_features"])


Selected features: ['spring_prec', 'autumn_prec', 'summer_prec', 'winter_prec', 'autumn_tmax', 'elevation', 'autumn_tmin', 'summer_tmax', 'spring_tmax', 'winter_tmin', 'spring_tmin', 'summer_tmin', 'winter_tmax', 'ORG_CARBON', 'TOTAL_N', 'TCARBON_EQ', 'PH_WATER', 'CEC_EFF', 'SAND', 'REF_BULK', 'SILT', 'GRIDCODE_130.0', 'CEC_CLAY', 'GRIDCODE_30.0', 'CN_RATIO', 'GRIDCODE_20.0', 'CEC_SOIL', 'BULK', 'COARSE', 'GRIDCODE_14.0', 'ESP', 'BSAT', 'GYPSUM', 'ALUM_SAT', 'GRIDCODE_150.0', 'GRIDCODE_50.0', 'GRIDCODE_201.0', 'ELEC_COND', 'GRIDCODE_151.0', 'TEXTURE_USDA_5.0']


In [None]:
nb_features = 15
unsupervised_feature_reduction( csv_path = "../data/preprocessed/preprocessed_data.csv",
                                output_path=f"../data/preprocessed/preprocessed_reduced_data_unsupervised_{nb_features}.csv", 
                                var_threshold=0.01, 
                                cluster_distance=0.3,
                                use_autoencoder=True,
                                n_ae_features=nb_features,
                                ae_epochs=50)

Initial features: 63
After Variance Filter: 51
After Clustering: 32
Starting Autoencoder compression...
[1m22984/22984[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m40s[0m 2ms/step
After Autoencoder: 15 synthetic features
Saved to ../data/preprocessed/preprocessed_reduced_data_unsupervised.csv


In [3]:
unsupervised_feature_reduction( csv_path = "../data/preprocessed/preprocessed_data.csv",
                                output_path="../data/preprocessed/preprocessed_reduced_unsupervised_32.csv", 
                                var_threshold=0.01, 
                                cluster_distance=0.3,
                                use_autoencoder=False,
                                ae_epochs=50,
                                percentage_data = 1)

Initial features: 63
After Variance Filter: 51
After Clustering: 32
Saved to ../data/preprocessed/preprocessed_reduced_unsupervised_32.csv


In [12]:
# --- Analysis 1: Initial Check and Cleaning ---
print("--- üî¨ Initial Duplicate Analysis (merged.csv) ---")
stats = duplicate_analysis(
    "../data/preprocessed/preprocessed_reduced_data.csv",
    ignore_columns=["latitude", "longitude"],
    delete_duplicates=False,
)

print(f"‚ùå Duplicate Rows Found: {stats['duplicate_rows']}")
print(f"‚ûó Duplicate Percentage: {stats['duplicate_percentage']:.2f}%")
print("üìã Sample of Duplicated Rows:")
print(stats["duplicated_sample"])

print("\n" + "="*50 + "\n")

# --- Analysis 2: Verification on Cleaned Data ---
print("--- ‚úÖ Verification Analysis (merged_unique.csv) ---")
stats = duplicate_analysis(
    "../data/Merged/merged_unique.csv",
    ignore_columns=["latitude", "longitude"]
)

print(f"‚ùå Duplicate Rows Found: {stats['duplicate_rows']}")
print(f"‚ûó Duplicate Percentage: {stats['duplicate_percentage']:.2f}%")
print("üìã Sample of Duplicated Rows:")
print(stats["duplicated_sample"])


--- üî¨ Initial Duplicate Analysis (merged.csv) ---
‚ùå Duplicate Rows Found: 5748
‚ûó Duplicate Percentage: 0.78%
üìã Sample of Duplicated Rows:
      spring_prec  autumn_prec  summer_prec  winter_prec  autumn_tmax  \
2036    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2037    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2038    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2039    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2040    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2041    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2042    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2043    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2044    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   
2045    -0.863995    -1.431071    -1.385759    -1.077314      1.45875   

      elevation  autumn_tmin  summer_tmax  sprin

In [13]:
fired_pourcentage('../data/preprocessed/preprocessed_reduced_data.csv', label_column_name='fire')


## üìä Dataset Statistics
---------------------------------
**Total Lines (Rows):** **735483**
**Total Features:** **40**
**Label Column Name:** `fire`
---------------------------------
### Label Distribution (Classes 0 and 1)
**Label 0 Count:** **673534** (Non-Fire)
**Label 0 Percentage:** **91.58%**
---
**Label 1 Count:** **61949** (Fire)
**Label 1 Percentage:** **8.42%**
---------------------------------
