# üîÑ Merged Dataset ‚Äî Minimal EDA & Preprocessing

This notebook performs a focused quality pass on the merged dataset:
- Quantify and treat missing values with targeted strategies.
- Detect and mitigate outliers using a single, consistent method.
- Apply minimal encoding/scaling and persist a clean artifact for modeling.

Non‚Äëessential analyses (Q‚ÄìQ plots, correlation matrices, excessive visuals) are intentionally excluded to keep the pipeline lean and reproducible.


# Merging Data

In [1]:
from scripts.dataMerging.combineDatasets import extract_features_elevation , extract_features_landcover , extract_features_monthly_clim , extract_features_soil , organize_monthly_climat_files
from scripts.dataMerging.mergeDataSources import progressive_merge
from scripts.dataMerging.generateGrid import generate_grid_in_shape
from scripts.dataPreprocessing.dataCleaning import process_fire_data , treat_sensor_errors_soil , impute_with_geo_zones 
from scripts.dataPreprocessing.scalingEncoding import scalingEncodingDataset
from scripts.statistics.firePerSeason import calculate_seasonal_fire_percentage

## Create a refrence grid


In [2]:

# Step 1: Generate grid (only once)
grid_df = generate_grid_in_shape(
    "../data/shapefiles/combined/alg_tun.shp",
    resolution=0.05, # 5 KM resolution
    output_csv="../data/features/grid_points.csv",
)



üìÇ Loading shapefile...
üó∫Ô∏è Bounding box: [-8.67386818 18.96023083 11.98736715 37.55986   ]
üìè Grid: 414 √ó 372 = 154,008 total points
üîç Filtering points inside region...


KeyboardInterrupt: 

## Extract Nearest Points from grid refrence fire 

In [19]:

# Define the paths and parameters
GRID_FILE = "../data/features/grid_points.csv"
FIRE_FILE = "../data/fire_dataset/viirs-jpss1_2024_alg_Tun.csv"
TARGET_FIRE_TYPE = 2 

process_fire_data(
    grid_path=GRID_FILE,
    fire_path=FIRE_FILE,
    target_type=TARGET_FIRE_TYPE,
    output_file="../data/preprocessed/fire_preprocessed.csv"
)


‚úÖ Saved (91102, 3) grid points with binary fire data (1/0) to ../data/preprocessed/fire_preprocessed.csv


## General function for the preprocessing of all the data

## ‚òÅÔ∏è Climat Dataset ‚òÅÔ∏è



In [6]:

# Organize the files
monthly_tmax_data = organize_monthly_climat_files(
    "../data/climate_dataset/5min/max/*.tif"
)
monthly_tmin_data = organize_monthly_climat_files(
    "../data/climate_dataset/5min/min/*.tif"
)
monthly_tprec_data = organize_monthly_climat_files(
    "../data/climate_dataset/5min/prec/*.tif"
)


fires_tmax = extract_features_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    raster_dict=monthly_tmax_data,
    output_path="../data/features/grid_tmax.csv",
    col_name="tmax",
)


fires_tmin = extract_features_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    raster_dict=monthly_tmin_data,
    output_path="../data/features/grid_tmin.csv",
    col_name="tmin",
)


fires_tprec = extract_features_monthly_clim(
    point_csv="../data/features/grid_points.csv",
    raster_dict=monthly_tprec_data,
    output_path="../data/features/grid_tprec.csv",
    col_name="prec",
)

Month 01: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:17<00:00, 5259.51it/s]
Month 02: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:17<00:00, 5275.49it/s]
Month 03: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:17<00:00, 5081.78it/s]
Month 04: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:18<00:00, 5003.54it/s]
Month 05: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:14<00:00, 6093.41it/s]
Month 06: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:17<00:00, 5080.64it/s]
Month 07: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:19<00:00, 4560.28it/s]
Month 08: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:17<00:00, 5170.49it/s]
Month 09: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:14<00:00, 6103.04it/s]
Month 10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:13<00:00, 6823.80it/s]
Month 11: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7031.24it/s]
Month 12: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/9

‚úÖ Finished sampling all monthly rasters.
üíæ Saved seasonal climatology to ../data/features/grid_tmax.csv


Month 01: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:11<00:00, 7638.57it/s]
Month 02: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7527.26it/s]
Month 03: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7357.95it/s]
Month 04: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:11<00:00, 7792.98it/s]
Month 05: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:13<00:00, 6810.94it/s]
Month 06: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:14<00:00, 6140.11it/s]
Month 07: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7299.74it/s]
Month 08: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:11<00:00, 7730.85it/s]
Month 09: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:11<00:00, 7969.92it/s]
Month 10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:11<00:00, 7789.68it/s]
Month 11: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7281.98it/s]
Month 12: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/9

‚úÖ Finished sampling all monthly rasters.
üíæ Saved seasonal climatology to ../data/features/grid_tmin.csv


Month 01: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:13<00:00, 6657.79it/s]
Month 02: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:13<00:00, 6899.63it/s]
Month 03: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7542.38it/s]
Month 04: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:13<00:00, 6917.20it/s]
Month 05: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7320.27it/s]
Month 06: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:13<00:00, 6773.21it/s]
Month 07: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:14<00:00, 6307.43it/s]
Month 08: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7105.57it/s]
Month 09: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7192.15it/s]
Month 10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:11<00:00, 7789.78it/s]
Month 11: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:12<00:00, 7475.49it/s]
Month 12: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/9

‚úÖ Finished sampling all monthly rasters.
üíæ Saved seasonal climatology to ../data/features/grid_tprec.csv


In [2]:
calculate_seasonal_fire_percentage('../data/fire_dataset/viirs-jpss1_2024_alg_Tun.csv')


Unnamed: 0,Season,Count,Percentage
3,Winter,18609,20.62%
2,Spring,23093,25.59%
0,Summer,24667,27.33%
1,Autumn/Fall,23881,26.46%


## General Preprocessing for Climat CSV

In [7]:

impute_with_geo_zones("../data/features/grid_tmax.csv", base_res=0.05 , min_points=10 ,max_res=0.2, output_path="../data/features_cleaned/grid_tmax_clean.csv")
scalingEncodingDataset("../data/features_cleaned/grid_tmax_clean.csv","../data/preprocessed/tmax_preprocessed.csv")

impute_with_geo_zones("../data/features/grid_tmin.csv", base_res=0.05, min_points=10 ,max_res=0.2, output_path="../data/features_cleaned/grid_tmin_clean.csv")
scalingEncodingDataset("../data/features_cleaned/grid_tmin_clean.csv","../data/preprocessed/tmin_preprocessed.csv")


impute_with_geo_zones("../data/features/grid_tprec.csv", base_res=0.05 , min_points=10 ,max_res=0.2, output_path="../data/features_cleaned/grid_prec_clean.csv")
scalingEncodingDataset("../data/features_cleaned/grid_prec_clean.csv","../data/preprocessed/prec_preprocessed.csv")



Missing values (percent) per column :
winter_tmax    0.356743
spring_tmax    0.356743
summer_tmax    0.356743
autumn_tmax    0.356743
dtype: float64

=== Imputing column: winter_tmax ===
winter_tmax: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_tmax_clean.csv

=== Imputing column: spring_tmax ===
spring_tmax: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_tmax_clean.csv

=== Imputing column: summer_tmax ===
summer_tmax: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_tmax_clean.csv

=== Imputing column: autumn_tmax ===
autumn_tmax: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_tmax_clean.csv
Saved preprocessed dataset ‚Üí ../data/preprocessed/tmax_preprocessed.csv
Missing values (percent) per column :
winter_tmin    0.356743
spring_tmin    0.356743
summer_tmin    0.356743
autumn_tmin    0.356743
dtype: float64

=== Imputin

## üü© Landcover üü©

In [8]:

fires_with_landcover = extract_features_landcover(
    csv_path="../data/features/grid_points.csv",
    shapefile_path="../data/land_dataset/combined/alg_tun_landcvr.shp",
    lat_col="latitude",
    lon_col="longitude",
    keep_cols=["GRIDCODE"],  # can be ["GRIDCODE", "CLASS", "AREA", ...]
    output_path="../data/features/grid_landcover.csv",
)


## General Preprocessing for Landcover CSV

In [8]:
impute_with_geo_zones("../data/features/grid_landcover.csv", base_res=0.05, min_points=10 ,max_res=0.2, output_path="../data/features_cleaned/grid_landcover_clean.csv")
scalingEncodingDataset("../data/features_cleaned/grid_landcover_clean.csv","../data/preprocessed/landcover_preprocessed.csv")


Missing values (percent) per column :
GRIDCODE    0.051591
dtype: float64

=== Imputing column: GRIDCODE ===
GRIDCODE: imputation done using geo-zones.
üíæ Saved imputation to ../data/features_cleaned/grid_landcover_clean.csv
Saved preprocessed dataset ‚Üí ../data/preprocessed/landcover_preprocessed.csv


## üå± Soil üå±

In [10]:
_ , fires_with_soil = extract_features_soil(
    csv_path="../data/features/grid_points.csv",
    raster_path="../data/soil_dataset/original/HWSD2_RASTER/HWSD2.bil",
    soil_attributes_csv="../data/soil_dataset/simplified/D1_soil_features_alg_tun.csv",
    output_soil_ids="../data/features/fire_soil_ids.csv",
    output_soil_feature="../data/features/grid_soil.csv",
)

In [9]:



treat_sensor_errors_soil("../data/features/grid_soil.csv",output_path="../data/features/grid_soil_treated.csv")

‚úî Cleaning complete!
  Deleted rows : 15998
  Fixed rows   : 21237


## General Preprocessing for Soil CSV

In [10]:


CATEGORICAL_COLS_SOIL = ["TEXTURE_USDA", "TEXTURE_SOTER"]  # categorical columns
NUMERIC_COLS_SOIL = [
    "COARSE", "SAND", "SILT", "CLAY", "BULK", "REF_BULK", "ORG_CARBON", "PH_WATER",
    "TOTAL_N", "CN_RATIO", "CEC_SOIL", "CEC_CLAY", "CEC_EFF", "TEB", "BSAT",
    "ALUM_SAT", "ESP", "TCARBON_EQ", "GYPSUM", "ELEC_COND"
]  # numeric columns

# Usage
soil_cleaned = impute_with_geo_zones("../data/features/grid_soil_treated.csv",num_cols=NUMERIC_COLS_SOIL , cat_cols=CATEGORICAL_COLS_SOIL,  base_res=0.05, min_points=10 ,max_res=0.5, output_path="../data/features_cleaned/grid_soil_clean.csv")


Missing values (percent) per column :
COARSE          11.772825
SAND            11.772825
SILT            11.772825
CLAY            11.772825
TEXTURE_USDA    11.772825
BULK            11.772825
REF_BULK        11.772825
ORG_CARBON      11.772825
PH_WATER        11.772825
TOTAL_N         11.772825
CN_RATIO        11.772825
CEC_SOIL        11.772825
CEC_CLAY        11.772825
CEC_EFF         11.772825
TEB             11.772825
BSAT            11.772825
ALUM_SAT        11.772825
ESP             11.772825
TCARBON_EQ      11.772825
GYPSUM          11.772825
ELEC_COND       11.772825
dtype: float64

=== Imputing column: COARSE ===


KeyboardInterrupt: 

In [15]:
df = pd.read_csv("../data/features_cleaned/grid_soil_clean.csv")
if "TEXTURE_SOTER" in df.columns:
        df.drop(columns=["TEXTURE_SOTER"], inplace=True)

scalingEncodingDataset("../data/features_cleaned/grid_soil_clean.csv","../data/preprocessed/soil_preprocessed.csv",categorical_col=["TEXTURE_USDA"])


TEXTURE_USDA classes found: [np.int64(3), np.int64(5), np.int64(7), np.int64(9), np.int64(10), np.int64(11), np.int64(12)]
Saved preprocessed dataset ‚Üí ../data/preprocessed/soil_preprocessed.csv


## üèîÔ∏è Elevation üèîÔ∏è

In [12]:
fires_with_elevation = extract_features_elevation(
    raster_path="../data/elevation_dataset/simplified/elevation_clipped.tif",
    fire_csv_path="../data/features/grid_points.csv",
    output_csv="../data/features/grid_elevation.csv",
    value_name="elevation",
)


Loaded 91102 points from ../data/features/grid_points.csv


Extracting elevation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 91102/91102 [00:13<00:00, 6933.98it/s]


‚úÖ Saved extracted elevation to ../data/features/grid_elevation.csv


## General Preprocessing for Elevation CSV

In [17]:
impute_with_geo_zones("../data/features/grid_elevation.csv", base_res=0.05, min_points=10 ,max_res=0.1, output_path="../data/features_cleaned/grid_elevation_clean.csv")
scalingEncodingDataset("../data/features_cleaned/grid_elevation_clean.csv","../data/preprocessed/elevation_preprocessed.csv")


Missing values (percent) per column :
Series([], dtype: float64)
Saved preprocessed dataset ‚Üí ../data/preprocessed/elevation_preprocessed.csv


# Merge

## üî• Merging with Fire Data üî•

In [17]:
csv_list= ["../data/features_cleaned/grid_tmax_clean.csv", "../data/features_cleaned/grid_tmin_clean.csv","../data/features_cleaned/grid_tprec_clean.csv",  "../data/features_cleaned/grid_landcover_clean.csv" , "../data/features_cleaned/grid_elevation_clean.csv" , "../data/features_cleaned/grid_soil_clean.csv","../data/features_cleaned/grid_fire_clean.csv"]
temp_df = progressive_merge(
    csv_list,
    on=["latitude", "longitude"],
    how="inner",
    output_path="../data/Merged/merged.csv"
)


Loading first CSV: ../data/features_cleaned/grid_tmax_clean.csv
üîÅ Merging file 2/7: ../data/features_cleaned/grid_tmin_clean.csv
‚úÖ Intermediate merged size: (91102, 10)
üîÅ Merging file 3/7: ../data/features_cleaned/grid_tprec_clean.csv
‚úÖ Intermediate merged size: (91102, 11)
üîÅ Merging file 4/7: ../data/features_cleaned/grid_landcover_clean.csv
‚úÖ Intermediate merged size: (91102, 12)
üîÅ Merging file 5/7: ../data/features_cleaned/grid_elevation_clean.csv
‚úÖ Intermediate merged size: (91102, 13)
üîÅ Merging file 6/7: ../data/features_cleaned/grid_soil_clean.csv
‚úÖ Intermediate merged size: (196405, 35)
üîÅ Merging file 7/7: ../data/features_cleaned/grid_fire_clean.csv
‚úÖ Intermediate merged size: (196405, 36)
‚úÖ All files merged successfully.
