# GISTDA Wildfire Machine Learning Training

### This project is focused on developing and training machine learning models to predict and monitor wildfires. It utilizes datasets from Sentinel-2 Images as Raster GeoTIFF format which have been wildfire extraction, analyzes environmental factors, and applies machine learning algorithms to enhance prediction performance.

## Import Library Package

##### 1. **Numpy (`numpy`)**
   - **Description**: A powerful library for numerical computations in Python. It provides support for arrays, matrices, and a large collection of mathematical functions to operate on these data structures, making it essential for scientific and statistical analysis.

##### 2. **Pandas (`pandas`)**
   - **Description**: A widely-used library for data manipulation and analysis. It provides data structures such as DataFrames to store and manipulate large datasets, making data cleaning, transformation, and exploration more efficient and intuitive.

##### 3. **Matplotlib (`matplotlib.pyplot`)**
   - **Description**: A plotting library that provides tools for creating a wide range of static, animated, and interactive visualizations in Python. It is commonly used for generating graphs, charts, and other visual data representations.

##### 4. **Seaborn (`seaborn`)**
   - **Description**: A statistical data visualization library built on top of Matplotlib. It offers high-level functions for creating informative and attractive visualizations, especially useful for exploring and understanding data trends and distributions.

##### 5. **OS (`os`)**
   - **Description**: A standard library in Python that provides functions to interact with the operating system, allowing you to work with directories, files, and system paths. It's useful for handling file operations, environment variables, and system commands.

##### 6. **Pickle (`pickle`)**
   - **Description**: A Python module used to serialize and deserialize Python objects, allowing you to save complex data structures to files and load them back into your program. It’s commonly used for saving trained models or intermediate data states.

##### 7. **Dask (`dask.dataframe` and `dask`)**
   - **Description**: A parallel computing library that scales up computations on larger datasets. `dask.dataframe` provides similar functionality to Pandas DataFrames but can handle larger-than-memory data by performing parallel, chunked computations.

##### 8. **Rasterio (`rasterio`)**
   - **Description**: A library for reading and writing geospatial raster data. It’s widely used for working with geospatial data in formats like GeoTIFF, allowing for operations on large image files commonly used in remote sensing and GIS applications.

##### 9. **Scikit-Learn (`sklearn`)**
   - **Description**: A robust library for machine learning that provides simple and efficient tools for data analysis and modeling. It includes modules like `MinMaxScaler` for scaling data, `cross_val_score` and `cross_val_predict` for evaluating models, and various classifiers.

##### 10. **LightGBM (`lightgbm`)**
   - **Description**: A high-performance, gradient-boosting framework developed by Microsoft. It's optimized for speed and efficiency on large datasets and is particularly well-suited for structured data and classification problems.

##### 11. **IPython (`IPython.display`)**
   - **Description**: A library for creating interactive elements in Jupyter Notebooks, such as displaying Markdown, HTML, and other rich content. It’s often used to improve the readability and interactivity of notebook outputs.

##### 12. **Delayed and Compute (`dask.delayed` and `dask.compute`)**
   - **Description**: Functions in the Dask library that allow you to parallelize and execute tasks asynchronously. `delayed` is used to mark a function for lazy evaluation, and `compute` is used to execute the delayed functions, making computations efficient and scalable.

##### 13. **Rasterio Windows (`rasterio.windows`)**
   - **Description**: A submodule in Rasterio that allows for windowed or tiled reading of raster data. This is useful for reading and processing only portions of large raster datasets, improving efficiency when working with large geospatial files.

##### 14. **XGBoost (`xgboost`)**
   - **Description**: An optimized gradient-boosting framework that’s highly effective for predictive modeling tasks, especially classification and regression. It’s known for its speed and performance and is widely used in machine learning competitions for its accuracy.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob
import pickle
import dask.dataframe as dd
import rasterio
import dask
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn.metrics import classification_report, confusion_matrix
from IPython.display import display, Markdown
from dask import delayed, compute
from rasterio.windows import Window
from xgboost import XGBClassifier

## Import Raster GeoTIFF files

This code reads large raster files in manageable chunks, processes each chunk, and stores the data in a consolidated DataFrame.

1. **Folder Path and Chunk Size**: Sets the path to the folder containing raster files (`raster_train_file_path`) and defines the chunk size (`CHUNK_SIZE`).

2. **`read_raster_in_chunks` Function**:
   - Reads a raster file chunk-by-chunk.
   - For each chunk, it extracts data from all bands, along with the pixel coordinates.
   - Stores each chunk's data in a temporary DataFrame, which is added to a list.

3. **Task Creation**:
   - Creates a list of tasks using Dask's `@delayed` to process each raster file in parallel.

4. **Compute and Combine**:
   - Computes each task to get individual DataFrames for each file, then combines them into a final DataFrame (`final_df`).

5. **Debug Outputs**:
   - Prints the shape, columns, and a sample of the final DataFrame for inspection.

In [11]:
# Parameters
FOLDER_PATH = 'Raster_Train'

# Find all .tif files in the folder
tif_files = glob.glob(os.path.join(FOLDER_PATH, '*.tif'))

# Load and concatenate all data
all_data = []

for file in tif_files:
    with rasterio.open(file) as src:
        bands = src.read()  # (bands, height, width)
        bands_reshaped = bands.reshape(bands.shape[0], -1).T  # (pixels, bands)

        # --- Dynamic Column Naming Logic ---
        # Attempt to use band descriptions from the file's metadata
        column_names = list(src.descriptions)

        # Fallback if descriptions are not available, empty, or don't match the band count
        if not all(column_names) or len(column_names) != src.count:
            generic_names = [f'band_{i+1}' for i in range(src.count)]
            # If there were some descriptions, but not a complete set, warn the user.
            if any(column_names):
                print(f"Warning: Incomplete band descriptions in {FOLDER_PATH}. Falling back to generic names.")
            column_names = generic_names

        df = pd.DataFrame(bands_reshaped, columns=column_names)
        all_data.append(df)

# Combine into a single DataFrame
df_all = pd.concat(all_data, ignore_index=True)

# Check df_all shape before normalization
print("Shape of df_all before normalization:", df_all.shape)

# Display DataFrame (computing results)
display(df_all.head())  # Displays a small, computed sample

Shape of df_all before normalization: (266534912, 16)


Unnamed: 0,B02,B03,B04,B05,B06,B07,B08,B8A,B11,B12,dNBR,BAIS2,NDVI,NDWI,SAVI,Burn_Label
0,1544.0,1614.0,1737.0,2066.0,2474.0,2656.0,2675.0,2919.0,3043.0,2457.0,0.228026,17610.087891,0.212602,-0.247377,0.425108,0.0
1,1492.0,1596.0,1690.0,2066.0,2474.0,2656.0,2716.0,2919.0,3043.0,2457.0,0.228026,17853.355469,0.232864,-0.25974,0.465623,0.0
2,1494.0,1620.0,1716.0,2068.0,2410.0,2682.0,2708.0,2878.0,3055.0,2531.0,0.249016,12238.231445,0.224231,-0.251386,0.448362,0.0
3,1512.0,1642.0,1768.0,2068.0,2410.0,2682.0,2655.0,2878.0,3055.0,2531.0,0.249016,12056.858398,0.200543,-0.235746,0.400995,0.0
4,1566.0,1672.0,1791.0,2115.0,2491.0,2673.0,2586.0,2935.0,3187.0,2637.0,0.25877,9880.970703,0.181631,-0.214655,0.36318,0.0


## Exploratory Data Analysis (EDA) & Feature Engineering

This code performs several operations on a DataFrame (`final_df`), converting it to a Dask DataFrame for distributed processing, renaming columns, and dropping unnecessary columns:

1. **Convert to Dask DataFrame**:
   - Converts the existing Pandas DataFrame (`final_df`) into a Dask DataFrame (`ddf`) to enable parallel, distributed processing.

2. **Rename Columns**:
   - Defines a list of new column names for Sentinel-2 bands and other data (`new_col_names`).
   - Renames `ddf` columns using this list to make them more descriptive.

3. **Drop Unneeded Columns**:
   - Drops columns such as `raster_file`, `subfolder`, `x`, `y`, and `dNBR`, keeping only essential information in the DataFrame (`df`).

4. **Display DataFrame**:
   - Uses `display()` to view the DataFrame when needed, triggering Dask's computation.

In [12]:
# Drop unnecessary columns
columns_to_drop = ['dNBR']
df = df_all.drop(columns=columns_to_drop)

# Display DataFrame (computing results)
print(df.head())  # Displays a small, computed sample

      B02     B03     B04     B05     B06     B07     B08     B8A     B11  \
0  1544.0  1614.0  1737.0  2066.0  2474.0  2656.0  2675.0  2919.0  3043.0   
1  1492.0  1596.0  1690.0  2066.0  2474.0  2656.0  2716.0  2919.0  3043.0   
2  1494.0  1620.0  1716.0  2068.0  2410.0  2682.0  2708.0  2878.0  3055.0   
3  1512.0  1642.0  1768.0  2068.0  2410.0  2682.0  2655.0  2878.0  3055.0   
4  1566.0  1672.0  1791.0  2115.0  2491.0  2673.0  2586.0  2935.0  3187.0   

      B12         BAIS2      NDVI      NDWI      SAVI  Burn_Label  
0  2457.0  17610.087891  0.212602 -0.247377  0.425108         0.0  
1  2457.0  17853.355469  0.232864 -0.259740  0.465623         0.0  
2  2531.0  12238.231445  0.224231 -0.251386  0.448362         0.0  
3  2531.0  12056.858398  0.200543 -0.235746  0.400995         0.0  
4  2637.0   9880.970703  0.181631 -0.214655  0.363180         0.0  


### Check Burn Class

This code checks and displays the counts of burn records in the DataFrame:

1. **Count Burn Labels**:
   - Counts occurrences in the `Burn_Label` column to determine the number of "Burn" and "Unburn" records.
   - Renames the labels: `1` to "Burn" and `0` to "Unburn" for readability.

2. **Display Counts**:
   - Prints the resulting counts to show the distribution of burn and unburned areas.

In [13]:
# Check Burn Records
burn_counts = df['Burn_Label'].value_counts().rename(index={1: 'Burn', 0: 'Unburn'})

# Display the counts with labels
print(burn_counts)

Burn_Label
Unburn    233896598
Burn       32638314
Name: count, dtype: int64


### Downsampling

This code performs downsampling to balance the dataset by reducing the number of "Unburn" records to match the count of "Burn" records:

1. **Get Burn Count**:
   - Retrieves the count of "Burn" records from `burn_counts`.

2. **Sample Unburned Records**:
   - Selects a random sample of "Unburn" records, equal in size to the number of "Burn" records, using a fixed `random_state` for reproducibility.

3. **Combine Burn and Downsampled Unburn Records**:
   - Combines all "Burn" records with the downsampled "Unburn" sample into a new DataFrame (`downsampled_df`).

4. **Check New Burn Record Counts**:
   - Counts and displays the "Burn" and "Unburn" records in `downsampled_df` to verify balance.

In [14]:
burn_count = burn_counts['Burn']
unburn_sample = df[df['Burn_Label'] == 0].sample(n=burn_count, random_state=42)

downsampled_df = pd.concat([df[df['Burn_Label'] == 1], unburn_sample])

# Check Burn Records
burn_counts = downsampled_df['Burn_Label'].value_counts().rename(index={1: 'Burn', 0: 'Unburn'})

# Display the counts with labels
print(burn_counts)

Burn_Label
Burn      32638314
Unburn    32638314
Name: count, dtype: int64


## Pre-Processing

### Remove infinite values

This code handles the presence of infinite values and missing data in the `downsampled_df` DataFrame:

1. **Replace Infinite Values**:
   - Replaces both positive and negative infinite values (`np.inf` and `-np.inf`) with `NaN` using `replace()`. This ensures that infinite values do not interfere with further processing.

2. **Drop Rows with Missing Values**:
   - Removes any rows containing `NaN` values using `dropna()`, ensuring the DataFrame only contains valid data.

3. **Display the DataFrame**:
   - Displays the cleaned DataFrame (`downsampled_df`) for inspection.

In [15]:
# Replacing infinite with nan 
downsampled_df.replace([np.inf, -np.inf], np.nan, inplace=True) 
  
# Dropping all the rows with nan values 
downsampled_df.dropna(inplace=True)

# Printing df 
display(downsampled_df)

Unnamed: 0,B02,B03,B04,B05,B06,B07,B08,B8A,B11,B12,BAIS2,NDVI,NDWI,SAVI,Burn_Label
12,1570.0,1638.0,1789.0,2036.0,2199.0,2358.0,2348.0,2588.0,3156.0,2839.0,-12065.983398,0.135122,-0.178123,0.270179,1.0
13,1587.0,1668.0,1828.0,2036.0,2199.0,2358.0,2345.0,2588.0,3156.0,2839.0,-11936.529297,0.123892,-0.168702,0.247724,1.0
15,1598.0,1678.0,1812.0,2050.0,2219.0,2332.0,2400.0,2568.0,3203.0,2905.0,-15038.642578,0.139601,-0.177048,0.279136,1.0
18,1546.0,1630.0,1734.0,1973.0,2164.0,2327.0,2285.0,2482.0,2861.0,2461.0,-1882.131836,0.137099,-0.167305,0.274129,1.0
232,1662.0,1766.0,1902.0,2113.0,2255.0,2358.0,2468.0,2532.0,2984.0,2631.0,-6323.865723,0.129519,-0.165801,0.258980,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199616788,1836.0,1979.0,2174.0,2470.0,2698.0,2894.0,2963.0,3065.0,3939.0,3423.0,-18058.707031,0.153592,-0.199110,0.307123,0.0
244231951,1487.0,1682.0,1795.0,2469.0,3049.0,3202.0,3402.0,3540.0,3413.0,2404.0,60252.687500,0.309217,-0.338316,0.618315,0.0
3559136,1701.0,1790.0,1998.0,2171.0,2288.0,2368.0,2428.0,2556.0,3405.0,3115.0,-22166.927734,0.097153,-0.151257,0.194262,0.0
103705560,1502.0,1808.0,1686.0,2189.0,2956.0,3246.0,3176.0,3454.0,3498.0,2567.0,46237.296875,0.306458,-0.274478,0.612790,0.0


### Seperate Burn_Label from DataFrame

This code separates the `Burn_Label` from the main DataFrame and ensures the label is in the correct format:

1. **Separate Burn Label**:
   - Extracts the `Burn_Label` column from `downsampled_df` into a new DataFrame (`burn_label`).

2. **Remove Burn Label from Main DataFrame**:
   - Drops the `Burn_Label` column from `downsampled_df` to ensure only the feature data remains.

3. **Convert Burn Label to Integer**:
   - Changes the data type of the `burn_label` DataFrame to `int32` to ensure consistent and efficient processing.

4. **Display Burn Label**:
   - Displays the modified `burn_label` DataFrame to verify the changes.

In [16]:
# Seperate Burn_Label from DataFrame
burn_label = downsampled_df[['Burn_Label']]

# Drop Label from DataFrame
downsampled_df = downsampled_df.drop(columns=['Burn_Label'])

# Change type of Label to Integer Format
burn_label = burn_label.astype('int32')
display(burn_label)

Unnamed: 0,Burn_Label
12,1
13,1
15,1
18,1
232,1
...,...
199616788,0
244231951,0
3559136,0
103705560,0


### Normalization Data with MinMax Scaler

This code normalizes the feature data and saves the normalization model, while also combining it with the `Burn_Label`:

1. **List Columns**:
   - Creates a list of column names from `downsampled_df` to keep track of the column order after normalization (`cols_norm`).

2. **Normalize the Data**:
   - Imports `MinMaxScaler` from `sklearn` and fits it to the data in `downsampled_df`, which scales the values between 0 and 1.

3. **Save the Scaler**:
   - Saves the fitted `MinMaxScaler` model to a specified path (`MinMax_Scaler.pkl`) for later use.

4. **Apply Normalization**:
   - Normalizes the data by applying the `scaler` to `downsampled_df`, then converts the result back into a DataFrame (`df_norm`) with the original column names.

5. **Check Shape**:
   - Prints the shape of `df_norm` to confirm the normalization was applied correctly.

6. **Concatenate with Burn Label**:
   - Combines the normalized feature data (`df_norm`) with the `burn_label` DataFrame, aligning them by their indices and ensuring the result is a complete dataset.

7. **Display the DataFrame**:
   - Displays the final DataFrame (`df_norm`), which now includes both the normalized features and the `Burn_Label`.


In [17]:
# Reassign the dataframe with a list of the columns
cols_norm = downsampled_df.columns.tolist()

# Import Normalize technique
scaler = MinMaxScaler()

# Normalize data
scaler.fit(downsampled_df)

# Save the scaler
scaler_save_path = r'Export_Model'
save_path = os.path.join(scaler_save_path, 'MinMax_Scaler.pkl')
os.makedirs(os.path.dirname(save_path), exist_ok=True)
with open(save_path, 'wb') as f:
    pickle.dump(scaler, f)

# Normalize Data
df_norm = scaler.transform(downsampled_df)
df_norm = pd.DataFrame(df_norm, columns=cols_norm)

# Check df_norm shape after normalization
print("Shape of df_norm after normalization:", df_norm.shape)

# Concatenate df_norm with burn_label
df_norm = pd.concat([df_norm.reset_index(drop=True), burn_label.reset_index(drop=True)], axis=1, sort=False)
display(df_norm)

Shape of df_norm after normalization: (59674126, 14)


Unnamed: 0,B02,B03,B04,B05,B06,B07,B08,B8A,B11,B12,BAIS2,NDVI,NDWI,SAVI,Burn_Label
0,0.111442,0.052246,0.065184,0.093000,0.118952,0.142207,0.181734,0.166144,0.136634,0.120112,0.281439,0.656207,0.292231,0.656066,1
1,0.112649,0.054799,0.068292,0.093000,0.118952,0.142207,0.181502,0.166144,0.136634,0.120112,0.281524,0.649715,0.297891,0.649573,1
2,0.113430,0.055650,0.067017,0.094183,0.120741,0.139792,0.185759,0.164300,0.139576,0.124315,0.279468,0.658797,0.292877,0.658656,1
3,0.109739,0.051566,0.060802,0.087673,0.115821,0.139328,0.176858,0.156371,0.118170,0.096039,0.288188,0.657350,0.298730,0.657209,1
4,0.117973,0.063138,0.074189,0.099510,0.123960,0.142207,0.191022,0.160981,0.125868,0.106865,0.285244,0.652969,0.299634,0.652827,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59674121,0.130324,0.081263,0.095864,0.129692,0.163581,0.191993,0.229334,0.210124,0.185642,0.157305,0.277467,0.666885,0.279623,0.666750,0
59674122,0.105551,0.055990,0.065663,0.129608,0.194974,0.220602,0.263313,0.253918,0.152720,0.092409,0.329368,0.756851,0.195993,0.756741,0
59674123,0.120741,0.065180,0.081839,0.104413,0.126912,0.143136,0.187926,0.163194,0.152219,0.137689,0.274744,0.634258,0.308371,0.634112,0
59674124,0.106616,0.066712,0.056977,0.105935,0.186656,0.224689,0.245820,0.245989,0.158040,0.102789,0.320079,0.755256,0.234344,0.755143,0


## Machine Learning Models Training Deployment

### XGBoost Model Training

The `xgboost_model` function performs hyperparameter tuning for the XGBoost model to find the optimal `max_depth` and evaluate its performance using cross-validation. GPU processing is enabled to speed up the training process. Below is a breakdown of the steps involved:

1. **Input Parameters**:
   - `df`: The DataFrame containing the dataset with the target variable `Burn_Label`.
   - `xgb_estimator_num`: The number of estimators (trees) for the XGBoost model, default value is 100.
   - `objective`: The objective function for the model, default value is `"binary:logistic"` for binary classification.

2. **Data Preparation**:
   - The target variable (`Burn_Label`) is separated from the features, and the feature matrix (`X`) and target vector (`Y`) are defined.

3. **Cross-Validation Setup**:
   - A 10-fold cross-validation strategy (`KFold`) is used to split the data into training and validation sets for robust evaluation of the model.

4. **Hyperparameter Tuning**:
   - The function iterates through different values of `max_depth` (ranging from 2 to 20), training an XGBoost model with each `max_depth` value.
   - For each value of `max_depth`, the model is evaluated using cross-validation, and the mean cross-validation accuracy is stored.

5. **GPU Acceleration**:
   - GPU processing is enabled by using the `tree_method="gpu_hist"` and `predictor="gpu_predictor"` parameters for both training and prediction. This significantly accelerates the model training and prediction times.

6. **Plotting Results**:
   - A plot is generated to visualize the relationship between `max_depth` and mean cross-validation accuracy. This helps in identifying the optimal `max_depth` for the best model.

7. **Optimal Model Training**:
   - Once the best `max_depth` is identified, the final model is trained with the optimal hyperparameters, and cross-validation is performed again to calculate the mean and standard deviation of the accuracy.

8. **Model Evaluation**:
   - A classification report and confusion matrix are generated to evaluate the model's performance in terms of precision, recall, F1-score, and overall accuracy.
   - A confusion matrix heatmap is plotted to provide a clear visualization of the model's predictions versus the actual outcomes.

9. **Output**:
   - The function returns the best-trained XGBoost model (`xgb_model_best`), along with the mean and standard deviation of the cross-validation scores.

10. **Example Usage**:
   - The function is called with a dataset (`df_norm`), and it performs hyperparameter tuning and evaluation of the XGBoost model using GPU acceleration.

This approach leverages GPU resources for faster training while optimizing model performance based on cross-validation results, making it suitable for large datasets and efficient model selection.

In [None]:
def xgboost_model(df_norm, xgb_estimator_num=100, objective="binary:logistic"):
    """
    Performs hyperparameter tuning for XGBoost to find optimal max_depth
    and plots the results with GPU or CPU processing.
    
    Parameters:
    -----------
    df_norm : pandas.DataFrame
        Input dataframe containing features and target variable
    xgb_estimator_num : int, optional (default=100)
        Number of estimators for XGBoost
    objective : str, optional (default="binary:logistic")
        Objective function for XGBoost
    
    Returns:
    --------
    tuple: (best_xgb_model, mean_cross_val_score, std_cross_val_score)
    """
    # Check if GPU is available
    def is_gpu_available():
        try:
            test_model = XGBClassifier(tree_method="gpu_hist")
            return "gpu_hist" in test_model.get_xgb_params().get("tree_method", "")
        except:
            return False

    gpu_available = is_gpu_available()
    if gpu_available:
        print(f"GPU available: {gpu_available}")
    else:
        print(f"GPU unavailable, CPU Enabled.")

    display(Markdown("### Hyperparameter Tuning for XGBoost"))
    print()  # Add Blank Line

    # Define the features and target
    X = df_norm.drop(columns=['Burn_Label'])  # Features: all columns except Burn_Label
    Y = df_norm['Burn_Label']                 # Target: Burn_Label column

    # Set up cross-validation
    kf = KFold(n_splits=10, shuffle=True, random_state=42)  # 10-fold cross-validation
    
    max_depths = list(range(2, 21))  # Range of max_depth to test
    mean_cv_scores = []
    cv_std_devs = []

    # Perform hyperparameter tuning
    for max_depth in max_depths:
        xgb_model = XGBClassifier(
            n_estimators=xgb_estimator_num,
            max_depth=max_depth,
            objective=objective,
            eval_metric="logloss",
            random_state=42,
            tree_method="hist",  # Use "hist" for both CPU and GPU
            device="cuda" if gpu_available else "cpu"  # Explicitly set device
        )
        try:
            scores_cv = cross_val_score(xgb_model, X, Y, cv=kf, scoring='accuracy')
            mean_cv_scores.append(scores_cv.mean())
            cv_std_devs.append(scores_cv.std())
            print(f"Max Depth: {max_depth}, CV Score: {round(scores_cv.mean() * 100, 2)}%, Std Dev: {round(scores_cv.std() * 100, 2)}%")
        except Exception as e:
            print(f"Error processing max_depth {max_depth}: {e}")
            mean_cv_scores.append(0)
            
    # Find the best max_depth
    best_max_depth = max_depths[np.argmax(mean_cv_scores)]
    best_accuracy = max(mean_cv_scores)
    print(f"Best Max Depth: {best_max_depth} with Mean CV Accuracy: {round(best_accuracy * 100, 2)}%")

    # Plotting the mean cross-validation accuracy vs. max_depth
    plt.figure(figsize=(10, 6))
    plt.plot(max_depths, mean_cv_scores, marker='o', linestyle='-', color='blue')
    plt.fill_between(
        max_depths, 
        np.array(mean_cv_scores) - np.array(cv_std_devs), 
        np.array(mean_cv_scores) + np.array(cv_std_devs), 
        color='lightblue', alpha=0.5, label='Std Dev'
    )
    plt.xlabel('Max Depth')
    plt.ylabel('Mean CV Accuracy')
    plt.title('XGBoost - Max Depth vs. Mean CV Accuracy')
    plt.xticks(max_depths)
    plt.grid(True)
    plt.show()

    # Train and evaluate XGBoost with the best max_depth
    xgb_model_best = XGBClassifier(
        n_estimators=xgb_estimator_num,
        max_depth=best_max_depth,
        objective=objective,
        eval_metric="logloss",
        random_state=42,
        tree_method="hist",  # Use "hist" for both CPU and GPU
        device="cuda" if gpu_available else "cpu"  # Explicitly set device
    )
    xgb_model_best.fit(X, Y)
    scores_cv = cross_val_score(xgb_model_best, X, Y, cv=kf, scoring='accuracy')
    mean_cv = scores_cv.mean()
    std_cv = scores_cv.std()

    # Display Cross-validation results
    print(f"XGBoost (Best Max Depth = {best_max_depth}) Cross-validation scores: {round(mean_cv * 100, 2)}%")
    print(f"XGBoost (Best Max Depth = {best_max_depth}) Standard deviation: {round(std_cv * 100, 2)}%")
    
    # Generate classification report
    y_pred = cross_val_predict(xgb_model_best, X, Y, cv=kf)
    report = classification_report(Y, y_pred, output_dict=True)
    cm = confusion_matrix(Y, y_pred)
    
    xgb_result = [{
        'Classifier': 'XGBoost',
        'Model Definition': xgb_model_best,
        'Class 0 - Precision': report['0']['precision'],
        'Class 0 - Recall': report['0']['recall'],
        'Class 0 - F1-Score': report['0']['f1-score'],
        'Class 1 - Precision': report['1']['precision'],
        'Class 1 - Recall': report['1']['recall'],
        'Class 1 - F1-Score': report['1']['f1-score'],
        'Average - Precision': report['macro avg']['precision'],
        'Average - Recall': report['macro avg']['recall'],
        'Average - F1-Score': report['macro avg']['f1-score'],
        'Accuracy': report['accuracy'],
        'Confusion Matrix': cm
    }]
    
    xgb_result_df = pd.DataFrame(xgb_result)
    
    display(Markdown("### Classification Report of XGBoost (Best Max Depth)"))
    display(xgb_result_df)
    
    # Plot Confusion Matrix
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title('Confusion Matrix - XGBoost (Best Max Depth)')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

    print()  # Add Blank Line
    display(Markdown("<span style='color: green; font-weight: bold;'>XGBoost Model Run Complete (GPU/CPU Enabled)</span>"))
    print()  # Add Blank Line

    return xgb_model_best, mean_cv, std_cv

# Call the xgboost_model function with the desired parameters and your DataFrame
xgb_model_best, mean_cv_best, std_cv_best = xgboost_model(df_norm)

GPU available: True


### Hyperparameter Tuning for XGBoost




Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.


  return func(**kwargs)


Max Depth: 2, CV Score: 98.37%, Std Dev: 0.01%
Max Depth: 3, CV Score: 98.38%, Std Dev: 0.0%
Max Depth: 4, CV Score: 98.4%, Std Dev: 0.0%
Max Depth: 5, CV Score: 98.42%, Std Dev: 0.0%
Max Depth: 6, CV Score: 98.44%, Std Dev: 0.01%
Max Depth: 7, CV Score: 98.46%, Std Dev: 0.01%


### Export XGBoost as pickle

After training the XGBoost model with the optimal hyperparameters, the model is saved to a file for later use. The following steps are involved in saving the trained model:

1. **Define Save Path**:
   - The variable `savepath` is set to a directory called `'Export_Model'`, where the model will be saved. If the directory does not exist, it should be created manually or via additional code.

2. **Create the File Path**:
   - `xgb_filename_model` constructs the full file path for saving the model. The model file will be named `Model_XGB.sav` and will be stored in the `Export_Model` directory.

3. **Save the Model**:
   - The `pickle.dump` function is used to serialize and save the trained XGBoost model (`xgb_model_best`) to the specified file path. The file is saved in binary format using the `wb` mode.

4. **Usage**:
   - This allows the trained model to be loaded later for predictions or further analysis without needing to retrain it.

By saving the model, you can easily load it in the future for inference or to integrate it into a production environment.

In [None]:
savepath = r'Export_Model'
xgb_filename_model = os.path.join(savepath, 'Model_XGB.sav')
pickle.dump(xgb_model_best, open(xgb_filename_model, 'wb'))