# Drought Prediction

This Jupyter notebook shows the final model after tuning and training in the notebook `droughtPrediction_PyTorch_HP.ipynb`.  
Below it is evaluated on the test dataset and a data pipeline created for predicting on the original data format.

In [9]:
# General purpose libraries
import pandas as pd
import pickle

# PyTorch libraries for deep learning
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

# Sklearn libraries for metrics
from sklearn.metrics import f1_score, mean_absolute_error
from torchviz import make_dot

In [10]:
# Check device, use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cuda


The `DroughtClassifier` class is a neural network designed in `droughtPrediction_PyTorch_HP.ipynb` for drought prediction. It consists of multiple linear layers with ReLU activation and dropout for regularization.

In [11]:
class DroughtClassifier(nn.Module):
    """
    A neural network classifier for drought prediction.

    Args:
        input_size (int): The number of input features.
        hidden_sizes (list of int): A list containing the sizes of the hidden layers.
        output_size (int): The number of output classes.
        dropout_prob (float, optional): The probability of an element to be zeroed in dropout. Default is 0.5.

    Attributes:
        layers (nn.ModuleList): A list of linear layers.
        dropout (nn.Dropout): Dropout layer for regularization.
    """
    def __init__(self, input_size, hidden_sizes, output_size, dropout_prob=0.5):
        super(DroughtClassifier, self).__init__()
        self.layers = nn.ModuleList()
        
        # Input layer
        self.layers.append(nn.Linear(input_size, hidden_sizes[0]))
        
        # Hidden layers
        for i in range(len(hidden_sizes) - 1):
            self.layers.append(nn.Linear(hidden_sizes[i], hidden_sizes[i+1]))
        
        # Output layer
        self.layers.append(nn.Linear(hidden_sizes[-1], output_size))
        
        # Dropout layer
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, x):
        """
        Defines the forward pass of the neural network.

        Args:
            x (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The output tensor after passing through the network.
        """
        # Apply each layer followed by ReLU activation and dropout, except the last layer
        for layer in self.layers[:-1]:
            x = self.dropout(F.relu(layer(x)))
        # Apply the last layer without activation or dropout
        x = self.layers[-1](x)
        return x

The `evaluate_model` function evaluates the model on the test dataset and prints metrics including loss, accuracy, Macro F1 Mean, and MAE Mean.

In [12]:
# Evaluation function
def evaluate_model(model, test_loader, criterion):
    """
    Evaluates the model on the test dataset and prints the test loss, accuracy, Macro F1 Mean, and MAE Mean.

    Args:
        model (nn.Module): The trained neural network model to be evaluated.
        test_loader (DataLoader): DataLoader for the test dataset.
        criterion (nn.Module): The loss function.

    Returns:
        None
    """
    model.eval() # Set model to evaluation mode
    running_loss = 0.0
    correct_predictions = 0
    all_labels = []
    all_preds = []

    with torch.no_grad():                                           # Disable gradient computation
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)   # Move inputs and labels to GPU

            outputs = model(inputs)             # Forward pass
            loss = criterion(outputs, labels)   # Compute loss

            running_loss += loss.item() * inputs.size(0)    # Accumulate loss
            
            _, preds = torch.max(outputs, 1)                            # Get predictions
            correct_predictions += torch.sum(preds == labels).item()    # Count correct predictions

            all_labels.extend(labels.cpu().numpy())                     # Collect all labels
            all_preds.extend(preds.cpu().numpy())                       # Collect all predictions

    test_loss = running_loss / len(test_loader.dataset)         # Calculate average test loss
    accuracy = correct_predictions / len(test_loader.dataset)   # Calculate test accuracy
    macro_f1 = f1_score(all_labels, all_preds, average='macro') # Calculate Macro F1 Mean
    mae = mean_absolute_error(all_labels, all_preds)            # Calculate MAE Mean
    
    # Print test metrics
    print(f'Test Loss: {test_loss:.4f}, Accuracy: {accuracy:.4f}, Macro F1 Mean: {macro_f1:.4f}, MAE Mean: {mae:.4f}')

The `format_input_for_model` formats data for prediction in the model by performing the same steps taken in `droughtPrediction_DataEng.ipynb`.  
This is done by saving and loading the `StandardScaler` and `PCA` objects used on the training dataset.

In [13]:
def format_input_for_model(input_series, scaler_file='data/scaler.pkl', pca_model_file='data/pca_model.pkl'):
    """
    Format input data series for model prediction.

    Args:
    - input_series (pd.Series): Input series containing data to be formatted.
    - scaler_file (str): File path to the saved StandardScaler object.
    - pca_model_file (str): File path to the saved PCA model object.

    Returns:
    - numpy.ndarray: Transformed and scaled input data ready for model prediction.
    """
    # Step 1: Merge input_series with soil_df based on 'fips'
    soil_df = pd.read_csv('data/soil_data.csv')
    input_data = pd.DataFrame(input_series).T.merge(soil_df, on='fips', how='left')
    
    # Step 2: Drop unnecessary columns 'date' and 'fips' if they exist
    input_data.drop(columns=['date', 'fips'], inplace=True, errors='ignore')
    
    # Step 3: Load the saved StandardScaler object
    with open(scaler_file, 'rb') as file:
        scaler = pickle.load(file)
    
    # Step 4: Scale the input data
    scaled_data = scaler.transform(input_data)
    
    # Step 5: Load the saved PCA object and apply transformation
    with open(pca_model_file, 'rb') as file:
        pca_model = pickle.load(file)
    
    pca_transformed = pca_model.transform(scaled_data)
    
    return pca_transformed

The function `predict_formatted_input` performs the actual prediction of the data using a PyTorch model.

In [14]:
# Function to format input and make predictions
def predict_formatted_input(model, input_series):
    """
    Format input data series, make predictions using a PyTorch model.

    Args:
    - model (torch.nn.Module): PyTorch model to use for predictions.
    - input_series (pd.Series): Input series containing data to be formatted and predicted.

    Returns:
    - int: Predicted class index.
    """
    # Step 1: Format input for model
    formatted_input = format_input_for_model(input_series)
    
    # Step 2: Convert formatted_input to torch tensor
    input_tensor = torch.tensor(formatted_input, dtype=torch.float32)
    
    # Step 3: Ensure model is in evaluation mode and on CPU
    model.eval()
    model.cpu()  # Move model to CPU explicitly
    
    # Step 4: Move input_tensor to CPU if it's not already
    input_tensor = input_tensor.cpu()
    
    # Step 5: Perform prediction
    with torch.no_grad():
        output = model(input_tensor)
        _, predicted = torch.max(output, 1)  # Assuming classification task, get predicted class index
    
    return predicted.item()  # Return the predicted class as an integer

Here is where we load the pretrained model.

In [15]:
# Load the retrained model
with open('saved_models//retrained_model_stepLR2.pkl', 'rb') as f:
    model = pickle.load(f)

Here we load the original data as well as the test dataset.  
We also convert the test dataset to PyTorch tensors and create DataLoaders to evaluate the model on.

In [16]:
drought_df =  pd.read_csv('data/all_timeseries.csv')

# Load training and testing data from a pickle file
with open('data/Xy_trainTest.pkl', 'rb') as f:
    # Unpickle the data into training and testing datasets
    X_train, X_test, y_train, y_test = pickle.load(f)

# Convert data to PyTorch tensors
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.to_numpy(), dtype=torch.long)

# Create DataLoaders
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader  = DataLoader(test_dataset, batch_size=512, shuffle=False, num_workers=4, pin_memory=True)

#### Evaluating Model Statistincs on Test Dataset

In [17]:
criterion = nn.CrossEntropyLoss()

print(f'Model Metrics on test dataset:')
evaluate_model(model, test_loader, criterion)

Model Metrics on test dataset:
Test Loss: 0.6352, Accuracy: 0.7337, Macro F1 Mean: 0.6895, MAE Mean: 0.3255


#### Utilize Model pipline to predict original data

In [18]:
def get_pred(row):
    input_series = drought_df.drop(columns=['score']).iloc[row]
    true_score = drought_df['score'].iloc[row]

    # Predict with the model
    predicted_class = predict_formatted_input(model, input_series)

    # Display input_series and prediction
    print(f"Input Series:\n{input_series}")
    print(f"\nPredicted Class: {predicted_class}")
    print(f"True Class     : {true_score}")

In [19]:
get_pred(row=1)

Input Series:
fips                 1001
date           2000-01-11
PRECTOT              1.33
PS                  100.4
QV2M                 6.63
T2M                 11.48
T2MDEW               7.84
T2MWET               7.84
T2M_MAX             18.88
T2M_MIN              5.72
T2M_RANGE           13.16
TS                  10.43
WS10M                1.76
WS10M_MAX            2.48
WS10M_MIN            1.05
WS10M_RANGE          1.43
WS50M                3.55
WS50M_MAX            6.38
WS50M_MIN            1.71
WS50M_RANGE          4.67
year                 2000
month                   1
day                    11
Name: 1, dtype: object

Predicted Class: 2
True Class     : 2


#### torchviz Visualization

In [23]:
# Visualize the model using torchviz
def visualize_model(model, input_size):
    model.eval()
    x = torch.randn(1, input_size)  # Create a dummy input tensor
    y = model(x)  # Perform a forward pass through the model
    make_dot(y, params=dict(model.named_parameters())).render("drought_model_simp", format="png")  
    make_dot(y, params=dict(model.named_parameters()), show_attrs=True, show_saved=True).render("drought_model_comp", format="png")  

# Assuming input_size is known (e.g., from X_test.shape[1])
input_size = X_test.shape[1]
visualize_model(model, input_size)

drought_model_simp `DroughtClassifier` model

1. **Input Layer:**
   - The input tensor has a shape of \((1, 52)\), indicating a batch size of 1 and 52 input features.

2. **First Linear Layer (layers.0):**
   - **weights:** `layers.0.weight (1024, 52)`
   - **bias:** `layers.0.bias (1024)`
   - This layer maps the 52 input features to 1024 features using a linear transformation.

3. **First Activation and Dropout:**
   - The output from the first linear layer passes through a ReLU activation function, followed by a dropout layer to introduce regularization.
   - Represented by `ReLUBackward0` and `TBackward0`.

4. **Second Linear Layer (layers.1):**
   - **weights:** `layers.1.weight (512, 1024)`
   - **bias:** `layers.1.bias (512)`
   - This layer takes the 1024 features from the previous layer and maps them to 512 features.

5. **Second Activation and Dropout:**
   - Similar to the first layer, the output from the second linear layer passes through ReLU activation and dropout.
   - Represented by `ReLUBackward0` and `TBackward0`.

6. **Third Linear Layer (layers.2):**
   - **weights:** `layers.2.weight (256, 512)`
   - **bias:** `layers.2.bias (256)`
   - This layer reduces the 512 features to 256 features.

7. **Third Activation and Dropout:**
   - Again, the output goes through ReLU activation and dropout.
   - Represented by `ReLUBackward0` and `TBackward0`.

8. **Fourth Linear Layer (layers.3):**
   - **weights:** `layers.3.weight (128, 256)`
   - **bias:** `layers.3.bias (128)`
   - This layer further reduces the features from 256 to 128.

9. **Fourth Activation and Dropout:**
   - The output undergoes ReLU activation and dropout.
   - Represented by `ReLUBackward0` and `TBackward0`.

10. **Fifth (Output) Linear Layer (layers.4):**
    - **weights:** `layers.4.weight (6, 128)`
    - **bias:** `layers.4.bias (6)`
    - This final layer maps the 128 features to 6 output classes (assuming a 6-class classification problem).

11. **Output:**
    - The final output tensor has a shape of \((1, 6)\), indicating the model's prediction probabilities for each of the 6 classes.

**AccumulateGrad Nodes:**
- These nodes represent where the gradients are accumulated during the backward pass for updating the model parameters.

**AddmmBackward0 Nodes:**
- These nodes represent the matrix multiplication operations performed during the forward pass through each linear layer.
- Detailed information about the matrix multiplication is provided, including the sizes of the matrices involved (`mat1`, `mat2`), strides, and saved tensors.

**ReluBackward0 and TBackward0 Nodes:**
- These nodes represent the ReLU activation and dropout operations applied to the output of each linear layer.

**mat1, mat2, and result Tensors:**
- `mat1` and `mat2` represent the input and weight matrices involved in the matrix multiplication.
- `result` represents the output tensor after each matrix multiplication.