# Model Experiments for Financial Tweet Sentiment Analysis

This notebook demonstrates model training and evaluation for a sentiment analysis model focused on financial tweets. It leverages a BERT-based model to classify tweets as positive, negative, or neutral, providing valuable sentiment insights for financial analysis.

The notebook covers the following steps:

1. **Load and Prepare Data**: Loads the latest data loaders generated by the preprocessing pipeline.
2. **Define and Train the Model**: Uses a BERT-based architecture for training.
3. **Evaluate and Log Results**: Measures accuracy, processing time, and logs results using MLflow.

---

**Objective**: To fine-tune a BERT-based model on financial tweet data and log performance metrics to track progress.

## Step 1: Load and Prepare Data

In this step, we load the latest data loaders generated from the preprocessing pipeline, which contain preprocessed
training and validation data for sentiment analysis.

The data loaders (`train_loader` and `validation_loader`) contain tokenized tweet text and additional features, like the presence of a source link. These loaders ensure that data is efficiently batched for training and evaluation in the model.

Using data loaders allows for streamlined processing and batching, making model training faster and memory-efficient.


In [1]:
import os

def get_project_root() -> str:
    return os.path.abspath(os.path.join(os.getcwd(), "../"))

In [2]:
print(get_project_root())

/Users/maxmartyshov/Desktop/IU/year3/PMDL/Sentiment_Analysis_for_Financial_News


In [3]:
os.chdir(get_project_root())

In [4]:
import sys
from pipelines.extract_training_data import extract_latest_loaders

src_path = os.path.join(os.getcwd(), 'src')
if src_path not in sys.path:
    sys.path.append(src_path)
dataloaders = extract_latest_loaders()
train_loader = dataloaders['train']
val_loader = dataloaders['validation']

[31mThe ZenML global configuration version (0.68.1) is higher than the version of ZenML currently being used (0.67.0). Read more about this issue and how to solve it here: [0m[1;36m[0m[34mhttps://docs.zenml.io/reference/global-settings[31m#version-mismatch-downgrading[31m[0m


  return torch.load(f)  # nosec


Pipeline artifact: e2b98d81-fb4d-45c6-bc00-b422e73bdacc loaded successfully


## Step 2: Initialize the Sentiment Analysis Model

In this section, we initialize the sentiment analysis model, which is based on a BERT architecture with an additional layer to incorporate the `has_source` feature.

**Model Structure**:
- The model leverages a pre-trained BERT model as its base.
- A fully connected linear layer is added to handle sentiment classification into three classes (positive, neutral, negative).
- A dropout layer is used for regularization to help prevent overfitting.

**Purpose**: Initializing the model here enables us to fine-tune it specifically for the task of sentiment analysis on financial tweets, leveraging BERT's language understanding along with the custom feature representing source presence.


In [5]:
import torch.nn as nn
import torch
from transformers import BertModel


class SentimentAnalysisModel(nn.Module):
    def __init__(self, bert_model_name='bert-base-uncased', num_labels=3):
        super(SentimentAnalysisModel, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        
        # Freeze the BERT embedding layers to retain base language understanding
        for param in self.bert.embeddings.parameters():
            param.requires_grad = False
        
        # Dropout and two fully connected layers
        self.dropout = nn.Dropout(0.4)  # Increased dropout to 0.4
        self.fc1 = nn.Linear(self.bert.config.hidden_size + 1, 128)
        self.fc2 = nn.Linear(128, num_labels)

    def forward(self, input_ids, attention_mask, has_source):
        embeddings = self.bert(input_ids=input_ids, attention_mask=attention_mask).pooler_output
        has_source = has_source.unsqueeze(1)
        combined_input = torch.cat((embeddings, has_source), dim=1)
        
        # Passing through dropout and two FC layers with ReLU
        x = self.dropout(combined_input)
        x = torch.relu(self.fc1(x))
        logits = self.fc2(x)
        return logits


## Step 3: Training and Validation - One Epoch

In this section, we define functions to train and validate the model within a single epoch.

### Training Workflow
- The model is set to training mode, enabling dropout and allowing gradient updates.
- For each batch in the training data loader:
  - We perform a forward pass to calculate predictions.
  - Calculate the loss between predictions and true labels.
  - Perform backpropagation to update model parameters based on the calculated gradients.
- Training loss is recorded to track the model’s learning progress.

### Validation Workflow
- The model is switched to evaluation mode, which disables dropout layers for stable predictions.
- For each batch in the validation data loader:
  - We perform a forward pass to calculate predictions without updating gradients.
  - Calculate the loss and accuracy for model performance on unseen data.
- Validation loss and accuracy are tracked to assess the model’s generalization capabilities.

**Purpose**: Training and validating one epoch at a time allows us to monitor model performance and detect potential issues like overfitting, guiding further training adjustments as needed.


In [6]:
from tqdm import tqdm

import mlflow
import mlflow.pytorch

def train_one_epoch(model, dataloader, optimizer, criterion, device, epoch):
    model.train()
    train_loss = 0.0
    total = 0.

    loop = tqdm(
        enumerate(dataloader, 1),
        total=len(dataloader),
        desc=f"Epoch {epoch}: train",
        leave=True,
    )

    for _, batch in loop:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        has_source = batch['has_source'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()

        logits = model(input_ids = input_ids, attention_mask=attention_mask, has_source=has_source)

        loss = criterion(logits, labels)

        loss.backward()
        optimizer.step()

        train_loss += loss.item() * input_ids.size(0)
        total += labels.size(0)

        loop.set_postfix({"loss": train_loss/total})

    avg_train_loss = train_loss / total
    mlflow.log_metric('train_loss', avg_train_loss, step=epoch)


def val_one_epoch(model, dataloader, criterion, device, epoch, best_so_far, ckpt_name='model'):
    model.eval()
    val_loss = 0.
    correct = 0.
    total = 0.
    with torch.no_grad():
        loop = tqdm(
            enumerate(dataloader, 1),
            total=len(dataloader),
            desc=f"Epoch {epoch}: val",
            leave=True,
        )
        for i, batch in loop:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            has_source = batch['has_source'].to(device)
            labels = batch['labels'].to(device)

            logits = model(input_ids=input_ids, attention_mask=attention_mask, has_source=has_source)

            loss = criterion(logits, labels)
            val_loss += loss.item() * input_ids.size(0)

            _, preds = torch.max(logits, dim=1)
            correct += (preds == labels).sum().item()

            total += labels.size(0)

            loop.set_postfix({"loss": val_loss/total, "acc": correct / total})
        current_acc = correct / total

        avg_val_loss = val_loss / total
        mlflow.log_metric('validation_loss', avg_val_loss, step=epoch)
        mlflow.log_metric('validation_accuracy', current_acc, step=epoch)


        if current_acc > best_so_far:
            print(f"Validation accuracy improved from {best_so_far:.4f} to {current_acc:.4f}. Saving model...")
            mlflow.pytorch.log_model(model, ckpt_name)

            best_so_far = current_acc
    return best_so_far

## Step 4: Model Registration and Updating the Champion Model

In this step, we register the trained model in the MLflow Model Registry and update the "champion" alias to the latest best-performing version. Registering models and assigning aliases help us manage multiple model versions, allowing for easy deployment of the current best model.

### Model Registration
- **Purpose**: Registers the trained model under a specified name in the MLflow Model Registry, linking it to a unique run ID.
- **Signature**: We define an input and output schema (model signature) to ensure compatibility and verify input format for deployment.
- **Result**: This registered model can then be versioned, tracked, and deployed as needed.

### Updating the Champion Model
- **Champion Alias**: The "champion" alias points to the best-performing model version, making it easy to retrieve and use in production.
- **Updating the Alias**: After registering the model, we set the "champion" alias to the latest model version, which has been validated to perform optimally.

By registering the model and updating the champion alias, we maintain organized model versions and ensure easy access to the top-performing model for deployment.


In [7]:
from mlflow.tracking import MlflowClient

def register_model(run_id, model_name, description):
    """
    Registers the incoming model from the specified run without modifying the Champion tag.
    
    Parameters:
    - run_id: str, the ID of the run where the model is logged.
    - model_name: str, the name of the model in the registry.
    - description: str, a description for the model version.
    
    Returns:
    - version: int, the version of the registered model.
    """
    client = MlflowClient()
    model_uri = f"runs:/{run_id}/{model_name}"
    result = mlflow.register_model(model_uri, model_name)
    print(f"Model registered with name '{model_name}' and version '{result.version}'")
    client.update_model_version(
        name=model_name,
        version=result.version,
        description=description,
    )
    return result.version


In [8]:
def update_champion_alias(model_name, metric_name="validation_accuracy"):
    """
    Goes through all model versions, checks their metrics, and assigns the "Champion" alias to the best-performing model.
    
    Parameters:
    - model_name: str, the name of the model in the MLflow Model Registry.
    - metric_name: str, the metric to base the Champion selection on (default is 'accuracy').
    
    Returns:
    - champion_version: int, the version of the model that is now assigned the "Champion" alias.
    """
    client = MlflowClient()
    
    # Search for all registered versions of the model
    versions = client.search_model_versions(f"name='{model_name}'")
    
    # Initialize variables to track the best version based on the metric
    best_version = None
    best_metric_value = -float('inf')  # Assume we're maximizing the metric (e.g., accuracy)
    best_run_id = None

    # Go through all versions to find the one with the best metric
    for version in versions:
        run_id = version.run_id
        # Get the run's metrics
        run = client.get_run(run_id)
        
        if metric_name in run.data.metrics:
            metric_value = run.data.metrics[metric_name]
            if metric_value > best_metric_value:
                best_metric_value = metric_value
                best_version = version.version
                best_run_id = run_id
    
    # Check if a best version was found
    if best_version is None:
        raise ValueError(f"No models found with metric '{metric_name}'")

    # Reassign the "champion" alias to the best version
    client.set_registered_model_alias(
        name=model_name,
        alias="champion",
        version=best_version
    )
    print(f"Model version {best_version} from run {best_run_id} assigned as 'champion' with {metric_name}: {best_metric_value}")

    return best_version


## Step 5: Main Training Loop

In this section, we define the main training loop, where the model is trained and validated across multiple epochs. During each epoch, the model undergoes a cycle of training on the training dataset and then validation on the validation dataset to assess its performance.

### Workflow
1. **Training Phase**: 
   - For each epoch, the model performs forward and backward passes on batches from the training set, adjusting weights based on the calculated gradients.
   - Training loss is recorded and logged for each epoch to monitor the model’s learning progress.

2. **Validation Phase**: 
   - After each training phase, the model is evaluated on the validation set without updating weights.
   - Validation metrics, including loss and accuracy, are tracked to assess the model’s generalization ability on unseen data.

3. **Logging with MLflow**:
   - For each epoch, training and validation metrics (loss and accuracy) are logged in MLflow, enabling us to track performance over time and compare results across epochs.

**Purpose**: The main training loop facilitates iterative learning, helping to evaluate model performance and detect issues like overfitting. Tracking metrics across epochs allows us to determine the point of optimal performance and informs further tuning efforts.


In [9]:
import torch.optim as optim
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
project_root = os.path.abspath(os.getcwd())
mlflow.set_tracking_uri(f"file://{project_root}/mlruns")

epochs = 10
device = 'mps'
model_name = 'simple_sentiment_analysis_model'
model_folder = "simple_sentiment_analysis_model"
lr = 1e-5

model_desctiption = "BERT with two FC layers (128, num_labels), 0.4 dropout, embeddings frozen, with ReLU activation."

model = SentimentAnalysisModel(bert_model_name='bert-base-uncased', num_labels=3).to(device)
criterion = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters(), lr=lr)

best_so_far = 0.
mlflow.set_experiment("SentimentAnalysis")

with mlflow.start_run():
    mlflow.log_param("learning_rate", lr)
    mlflow.log_param("epochs", epochs)
    run = mlflow.active_run()
    run_id = run.info.run_id
    for epoch in range(epochs):
        train_one_epoch(model, train_loader, optimizer, criterion, device, epoch)
        best_so_far = val_one_epoch(model, val_loader, criterion, device, epoch, best_so_far, model_name)
    register_model(run_id, model_name, model_desctiption)
    update_champion_alias(model_name)

Epoch 0: train: 100%|██████████| 344/344 [02:00<00:00,  2.85it/s, loss=0.768]
Epoch 0: val: 100%|██████████| 170/170 [00:20<00:00,  8.26it/s, loss=0.431, acc=0.852]


Validation accuracy improved from 0.0000 to 0.8518. Saving model...


Epoch 1: train: 100%|██████████| 344/344 [03:39<00:00,  1.56it/s, loss=0.36] 
Epoch 1: val: 100%|██████████| 170/170 [00:43<00:00,  3.94it/s, loss=0.329, acc=0.89] 


Validation accuracy improved from 0.8518 to 0.8895. Saving model...


Epoch 2: train: 100%|██████████| 344/344 [02:56<00:00,  1.95it/s, loss=0.227]
Epoch 2: val: 100%|██████████| 170/170 [00:25<00:00,  6.78it/s, loss=0.307, acc=0.897]


Validation accuracy improved from 0.8895 to 0.8973. Saving model...


Epoch 3: train: 100%|██████████| 344/344 [02:30<00:00,  2.28it/s, loss=0.149]
Epoch 3: val: 100%|██████████| 170/170 [00:26<00:00,  6.30it/s, loss=0.331, acc=0.901]


Validation accuracy improved from 0.8973 to 0.9006. Saving model...


Epoch 4: train:   8%|▊         | 28/344 [00:14<02:43,  1.93it/s, loss=0.102]


## Step 6: Saving the Model to the Models Folder

After training, we save the model's architecture and weights to the `models` folder. This allows for easy retrieval of the trained model for further experimentation, deployment, or fine-tuning.

### Saving Workflow
- **Model Architecture**: The architecture of the model, including its layers and configurations, is saved to ensure that the structure can be reconstructed when reloading the model.
- **Model Weights**: The learned weights from training are saved, which capture the model's optimized parameters for making predictions.

Saving the model in the `models` folder ensures that we have a stored version of the trained model readily available, enabling future use or deployment without re-training from scratch.


In [34]:
mlflow.get_tracking_uri()

'file:///Users/maxmartyshov/Desktop/IU/year3/PMDL/Sentiment_Analysis_for_Financial_News/mlruns'

In [35]:
import os
from src.inference import load_model_from_registry

# Load the model from the registry
model = load_model_from_registry(model_name, 'champion')

# Define the path to save model weights
model_folder_path = os.path.join(os.getcwd(), 'models', model_folder)
model_weights_pth_path = os.path.join(model_folder_path, 'model_weights.pth')

# Create the folder if it does not exist
os.makedirs(model_folder_path, exist_ok=True)

# Save the model weights to a .pth file
torch.save(model.state_dict(), model_weights_pth_path)

print(f"Model weights saved to {model_weights_pth_path}")


Model bert_lstm_sentiment_analysis_model loaded successfully
Model weights saved to /Users/maxmartyshov/Desktop/IU/year3/PMDL/Sentiment_Analysis_for_Financial_News/models/bert_lstm_sentiment_analysis_model/model_weights.pth
