# **COMGRA NLP TASK**

This notebook trains a simple sentiment analysis model on the IMDB dataset using PyTorch. It performs the following tasks:

1. Data Preparation: Defines functions to load and preprocess the IMDB data and a custom Dataset class that builds a vocabulary and converts text reviews into vector representations.
2. Model Definition: Implements a simple feed-forward neural network for sentiment classification.
3. Training Loop: Runs the training (and validation) loop, using a global step counter for monotonic training steps. In each batch, key tensors and gradients are recorded using Comgra.
4. Testing & Finalization: Defines a prediction function to test the model on sample reviews and finalizes the Comgra recording.
5. Comgra Server Launch: Starts a Comgra server to view the recorded training metrics (ideal for environments like Google Colab).



In [1]:
# Setup
!pip install comgra==0.11.5 --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/92.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.3/92.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.3/229.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.2/810.2 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m82.2 MB/s[0m eta [36m0:00

# **Imports all necessary libraries and sets up the computing device (using CUDA if available).**

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import comgra
from comgra.objects import DecisionMakerForRecordingsFrequencyPerType
from comgra.recorder import ComgraRecorder
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import re
import subprocess
import time
from google.colab import output
from IPython.display import display, HTML

# Device configuration: use CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


Using device: cuda


# **Data Preparation Functions and Dataset Definition**
This cell defines functions to load and preprocess the IMDB dataset. It also defines a custom Dataset class (MovieReviewDataset) that builds a vocabulary and converts text to vector representations.

In [3]:
def load_imdb_data(num_samples=5000):
    url = "https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv"
    df = pd.read_csv(url)
    df = df.sample(n=min(num_samples, len(df)), random_state=42)
    df['sentiment'] = (df['sentiment'] == 'positive').astype(int)
    return df['review'].tolist(), df['sentiment'].tolist()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = ' '.join(text.split())
    return text

class MovieReviewDataset(Dataset):
    def __init__(self, texts, labels, vocab_size=5000, max_length=200, word_to_idx=None):
        self.texts = [preprocess_text(text) for text in texts]
        self.labels = labels
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.word_to_idx = word_to_idx if word_to_idx is not None else self._build_vocabulary()

    def _build_vocabulary(self):
        words = ' '.join(self.texts).split()
        word_counts = Counter(words)
        # Reserve two indices for <PAD> and <UNK>
        common_words = dict(word_counts.most_common(self.vocab_size - 2))
        word_to_idx = {word: idx + 2 for idx, word in enumerate(common_words.keys())}
        word_to_idx['<PAD>'] = 0
        word_to_idx['<UNK>'] = 1
        return word_to_idx

    def text_to_vector(self, text):
        vector = torch.zeros(self.vocab_size)
        for word in text.split()[:self.max_length]:
            idx = self.word_to_idx.get(word, 1)  # Use 1 for unknown words
            vector[idx] += 1
        return vector

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        vector = self.text_to_vector(text)
        return vector, torch.tensor(label, dtype=torch.float32)


# **Model Definition**

This cell defines the neural network model (SentimentModel), which is a simple feed-forward network with dropout, ReLU activations, and a sigmoid output.

In [4]:
class SentimentModel(nn.Module):
    def __init__(self, vocab_size):
        super(SentimentModel, self).__init__()
        self.layer1 = nn.Linear(vocab_size, 128)
        self.layer2 = nn.Linear(128, 64)
        self.layer3 = nn.Linear(64, 1)
        self.dropout = nn.Dropout(0.3)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.dropout(self.relu(self.layer1(x)))
        x = self.dropout(self.relu(self.layer2(x)))
        x = self.sigmoid(self.layer3(x))
        return x


# **Initialize Comgra Recorder**

This cell initializes the Comgra recorder. The recorder (accessed as comgra.my_recorder) is responsible for tracking training metrics such as inputs, outputs, loss, gradients, and KPIs during training.

In [5]:
comgra.my_recorder = ComgraRecorder(
    comgra_root_path="/content",
    group="movie_sentiment_analysis",
    trial_id="imdb_trial",  # Use a unique trial_id if you run multiple sessions
    decision_maker_for_recordings=DecisionMakerForRecordingsFrequencyPerType(min_training_steps_difference=10),
)

**comgra.my_recorder:** Used throughout the training process to record batches, iterations, tensors, gradients, and KPIs.

# **Data Loading, Splitting, and DataLoader Setup**

This cell loads the IMDB data, splits it into training and validation sets, and creates PyTorch DataLoader objects. Notice that the training dataset builds the vocabulary which is then reused by the validation dataset.

In [6]:
print("Loading IMDB dataset...")
texts, labels = load_imdb_data()

# Split data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

# Create dataset instances (validation uses the training vocabulary)
train_dataset = MovieReviewDataset(train_texts, train_labels)
val_dataset = MovieReviewDataset(val_texts, val_labels,
                                 vocab_size=train_dataset.vocab_size,
                                 word_to_idx=train_dataset.word_to_idx)

# Create DataLoaders for batch processing
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)


Loading IMDB dataset...


# **Model, Loss, Optimizer Setup, and Comgra Module Tracking**
In this cell, the model is instantiated and moved to the correct device (GPU if available). It also defines the loss function (Binary Cross Entropy) and optimizer (Adam). The model is then registered with the Comgra recorder using comgra.my_recorder.track_module.

In [7]:
# Create the model and move it to the selected device
model = SentimentModel(train_dataset.vocab_size).to(device)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Register the model with Comgra for tracking
comgra.my_recorder.track_module("movie_sentiment_model", model)


**comgra.my_recorder.track_module:** Registers the model so that its weights and architecture are tracked during training.

# **Training Loop with Global Step and CUDA Support**
This cell contains the main training loop. For each batch, the Comgra recorder records inputs, outputs, loss, and gradients. A global step counter ensures that the training step increases monotonically.

In [8]:
num_epochs = 3
best_val_accuracy = 0
global_step = 0  # Global counter for monotonic training steps

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    correct_train = 0
    total_train = 0

    for batch_idx, (inputs, targets) in enumerate(train_loader):
        # Move data to device
        inputs = inputs.to(device)
        targets = targets.to(device)

        # Start a new batch in Comgra with the current global step
        comgra.my_recorder.start_batch(global_step, inputs.shape[0])
        comgra.my_recorder.start_iteration()

        # Forward pass and record tensors
        comgra.my_recorder.register_tensor("inputs", inputs, is_input=True)
        outputs = model(inputs)
        comgra.my_recorder.register_tensor("outputs", outputs)
        comgra.my_recorder.register_tensor("targets", targets.unsqueeze(1), is_target=True)

        loss = criterion(outputs, targets.unsqueeze(1))
        total_loss += loss.item()

        # Calculate training accuracy for the batch
        predictions = (outputs >= 0.5).float()
        correct_train += (predictions == targets.unsqueeze(1)).sum().item()
        total_train += targets.size(0)

        # Record the loss in Comgra
        comgra.my_recorder.register_tensor("loss", loss, is_loss=True)
        comgra.my_recorder.record_kpi_in_graph("loss", "", loss)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Record gradients and finish the iteration/batch
        comgra.my_recorder.record_current_gradients("gradients")
        comgra.my_recorder.finish_iteration()
        comgra.my_recorder.finish_batch()

        global_step += 1  # Ensure the training step is monotonically increasing

    # -------------------------
    # Validation Phase
    # -------------------------
    model.eval()
    val_loss = 0
    correct_val = 0
    total_val = 0

    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            val_loss += criterion(outputs, targets.unsqueeze(1)).item()
            predictions = (outputs >= 0.5).float()
            correct_val += (predictions == targets.unsqueeze(1)).sum().item()
            total_val += targets.size(0)

    train_accuracy = 100 * correct_train / total_train
    val_accuracy = 100 * correct_val / total_val
    avg_loss = total_loss / len(train_loader)
    avg_val_loss = val_loss / len(val_loader)

    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Training Loss: {avg_loss:.4f}, Training Accuracy: {train_accuracy:.2f}%")
    print(f"Validation Loss: {avg_val_loss:.4f}, Validation Accuracy: {val_accuracy:.2f}%")
    print("-" * 60)

    # Record KPIs using Comgra
    comgra.my_recorder.record_kpi_in_graph("train_accuracy", "", train_accuracy)
    comgra.my_recorder.record_kpi_in_graph("val_accuracy", "", val_accuracy)

    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy

# Finalize the Comgra recording session
comgra.my_recorder.finalize()


  val = tensor.std(dim=value_dimensions).unsqueeze(dim=expansion_dim)


Epoch 1/3
Training Loss: 0.5330, Training Accuracy: 74.30%
Validation Loss: 0.4033, Validation Accuracy: 81.90%
------------------------------------------------------------
Epoch 2/3
Training Loss: 0.2669, Training Accuracy: 89.78%
Validation Loss: 0.4727, Validation Accuracy: 80.20%
------------------------------------------------------------
Epoch 3/3
Training Loss: 0.1530, Training Accuracy: 94.28%
Validation Loss: 0.4574, Validation Accuracy: 81.80%
------------------------------------------------------------


Important Comgra Objects and Methods:

1. comgra.my_recorder.start_batch / start_iteration: Begin tracking a new batch and iteration.
2. register_tensor: Logs important tensors (inputs, outputs, targets, loss).
3. record_current_gradients: Captures the current gradients.
4. record_kpi_in_graph: Logs key performance indicators (e.g., loss, accuracy).
5. finalize: Ends the recording session.

# **Model Prediction and Testing**

This cell defines a function to predict sentiment for new text reviews and then tests the model on a few examples.

In [9]:
def predict_sentiment(model, text, dataset, device):
    model.eval()
    processed_text = preprocess_text(text)
    vector = dataset.text_to_vector(processed_text)
    vector = vector.to(device)
    with torch.no_grad():
        output = model(vector.unsqueeze(0))
    probability = output.item()
    return probability, "Positive" if probability >= 0.5 else "Negative"

# Test the model on example reviews
test_reviews = [
    "This movie was absolutely fantastic! Great acting and amazing plot.",
    "I couldn't finish watching it. The plot was boring and the acting was terrible.",
]

print("\nTesting model on example reviews:")
for review in test_reviews:
    prob, sentiment = predict_sentiment(model, review, train_dataset, device)
    print(f"\nReview: {review}")
    print(f"Sentiment: {sentiment} (confidence: {prob:.2%})")



Testing model on example reviews:

Review: This movie was absolutely fantastic! Great acting and amazing plot.
Sentiment: Positive (confidence: 82.71%)

Review: I couldn't finish watching it. The plot was boring and the acting was terrible.
Sentiment: Negative (confidence: 11.39%)


# **Launch the Comgra Server**

This final cell launches the Comgra server in the background and displays it using an iframe (ideal for Google Colab).

In [10]:
server_command = ["comgra", "--path", "/content/movie_sentiment_analysis"]
server_process = subprocess.Popen(server_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(2)
print("\nServer started in the background")

def display_server(port):
    script = f"""
    <script>
    (async () => {{
        const url = await google.colab.kernel.proxyPort({port});
        const iframe = document.createElement('iframe');
        iframe.src = url;
        iframe.width = '1400px';
        iframe.height = '1200px';
        iframe.frameBorder = 0;
        document.body.appendChild(iframe);
    }})();
    </script>
    """
    display(HTML(script))

display_server(8055)



Server started in the background


**Comgra Server:** Although not directly part of comgra.my_recorder, launching the server allows you to visualize all the recorded metrics and KPIs via the Comgra dashboard.