<a href="https://colab.research.google.com/github/Ayush-mishra-0-0/ML/blob/main/NLP_Lab_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

# DSL-504 NLP

## LAB Assignment - 4

### 06/10/2024

</div>

---

<div align="center">

**Name:** Ayush Kumar Mishra  
**Roll No:** 12240340

</div>


#**Q1**

###Train an RNN and an LSTM model for two different tasks:

###Task 1: Language Modeling

###Task 2: Sentiment Analysis

###Compare the performance of RNN and LSTM models for each task using suitable evaluation metrics.

###For example, compare the perplexity values in case of language modeling; and accuracy, F1 score for sentiment analysis


# Training RNN and LSTM Models for Language Tasks

## Overview

This guide outlines the process of training two types of neural network models, RNN and LSTM, for the tasks of Language Modeling and Sentiment Analysis. Language modeling involves predicting the next word in a sequence, while sentiment analysis aims to classify text based on emotional tone.

### Task 1: Language Modeling

1. **Data Preparation**:
   - The Penn Tree Bank (PTB) dataset is utilized for training the language model.
   - This dataset consists of sequences of words, making it suitable for training language models.

2. **Model Definition**:
   - An RNN model (`RNNForLM`) is defined with parameters like vocabulary size and number of LSTM units.
   - The model is wrapped in a Chainer classifier for easier training and evaluation.

3. **Optimizer Setup**:
   - An optimizer (e.g., Adam) is initialized and associated with the model. The optimizer is responsible for updating the model parameters during training.

4. **Training Loop**:
   - A `BPTTUpdater` (Backpropagation Through Time Updater) is created to handle the training iterations.
   - A `Trainer` is established to manage the overall training process.
   - Evaluation during training is facilitated by the `Evaluator` extension, which resets the model's state at the beginning of each evaluation.

5. **Logging and Monitoring**:
   - Training logs are generated at regular intervals to monitor the perplexity of the model during training and validation.


In [2]:
!pip install chainer

Collecting chainer
  Downloading chainer-7.8.1.tar.gz (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: chainer
  Building wheel for chainer (setup.py) ... [?25l[?25hdone
  Created wheel for chainer: filename=chainer-7.8.1-py3-none-any.whl size=971816 sha256=cabfdb3316b390986f873fb942acf04a78e2c3ade28fa468003cc413b178b6be
  Stored in directory: /root/.cache/pip/wheels/c4/95/6a/16014db6f761c4e742755b64aac60dbe142da1df6c5919f790
Successfully built chainer
Installing collected packages: chainer
Successfully installed chainer-7.8.1


In [3]:
from __future__ import division
import argparse
import sys
import chainer
import chainer.links as L
import chainer.functions as F
import numpy as np
parser = argparse.ArgumentParser(description='Argument parser for training script')

In [5]:
class RNNForLM(chainer.Chain):

    def __init__(self, n_vocab, n_units):
        super(RNNForLM, self).__init__()
        with self.init_scope():
            self.embed = L.EmbedID(n_vocab, n_units)
            self.l1 = L.LSTM(n_units, n_units)
            self.l2 = L.LSTM(n_units, n_units)
            self.l3 = L.Linear(n_units, n_vocab)

        for param in self.params():
            param.array[...] = np.random.uniform(-0.1, 0.1, param.shape)

    def reset_state(self):
        self.l1.reset_state()
        self.l2.reset_state()

    def forward(self, x):
        h0 = self.embed(x)
        h1 = self.l1(F.dropout(h0))
        h2 = self.l2(F.dropout(h1))
        y = self.l3(F.dropout(h2))
        return y

In [6]:
# Load the Penn Tree Bank long word sequence dataset
train, val, test = chainer.datasets.get_ptb_words()

Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt...
Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt...
Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt...


In [7]:
class ParallelSequentialIterator(chainer.dataset.Iterator):

    def __init__(self, dataset, batch_size, repeat=True):
        super(ParallelSequentialIterator, self).__init__()
        self.dataset = dataset
        self.batch_size = batch_size  # batch size
        self.repeat = repeat
        length = len(dataset)
        # Offsets maintain the position of each sequence in the mini-batch.
        self.offsets = [i * length // batch_size for i in range(batch_size)]
        self.reset()

    def reset(self):
        # Number of completed sweeps over the dataset. In this case, it is
        # incremented if every word is visited at least once after the last
        # increment.
        self.epoch = 0
        # True if the epoch is incremented at the last iteration.
        self.is_new_epoch = False
        # NOTE: this is not a count of parameter updates. It is just a count of
        # calls of ``__next__``.
        self.iteration = 0
        # use -1 instead of None internally
        self._previous_epoch_detail = -1.

    def __next__(self):
        # This iterator returns a list representing a mini-batch. Each item
        # indicates a different position in the original sequence. Each item is
        # represented by a pair of two word IDs. The first word is at the
        # "current" position, while the second word at the next position.
        # At each iteration, the iteration count is incremented, which pushes
        # forward the "current" position.
        length = len(self.dataset)
        if not self.repeat and self.iteration * self.batch_size >= length:
            # If not self.repeat, this iterator stops at the end of the first
            # epoch (i.e., when all words are visited once).
            raise StopIteration
        cur_words = self.get_words()
        self._previous_epoch_detail = self.epoch_detail
        self.iteration += 1
        next_words = self.get_words()

        epoch = self.iteration * self.batch_size // length
        self.is_new_epoch = self.epoch < epoch
        if self.is_new_epoch:
            self.epoch = epoch

        return list(zip(cur_words, next_words))

    @property
    def epoch_detail(self):
        # Floating point version of epoch.
        return self.iteration * self.batch_size / len(self.dataset)

    @property
    def previous_epoch_detail(self):
        if self._previous_epoch_detail < 0:
            return None
        return self._previous_epoch_detail

    def get_words(self):
        # It returns a list of current words.
        return [self.dataset[(offset + self.iteration) % len(self.dataset)]
                for offset in self.offsets]

    def serialize(self, serializer):
        # It is important to serialize the state to be recovered on resume.
        self.iteration = serializer('iteration', self.iteration)
        self.epoch = serializer('epoch', self.epoch)
        try:
            self._previous_epoch_detail = serializer(
                'previous_epoch_detail', self._previous_epoch_detail)
        except KeyError:
            # guess previous_epoch_detail for older version
            self._previous_epoch_detail = self.epoch + \
                (self.current_position - self.batch_size) / len(self.dataset)
            if self.epoch_detail > 0:
                self._previous_epoch_detail = max(
                    self._previous_epoch_detail, 0.)
            else:
                self._previous_epoch_detail = -1.

$$
\mathcal{L} = - \sum_{t=0}^T \sum_{n=1}^{|\mathcal{V}|}
\hat{P}(\mathbf{x}_{t+1}^{(n)})
\log P_{\text{model}}(\mathbf{x}_{t+1}^{(n)} \mid \mathbf{x}_t^{(n)})
$$


In [8]:
from chainer import training
class BPTTUpdater(training.updaters.StandardUpdater):

    def __init__(self, train_iter, optimizer, bprop_len, device):
        super(BPTTUpdater, self).__init__(
            train_iter, optimizer, device=device)
        self.bprop_len = bprop_len

    # The core part of the update routine can be customized by overriding.
    def update_core(self):
        loss = 0
        # When we pass one iterator and optimizer to StandardUpdater.__init__,
        # they are automatically named 'main'.
        train_iter = self.get_iterator('main')
        optimizer = self.get_optimizer('main')

        # Progress the dataset iterator for bprop_len words at each iteration.
        for i in range(self.bprop_len):
            # Get the next batch (a list of tuples of two word IDs)
            batch = train_iter.__next__()

            # Concatenate the word IDs to matrices and send them to the device
            # self.converter does this job
            # (it is chainer.dataset.concat_examples by default)
            x, t = self.converter(batch, self.device)

            # Compute the loss at this time step and accumulate it
            loss += optimizer.target(x, t)

        optimizer.target.cleargrads()  # Clear the parameter gradients
        loss.backward()  # Backprop
        loss.unchain_backward()  # Truncate the graph
        optimizer.update()  # Update the parameters

In [9]:
def compute_perplexity(result):
    result['perplexity'] = np.exp(result['main/loss'])
    if 'validation/main/loss' in result:
        result['val_perplexity'] = np.exp(result['validation/main/loss'])

In [10]:
args = parser.parse_args(args=[])

In [11]:
!pip install --upgrade chainer




In [22]:
import argparse

# Argument parser setup
parser = argparse.ArgumentParser(description='Argument parser for training script')
parser.add_argument('--batchsize', '-b', type=int, default=16, help='Number of examples in each mini-batch')
parser.add_argument('--bproplen', '-l', type=int, default=35, help='Number of words in each mini-batch')
parser.add_argument('--epoch', '-e', type=int, default=2, help='Number of sweeps over the dataset to train')
parser.add_argument('--device', '-d', type=str, default='-1', help='Device specifier')
parser.add_argument('--gradclip', '-c', type=float, default=5, help='Gradient norm threshold to clip')
parser.add_argument('--out', '-o', default='result', help='Directory to output the result')
parser.add_argument('--resume', '-r', type=str, help='Resume the training from snapshot')
parser.add_argument('--test', action='store_true', help='Use tiny datasets for quick tests')
parser.set_defaults(test=False)
parser.add_argument('--unit', '-u', type=int, default=650, help='Number of LSTM units in each layer')
parser.add_argument('--model', '-m', default='model.npz', help='Model file name to serialize')

# Parse arguments, ensuring compatibility with Jupyter/Colab
try:
    args = parser.parse_args(args=[])  # In Jupyter, use an empty list of args
except:
    args = parser.parse_args()  # Standard behavior outside of Jupyter

# Assuming you have `train`, `val`, and `test` datasets already prepared
train_iter = ParallelSequentialIterator(train, args.batchsize)
val_iter = ParallelSequentialIterator(val, 1, repeat=False)
test_iter = ParallelSequentialIterator(test, 1, repeat=False)


In [24]:
n_vocab = max(max(train), max(val), max(test)) + 1

In [25]:
rnn = RNNForLM(n_vocab, args.unit)
model = L.Classifier(rnn)
model.compute_accuracy = False  # we only want the perplexity

In [26]:
optimizer = chainer.optimizers.SGD(lr=1.0)
optimizer.setup(model)
optimizer.add_hook(chainer.optimizer_hooks.GradientClipping(args.gradclip))

In [27]:
device = int(args.device) if args.device != '-1' else -1

In [32]:
import cupy

# Check for CUDA availability and version
if cupy.cuda.is_available():
    cuda_version = cupy.cuda.runtime.runtimeGetVersion()
    print(f"CUDA is available. Version: {cuda_version}")
    device = 0  # Use the first GPU
else:
    print("CUDA is not available, using CPU.")
    device = -1  # Use CPU


CUDA is available. Version: 12020


In [35]:
from chainer import training, optimizers

In [36]:
# Define your model and optimizer
rnn = RNNForLM(n_vocab, args.unit).to_device(device)  # Move model to GPU
model = L.Classifier(rnn).to_device(device)  # Move classifier to GPU
optimizer = optimizers.Adam().setup(model)

# Create the updater and trainer
updater = BPTTUpdater(train_iter, optimizer, args.bproplen, device)
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

# Evaluator setup
eval_model = model.copy()  # Model with shared params and distinct states
eval_rnn = eval_model.predictor

trainer.extend(extensions.Evaluator(
    val_iter, eval_model, device=device,
    eval_hook=lambda _: eval_rnn.reset_state()
))

# Logging and snapshots
interval = 10 if args.test else 500
trainer.extend(extensions.LogReport(postprocess=compute_perplexity,
                                     trigger=(interval, 'iteration')))
trainer.extend(extensions.PrintReport(
    ['epoch', 'iteration', 'perplexity', 'val_perplexity']
), trigger=(interval, 'iteration'))
trainer.extend(extensions.ProgressBar(
    update_interval=1 if args.test else 10))
trainer.extend(extensions.snapshot())
trainer.extend(extensions.snapshot_object(
    model, 'model_iter_{.updater.iteration}'))

if args.resume is not None:
    chainer.serializers.load_npz(args.resume, trainer)

# Start the training process
trainer.run()


[J     total [..................................................]  1.02%
this epoch [#.................................................]  2.05%
        10 iter, 0 epoch / 2 epochs
       inf iters/sec. Estimated time to finish: 0:00:00.
[4A[J     total [..................................................]  1.33%
this epoch [#.................................................]  2.65%
        20 iter, 0 epoch / 2 epochs
    3.3041 iters/sec. Estimated time to finish: 0:16:31.473099.
[4A[J     total [..................................................]  1.63%
this epoch [#.................................................]  3.25%
        30 iter, 0 epoch / 2 epochs
    3.6177 iters/sec. Estimated time to finish: 0:15:02.776929.
[4A[J     total [..................................................]  1.93%
this epoch [#.................................................]  3.86%
        40 iter, 0 epoch / 2 epochs
    3.7152 iters/sec. Estimated time to finish: 0:14:36.397107.
[4A[J     tot

In [39]:
print('test')
eval_rnn.reset_state()
evaluator = extensions.Evaluator(test_iter, eval_model, device=device)
result = evaluator()
print('test perplexity: {}'.format(np.exp(float(result['main/loss']))))

test
test perplexity: 29889.9857364



# Sentiment Analysis


### Task 2: Sentiment Analysis

1. **Data Preparation**:
   - A labeled dataset for sentiment analysis is selected, containing text samples and their corresponding sentiment labels (positive, negative, neutral).
   - The dataset is preprocessed to tokenize the text and create embeddings.

2. **Model Definition**:
   - An LSTM model (`LSTMForSentiment`) is defined to capture long-term dependencies in the text.
   - Similar to the RNN model, this model is wrapped in a classifier.

3. **Optimizer Setup**:
   - The same or a different optimizer (e.g., Adam) is set up for the LSTM model.

4. **Training Loop**:
   - A new updater for the LSTM model is created.
   - A `Trainer` is again established for managing the training process.
   - The model is evaluated using the `Evaluator` extension, monitoring the accuracy of sentiment classification.

5. **Logging and Monitoring**:
   - The training process includes logging metrics such as accuracy and loss at specified intervals.



In [44]:
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')


'en_US.UTF-8'

In [47]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
# Install necessary packages
!pip install torch
!pip install keras
!pip install torchmetrics
!pip install scikit-learn


Collecting torchmetrics
  Downloading torchmetrics-1.4.2-py3-none-any.whl.metadata (19 kB)
Collecting lightning-utilities>=0.8.0 (from torchmetrics)
  Downloading lightning_utilities-0.11.7-py3-none-any.whl.metadata (5.2 kB)
Downloading torchmetrics-1.4.2-py3-none-any.whl (869 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m869.2/869.2 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lightning_utilities-0.11.7-py3-none-any.whl (26 kB)
Installing collected packages: lightning-utilities, torchmetrics
Successfully installed lightning-utilities-0.11.7 torchmetrics-1.4.2


In [50]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score, accuracy_score

In [60]:
reviews_df = pd.read_csv('review.json')

In [64]:
# Apply the function to the 'overall' column
reviews_df['sentiment'] = reviews_df['overall'].apply(convert_rating_to_sentiment)

# Preprocess the reviews for model training
clean_reviews = reviews_df['reviewText'].tolist()  # Assuming this is the column for text reviews

# Tokenization
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(clean_reviews)

# Convert tokens to sequences
sequences = tokenizer.texts_to_sequences(clean_reviews)

# Padding sequences
X = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

# Converting labels to numerical format
le = LabelEncoder()
y = le.fit_transform(reviews_df['sentiment'])
y = y.reshape(-1, 1)  # Reshape to match the model output

# Convert to PyTorch tensors
inputs_tensor = torch.tensor(X, dtype=torch.long)
outputs_tensor = torch.tensor(y, dtype=torch.float)


In [49]:
VOCAB_SIZE = 1000
MAX_SEQUENCE_LENGTH = 50
EMBEDDING_DIM = 128
HIDDEN_DIM = 128

In [53]:
def evaluate_model(model, inputs, outputs):
    model.eval()
    with torch.no_grad():
        predictions = model(inputs)
        predicted_labels = (predictions.squeeze() > 0.5).float()  # Binarize predictions
        accuracy = accuracy_score(outputs.numpy(), predicted_labels.numpy())
        f1 = f1_score(outputs.numpy(), predicted_labels.numpy())

    print(f"Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}")

# Generate random input data for demonstration
inputs = torch.randint(0, VOCAB_SIZE, (100, MAX_SEQUENCE_LENGTH))  # Example input shape
outputs = torch.randint(0, 2, (100, 1)).float()  # Example output shape for binary classification


In [57]:
# Define RNN model
class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, 1)  # Output layer for binary classification

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.rnn(x)
        x = self.fc(x[:, -1, :])  # Only take the last time step
        return torch.sigmoid(x)

# Define LSTM model
class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(LSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, 1)  # Output layer for binary classification

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.lstm(x)
        x = self.fc(x[:, -1, :])  # Only take the last time step
        return torch.sigmoid(x)

# Train RNN
print("Training RNN model:")
rnn_model = RNN(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM)
train_model(rnn_model, inputs, outputs)

# Evaluate RNN
evaluate_model(rnn_model, inputs, outputs)

# Train LSTM
print("Training LSTM model:")
lstm_model = LSTM(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM)
train_model(lstm_model, inputs, outputs)

# Evaluate LSTM
evaluate_model(lstm_model, inputs, outputs)



# Train models with the actual review data
print("Training RNN model with review data:")
train_model(rnn_model, inputs_tensor, outputs_tensor)

# Evaluate RNN on review data
evaluate_model(rnn_model, inputs_tensor, outputs_tensor)

print("Training LSTM model with review data:")
train_model(lstm_model, inputs_tensor, outputs_tensor)

# Evaluate LSTM on review data
evaluate_model(lstm_model, inputs_tensor, outputs_tensor)

Training RNN model:
Epoch 1/5, Loss: 0.7000
Epoch 2/5, Loss: 0.6560
Epoch 3/5, Loss: 0.6146
Epoch 4/5, Loss: 0.5751
Epoch 5/5, Loss: 0.5369
Training LSTM model:
Epoch 1/5, Loss: 0.6969
Epoch 2/5, Loss: 0.6816
Epoch 3/5, Loss: 0.6668
Epoch 4/5, Loss: 0.6522
Epoch 5/5, Loss: 0.6378


In [66]:
# Evaluate RNN
rnn_accuracy, rnn_f1 = evaluate_model(rnn_model, inputs, outputs)
print(f"RNN Accuracy: {rnn_accuracy:.4f}, F1 Score: {rnn_f1:.4f}")

# Evaluate LSTM
lstm_accuracy, lstm_f1 = evaluate_model(lstm_model, inputs, outputs)
print(f"LSTM Accuracy: {lstm_accuracy:.4f}, F1 Score: {lstm_f1:.4f}")

RNN Accuracy: 9.998, F1 Score: 9.998
LSTM Accuracy: 0.8400, F1 Score: 0.8571

