**SHARMI DAS.   N01639206**

**FINAL OUTCOME FROM THIS NOTEBOOK**

**Best Hyperparameters: {'lr': 0.001, 'wd': 1e-06, 'dropout': 0.3}**

**Best Validation Accuracy: 84.52667124542124**

# LSTM Sentiment Analysis Model Definition

This code defines an LSTM (Long Short-Term Memory) neural network model using PyTorch for sentiment analysis. It includes an embedding layer, LSTM layers, and a fully connected output layer. The model computes the forward pass and calculates the loss using binary cross-entropy with logits. It also incorporates dropout for regularization during training.


In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
import torch.distributed as dist

import time
import os
import sys
import io

class LSTM_model(nn.Module):
    def __init__(self,vocab_size,n_hidden):
        super(LSTM_model, self).__init__()

        self.embedding = nn.Embedding(vocab_size,n_hidden)#,padding_idx=0)

        #self.lstm = nn.LSTM(n_hidden,n_hidden)
        self.lstm = nn.LSTM(n_hidden,n_hidden,num_layers=2,dropout=0.4)
        self.fc_output = nn.Linear(n_hidden, 1)

        #self.loss = nn.CrossEntropyLoss()
        self.loss = nn.BCEWithLogitsLoss()

    def forward(self, X, t, train=True):

        embed = self.embedding(X) # batch_size, time_steps, features
        no_of_timesteps = embed.shape[1]
        n_hidden = embed.shape[2]
        input = embed.permute(1, 0, 2) # input : [len_seq, batch_size, embedding_dim]
        hidden_state = Variable(torch.zeros(2*1, len(X), n_hidden)).cuda() # [num_layers(=2) * num_directions(=1), batch_size, n_hidden]
        cell_state = Variable(torch.zeros(2*1, len(X), n_hidden)).cuda() # [num_layers(=2) * num_directions(=1), batch_size, n_hidden]
        # final_hidden_state, final_cell_state : [num_layers(=2) * num_directions(=1), batch_size, n_hidden]
        output, (final_hidden_state, final_cell_state) = self.lstm(input, (hidden_state, cell_state))
        #output = output.permute(1, 2, 0) # output : [batch_size, n_hidden, len_seq]
        h = output[-1]
        #pool = nn.MaxPool1d(no_of_timesteps)
        #h = pool(output)
        h = h.view(h.size(0),-1)
        h = self.fc_output(h)
        return self.loss(h[:,0],t), h[:,0]#F.softmax(h, dim=1)



# Data Preprocessing for Sentiment Analysis

This code segment preprocesses the data for sentiment analysis. It reads training and testing data from CSV files using pandas, splits the data into features (x) and labels (y), and further splits it into training, validation, and testing sets using scikit-learn's train_test_split function. It saves the preprocessed data as numpy arrays for future use.

In [None]:
import pandas as pd
import numpy as np
import os
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Read train and test data from CSV files
train_df = pd.read_csv('/content/sample_data/training_data (1).csv')
test_df = pd.read_csv('/content/sample_data/testing_data.csv')

# Assuming 'x' columns contain features and 'y' columns contain labels
x_data = train_df['Reviews'].values
y_data = train_df['Sentiment'].values

# Split data into training, validation, and testing sets
x_train_all, x_test, y_train_all, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=SEED)
x_train, x_val, y_train, y_val = train_test_split(x_train_all, y_train_all, test_size=0.2, random_state=SEED)

# Create the 'preprocessed_data' directory if it doesn't exist
directory = '../preprocessed_data/'
if not os.path.exists(directory):
    os.makedirs(directory)

# Save train and test data as numpy arrays
np.save(directory + 'x_train.npy', x_train)
np.save(directory + 'y_train.npy', y_train)
np.save(directory + 'x_test.npy', x_test)
np.save(directory + 'y_test.npy', y_test)

In [None]:
# Load train and test data as numpy arrays
x_train = np.load('../preprocessed_data/x_train.npy', allow_pickle=True)
y_train = np.load('../preprocessed_data/y_train.npy')
x_test = np.load('../preprocessed_data/x_test.npy', allow_pickle=True)
y_test = np.load('../preprocessed_data/y_test.npy')

# Tokenization and Padding
vocab_size = 8000
sequence_len = 150
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(x_train)

x_train_seq = tokenizer.texts_to_sequences(x_train)
x_train_pad = pad_sequences(x_train_seq, maxlen=sequence_len, padding='post')

x_test_seq = tokenizer.texts_to_sequences(x_test)
x_test_pad = pad_sequences(x_test_seq, maxlen=sequence_len, padding='post')

# Convert to numpy arrays
x_train_pad = np.array(x_train_pad)
x_test_pad = np.array(x_test_pad)

vocab_size += 1  # Increment vocab size for padding token

# Preprocess validation set
x_val = x_val.astype(str)

x_val_seq = tokenizer.texts_to_sequences(x_val)
x_val_pad = pad_sequences(x_val_seq, maxlen=sequence_len, padding='post')
x_val_pad = np.array(x_val_pad)

##Before hyperparameter tuning , with Learning Rate Scheduler


This code segment trains an LSTM model for sentiment analysis using PyTorch. It begins by initializing the model with a vocabulary size and hidden dimension of 800. The Adam optimizer is utilized with a learning rate of 0.001. Additionally, a learning rate scheduler is defined to adjust the learning rate based on validation loss. The model is trained for 5 epochs, with training and validation loss and accuracy monitored. After training, the model is evaluated on the test data, and metrics such as accuracy, loss, precision, recall, and F1 score are computed and printed. Overall, **the model achieves a testing accuracy of 85.09% with a precision of 0.8974, recall of 0.8380, and F1 score of 0.8667.**

In [None]:
model = LSTM_model(vocab_size, 800)
model.cuda()

# Define optimizer
LR = 0.001
optimizer = optim.Adam(model.parameters(), lr=LR)

# Define learning rate scheduler
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5, factor=0.1, verbose=True)

batch_size = 200
no_of_epochs = 5
L_Y_train = len(y_train)
L_Y_val = len(y_val)
L_Y_test = len(y_test)

model.train()

train_loss = []
val_loss = []
train_accu = []
val_accu = []

prev_val_loss = float('inf')
patience = 2  # Number of epochs to wait before stopping if the model stops learning
no_improvement_counter = 0

for epoch in range(no_of_epochs):
    # Training
    model.train()

    epoch_acc = 0.0
    epoch_loss = 0.0
    epoch_counter = 0

    time1 = time.time()

    I_permutation = np.random.permutation(L_Y_train)

    for i in range(0, L_Y_train, batch_size):

        x_input = x_train_pad[I_permutation[i:i+batch_size]]
        y_input = np.asarray(y_train[I_permutation[i:i+batch_size]], dtype=float)
        data = Variable(torch.LongTensor(x_input)).cuda()
        target = Variable(torch.FloatTensor(y_input)).cuda()

        optimizer.zero_grad()
        loss, pred = model(data, target)
        loss.backward()

        optimizer.step()   # Update weights

        prediction = pred >= 0.0
        truth = target >= 0.5
        acc = prediction.eq(truth).sum().cpu().data.numpy()
        epoch_acc += acc
        epoch_loss += loss.data.item()
        epoch_counter += batch_size

    epoch_acc /= epoch_counter
    epoch_loss /= (epoch_counter / batch_size)

    train_loss.append(epoch_loss)
    train_accu.append(epoch_acc)

    print(f"Epoch {epoch}: - Training Accuracy: {epoch_acc * 100:.2f}% - Training Loss: {epoch_loss:.4f} - Time Taken: {time.time() - time1:.4f} seconds")
    # Validation
    model.eval()

    val_loss_epoch = 0.0
    val_acc_epoch = 0.0
    val_counter = 0

    for i in range(0, L_Y_val, batch_size):

        x_input = x_val_pad[i:i+batch_size]
        y_input = np.asarray(y_val[i:i+batch_size], dtype=float)
        target = Variable(torch.FloatTensor(y_input)).cuda()
        data = Variable(torch.LongTensor(x_input)).cuda()
        with torch.no_grad():
            loss, pred = model(data, target)

        prediction = pred >= 0.0
        truth = target >= 0.5
        acc = prediction.eq(truth).sum().cpu().data.numpy()
        val_acc_epoch += acc
        val_loss_epoch += loss.data.item()
        val_counter += batch_size

    val_acc_epoch /= val_counter
    val_loss_epoch /= (val_counter / batch_size)

    val_loss.append(val_loss_epoch)
    val_accu.append(val_acc_epoch)

    print(f"  - Validation Accuracy: {val_acc_epoch * 100:.2f}% - Validation Loss: {val_loss_epoch:.4f}")

    # Adjust learning rate based on validation loss
    scheduler.step(val_loss_epoch)

    # Check for early stopping
    if val_loss_epoch >= prev_val_loss:
        no_improvement_counter += 1
    else:
        no_improvement_counter = 0
    prev_val_loss = val_loss_epoch

    if no_improvement_counter >= patience:
        print("Early stopping: Validation loss has not decreased for", patience, "epochs.")
        break
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Testing
model.eval()

test_loss = 0.0
test_acc = 0.0
test_precision = 0.0
test_recall = 0.0
test_f1 = 0.0
test_counter = 0

predictions = []
true_labels = []

for i in range(0, L_Y_test, batch_size):

    x_input = x_test_pad[i:i+batch_size]
    y_input = np.asarray(y_test[i:i+batch_size], dtype=float)
    target = Variable(torch.FloatTensor(y_input)).cuda()
    data = Variable(torch.LongTensor(x_input)).cuda()
    with torch.no_grad():
        loss, pred = model(data, target)

    prediction = pred >= 0.0
    truth = target >= 0.5
    acc = prediction.eq(truth).sum().cpu().data.numpy()
    test_acc += acc
    test_loss += loss.data.item()
    test_counter += batch_size

    # Convert predictions and true labels to numpy arrays
    predictions.extend(prediction.cpu().numpy())
    true_labels.extend(truth.cpu().numpy())

test_acc /= test_counter
test_loss /= (test_counter / batch_size)

# Convert lists to numpy arrays
predictions = np.array(predictions)
true_labels = np.array(true_labels)

# Calculate additional evaluation metrics
test_precision = precision_score(true_labels, predictions)
test_recall = recall_score(true_labels, predictions)
test_f1 = f1_score(true_labels, predictions)

print(f"Testing Accuracy: {test_acc * 100:.2f}%")
print(f"Testing Loss: {test_loss:.4f}")
print(f"Precision: {test_precision:.4f}")
print(f"Recall: {test_recall:.4f}")
print(f"F1 Score: {test_f1:.4f}")






Epoch 0: - Training Accuracy: 54.23% - Training Loss: 0.6822 - Time Taken: 82.8267 seconds
  - Validation Accuracy: 51.50% - Validation Loss: 0.6707
Epoch 1: - Training Accuracy: 61.07% - Training Loss: 0.6535 - Time Taken: 82.0316 seconds
  - Validation Accuracy: 58.84% - Validation Loss: 0.6796
Epoch 2: - Training Accuracy: 67.42% - Training Loss: 0.5978 - Time Taken: 82.1675 seconds
  - Validation Accuracy: 79.80% - Validation Loss: 0.4682
Epoch 3: - Training Accuracy: 84.05% - Training Loss: 0.3726 - Time Taken: 82.5664 seconds
  - Validation Accuracy: 80.70% - Validation Loss: 0.4327
Epoch 4: - Training Accuracy: 89.24% - Training Loss: 0.2624 - Time Taken: 82.6324 seconds
  - Validation Accuracy: 86.96% - Validation Loss: 0.3053
Testing Accuracy: 85.09%
Testing Loss: 0.2893
Precision: 0.8974
Recall: 0.8380
F1 Score: 0.8667


##Hyperparameter tuning and kfold crossvalidation


This code snippet implements hyperparameter tuning using k-fold cross-validation for a neural network model, particularly an LSTM-based one. It explores various combinations of learning rates, weight decays, and dropout probabilities to find the optimal configuration. During each iteration, it trains the model on different folds of the training data and evaluates performance on the validation set. It employs techniques like early stopping and learning rate adjustment based on validation loss to prevent overfitting and improve convergence. Finally, it reports the best hyperparameters and the corresponding validation accuracy achieved. In this run, **the best configuration consists of a learning rate of 0.001, weight decay of 1e-06, and a dropout probability of 0.3, achieving a validation accuracy of 84.53%.**

In [None]:
from sklearn.model_selection import KFold
from tqdm import tqdm

# Define k-fold cross-validation
num_folds = 3
kf = KFold(n_splits=num_folds)

# Define hyperparameters to tune
learning_rates = [0.001, 0.0001]
weight_decays = [1e-4, 1e-5, 1e-6]
dropout_probs = [0.3, 0.5, 0.7]

best_accuracy = 0.0
best_hyperparameters = {}
n_hidden = 100

# Perform k-fold cross-validation for hyperparameter tuning
for lr in learning_rates:
    for wd in weight_decays:
        for dp in dropout_probs:
            print(f'Training with LR={lr}, WD={wd}, Dropout={dp}')

            fold_accuracies = []

            for fold, (train_indices, val_indices) in enumerate(kf.split(x_train)):
                print(f'Fold {fold + 1}/{num_folds}')

                # Split data into training and validation sets for this fold
                x_train_fold = x_train[train_indices]
                y_train_fold = y_train[train_indices]
                x_val_fold = x_train[val_indices]
                y_val_fold = y_train[val_indices]

                # Initialize the model
                model = LSTM_model(vocab_size, n_hidden)
                model.cuda()

                # Define optimizer
                optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=wd)

                # Training loop
                for epoch in range(num_epochs):  # Loop over the dataset multiple times
                    model.train()

                    epoch_acc = 0.0
                    epoch_loss = 0.0
                    epoch_counter = 0

                    time1 = time.time()

                    I_permutation = np.random.permutation(L_Y_train)

                    for i in range(0, L_Y_train, batch_size):

                        x_input = x_train_pad[I_permutation[i:i+batch_size]]
                        y_input = np.asarray(y_train[I_permutation[i:i+batch_size]], dtype=float)
                        data = Variable(torch.LongTensor(x_input)).cuda()
                        target = Variable(torch.FloatTensor(y_input)).cuda()

                        optimizer.zero_grad()
                        loss, pred = model(data, target)
                        loss.backward()

                        optimizer.step()   # Update weights

                        prediction = pred >= 0.0
                        truth = target >= 0.5
                        acc = prediction.eq(truth).sum().cpu().data.numpy()
                        epoch_acc += acc
                        epoch_loss += loss.data.item()
                        epoch_counter += batch_size

                    epoch_acc /= epoch_counter
                    epoch_loss /= (epoch_counter / batch_size)

                    train_loss.append(epoch_loss)
                    train_accu.append(epoch_acc)

                    print(f"Epoch {epoch + 1}/{num_epochs}: - Training Accuracy: {epoch_acc * 100:.2f}% - Training Loss: {epoch_loss:.4f} - Time Taken: {time.time() - time1:.4f} seconds")

                    # Validation
                    model.eval()

                    val_loss_epoch = 0.0
                    val_acc_epoch = 0.0
                    val_counter = 0

                    for i in range(0, L_Y_val, batch_size):

                        x_input = x_val_pad[i:i+batch_size]
                        y_input = np.asarray(y_val[i:i+batch_size], dtype=float)
                        target = Variable(torch.FloatTensor(y_input)).cuda()
                        data = Variable(torch.LongTensor(x_input)).cuda()
                        with torch.no_grad():
                            loss, pred = model(data, target)

                        prediction = pred >= 0.0
                        truth = target >= 0.5
                        acc = prediction.eq(truth).sum().cpu().data.numpy()
                        val_acc_epoch += acc
                        val_loss_epoch += loss.data.item()
                        val_counter += batch_size

                    val_acc_epoch /= val_counter
                    val_loss_epoch /= (val_counter / batch_size)

                    val_loss.append(val_loss_epoch)
                    val_accu.append(val_acc_epoch)

                    print(f"  - Validation Accuracy: {val_acc_epoch * 100:.2f}% - Validation Loss: {val_loss_epoch:.4f}")

                    # Adjust learning rate based on validation loss
                    scheduler.step(val_loss_epoch)

                    # Check for early stopping
                    if val_loss_epoch >= prev_val_loss:
                        no_improvement_counter += 1
                    else:
                        no_improvement_counter = 0
                    prev_val_loss = val_loss_epoch

                    if no_improvement_counter >= patience:
                        print("Early stopping: Validation loss has not decreased for", patience, "epochs.")
                        break

                    # Calculate validation accuracy for this fold
                    fold_accuracy = val_acc_epoch * 100  # Calculate accuracy for this fold
                    fold_accuracies.append(fold_accuracy)

            # Calculate average validation accuracy across all folds
            avg_accuracy = np.mean(fold_accuracies)
            print(f'Average Validation Accuracy: {avg_accuracy}')

            # Update best hyperparameters if this combination is the best so far
            if avg_accuracy > best_accuracy:
                best_accuracy = avg_accuracy
                best_hyperparameters = {'lr': lr, 'wd': wd, 'dropout': dp}

print(f'Best Hyperparameters: {best_hyperparameters}')
print(f'Best Validation Accuracy: {best_accuracy}')


Training with LR=0.001, WD=0.0001, Dropout=0.3
Fold 1/3
Epoch 1/10: - Training Accuracy: 54.26% - Training Loss: 0.6783 - Time Taken: 3.1895 seconds
  - Validation Accuracy: 54.77% - Validation Loss: 0.6676
Epoch 2/10: - Training Accuracy: 57.20% - Training Loss: 0.6626 - Time Taken: 3.0991 seconds
  - Validation Accuracy: 58.31% - Validation Loss: 0.6707
Epoch 3/10: - Training Accuracy: 67.25% - Training Loss: 0.5882 - Time Taken: 3.1174 seconds
  - Validation Accuracy: 80.57% - Validation Loss: 0.4235
Epoch 4/10: - Training Accuracy: 85.96% - Training Loss: 0.3377 - Time Taken: 3.2336 seconds
  - Validation Accuracy: 88.70% - Validation Loss: 0.2830
Epoch 5/10: - Training Accuracy: 90.93% - Training Loss: 0.2307 - Time Taken: 3.1390 seconds
  - Validation Accuracy: 89.88% - Validation Loss: 0.2526
Epoch 6/10: - Training Accuracy: 92.74% - Training Loss: 0.1894 - Time Taken: 3.1177 seconds
  - Validation Accuracy: 90.38% - Validation Loss: 0.2436
Epoch 7/10: - Training Accuracy: 94.30

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Epoch 1/10: - Training Accuracy: 54.43% - Training Loss: 0.6785 - Time Taken: 3.0697 seconds
  - Validation Accuracy: 64.98% - Validation Loss: 0.6481
Epoch 2/10: - Training Accuracy: 62.28% - Training Loss: 0.6483 - Time Taken: 3.0190 seconds
  - Validation Accuracy: 67.19% - Validation Loss: 0.6196
Epoch 3/10: - Training Accuracy: 66.54% - Training Loss: 0.6219 - Time Taken: 3.0372 seconds
  - Validation Accuracy: 70.51% - Validation Loss: 0.5904
Epoch 4/10: - Training Accuracy: 77.32% - Training Loss: 0.4949 - Time Taken: 3.1563 seconds
  - Validation Accuracy: 67.89% - Validation Loss: 0.5836
Epoch 5/10: - Training Accuracy: 85.72% - Training Loss: 0.3495 - Time Taken: 3.0299 seconds
  - Validation Accuracy: 84.90% - Validation Loss: 0.3733
Epoch 6/10: - Training Accuracy: 89.46% - Training Loss: 0.2728 - Time Taken: 3.0376 seconds
  - Validation Accuracy: 86.90% - Validation Loss: 0.3027
Epoch 7/10: - Training Accuracy: 91.91% - Training Loss: 0.2155 - Time Taken: 3.2330 seconds
 

### Best Hyperparameters: {'lr': 0.001, 'wd': 1e-06, 'dropout': 0.3}
### Best Validation Accuracy: 84.52667124542124