<a href="https://colab.research.google.com/github/SproutCoder/text_mining_23/blob/main/project_3_nn_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3: Text Classification using BOW and Neuronal Networks

For this project, imagine you are a group of data scientists who want to train a neural network to predict the sentiment of a movie review. You are given the set of IMDb reviews you used for Project 1, split to train (*train.jsonl*) and test data (*test.jsonl*).


For this project, you might need the following python packages:
- sklearn
- pandas
- PyTorch

For PyTorch installation, please refer to [PyTorch](https://pytorch.org/get-started/locally/).

### Enter names and mat. numbers:
- Group Name PiKa

- Sebastian Pirozhkov, 421892
- Christopher Kaschny, 447930
- name 3, mat 3
- name 4, mat 4

In [14]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import string
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

cpu


## Task 1: Numerical Representation of Texts
As presented in the lecture, we want to represent texts by a "bag of words". Therefore, the text is represented by the bag (or multiset) of its words.

a) Proceed as follows:
1. remove all stop words
2. remove punctuations
3. lower words
2. create the term-document matrix

Hint: You may use your code from Project 1 here.

In [2]:
# Function to preprocess text

nltk.download('stopwords')

def preprocess_text(text):
    # Remove punctuation using NLTK
    text = "".join([word for word in text if word not in string.punctuation])
    # Convert words to lowercase
    text = text.lower()
    # Remove stopwords using NLTK
    stop_words = set(stopwords.words('english'))
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
# CREATE TERM-DOC MATRIX

# Load the train and test data using pandas
train_data = pd.read_json("train.jsonl", lines=True)
test_data = pd.read_json("test.jsonl", lines=True)

# Combine train and test data for preprocessing
combined_data = pd.concat([train_data, test_data], ignore_index=True)

# Apply text preprocessing
combined_data['processed_text'] = combined_data['text'].apply(preprocess_text)

vectorizer = CountVectorizer()
term_doc_matrix = vectorizer.fit_transform(combined_data['processed_text'])
term_doc_matrix = torch.Tensor(term_doc_matrix.toarray())

print(term_doc_matrix)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


In [21]:
#btw:
train_data.head()

Unnamed: 0,text,label,data-id,id
0,"""The Garden of Allah"" was one of the first fea...",pos,0,0
1,Here's how you do it: Believe in God and repen...,neg,1,1
2,I thought the whole movie played out beautiful...,pos,2,2
3,The best Modesty Blaise movie I have seen so f...,pos,3,3
4,Every movie critic and metal head hated this m...,pos,4,4


b) write a data class for the BOW dataset and the labels. Transform the categorical labels to numerical labels.

In [4]:
class BOW_data(Dataset):
    def __init__(self, data_points: torch.FloatTensor, class_labels: torch.LongTensor):
        self.data_points = data_points
        self.class_labels = class_labels

    def __len__(self):
        return self.data_points.shape[0] # Return the length of self.data_points

    def __getitem__(self, index):
        data = self.data_points[index]
        label = self.class_labels[index]
        return data, label

c) Instantiate the train and test data objects.

In [5]:
# Convert categorical labels to numerical labels
label_encoder = LabelEncoder()
numerical_labels = label_encoder.fit_transform(combined_data['label'])

# Split the term-document matrix into train and test sets
train_data_points = term_doc_matrix[:len(train_data)]
test_data_points = term_doc_matrix[len(train_data):]

# Split the numerical labels into train and test sets
train_labels = numerical_labels[:len(train_data)]
test_labels = numerical_labels[len(train_data):]

# convert them to tensor
train_labels = torch.Tensor(train_labels).long()
test_labels = torch.Tensor(test_labels).long()

# Instantiate the train and test data objects
train_dataset = BOW_data(train_data_points, train_labels)
test_dataset = BOW_data(test_data_points, test_labels)

## Task 2: Design the Network
a) Design a neural network with 4 fully connected layers for the task of classifying handwritten digits.
- Use the ReLu activation function after the first three layers.
- Use one of the weight initializers from the lecture to initialize the network's weights.

In [6]:
class Classifier(nn.Module):
    def __init__(self, input_size: int, hidden_1_size: int, hidden_2_size: int, hidden_3_size: int, output_size: int):
        super(Classifier, self).__init__()

        self.fc1 = nn.Linear(input_size, hidden_1_size)
        self.fc2 = nn.Linear(hidden_1_size, hidden_2_size)
        self.fc3 = nn.Linear(hidden_2_size, hidden_3_size)
        self.fc4 = nn.Linear(hidden_3_size, output_size)

        self.relu = nn.ReLU()

        # Initialize weights using He/Kaiming initializer (from the lecture)
        nn.init.kaiming_uniform_(self.fc1.weight, nonlinearity='relu')
        nn.init.kaiming_uniform_(self.fc2.weight, nonlinearity='relu')
        nn.init.kaiming_uniform_(self.fc3.weight, nonlinearity='relu')
        nn.init.kaiming_uniform_(self.fc4.weight, nonlinearity='relu')

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.relu(self.fc3(x))
        x = self.fc4(x)
        return x

b) write a function to train the model on batches for a given number of epochs.

In [7]:
import torch.optim as optim

def train(clf, train_data, batch_size, epochs, learning_rate=0.0001):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(clf.parameters(), lr=learning_rate)

    num_batches = len(train_data) // batch_size

    for epoch in range(epochs):
        running_loss = 0.0

        for i in range(num_batches):
            inputs, labels = train_data[i * batch_size: (i + 1) * batch_size]

            optimizer.zero_grad()

            outputs = clf(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs} | Loss: {running_loss / num_batches:.4f}")

    print("Training complete.")

c) Instantiate the neural network classifier.
- Given the term-document-matrix, what is your input size?
- For the later tasks, try different hidden layer sizes and compare the results.

Using two output neurons (`output_size = 2`) instead of one in sentiment classification offers the advantage of capturing both positive and negative sentiments separately. This allows for a more nuanced understanding of the sentiment expressed in the text, enabling the model to provide more detailed and accurate predictions.

In [16]:
input_size = term_doc_matrix.shape[1]
output_size = 2 # num of sentiment classes, i.e. good / bad | one could also work with 1 instead
#print(input_size)

#hidden_1_size = 8
#hidden_2_size = 4
#hidden_3_size = 2
hidden_sizes_init = [8,4,2]

# Instantiate the neural network classifier
#classifier = Classifier(input_size, hidden_1_size, hidden_2_size, hidden_3_size, output_size)
classifier = Classifier(input_size, hidden_sizes_init[0], hidden_sizes_init[1], hidden_sizes_init[2], output_size)

d) Train the model on a batch size of 8 and 3 epochs.
If you run out of memory, further reduce the batch size and the hidden layers' sizes.

In [17]:
batch_size = 8
epochs = 3

# Train the model
train(classifier, train_dataset, batch_size, epochs)

Epoch 1/3 | Loss: 0.8023
Epoch 2/3 | Loss: 0.7004
Epoch 3/3 | Loss: 0.5302
Training complete.


## Task 3: Evaluate the Neural Network

a) write a function that returns the accuracy of your trained model.

In [18]:
from sklearn.metrics import accuracy_score

def evaluate(clf, test_data):
    # Set the classifier to evaluation mode
    clf.eval()

    # Create a data loader for the test data
    test_loader = DataLoader(test_data, batch_size=len(test_data))

    # Iterate over the test data batches
    for data, labels in test_loader:
        # Forward pass through the classifier
        outputs = clf(data)

        # Convert the predicted labels to numpy array
        predicted_labels = outputs.argmax(dim=1).numpy()

        # Convert the true labels to numpy array
        true_labels = labels.numpy()

        # Compute the accuracy using scikit-learn's accuracy_score function
        accuracy = accuracy_score(true_labels, predicted_labels)

        # Return the accuracy
        return accuracy

b) Evaluate the model.
- Test at least three different sets parameters for the neural network (hidden sizes)
- Use higher values for the number of training epochs. What changes do you expect?

In [19]:
# FIRST EVALUATION

# Evaluate the model on the test data
test_accuracy = evaluate(classifier, test_dataset)

print("Test Accuracy (hidden layers sizes are [8,4,2]):", test_accuracy)

Test Accuracy (hidden layers sizes are [8,4,2]): 0.6668




---



Let's try to evaluate more systematically with `hidden_sizes = [(128, 64, 32),(64, 32, 16),(16, 8, 4)]`and `epochs = [3, 5, 7] `.
We'd expect that with more training epochs tend to increase the accuracy unless one has overfitted the model on the training data. Also noteworthy are a longer training time and associated diminishing returns. So one should try striking a balance with epoch numbers.

In [20]:
from sklearn.metrics import accuracy_score

hidden_sizes = [(128, 64, 32),(64, 32, 16),(16, 8, 4)]
epochs = [3, 5, 7]

best_accuracy = 0
best_params = None
best_epoch = 0

for hidden_size in hidden_sizes:
    for num_epochs in epochs:
        classifier = Classifier(input_size, hidden_size[0], hidden_size[1], hidden_size[2], output_size)

        # Train the classifier on the training dataset for the specified number of epochs
        train(classifier, train_dataset, batch_size, num_epochs)

        # Evaluate the classifier on the test dataset
        accuracy = evaluate(classifier, test_dataset)

        # Check if this parameter combination and epoch achieves the best accuracy so far
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = hidden_size
            best_epoch = num_epochs

# Print the best parameter combination, epoch, and the corresponding accuracy
print("Best Parameters:", best_params)
print("Best Epoch:", best_epoch)
print("Accuracy:", best_accuracy)

Epoch 1/3 | Loss: 0.4125
Epoch 2/3 | Loss: 0.0938
Epoch 3/3 | Loss: 0.0198
Training complete.
Epoch 1/5 | Loss: 0.4281
Epoch 2/5 | Loss: 0.1085
Epoch 3/5 | Loss: 0.0231
Epoch 4/5 | Loss: 0.0040
Epoch 5/5 | Loss: 0.0010
Training complete.
Epoch 1/7 | Loss: 0.4253
Epoch 2/7 | Loss: 0.1080
Epoch 3/7 | Loss: 0.0237
Epoch 4/7 | Loss: 0.0050
Epoch 5/7 | Loss: 0.0014
Epoch 6/7 | Loss: 0.0006
Epoch 7/7 | Loss: 0.0003
Training complete.
Epoch 1/3 | Loss: 0.4577
Epoch 2/3 | Loss: 0.1369
Epoch 3/3 | Loss: 0.0387
Training complete.
Epoch 1/5 | Loss: 0.4503
Epoch 2/5 | Loss: 0.1358
Epoch 3/5 | Loss: 0.0373
Epoch 4/5 | Loss: 0.0106
Epoch 5/5 | Loss: 0.0036
Training complete.
Epoch 1/7 | Loss: 0.4662
Epoch 2/7 | Loss: 0.1461
Epoch 3/7 | Loss: 0.0454
Epoch 4/7 | Loss: 0.0168
Epoch 5/7 | Loss: 0.0064
Epoch 6/7 | Loss: 0.0021
Epoch 7/7 | Loss: 0.0009
Training complete.
Epoch 1/3 | Loss: 0.4667
Epoch 2/3 | Loss: 0.1846
Epoch 3/3 | Loss: 0.0790
Training complete.
Epoch 1/5 | Loss: 0.5677
Epoch 2/5 | Loss:

So the evaluation actually confirms that more Epochs have deminishing returns. Our best results were with a smaller neural network and only three epochs.