## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [38]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


In [39]:
Y.head()

Unnamed: 0,0
0,1
1,0
2,1
3,0
4,1


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

**(d)** Test your sentiment-classifier on the test set.

**(e)** Use the classifier to classify a few sentences you write yourselves. 

#### (a) Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the CountVectorizer from sklearn.feature_extraction.text to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the max_features-parameter of CountVectorizer).

In [47]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Load the data
reviews = pd.read_csv('reviews.txt', header=None, names=['review'])
labels = pd.read_csv('labels.txt', header=None, names=['label'])

# Split the data into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(
    reviews['review'], labels['label'], test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

# Create the Bag-of-Words representation using CountVectorizer
vectorizer = CountVectorizer(max_features=10000)
X_train_bow = vectorizer.fit_transform(X_train)
X_val_bow = vectorizer.transform(X_val)
X_test_bow = vectorizer.transform(X_test)

# Print the shapes of the datasets
print("Train set shape:", X_train_bow.shape)
print("Validation set shape:", X_val_bow.shape)
print("Test set shape:", X_test_bow.shape)

Train set shape: (17500, 10000)
Validation set shape: (3750, 10000)
Test set shape: (3750, 10000)


### (b) Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [50]:
# Get the feature names (words in the vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Single word representation
word_index = feature_names.tolist().index('good')  # Example: word "good"
print(f"The word 'good' is represented as index: {word_index}")

# Representation of a single review
sample_review = X_train.iloc[0]  # First review in the training set
sample_review_bow = vectorizer.transform([sample_review])

print(f"Original review: {sample_review}")
print(f"Bag-of-Words representation (non-zero entries): {sample_review_bow}")
print(f"Feature indices with non-zero values: {sample_review_bow.indices}")
print(f"Counts of the corresponding words: {sample_review_bow.data}")

The word 'good' is represented as index: 3841
Original review: this movie is a joke and must be one of the worst movies stallone ever made . this is a typical   s movie where you have one man destroying the whole army by himself .  first blood pt .   is very similar to schwarzenegger  s  commando   but there you have arnold killing the terrorist while here you have a specific nation showed as the bad guys . this movie is a typical american anti  soviet propaganda . true  this was the peak of the cold war  but i  m sick of having communists or the nazis always being shown as the enemy . there are so many american movies that have this one thing in common . why can  t there a movie that show americans as the enemy  who  s going to believe that one lone soldier will destroy the whole army  do you really think that something like this would have really happened  by the looks of it  an average  brain washed american viewer certainly would .  
Bag-of-Words representation (non-zero entries): 

### (c) Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy.

In [69]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [70]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train_bow.toarray(), dtype=torch.float32)
y_train_tensor = torch.tensor((y_train == "positive").astype(int).values, dtype=torch.long)
X_val_tensor = torch.tensor(X_val_bow.toarray(), dtype=torch.float32)
y_val_tensor = torch.tensor((y_val == "positive").astype(int).values, dtype=torch.long)

# Create DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)

# Define the neural network
class SentimentClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SentimentClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return self.softmax(x)

# Initialize the model, loss function, and optimizer
input_size = 10000
hidden_size = 128
output_size = 2  
model = SentimentClassifier(input_size, hidden_size, output_size)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    # Validation accuracy
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in val_loader:
            outputs = model(X_batch)
            _, predicted = torch.max(outputs, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()
    val_accuracy = correct / total
    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}")

Epoch 1, Loss: 128.3367, Validation Accuracy: 0.8672
Epoch 2, Loss: 106.8789, Validation Accuracy: 0.8827
Epoch 3, Loss: 101.7261, Validation Accuracy: 0.8779
Epoch 4, Loss: 98.4121, Validation Accuracy: 0.8747
Epoch 5, Loss: 96.7635, Validation Accuracy: 0.8709
Epoch 6, Loss: 94.5402, Validation Accuracy: 0.8765
Epoch 7, Loss: 94.5927, Validation Accuracy: 0.8749
Epoch 8, Loss: 93.7861, Validation Accuracy: 0.8712
Epoch 9, Loss: 93.9275, Validation Accuracy: 0.8715
Epoch 10, Loss: 93.1081, Validation Accuracy: 0.8736


### (d) Test your sentiment-classifier on the test set.

In [55]:
# Convert test data to PyTorch tensors
X_test_tensor = torch.tensor(X_test_bow.toarray(), dtype=torch.float32)
y_test_tensor = torch.tensor((y_test == "positive").astype(int).values, dtype=torch.long)

# Test accuracy
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=64)

model.eval()
correct = 0
total = 0
with torch.no_grad():
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)
        _, predicted = torch.max(outputs, 1)
        total += y_batch.size(0)
        correct += (predicted == y_batch).sum().item()
test_accuracy = correct / total
print(f"Test Accuracy: {test_accuracy:.4f}")

Test Accuracy: 0.8744


### (e) Use the classifier to classify a few sentences you write yourselves.

In [57]:
# Custom sentences
custom_sentences = [
    "This movie was fantastic! I loved it.",
    "Absolutely terrible. I hated every second of it.",
    "It was okay, not great but not bad either."
]

# Transform sentences to Bag-of-Words representation
custom_bow = vectorizer.transform(custom_sentences)
custom_tensor = torch.tensor(custom_bow.toarray(), dtype=torch.float32)

# Predict sentiment
model.eval()
with torch.no_grad():
    outputs = model(custom_tensor)
    _, predicted = torch.max(outputs, 1)

# Map predictions to labels
predicted_labels = ["positive" if label == 1 else "negative" for label in predicted]
for sentence, label in zip(custom_sentences, predicted_labels):
    print(f"Sentence: {sentence}\nPredicted Sentiment: {label}\n")

Sentence: This movie was fantastic! I loved it.
Predicted Sentiment: positive

Sentence: Absolutely terrible. I hated every second of it.
Predicted Sentiment: negative

Sentence: It was okay, not great but not bad either.
Predicted Sentiment: negative

