## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [3]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [5]:
# Prepare the data (split and vectorize)
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Split the dataset into train, validation, and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(reviews[0], Y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)  # 0.25 * 0.8 = 0.2

# Convert text data into count vectors using Bag-of-Words with 10,000 features
vectorizer = CountVectorizer(max_features=10000)
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
X_test_vec = vectorizer.transform(X_test)

# Show data shapes and sample
print("\nStep (a) Output:")
print("Sample labels:")
print(Y.head())
print("Training set shape:", X_train_vec.shape)
print("Validation set shape:", X_val_vec.shape)
print("Test set shape:", X_test_vec.shape)
print("Vocabulary size:", len(vectorizer.vocabulary_))



Step (a) Output:
Sample labels:
   0
0  1
1  0
2  1
3  0
4  1
Training set shape: (15000, 10000)
Validation set shape: (5000, 10000)
Test set shape: (5000, 10000)
Vocabulary size: 10000


**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [7]:
# Explore review representation
# Show how a single review is represented
example_review = X_train.iloc[0]
example_vector = vectorizer.transform([example_review])
print("\nStep (b) Output:")
print("Original review:")
print(example_review)
print("Non-zero indices in vectorized review:")
print(example_vector.nonzero()[1])
print("Vector shape:", example_vector.shape)



Step (b) Output:
Original review:
  birth of the beatles   for being a us television movie  released in the fall of     has actually been  so far the best movie which tells the tale of the the four lads from liverpool that revolutionized the music industry and the world . as told by the point of view of former beatle pete best . the performance from the entire cast is excellent but  most especially the performance by stephen mackenna as john lennon and rod culbertson as paul mccartney . the film was produced by a legend of the rock and roll era  mr dick clark . who a year earlier in     had produced another tv movie  that has stood the test of time starring  kurt rusell  in the lead role about another musical legend  elvis  . that movie was directed by an unknown director named  john carpenter  who went on to direct other successful movies such as  halloween    escape from new york   and  the thing  . the same can be said for the director of the  birth of the beatles   mr richard marq

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [9]:
# Train the neural network
from sklearn.neural_network import MLPClassifier

# Create and train a simple neural network
model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=100, random_state=42)
model.fit(X_train_vec, y_train.values.ravel())

# Output confirmation
print("\nStep (c) Output:")
print("Model training completed")


Step (c) Output:
Model training completed


**(d)** Test your sentiment-classifier on the test set.

In [11]:
# Step (d): Validate and test the model
from sklearn.metrics import accuracy_score

# Predict on validation set and calculate accuracy
val_preds = model.predict(X_val_vec)
val_acc = accuracy_score(y_val.values.ravel(), val_preds)
print("\nStep (d) Output:")
print("Validation accuracy:", round(val_acc * 100, 2), "%")

# Predict on test set and calculate accuracy
test_preds = model.predict(X_test_vec)
test_acc = accuracy_score(y_test.values.ravel(), test_preds)
print("Test accuracy:", round(test_acc * 100, 2), "%")


Step (d) Output:
Validation accuracy: 88.36 %
Test accuracy: 86.8 %


**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [22]:
# Classify custom sentences
# Create a few example reviews
sample_sentences = [
    "I loved this movie, it was amazing!",
    "I hated the movie. It was boring and too long.",
    "It was just okay. Not great, not terrible.",
    "The acting was superb and the story was touching.",
    "Worst movie ever. I regret watching it."
    "not bad"
]

# Convert the custom sentences to vectors and predict
sample_vec = vectorizer.transform(sample_sentences)
sample_preds = model.predict(sample_vec)

print("\nStep (e) Output:")
for sent, pred in zip(sample_sentences, sample_preds):
    sentiment = "Positive" if pred else "Negative"
    print(f"'{sent}' => {sentiment}")


Step (e) Output:
'I loved this movie, it was amazing!' => Positive
'I hated the movie. It was boring and too long.' => Negative
'It was just okay. Not great, not terrible.' => Negative
'The acting was superb and the story was touching.' => Positive
'Worst movie ever. I regret watching it.not bad' => Negative
