## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

**(d)** Test your sentiment-classifier on the test set.

**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import tensorflow as tf
from tensorflow import keras

In [None]:
# Split the data into train (70%), validation (15%), and test (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(reviews, Y, test_size=0.3, random_state=43)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=43)

vectorizer = CountVectorizer(max_features=10000, binary=True)

# Transform the text data into a Bag-of-Words representation
X_train_bow = vectorizer.fit_transform(X_train[0])
X_val_bow = vectorizer.transform(X_val[0])
X_test_bow = vectorizer.transform(X_test[0])



In [None]:
# A single word is represented as a binary vector with a 1 in the position corresponding to the word's index in the vocabulary 
# (1 if the word is present, 0 if not).

# A whole review is represented as a binary vector with a 1 in the positions corresponding to the indices of words present in the review, 
# and 0 in the positions for words that are not in the review. Essentially, it's a sparse vector representing the presence of words in the review.


#Example
vocabulary = ["Hello", "World", "Goodnight", "Moon"]
word_vector_for_moon = [0 ,0 ,0 ,1]
sentance = "Goodnight moon, goodnight moon, goodnight cow jumping over the moon"
sentence_vector = [0, 0, 1, 1]


In [None]:
# Define a simple neural network model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(10000,)),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_bow.toarray(), y_train, validation_data=(X_val_bow.toarray(), y_val), epochs=5, batch_size=32)


In [None]:
test_loss, test_accuracy = model.evaluate(X_test_bow.toarray(), y_test)
print("Test Accuracy: {:.4f}".format(test_accuracy))


In [None]:
sentences = [ "I love this movie, it's awfully great!",
              "This is the worst film I've ever seen.", 
              "My reaction to this movie was not neutral in a not unpleasant way",
              "Not good, but great!",
              "Hello, world.", 
              "Goodbye, world.",
              "pokemon the movie was a terrible film . unlike the first one  this is not a good film at all . the graphics were decent but the story was flat and no real drama was built up in it . in the first one the interaction between the characters were decent . the subtraction of brock and addition of tracey was bad . tracey really doesn  t have much to say or do  and unlike brock offers no comic relief . the only good points is you get to see misty actually get jelous over ash  and her early brooding over being called his girlfriend was entertaining . overall this film isn  t worth renting and the short movie before didn  t do anything for me or my wife . and we do consider ourselves pokemon fans . oh well  maybe the next one will be better . cant ge t much worse  "]

# Vectorize the sentences using the same CountVectorizer
sentences_bow = vectorizer.transform(sentences)

# Predict sentiment for the sentences
predictions = model.predict(sentences_bow.toarray())

# Print the predictions
for sentence, prediction in zip(sentences, predictions):
    sentiment = "positive" if prediction > 0.5 else "negative"
    print(f"Sentence: '{sentence}' - Predicted Sentiment: {sentiment}")
