# Natural Language Processing 2024 – Ex. 2

**Add the names and ID of the submitting students here:**

1.

2.

3.


In this exercise we will perform the task of Sentiment analysis over the IMDB movie review dataset.

The dataset has around 50K movie reviews with each review labeled as "positive" or "negative".

Our goal is given the review we want to classifiy it as positive or negative, this task is also called "Sentiment Analysis"

Below you can find a suggestion of the order things should be implemented, you can follow this or do it your own way.

The exercise has several stages:

1. Downloading and cleaning the data
2. Running some basic analysis
3. Training a Feed Forward network to perform the task using classification
4. Training a Bi-Dir LSTM to perform the task
5. Playing with paramters to see if we get better results

Please sumbit the notebook after it's running stage. Grade will be given for clean code, with comments and explanations

In [2]:
import nltk
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')

nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yaniv\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\yaniv\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yaniv\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Data download and cleaning

1. Download the IMDB dataset.

2. Clean the data:
* Remove URLs, HTML tags and non-alphanumeric characters
* Remove stop-words (use NLTK)
* Lowercase the dataset
* (Optional) Anything else you think can help...

Show one example of a review before and after this cleaning (find a review which has at least one URL/HTML tag/Non-aplhanumeric characters)



In [3]:
# Your code heimport re


def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove non-alphanumeric characters
    text = re.sub(r'\W', ' ', text)
    # Convert to lowercase
    text = text.lower()
    #maybe remove some special chars 
    # for example she's will be shes.
    # will need to see how the next step handle this example
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    text = ' '.join([word for word in words if word not in stop_words])
    #should we remove all stop words? prob not
    #example not good - turns into "good"
    #so we need to think of a solution, maybe decrease from 
    #the stop words list some of the words that are important
    return text
def load_text(filename):
   file_path = filename  # Replace with your file path
   data = pd.read_csv(file_path, header=None, names=['review', 'sentiment'])

    # Inspect the data
   print(data.head())

   print(data.info())
    # Check for missing values
   print(data.isnull().sum())

    # Apply the cleaning function to the review column
   data['review'] = data['review'].apply(clean_text)
   return data
# Example usage
example_review = "This is an example review with HTML <b>bold</b> tags and a URL: https://example.com"
cleaned_review = clean_text(example_review)
print("Original:", example_review)
print("Cleaned:", cleaned_review)

# Load the data
data = load_text('IMDB_Dataset.csv')
print("cleaned text printing:\n")
print(data.head())


Original: This is an example review with HTML <b>bold</b> tags and a URL: https://example.com
Cleaned: example review html bold tags url


FileNotFoundError: [Errno 2] No such file or directory: 'IMDB_Dataset.csv'

# Tokenization

1. Tokenize the dataset (you can tokenize using spaces or use more robust methods from NLTK)
2. (Optional) Lemmatize the text (you can use NLTK) this can improve results
3. Lemmatize should be carfully be done, so we wont lose too much.
4. Show an example of 3 sentences before and after this process

In [12]:
from nltk.corpus import wordnet
from nltk.tag import pos_tag

#we decided to use POS tagging
# this prevent the lemmatize process from taking
# words like was and convert them into "wa"

auxiliary_verbs = {"am", "is", "are", "was", "were", "be", "being", "been", "will", "shall", "would", "should", "can", "could", "may", "might", "must", "do", "does", "did", "have", "has", "had"}

def get_wordnet_pos(treebank_tag):
    """Map POS tag to the format accepted by WordNetLemmatizer."""
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    return tag_dict.get(treebank_tag[0].upper(), wordnet.NOUN)
def tokenize_and_lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    print("POS Tags:", pos_tags)
    lemmatized_tokens = []
    for token, pos in pos_tags:
        if token.lower() in auxiliary_verbs: #no need to token.lower, already did it on preprocessing
            # Skip lemmatization for auxiliary verbs
            lemmatized_tokens.append(token)
        else:
            lemma = lemmatizer.lemmatize(token, get_wordnet_pos(pos))
            lemmatized_tokens.append(lemma)
            
    return ' '.join(lemmatized_tokens)

# Example sentences
example_sentences = [
    "The cats are chasing the mice.",
    "He was running late for the meeting.",
    "She's enjoying the sunny weather."
]

# Process and display the examples
for sentence in example_sentences:
    processed = tokenize_and_lemmatize(sentence)
    print("Original:", sentence)
    print("Processed:", processed, "\n")

POS Tags: [('The', 'DT'), ('cats', 'NNS'), ('are', 'VBP'), ('chasing', 'VBG'), ('the', 'DT'), ('mice', 'NN'), ('.', '.')]
Original: The cats are chasing the mice.
Processed: The cat are chase the mouse . 

POS Tags: [('He', 'PRP'), ('was', 'VBD'), ('running', 'VBG'), ('late', 'RB'), ('for', 'IN'), ('the', 'DT'), ('meeting', 'NN'), ('.', '.')]
Original: He was running late for the meeting.
Processed: He was run late for the meeting . 

POS Tags: [('She', 'PRP'), ("'s", 'VBZ'), ('enjoying', 'VBG'), ('the', 'DT'), ('sunny', 'NN'), ('weather', 'NN'), ('.', '.')]
Original: She's enjoying the sunny weather.
Processed: She 's enjoy the sunny weather . 



# Basic analysis

Perfrom some analysis on the data
1. Show the number percentage of negative/positive review (label balancing)
2. Plot some statistics on the length of review (after our cleaning process)
3. (Optional) show anything else you think is important

# Preparing the dataset for training
we can also use glove or previously used models as the first layer
1. Choose your vocabulary size K (should be between 1000 and 3000)
2. Find the top K frequent words in your database
3. Create word indexes like we did in class, for any word not in your top K  words replace with \<UNK\>. Remember to add an index for the \<PAD\> token.
4. Create a new dataset with indexes instead of words later to be used for training
5. Convert your labels to numeric representation (that your network can deal with).

Split the dataset to 80% traind and 20% test, remember to keep the balance between labels!
we need to make sure, we still have enough labels on both sides



In [None]:
# Your code here

# Training a feed forward neural network

For simplicity we would take only reviews with 500 words (after tokenization) or less.
For this part we would train a neural network that gets the full review as one input (like we had in our NER example in class) and outputs the label (positive or negative).
Remember that you need to PAD the words so all reviews will have the same length.

For this section please try at least 3 variants of different network and show if the results change, you can choose from the following:
1. Adding hidden layers to the network
2. Running with and without Dropout
3. Trying different optimizers

(Optional) Try to use the Glove embedding: Create an embedding layer in your PyTorch model using the loaded GloVe embeddings. You will initialize the weights of the embedding layer with the GloVe embeddings.

For each option:

* Plot the train and test error during training, does your network overfit?

* Plot the final results of the network, including accuracy and confusion matrix

In [None]:
# Your code here

# Training a BiDir LSTM neural network

Now do the same as the prvious section with a bi-directional LSTM.

Remember that the output of the LSTM should be connected to a small feed forward network to perform the actual classification.

Here again you can play with number of layers and the LSTM or the small network of the output. Show only the best result you got.

* Plot the train and test error during training, does your network overfit?

* Plot the final results of the network, including accuracy and confusion matrix

Are the results better than the previous section?




Finally show 3 reviews from the test data with correct labales and 3 without, why do you think the network did not success on these examples?