## Sentiment Analysis using RNN

Text Classification Task: Sentiment Analysis

- Data Preparation: Today we use a dataset (IMDb) for movie reviews. Each review is labeled as either positive or negative.

- Preprocessing: Tokenize the text and convert words to integers. Pad sequences to ensure they have the same length.

- Model Definition:
using an RNN layer to capture the sequential nature of the reviews.
Add a Dense layer with a sigmoid activation for binary classification.

- Training:
Train the model on the training dataset.

- Evaluation:
Test on a separate validation set and evaluate performance using metrics like accuracy or F1-score.

**Problem Statement:**

In this, we have to predict the number of positive and negative reviews based on sentiments by using RNN archticture. This is workable example on Many to One type as it takes sentances and output if it's negative or positive.

In [21]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import metrics
from tensorflow.keras import optimizers
from tensorflow.keras.utils import plot_model

from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout, Bidirectional

from numpy import asarray
from numpy import zeros

from numpy import asarray
from numpy import zeros
from tensorflow.keras.layers import GRU

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Embedding
from tensorflow.keras.layers import Flatten


import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from collections import Counter
from pathlib import Path
import os
import numpy as np
import re
import string
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.corpus import wordnet
import unicodedata
import html
stop_words = stopwords.words('english')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yostina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yostina\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\yostina\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [22]:
!pip install kaggle

Defaulting to user installation because normal site-packages is not writeable


In [23]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [24]:
!unzip imdb-dataset-of-50k-movie-reviews.zip

'unzip' is not recognized as an internal or external command,
operable program or batch file.


In [25]:
raw_data = pd.read_csv(r"C:\Users\yostina\Desktop\archive (2).zip")
print(raw_data.head())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [26]:
X = raw_data['review']  # Features: reviews
raw_data['label'] = raw_data['sentiment'].map({'positive': 1, 'negative': 0})
y = raw_data['label']  # Labels: sentiment (positive/negative)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Training set size: 40000
Testing set size: 10000


### Data Prerocessing Pipeline

- remove_special_chars(text)

Purpose: Clean the input text by removing special characters and HTML entities.

Steps:

Compile a regex to match multiple spaces.

Convert text to lowercase.

Replace specific HTML character codes with their corresponding characters.

Replace newline characters and HTML tags with appropriate representations.

Use html.unescape to convert any remaining HTML entities.

Replace multiple spaces with a single space.

- remove_non_ascii(text)

Purpose: Eliminate non-ASCII characters from the text.

Steps:

Normalize the text to a compatible Unicode format.

Encode to ASCII, ignoring non-ASCII characters.

Decode back to UTF-8 format.

- to_lowercase(text)

Purpose: Convert all characters in the text to lowercase.

Steps:

Simply return the text converted to lowercase.

- remove_punctuation(text)

Purpose: Strip punctuation from the text.

Steps:

Create a translation table that maps punctuation characters to None.

Use the translation table to translate the text.

- replace_numbers(text)

Purpose: Remove all integer occurrences from the text.

Steps:

Use a regex to find and replace all digits with an empty string.

- remove_whitespaces(text)

Purpose: Trim leading and trailing whitespace from the text.

Steps:

Return the text after applying the strip() method.

- remove_stopwords(words, stop_words)

Purpose: Filter out common stopwords from a list of words.

Steps:

Return a list of words that are not present in the provided stop_words set.

- stem_words(words)

Purpose: Apply stemming to a list of words.

Steps:

Create an instance of a stemmer.

Return a list of stemmed words using the stemmer.

- lemmatize_words(words)

Purpose: Lemmatize words in the text to their base form.

Steps:

Create an instance of a lemmatizer.

Return a list of lemmatized words.

- lemmatize_verbs(words)

Purpose: Specifically lemmatize verbs in the text.

Steps:

Create an instance of a lemmatizer.

Return a string of lemmatized verbs, maintaining space between words.

- text2words(text)

Purpose: Tokenize the input text into a list of words.

Steps:

Use a word tokenizer to split the text into individual words and return the list.

In [27]:

def remove_special_chars(text):
    re1 = re.compile(r'  +')
    x1 = text.lower().replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>', 'u_n').replace(' @.@ ', '.').replace(
        ' @-@ ', '-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x1))


def remove_non_ascii(text):
    """Remove non-ASCII characters from list of tokenized words"""
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')


def to_lowercase(text):
    return text.lower()



def remove_punctuation(text):
    """Remove punctuation from list of tokenized words"""
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)


def replace_numbers(text):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    return re.sub(r'\d+', '', text)


def remove_whitespaces(text):
    return text.strip()


def remove_stopwords(words, stop_words):
    """
    :param words:
    :type words:
    :param stop_words: from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
    or
    from spacy.lang.en.stop_words import STOP_WORDS
    :type stop_words:
    :return:
    :rtype:
    """
    return [word for word in words if word not in stop_words]


def stem_words(words):
    """Stem words in text"""
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in words]

def lemmatize_words(words):
    """Lemmatize words in text"""

    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]

def lemmatize_verbs(words):
    """Lemmatize verbs in text"""

    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word, pos='v') for word in words])

def text2words(text):
  return word_tokenize(text)

def normalize_text( text):
    text = remove_special_chars(text)
    text = remove_non_ascii(text)
    text = remove_punctuation(text)
    text = to_lowercase(text)
    text = replace_numbers(text)
    words = text2words(text)
    words = remove_stopwords(words, stop_words)
    # words = stem_words(words)# Either stem ovocar lemmatize
    words = lemmatize_words(words)
    words = lemmatize_verbs(words)

    return ''.join(words)

In [28]:
def normalize_corpus(corpus):
  return [normalize_text(t) for t in corpus]

In [29]:
proc_X_train = normalize_corpus(X_train)
proc_X_test = normalize_corpus(X_test)

In [30]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [31]:
!unzip glove.6B.zip

'unzip' is not recognized as an internal or external command,
operable program or batch file.


### Building Pre-trained word embeddings using Glove6B With Bi-Directional LSTM

Overview of Layers

- Embedding Layer:

This layer initializes with the pre-trained GloVe embeddings.

You will need to load the GloVe vectors and create an embedding matrix where each word index corresponds to its GloVe vector.

- Bidirectional LSTM Layer:

This layer processes sequences in both forward and backward directions, capturing context from both sides.
It consists of two LSTM layers: one for the forward pass and another for the backward pass.

- Dense Layer(s):

Typically, you'll have one or more fully connected layers to output your final predictions.
The last dense layer often uses a softmax activation for classification tasks.

- Output Layer:

This layer generates the final predictions, which can be class labels, probabilities, etc.

In [32]:
import zipfile

zip_path = r"C:\Users\yostina\Desktop\glove.6B.zip"  # المسار إلى ملف ZIP
extract_to = r"C:\Users\yostina\Desktop\glove"       # المجلد الذي سيتم فك الضغط إليه

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)

print("تم فك الضغط بنجاح!")


تم فك الضغط بنجاح!


In [34]:
# Prepare tokenizer
t = Tokenizer()
t.fit_on_texts(proc_X_train)  # Fit on training data only
vocab_size = len(t.word_index) + 1

# Integer encode the training documents
encoded_train_docs = t.texts_to_sequences(proc_X_train)
# Integer encode the testing documents
encoded_test_docs = t.texts_to_sequences(proc_X_test)

# Pad documents to a max length of 100 words (adjust as necessary)
max_length = 100
padded_train_docs = pad_sequences(encoded_train_docs, maxlen=max_length, padding='post')
padded_test_docs = pad_sequences(encoded_test_docs, maxlen=max_length, padding='post')

# Load the whole embedding into memory (make sure to have the GloVe file)
embeddings_index = dict()
with open(r"C:\Users\yostina\Desktop\glove\glove.6B.100d.txt", mode='rt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
print('Loaded %s word vectors.' % len(embeddings_index))

# Create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 100))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Define model using GRU
model = Sequential()
model.add(Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False))
model.add(GRU(100))  # GRU layer
model.add(Dropout(0.5))  # Dropout layer to prevent overfitting
model.add(Dense(1, activation='sigmoid'))  # Output layer

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# Summarize the model
model.summary()

# Fit the model
model.fit(padded_train_docs, y_train, epochs=5, batch_size = 32, verbose=1)  # Reduced epochs for quicker training

# Evaluate the model on the test set
loss, accuracy = model.evaluate(padded_test_docs, y_test, verbose=0)
print('Test Accuracy: %f' % (accuracy * 100))


Loaded 400000 word vectors.




Epoch 1/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 70ms/step - acc: 0.7037 - loss: 0.5498
Epoch 2/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 67ms/step - acc: 0.8435 - loss: 0.3599
Epoch 3/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 67ms/step - acc: 0.8679 - loss: 0.3146
Epoch 4/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m133s[0m 59ms/step - acc: 0.8798 - loss: 0.2923
Epoch 5/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m72s[0m 57ms/step - acc: 0.8897 - loss: 0.2707
Test Accuracy: 87.040001
