#Natural Language Processing (NLP) for Sentiment Analysis

###Problem Statement:
Build a sentiment analysis model using NLP techniques. The goal is to classify text reviews
as positive, negative, or neutral. Use a dataset like the IMDb Movie Reviews dataset.


1. **Objective:**
   - The primary objective is to develop a model for sentiment analysis, a task within NLP. Sentiment analysis involves determining the sentiment expressed in a piece of text, such as positive, negative, or neutral.

2. **NLP Techniques:**
   - Natural Language Processing involves the use of computational methods to understand and process human language. In this context, NLP techniques are employed to analyze and interpret text data, particularly for sentiment classification.

3. **Dataset:**
   - The dataset mentioned for training is the IMDb Movie Reviews dataset. This dataset likely contains movie reviews along with their corresponding sentiment labels (positive, negative, or neutral).

4. **Sentiment Analysis:**
   - Sentiment analysis, also known as opinion mining, is a task in NLP that involves determining the sentiment expressed in a piece of text. It is often used to automatically classify text as positive, negative, or neutral based on the emotions or opinions conveyed.

5. **Input Data:**
   - The input data for this problem consists of text reviews, specifically movie reviews from the IMDb dataset. Each review is associated with a sentiment label, indicating whether it is positive, negative, or neutral.

6. **Model Development:**
   - The model needs to be designed to analyze the text and predict the sentiment. Techniques such as word embeddings (e.g., Word2Vec, GloVe) and recurrent neural networks (RNNs) or transformer models (e.g., BERT) are commonly used for sentiment analysis tasks.

7. **Training:**
   - The model is trained on the IMDb Movie Reviews dataset. During training, the model learns to map the input text reviews to their corresponding sentiment labels. The model's parameters are adjusted to minimize the difference between predicted and actual sentiments.

8. **Evaluation:**
   - The performance of the sentiment analysis model is evaluated on a separate dataset, often a test set from the IMDb Movie Reviews dataset. Common evaluation metrics include accuracy, precision, recall, and F1 score.

9. **Application:**
   - Once the model is trained and validated, it can be used to classify the sentiment of new text reviews. This has practical applications in industries such as e-commerce, social media monitoring, and customer feedback analysis, where automated sentiment analysis can provide valuable insights.

#Generating Data

In [1]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
import re

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
# Load dataset
data = pd.read_csv('/content/drive/MyDrive/dataset/IMDB-Dataset.csv')

In [3]:
print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


In [4]:
english_stop_words = set(stopwords.words('english'))

In [5]:
def load_dataset():
    df = pd.read_csv('/content/drive/MyDrive/dataset/IMDB-Dataset.csv')
    x_data = df['review']       # Reviews/Input
    y_data = df['sentiment']    # Sentiment/Output

    # Pre-process review
    x_data = x_data.replace({'<.*?>': ''}, regex=True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex=True)     # remove non-alphabet
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stop_words])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case

    # Encode sentiment -> 0 & 1
    y_data = y_data.replace('positive', 1)
    y_data = y_data.replace('negative', 0)

    return x_data, y_data

In [6]:
x_data, y_data = load_dataset()

In [7]:
print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


In [8]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3)

In [9]:
print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
28704    [if, like, occasionally, enjoy, watching, terr...
9562     [less, thriller, colorful, adventure, suspense...
27943    [for, big, thinkers, among, us, the, intruder,...
35643    [a, wounded, tonto, standing, alone, protect, ...
42738    [it, utterly, pointless, rate, film, it, would...
                               ...                        
8108     [blake, edwards, legendary, fiasco, begins, se...
29517    [a, young, solicitor, london, arthur, kidd, se...
9914     [the, german, regional, broadcast, station, wd...
28110    [yeesh, talk, craptastic, thing, brutal, horri...
21269    [the, book, movie, based, excellent, took, com...
Name: review, Length: 35000, dtype: object 

47555    [a, film, little, positive, say, firstly, zero...
39692    [i, read, almost, books, seen, musical, produc...
5475     [naach, a, detailed, review, obtained, anywher...
27621    [when, watching, show, quite, sure, whether, s...
44154    [this, film, worst, film, i, ever, seen, it, c...
 

In [10]:
def get_max_length():
    review_length = [len(review) for review in x_train]
    return int(np.ceil(np.mean(review_length)))

In [11]:
# Encode review
tokenizer = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
tokenizer.fit_on_texts(x_train)
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(tokenizer.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[   55     6  1798 ...     0     0     0]
 [  244   615  3169 ...   308  1701   163]
 [  202   103 18170 ...     0     0     0]
 ...
 [    2   925 14836 ...     0     0     0]
 [41658   600 29090 ...     0     0     0]
 [    2   174     3 ...     0     0     0]] 

Encoded X Test
 [[   40     4    48 ...  2011  4148   165]
 [    1   245   120 ...  1598  2818 18028]
 [46426    40  3637 ...   107    83     8]
 ...
 [  692 19491 22209 ...     0     0     0]
 [    8    17   344 ...   132    16   317]
 [    2  2148  1991 ... 10555    50  1362]] 

Maximum review length:  130


In [49]:
# Architecture
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length=max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

print(model.summary())

checkpoint = ModelCheckpoint(
    'models/LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 130, 32)           2804480   
                                                                 
 lstm_2 (LSTM)               (None, 64)                24832     
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2829377 (10.79 MB)
Trainable params: 2829377 (10.79 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [64]:
# Architecture
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length=max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='relu'))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

print(model.summary())

checkpoint = ModelCheckpoint(
    'models/RAB.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 130, 32)           2804480   
                                                                 
 lstm_5 (LSTM)               (None, 64)                24832     
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2829377 (10.79 MB)
Trainable params: 2829377 (10.79 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [65]:
model.fit(x_train, y_train, batch_size=128, epochs=5, callbacks=[checkpoint])

Epoch 1/5
Epoch 1: accuracy improved from -inf to 0.66166, saving model to models/RAB.h5
Epoch 2/5
Epoch 2: accuracy improved from 0.66166 to 0.91154, saving model to models/RAB.h5
Epoch 3/5
Epoch 3: accuracy improved from 0.91154 to 0.95606, saving model to models/RAB.h5
Epoch 4/5
Epoch 4: accuracy improved from 0.95606 to 0.97366, saving model to models/RAB.h5
Epoch 5/5
Epoch 5: accuracy improved from 0.97366 to 0.97740, saving model to models/RAB.h5


<keras.src.callbacks.History at 0x7c2aa52cceb0>

In [66]:
y_pred_probs = model.predict(x_test, batch_size=128)
y_pred_classes = (y_pred_probs > 0.5).astype(int)

true = sum(1 for i, y in enumerate(y_test) if y == y_pred_classes[i])

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred_classes) - true))
print('Accuracy: {:.2%}'.format(true / len(y_pred_classes)))

Correct Prediction: 13054
Wrong Prediction: 1946
Accuracy: 87.03%


In [67]:
loaded_model = load_model('models/RAB.h5')

In [86]:
user_review = str(input('Movie Review: '))

Movie Review: 7.5


In [87]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
user_review = regex.sub('', user_review)
print('Cleaned: ', user_review)

Cleaned:  


In [88]:
words = user_review.split(' ')
filtered = [w for w in words if w not in english_stop_words]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

Filtered:  ['']


In [89]:
tokenized_words = tokenizer.texts_to_sequences(filtered)
tokenized_words = pad_sequences(tokenized_words, maxlen=max_length, padding='post', truncating='post')
print(tokenized_words)

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [90]:
result = loaded_model.predict(tokenized_words)
print(result)

[[0.9254229]]


In [91]:
if result >= 0.7:
    print('positive')
else:
    print('negative')

positive
