# Project in NLP we will be building a text classification model using NLP. The dataset we will be using is the IMDB dataset which is prebuilt in keras for faster execution.

The dataset contains movie data along with genres.

The task we would be doing is to classify the movie in their respective genres.

For the sake of simplicity, we use the first 10,000 records. You are free to explore with more data. The execution time increases with more data.

In [13]:
!pip install netron



In [29]:
import numpy as np
import netron
import nltk
from tensorflow.keras.preprocessing import sequence
from nltk.corpus import stopwords
from keras.utils import to_categorical
from keras import models
from keras import layers
from keras.datasets import imdb
from sklearn.metrics import confusion_matrix

This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

In [30]:
(train_data, train_target), (test_data, test_target) = imdb.load_data(num_words=10000)
dt = np.concatenate((train_data, test_data), axis=0)
tar = np.concatenate((train_target, test_target), axis=0)

In [31]:
#print(train_data)
print(test_target)
print(tar)

[0 1 1 ... 0 0 0]
[1 0 0 ... 0 0 0]


The function convert(): converts the words into vectors for processing.

In [32]:
def convert(sequences, dimension = 10000):
 results = np.zeros((len(sequences), dimension))
 print(results.shape)
 print(results)
 for i, sequence in enumerate(sequences):
  results[i, sequence] = 1
 return results

In [33]:
dt = convert(dt)
tar = np.array(tar).astype("float32")
test_x = dt[:9000]
test_y = tar[:9000]
train_x = dt[9000:]
train_y = tar[9000:]

(50000, 10000)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [34]:
model = models.Sequential()
# Input - Layer
model.add(layers.Dense(50, activation = "relu", input_shape=(10000, )))
# Hidden - Layers
model.add(layers.Dropout(0.4, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation = "sigmoid"))
model.summary()

model.save(r'C:\Users\Shwetha V\Desktop\nlp_model.h5')

netron.start(r'C:\Users\Shwetha V\Desktop\nlp_model.h5')

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_16 (Dense)            (None, 50)                500050    
                                                                 
 dropout_8 (Dropout)         (None, 50)                0         
                                                                 
 dense_17 (Dense)            (None, 50)                2550      
                                                                 
 dropout_9 (Dropout)         (None, 50)                0         
                                                                 
 dense_18 (Dense)            (None, 50)                2550      
                                                                 
 dense_19 (Dense)            (None, 1)                 51        
                                                                 
Total params: 505,201
Trainable params: 505,201
Non-tr

('localhost', 8080)

In [35]:
# compiling the model
 
model.compile(
 optimizer = "adam",
 loss = "binary_crossentropy",
 metrics = ["accuracy"]
)
results = model.fit(
 train_x, train_y,
 epochs= 2,
 batch_size = 500,
 validation_data = (test_x, test_y)
)

print("Test-Accuracy:", np.mean(results.history["val_accuracy"]))

Epoch 1/2
Epoch 2/2
Test-Accuracy: 0.8917222023010254


In [36]:
preds=model.predict(test_x)

best_index = np.argmax(preds, axis=0)[0]
print(best_index)
print(test_target[best_index])

2689
1


In [37]:
accuracy=model.evaluate(test_x, test_y)
print('Accuracy',accuracy)

Accuracy [0.2708359360694885, 0.8899999856948853]


In [39]:
# Define a function to preprocess user input
def preprocess_input(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text.lower())

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Convert the tokens to integers using the IMDb dataset's word index
    word_index = imdb.get_word_index()
    int_tokens = []
    num_words=10000
    for token in filtered_tokens:
        if token in word_index and word_index[token] < num_words:
            int_tokens.append(word_index[token])

    # Pad the sequence
    padded_seq = sequence.pad_sequences([int_tokens], maxlen=num_words)

    return padded_seq

# Take user input and preprocess it
user_input = input('Enter some text: ')
preprocessed_input = preprocess_input(user_input)

# Predict the sentiment of the input
prediction = model.predict(preprocessed_input)[0][0]

if prediction > 0.5:
    print('Positive sentiment')
else:
    print('Negative sentiment')


Enter some text: This a fantastic movie of three prisoners who become famous. One of the actors is george clooney and I'm not a fan but this roll is not bad. Another good thing about the movie is the soundtrack (The man of constant sorrow). I recommand this movie to everybody. Greetings Bart
Positive sentiment


In [40]:
# Define a function to preprocess user input
def preprocess_input(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text.lower())

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Convert the tokens to integers using the IMDb dataset's word index
    word_index = imdb.get_word_index()
    int_tokens = []
    num_words=10000
    for token in filtered_tokens:
        if token in word_index and word_index[token] < num_words:
            int_tokens.append(word_index[token])

    # Pad the sequence
    padded_seq = sequence.pad_sequences([int_tokens], maxlen=num_words)

    return padded_seq

# Take user input and preprocess it
user_input = input('Enter some text: ')
preprocessed_input = preprocess_input(user_input)

# Predict the sentiment of the input
prediction = model.predict(preprocessed_input)[0][0]

if prediction > 0.5:
    print('Positive sentiment')
else:
    print('Negative sentiment')

Enter some text:  An awful film! It must have been up against some real stinkers to be nominated for the Golden Globe. They've taken the story of the first famous female Renaissance painter and mangled it beyond recognition. My complaint is not that they've taken liberties with the facts; if the story were good, that would perfectly fine. But it's simply bizarre -- by all accounts the true story of this artist would have made for a far better film, so why did they come up with this dishwater-dull script? I suppose there weren't enough naked people in the factual version. It's hurriedly capped off in the end with a summary of the artist's life -- we could have saved ourselves a couple of hours if they'd favored the rest of the film with same brevity.
Negative sentiment
