# Natural Language Processing
Natural Language Processing (NLP) is a discipline of AI that deals with the communication between humans and computers. The NLP has shown the wolrd a piece of what AIs can do, by the popularity of autocompletes, chatbots and human-machine interaction. After all, NLP is the part that focusses on how computers can understand and interact to natural/human language.

# Recurrent Neural Networks
Recurrent neural network is a type of neural network designed to process sequential data by maintaining the memory of preivous inputs. RNNs have loops in their architecture, allowing information to persist across steps in a sequence. It immitates the human behaviour, since when we read something, we read one word at a time and every new word contributtes to the context development. In other words, RNNs can define the meaning of a word, based on the previous word.

# Encoding Text
Since machine learning models do not use raw text data as input, we do need to convert the text to something numerical. There are some methods that does this, however, there is one considered the best.

Starting with the Bag of Words, this method consisting of giving a number to each word of a sentence. The text "Hello I am markdown I am" would be [0,1,2,3,1,2] (but unodered), so that the model can learn by the frequency of a specific word (the one that appears most). This method is not considered good because it will not 'understand' the context of the text and would classify the following sentences as equal (they are not): "She only loves him" and "Only she loves him".

The next method is the Integer Encoding. It consists on the same way of the bag of words, but now, tracking the position of the words. The results are better, however, there is not still a way to get the context of the text since we can not relate words, because their labels (numbers) do not have a meaning, it is just the order presented on the text.

The last method (and the best one) is the Word Embedding. This method can both, track the sequence of the words and their meaning, allowing the model to identify the context of the text. This method guarantee that similar words have similar labels by placing them together in a high-dimensional vector space. As a result, the model is able to understand relationships between words, such as synonims or semantic associations, which improve its performance on tasks like sentiment analysis, machine trasnlarion and text classification

# Long Short-Term Memory
There are two algorithms to make the word embedding, the SipleRNN and the Long Short-Term Memory (LSTM). The difference between them is just a feature that allows longer texts. The SimpleRNN works by applying mathematics (will not be stated here) on the input (only one word) that will generate an output of the word. This output, will be used as parameter of the next mathematical block, this allows us to define a word based on the last. The only problem of this algorithm is that the longer the text is, less the knowledge it will have of the beginning of the sentence. Altough it works well for short texts, we do have another one that is considered better, the LSTM. On both algorithms, when analysing the third word of the text, there will not be strict evidences of the first word. What the LSTM does do, is to storage the outputs, allowing us to relate every word to every word.

# Practice
Goal: To identify the sentiment of a book review.

About the data: I am importing the dataset from Keras, it has 25,000 reviews from IMDB, each one has already a label stating if it is postive or negative, which means we are doing a supervised learning. The vocabulary size is 88,584 words and each one is encodded by its frequency on the reviews, observe that it is not separated by their meaning.

In [38]:
# Importing
from keras.datasets import imdb
from keras.preprocessing import sequence
import keras.preprocessing
import tensorflow as tf
import os
import numpy as np

In [39]:
# Defining variables
VOCAB_SIZE = 88584
MAXLEN = 250
BATCH_SIZE = 60

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=VOCAB_SIZE) 

In [40]:
# Lets see 1 review!
print(test_data[0])

[1, 591, 202, 14, 31, 6, 717, 10, 10, 18142, 10698, 5, 4, 360, 7, 4, 177, 5760, 394, 354, 4, 123, 9, 1035, 1035, 1035, 10, 10, 13, 92, 124, 89, 488, 7944, 100, 28, 1668, 14, 31, 23, 27, 7479, 29, 220, 468, 8, 124, 14, 286, 170, 8, 157, 46, 5, 27, 239, 16, 179, 15387, 38, 32, 25, 7944, 451, 202, 14, 6, 717]


In [41]:
# Observe on the last code block, that each review has a different length, this is an issue.
# Keras has aleady a function that does this.
train_data = sequence.pad_sequences(train_data, maxlen=MAXLEN)
test_data = sequence.pad_sequences(test_data, maxlen=MAXLEN)

# If the length is higher than the max length, it would trim extra words, otherwise, add 0's.

In [42]:
print(test_data[0]) # Lets see how a new review is now

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

In [24]:
# Creating the model
model = tf.keras.Sequential(
    [
        tf.keras.layers.Embedding(VOCAB_SIZE, 32), # 32 means the amount of dimensions the Tensor will have after Embedding
        tf.keras.layers.LSTM(32), # Doing the Long Short-Term Memory
        tf.keras.layers.Dense(1, activation='sigmoid')
    ]
)

print(model.summary())

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, None, 32)          2834688   
                                                                 
 lstm_3 (LSTM)               (None, 32)                8320      
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
Total params: 2843041 (10.85 MB)
Trainable params: 2843041 (10.85 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [None]:
# Compiling the model
model.compile(
    loss='binary_crossentropy', # Best loss function for binary classification situations
    optimizer='rmsprop', # Most recommended optmizer for RNNs
    metrics=['acc'] # Measuring the accuracy
)




In [30]:
# Training the model
history = model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [35]:
# Getting the results:
results = model.evaluate(test_data, test_labels)
print('\033[31mLoss: ', results[0],'\033[0m')
print('\033[32mAccuracy: ', results[1]*100,'\033[0m')

[31mLoss:  0.4565478563308716 [0m
[32mAccuracy:  85.971999168396 [0m


# Making predictions

In [55]:
# Encoding the text:
word_index = imdb.get_word_index() # This returns the index of the words of our word bank (from IMDB)

def encode_text(text):
    tokens = keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens] # List comprehenssion to detect if the word exists on our word bank, if not, the word = 0, which is unkown.
    return sequence.pad_sequences([tokens], maxlen=MAXLEN) # Returning the tokens with the adjusted size

print(encode_text('The movie is perfect!'))

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0 

In [83]:
# Defining the prediction function

def predict(text):
    text = encode_text(text=text) # We have to prepare the data before predicting
    pred = np.zeros((1, 250)) #This is an array of arrays
    pred[0] = text # The encoded text goes to pred[0] because this is the space of the text, pred is able to storage 250 texts (because our model learned this way)
    result = model.predict(pred) # Predicting
    return 0 if result[0] <0.5 else 1

In [85]:
text = "The movie is interesting and easy to watch, with good acting and a story that keeps you wanting more."

print('\033[32mPostive\033[0m' if predict(text) == 1 else '\033[31mNegative\033[0m')

[32mPostive[0m
