# TensorFlow used for
## Natural Language Processing 

code source: [Magnus Erik Hvass Pedersen](http://www.hvass-labs.org/)
/ [GitHub](https://github.com/Hvass-Labs/TensorFlow-Tutorials) / [Videos on YouTube](https://www.youtube.com/playlist?list=PL9Hr9sNUjfsmEu1ZniY0XpHSzl5uihcXZ)  
with modifications from [Tensorflow tutorials](https://www.tensorflow.org/tutorials) at [Tensorflow.org](https://www.tensorflow.org/tutorials)

In [453]:
import pandas as pd

In [454]:
#load data from csv
politics_tweet_df = pd.read_csv('../data/politics_tweet.csv')
politics_tweet_df= politics_tweet_df.drop(['Unnamed: 0'], axis=1)
politics_tweet_df.head(3)

Unnamed: 0,date,name,tweet
0,2020-11-15,<JoeBiden>,congratulations to nasa and spacex on today's...
1,2020-11-14,<JoeBiden>,"to the millions of hindus, jains, sikhs, and ..."
2,2020-11-13,<JoeBiden>,"i am the president-elect, but will not be pre..."


In [455]:
#count data for JoeBiden
politics_tweet_df[politics_tweet_df['name']=='<JoeBiden>'].count()

date     3032
name     3032
tweet    3032
dtype: int64

In [None]:
#count data for realDonaldTrump

In [456]:
politics_tweet_df[politics_tweet_df['name']=='<realDonaldTrump>'].count()

date     4019
name     4019
tweet    4019
dtype: int64

In [457]:
politics_tweet_df[politics_tweet_df['name']=='<senatemajldr>'].count()

date     515
name     515
tweet    515
dtype: int64

In [458]:
politics_tweet_df[politics_tweet_df['name']=='<SpeakerPelosi>'].count()

date     1005
name     1005
tweet    1005
dtype: int64

In [459]:
#label the data
def func(x):
    if x=='<realDonaldTrump>':
        return 0
    elif x=='<JoeBiden>':
        return 1
    elif x=='<SpeakerPelosi>':
        return 1
    else:
        return 0
politics_tweet_df['label'] = politics_tweet_df['name'].apply(func)
politics_tweet_df

Unnamed: 0,date,name,tweet,label
0,2020-11-15,<JoeBiden>,congratulations to nasa and spacex on today's...,1
1,2020-11-14,<JoeBiden>,"to the millions of hindus, jains, sikhs, and ...",1
2,2020-11-13,<JoeBiden>,"i am the president-elect, but will not be pre...",1
3,2020-11-13,<JoeBiden>,i am alarmed by the surge in reported covid-1...,1
4,2020-11-13,<JoeBiden>,as the remnants of tropical storm eta continu...,1
...,...,...,...,...
8566,2020-01-03,<senatemajldr>,"for too long, this evil man operated without ...",0
8567,2020-01-03,<senatemajldr>,soleimani made it his life’s work to take the...,0
8568,2020-01-03,<senatemajldr>,"this morning, iran’s master terrorist is dead...",0
8569,2020-01-03,<senatemajldr>,senators do not cease to be senators just bec...,0


In [460]:
y = politics_tweet_df["label"]
X = politics_tweet_df["tweet"]
print(X.shape, y.shape)

(8571,) (8571,)


In [461]:
# Use train_test_split to create training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [462]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist

In [463]:
# from tf.keras.models import Sequential  # This does not work!
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

## Tokenizer

A neural network cannot work directly on text-strings so we must convert it somehow. There are two steps in this conversion, the first step is called the "tokenizer" which converts words to integers and is done on the data-set before it is input to the neural network. The second step is an integrated part of the neural network itself and is called the "embedding"-layer, which is described further below.

We may instruct the tokenizer to only use e.g. the 10000 most popular words from the data-set.

In [464]:
num_words = 10000
tokenizer = Tokenizer(num_words=num_words)

The tokenizer can then be "fitted" to the data-set. This scans through all the text and strips it from unwanted characters such as punctuation, and also converts it to lower-case characters. The tokenizer then builds a vocabulary of all unique words along with various data-structures for accessing the data.

Note that we fit the tokenizer on the entire data-set so it gathers words from both the training- and test-data. This is OK as we are merely building a vocabulary and want it to be as complete as possible. The actual neural network will of course only be trained on the training-set.

In [466]:
%%time
tokenizer.fit_on_texts(X.tolist())

Wall time: 430 ms


We can then inspect the vocabulary that has been gathered by the tokenizer. This is ordered by the number of occurrences of the words in the data-set. These integer-numbers are called word indices or "tokens" because they uniquely identify each word in the vocabulary.

In [467]:
tokenizer.word_index

{'the': 1,
 'to': 2,
 'and': 3,
 'of': 4,
 'a': 5,
 'in': 6,
 'is': 7,
 'for': 8,
 'we': 9,
 'our': 10,
 'i': 11,
 'this': 12,
 'will': 13,
 'on': 14,
 'that': 15,
 'are': 16,
 'it': 17,
 'you': 18,
 'be': 19,
 'have': 20,
 'with': 21,
 'amp': 22,
 'president': 23,
 'as': 24,
 '—': 25,
 'trump': 26,
 'they': 27,
 'he': 28,
 'has': 29,
 'all': 30,
 'not': 31,
 'my': 32,
 'great': 33,
 'people': 34,
 'your': 35,
 'at': 36,
 'by': 37,
 'who': 38,
 'from': 39,
 'but': 40,
 'was': 41,
 'more': 42,
 'their': 43,
 'american': 44,
 'do': 45,
 'just': 46,
 'now': 47,
 'his': 48,
 'vote': 49,
 'country': 50,
 'up': 51,
 'need': 52,
 'donald': 53,
 'if': 54,
 'get': 55,
 'out': 56,
 'today': 57,
 'no': 58,
 'us': 59,
 'one': 60,
 'an': 61,
 'than': 62,
 'can': 63,
 'so': 64,
 'america': 65,
 'what': 66,
 'thank': 67,
 'make': 68,
 'house': 69,
 'time': 70,
 'or': 71,
 'been': 72,
 'americans': 73,
 'about': 74,
 'biden': 75,
 'day': 76,
 'nation': 77,
 'me': 78,
 'democrats': 79,
 'big': 80,
 'ne

We can then use the tokenizer to convert all texts in the training-set to lists of these tokens.

In [468]:
x_train_tokens = tokenizer.texts_to_sequences(X_train)

In [469]:
x_test_tokens = tokenizer.texts_to_sequences(X_test)

## Padding and Truncating Data

The Recurrent Neural Network can take sequences of arbitrary length as input, but in order to use a whole batch of data, the sequences need to have the same length. There are two ways of achieving this: (A) Either we ensure that all sequences in the entire data-set have the same length, or (B) we write a custom data-generator that ensures the sequences have the same length within each batch.

Solution (A) is simpler but if we use the length of the longest sequence in the data-set, then we are wasting a lot of memory. This is particularly important for larger data-sets.

So in order to make a compromise, we will use a sequence-length that covers most sequences in the data-set, and we will then truncate longer sequences and pad shorter sequences.

First we count the number of tokens in all the sequences in the data-set.

In [470]:
num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
num_tokens = np.array(num_tokens)

In [471]:
num_tokens

array([19, 31, 45, ..., 41, 42, 21])

In [472]:
np.mean(num_tokens)

30.133006650332515

In [473]:
np.max(num_tokens)

60

In [474]:
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

57

In [475]:
np.sum(num_tokens < max_tokens) / len(num_tokens)

0.9994166374985416

### This is where data is padded.

When padding or truncating the sequences that have a different length, we need to determine if we want to do this padding or truncating 'pre' or 'post'. If a sequence is truncated, it means that a part of the sequence is simply thrown away. If a sequence is padded, it means that zeros are added to the sequence.

So the choice of 'pre' or 'post' can be important because it determines whether we throw away the first or last part of a sequence when truncating, and it determines whether we add zeros to the beginning or end of the sequence when padding. This may confuse the Recurrent Neural Network.

In [477]:
pad = 'pre'
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
                            padding=pad, truncating=pad)


In [478]:
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)

In [567]:
x_test_pad

array([[   0,    0,    0, ..., 1195,  605,  110],
       [   0,    0,    0, ..., 1141,  520, 2136],
       [   0,    0,    0, ...,  432,    4,  118],
       ...,
       [   0,    0,    0, ...,    1,   44,   34],
       [   0,    0,    0, ...,   10, 3564,  711],
       [   0,    0,    0, ...,  593,   80,  135]])

In [479]:
x_train_pad.shape

(6428, 57)

In [480]:
x_test_pad.shape

(2143, 57)

In [481]:
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

In [482]:
def tokens_to_string(tokens):
    # Map from tokens back to words.
    words = [inverse_map[token] for token in tokens if token != 0]
    
    # Concatenate all words.
    text = " ".join(words)

    return text

In [483]:
X_train[1]

' to the millions of hindus, jains, sikhs, and buddhists celebrating the festival of lights,  and i send our best wishes for a #happydiwali. may your new year be filled with hope, happiness, and prosperity. sal mubarak.\n'

In [484]:
tokens_to_string(x_train_tokens[1])

'i issued the following statement with on the urgent need to replenish the paycheck protection program there is no excuse for a lack of urgency american jobs are literally at stake'

## Create the Recurrent Neural Network

We are now ready to create the Recurrent Neural Network (RNN). We will use the Keras API for this because of its simplicity. See Tutorial #03-C for a tutorial on Keras.

In [485]:
model = Sequential()

In [486]:
embedding_size = 8

The embedding-layer also needs to know the number of words in the vocabulary (`num_words`) and the length of the padded token-sequences (`max_tokens`). We also give this layer a name because we need to retrieve its weights further below.

In [487]:
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))

We can now add the first Gated Recurrent Unit (GRU) to the network. This will have 16 outputs. Because we will add a second GRU after this one, we need to return sequences of data because the next GRU expects sequences as its input.

In [488]:
model.add(GRU(units=16, return_sequences=True))
model.add(GRU(units=8, return_sequences=True))
model.add(GRU(units=4))
model.add(Dense(1, activation='sigmoid'))

optimizer = Adam(lr=1e-3)

In [489]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])


In [490]:
model.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 57, 8)             80000     
_________________________________________________________________
gru_22 (GRU)                 (None, 57, 16)            1200      
_________________________________________________________________
gru_23 (GRU)                 (None, 57, 8)             600       
_________________________________________________________________
gru_24 (GRU)                 (None, 4)                 156       
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


## Train the Recurrent Neural Network

We can now train the model. Note that we are using the data-set with the padded sequences. We use 5% of the training-set as a small validation-set, so we have a rough idea whether the model is generalizing well or if it is perhaps over-fitting to the training-set.

In [540]:
%%time
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=3, batch_size=64)


Train on 6106 samples, validate on 322 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Wall time: 42.9 s


<tensorflow.python.keras.callbacks.History at 0x1d525a4d488>

## Performance on Test-Set

Now that the model has been trained we can calculate its classification accuracy on the test-set.

In [541]:
%%time
# result = model.evaluate(x_test_pad, y_test)
model_loss, model_accuracy = model.evaluate(
    x_test_pad, y_test, verbose=2)
print(
    f"Normal Neural Network - Loss: {model_loss}, Accuracy: {model_accuracy}")

2143/2143 - 5s - loss: 0.3051 - accuracy: 0.9169
Normal Neural Network - Loss: 0.30507836667486177, Accuracy: 0.916938841342926
Wall time: 4.88 s


## Save the model for prediction

In [429]:
model.save("model1119-2.h5")

## OtherTest the model for prediction

In [542]:
%%time
y_pred = model.predict(x=x_test_pad[0:1000])
y_pred = y_pred.T[0]

Wall time: 4.62 s


In [543]:
cls_pred = np.array([1.0 if p>0.5 else 0.0 for p in y_pred])

In [544]:
cls_true = np.array(y_test[0:1000])

In [545]:
incorrect = np.where(cls_pred != cls_true)
incorrect = incorrect[0]

In [546]:
len(incorrect)

87

In [547]:
idx = incorrect[0]
idx

4

In [548]:
text = X_test.to_list()[idx]
text

' our country can’t afford a second epidemic of frivolous lawsuits while we fight the covid-19 pandemic. the next relief package should focus on four things: jobs, healthcare, kids in school, and liability protections for those helping us fight the coronavirus.  '

In [549]:
y_pred[idx]

0.94515264

In [550]:
cls_true[idx]

0

In [555]:
# Biden  tweet
text1 = 'It’s not enough to praise our essential workers — we have to protect and pay them.'.lower()
text2 = 'The workers on the frontlines of this pandemic are making extraordinary sacrifices every single day. They deserve leaders who will listen and work as hard for them as they are for their communities. As president, that’s exactly what I’ll do.'
#trump tweet
text3 = 'Hope that all House Republicans will vote against Crazy Nancy Pelosiï War Powers Resolution'
text4 = 'PRESIDENTIAL HARASSMENT!'
text5 = 'IRAN WILL NEVER HAVE A NUCLEAR WEAPON!'
text6 = 'The Impeachment Hoax'
text7 = 'Congress &amp; the President should not be wasting their time and energy on a continuation of the totally partisan Impeachment Hoax when we have so many important matters pending. 196 to ZERO was the Republican House vote'
text8 = 'These Media Posts will serve as notification to the United States Congress that should Iran strike any U.S. person or target'
texts = [text1, text2, text3, text4, text5, text6, text7, text8]

In [556]:
for i in range(len(texts)):
    texts[i] = texts[i].lower()
print(texts)

['it’s not enough to praise our essential workers — we have to protect and pay them.', 'the workers on the frontlines of this pandemic are making extraordinary sacrifices every single day. they deserve leaders who will listen and work as hard for them as they are for their communities. as president, that’s exactly what i’ll do.', 'hope that all house republicans will vote against crazy nancy pelosiï war powers resolution', 'presidential harassment!', 'iran will never have a nuclear weapon!', 'the impeachment hoax', 'congress &amp; the president should not be wasting their time and energy on a continuation of the totally partisan impeachment hoax when we have so many important matters pending. 196 to zero was the republican house vote', 'these media posts will serve as notification to the united states congress that should iran strike any u.s. person or target']


In [557]:
tokens = tokenizer.texts_to_sequences(texts)

tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
tokens_pad.shape

(8, 57)

In [558]:
model.predict(tokens_pad)

array([[0.9928597 ],
       [0.9928119 ],
       [0.00724876],
       [0.01900795],
       [0.03351384],
       [0.03952959],
       [0.00930074],
       [0.00776094]], dtype=float32)