# Introduction to NLP Fundamentals in TensorFlow

NLP has the goal of deriving information out of natural language (could be sequences of text or speech).

Another common term for NLP problems is sequence to sequence problems (seq2seq)

In [1]:
## Get helper functions
from helper_functions import *

## Get a text dataset

The dataset we're using is Kaggle's intro to NLP dataset (text samples of Tweets labelled as disaster or not a disaster)

In [2]:
train_dir = 'nlp_getting_started/train.csv'
test_dir = 'nlp_getting_started/test.csv'

In [3]:
import pandas as pd
train_data = pd.read_csv(train_dir)
test_data = pd.read_csv(test_dir)

In [4]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
test_data.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [6]:
#Shuffle training data
train_data_shuffle = train_data.sample(frac=1, random_state=42)
train_data_shuffle.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


## Visualize and become one with the data

In [7]:
# How many examples of each class are there?
train_data['target'].value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

In [8]:
#How many total samples?
len(train_data), len(test_data)

(7613, 3263)

In [12]:
#Visualize random training examples
import random

random_index = random.randint(0, len(train_data)-5)
for row in train_data_shuffle[['text','target']][random_index:random_index+5].itertuples():
    _, text, target = row
    print(f'Target: {target}', '(real disaster)' if target > 0 else '(not real disaster)')
    print(f'Text:\n{text}\n')
    print('---\n')

Target: 0 (not real disaster)
Text:
China's Stock Market Crash: Are There Gems In The Rubble? http://t.co/Ox3qb15LWQ | https://t.co/8u07FoqjzW http://t.co/tg5fQc8zEY

---

Target: 0 (not real disaster)
Text:
Hw18 going 90-100. Dude was keeping up with me. Took the same exit. Pulled to the side and told me he blew his motor. Lolol #2fast2furious

---

Target: 1 (real disaster)
Text:
@CochiseCollege For the people who died in Human Experiments by Unit 731 of Japanese military http://t.co/vVPLFQv58P http://t.co/ldx9uKNGsk

---

Target: 1 (real disaster)
Text:
@TANSTAAFL23 It's not an 'impulse' and it doesn't end in mass murder. Correlation does not imply causation.

---

Target: 1 (real disaster)
Text:
#RoddyPiperAutos Fears over missing migrants in Med: Rescuers search for survivors after a boat carrying as ma...  http://t.co/97B8AVgEWU

---



### Split data into training and validation sets

In [248]:
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_data_shuffle['text'].to_numpy(),
                                                                           train_data_shuffle['target'].to_numpy(),
                                                                           test_size=.1, random_state=42)

In [15]:
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [18]:
#Check the first 10 samples
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

## Converting Text into numbers

When Dealing with a text probelm, one of the first things you'll have to do before you can build a model is to convert your text to numbers. 

There are a few ways to do this:
* Tokenization - direct mapping of token (a token could be a word or a character) to number
* Embedding - Create a matrix of feature vector for each token (the size of the feature vector can be defined and this embedding can be learned)

In [19]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [22]:
#Use the default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens=None, #How many words in the vocabulary (automatically add <OOV>)
                                   standardize='lower_and_strip_punctuation', 
                                   split='whitespace',
                                   ngrams=None, #Create groups of n-words
                                   output_mode='int', #how to map tokens to numbers
                                   output_sequence_length=None, #pads all sequences to the same length, with "None" all will get 0's to match the longest sequence
                                   pad_to_max_tokens=False)

In [23]:
len(train_sentences[0].split()) #Detects there are 7 words in the tweet

7

In [26]:
#Find the average length of tokens (words) in the training tweets
total = round(sum([len(i.split()) for i in train_sentences])) #This gives us the total length of all train sentences in the dataset
total

102087

In [28]:
#To get the average length of each tweet we divide by the total number of tweets
round(total / len(train_sentences))

15

In [30]:
#Setup Text vectorization variables
max_vocab_length = 10000 #Max number of words to have in our vocabulary
max_length = 15 #Max length our sequences will be (how many words the model sees)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                   output_mode='int',
                                   output_sequence_length=max_length)

In [31]:
#Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

In [33]:
#Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])
#We can see the word There's got mapped to 264, a to 3, flood to 232, in to 4, etc. then the 0's is to pad to max length of 15

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [36]:
#Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
        \n\nVectorized Version:')
text_vectorizer([random_sentence])

Original text:
 Ready to get annihilated for the BUCS game        

Vectorized Version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[924,   5,  52, 558,  10,   2,   1, 397,   0,   0,   0,   0,   0,
          0,   0]])>

In [37]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary() #Get all the unique words in the training data
top_5_words = words_in_vocab[:5] # Get the most common 5 words
bottom_5_words = words_in_vocab[-5:] # Get the least common 5 words

print(f'Number of words in vocab: {len(words_in_vocab)}')
print(f'5 most common words: {top_5_words}')
print(f'5 least common words: {bottom_5_words}')

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


* The '' is our empty spaces, [UNK] is the mask for words outside of the 10,000 word vocabulary

### Creating an Embedding using an Embedding Layer

To make our embedding, we're going to use Tensorflow's embedding layer

The parameters we care most about for our embedding layer:
* `input_dim` = the size of our vocabulary
* `output_dim` = size of the output embedding vector, for example a value of 100 would mean each token gets represented by a vector 100 long
* `input_length` = length of the sequences being passed to the embedding layer

In [39]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = max_vocab_length, 
                            output_dim=128, #Setting this to a common number divisible by 8 insures computation speed
                            input_length = max_length)

embedding

<keras.src.layers.core.embedding.Embedding at 0x2da987c50>

In [41]:
# Get a random sentence form the training set
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
        \n\nEmbedded version:') #This takes positive INTEGERS and turns them into embeddings, that is why we must tokenize first because it will not be able to embed straight text

sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
 Monkeys Abused by Notorious Laboratory Dealer | A PETA Eyewitness Invest... https://t.co/QGqlpmRfJd via @YouTube        

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.00446198, -0.01731829, -0.03523778, ...,  0.03687832,
          0.0447212 , -0.04217243],
        [ 0.04077685,  0.04956594, -0.04109965, ..., -0.02833414,
          0.0154579 , -0.0461543 ],
        [-0.02807315, -0.01057363, -0.03195492, ...,  0.02459791,
          0.01830187,  0.00262801],
        ...,
        [-0.02387155, -0.00121753, -0.00781842, ...,  0.0155483 ,
         -0.0181623 ,  0.03316467],
        [ 0.03234914,  0.01103782,  0.03994464, ..., -0.031964  ,
          0.01055467, -0.02541409],
        [ 0.03234914,  0.01103782,  0.03994464, ..., -0.031964  ,
          0.01055467, -0.02541409]]], dtype=float32)>

In [45]:
# Check out a single token's embedding
print(random_sentence) #The sentence we are looking at the first word of
print(sample_embed[0][0]) #Embedding for a single word
print(sample_embed[0][0].shape) #Shape of the embedding for the single word

Monkeys Abused by Notorious Laboratory Dealer | A PETA Eyewitness Invest... https://t.co/QGqlpmRfJd via @YouTube
tf.Tensor(
[-0.00446198 -0.01731829 -0.03523778 -0.02021638 -0.04750841 -0.03411106
  0.00134579  0.0235271  -0.03880454  0.00289973 -0.02729475 -0.04992742
 -0.03465927 -0.03005945  0.03673089  0.02647002  0.04005947 -0.04114307
 -0.03851108 -0.01781851 -0.04636509 -0.03519504 -0.01553327  0.03706494
 -0.03756167  0.01794079  0.04475294 -0.02158301  0.03270828  0.00667913
  0.02125063  0.01107174 -0.02021428  0.00571175 -0.0206475   0.0088285
 -0.04571632 -0.00081696 -0.00645012 -0.02865743  0.02638391 -0.0391278
 -0.04644766 -0.00393521  0.03841479 -0.03591105 -0.04872149 -0.04291904
 -0.02825388 -0.00960202 -0.00940876 -0.00591218  0.00291022 -0.04019234
  0.04041827 -0.02335994  0.0107852  -0.02897387 -0.02619526  0.02604939
  0.00832865  0.01832208  0.03374655 -0.01407516 -0.01128565 -0.02998059
  0.03127444 -0.00212105 -0.00816075 -0.00718827 -0.03738066  0.04628411
 -

## Modelling a text dataset (Running a series of experiments)
### Experiments we're running:

* 0: Naive Bayes with TF-IDF encoder (baseline)
* 1: Feed-forward Neural Net (Dense Model)
* 2: LSTM (RNN)
* 3: GRU (RNN)
* 4: Bidirectional-LSTM (RNN)
* 5: 1D Convolutional Neural Network
* 6: TensorFlow Hub Pretrained Feature Extractor
* 7: TensorFlow Hub Pretrained Feature Extractor (10% of the data)

### Model 0: Getting a baseline

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

#Create tokenization and modelling pipeline
model_0 = Pipeline([
    ('tfidf', TfidfVectorizer()), #Convert words to numbers
    ('clf', MultinomialNB()) #Model the text
])

#Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [48]:
#Evaluate our baseline Model
baseline_score = model_0.score(val_sentences, val_labels)
print(f'Baseline Model Achieves an Accuracy of: {baseline_score*100: .2f}%')

Baseline Model Achieves an Accuracy of:  79.27%


In [50]:
train_data.target.value_counts() #Guessing would be about a 50/50 so the model is outperforming random guess

target
0    4342
1    3271
Name: count, dtype: int64

In [56]:
#Make predictions
baseline_preds = model_0.predict(val_sentences)
print(f'Predicted Labels: {baseline_preds[:20]}')
print(f'Actual Labels: {val_labels[:20]}')

Predicted Labels: [1 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1]
Actual Labels: [0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0]


In [53]:
val_sentences[1]

'FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps http://t.co/qZQc8WWwcN via @usatoday'

In [54]:
from sklearn.metrics import classification_report
print(classification_report(val_labels, baseline_preds))

              precision    recall  f1-score   support

           0       0.75      0.93      0.83       414
           1       0.89      0.63      0.73       348

    accuracy                           0.79       762
   macro avg       0.82      0.78      0.78       762
weighted avg       0.81      0.79      0.79       762



### Model 1: Feed Forward Neural Network (Dense Model)

In [61]:
# Build model with functional API
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string) #Inputs are 1 dimensional strings
x = text_vectorizer(inputs) #Tokenize our inputs
x = embedding(x) #Turn our tokenized words into embeddings
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation='sigmoid')(x) #Create the output layer for binary outputs

model_1 = tf.keras.Model(inputs, outputs)

model_1.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_2 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (  (None, 128)               0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1280129 (4.88 MB)
Trainable params: 1280129 (

In [62]:
#Compile the Model
model_1.compile(loss='binary_crossentropy',
               optimizer = 'adam',
               metrics=['accuracy'])

In [63]:
#Fit the model
model_1_history = model_1.fit(train_sentences, train_labels, epochs=5, validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [64]:
model_1.evaluate(val_sentences, val_labels)



[0.48018375039100647, 0.7808399200439453]

In [66]:
model_1_pred = model_1.predict(val_sentences)
model_1_pred.shape



(762, 1)

In [73]:
#Look at the first 10 predictions
model_1_pred[:10], val_labels[:10]

(array([[0.41930446],
        [0.7766302 ],
        [0.9977771 ],
        [0.13018925],
        [0.13156842],
        [0.93778   ],
        [0.9175396 ],
        [0.9933553 ],
        [0.9711619 ],
        [0.3107521 ]], dtype=float32),
 array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0]))

In [75]:
model_1_pred_converted = tf.squeeze(tf.round(model_1_pred)) #This will turn predictions into 0 or 1

In [76]:
model_1_pred_converted[:10], val_labels[:10] #Confirm they are in the same format and peek at the first 10 results

(<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>,
 array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0]))

In [78]:
print(classification_report(val_labels, model_1_pred_converted))
print('Our baseline appears to be outperforming the first deep learning model')

              precision    recall  f1-score   support

           0       0.77      0.86      0.81       414
           1       0.80      0.69      0.74       348

    accuracy                           0.78       762
   macro avg       0.79      0.77      0.78       762
weighted avg       0.78      0.78      0.78       762

Our baseline appears to be outperforming the first deep learning model


## Recurrent Neural Networks (RNN's)

RNN's are useful for sequence data

The premise of a recurrent neural net is to use the representation of a previous input to aid the representation of a later input

for an overview of the internals of a recurrent neural network see:

* MIT's Sequence Modelling lecture: https://www.youtube.com/watch?v=ySEx_Bqxvvo&t=18s
* Chris Olah's intro to LSTM's: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
* Andrej Karpathy's "The Unreasonable Effectiveness of Recurrent Neural Networks": https://karpathy.github.io/2015/05/21/rnn-effectiveness/

### Model 2: LSTM

LSTM = Long short term memory (one of the most popular LSTM cells)

Our structure of an RNN typically looks like this:
```
Input (text) -> Tokenize -> Embedding -> Layers (RNNs/Dense) -> Output (label probability)
```

In [88]:
# Create an LSTM Model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = embedding(x)
# print(x.shape)
x = layers.LSTM(64, return_sequences=True)(x) #When stacking RNN cells together, need to set return_sequences = True
# print(x.shape)
x = layers.LSTM(64)(x)
# print(x.shape)
x = layers.Dense(64, activation='relu')(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model_2 = tf.keras.Model(inputs, outputs)

In [89]:
# Get a summary
model_2.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_2 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 lstm_3 (LSTM)               (None, 15, 64)            49408     
                                                                 
 lstm_4 (LSTM)               (None, 64)                33024     
                                                                 
 dense_5 (Dense)             (None, 64)                4160      
                                                           

In [90]:
#Compile the model
model_2.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer = tf.keras.optimizers.legacy.Adam())

In [91]:
#Fit the model
model_2_history = model_2.fit(train_sentences, train_labels, epochs=5, validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [92]:
model_2_pred = model_2.predict(val_sentences)
model_2_pred[:10]



array([[9.4498703e-03],
       [5.0837511e-01],
       [9.9993849e-01],
       [1.7495912e-01],
       [2.1392272e-05],
       [9.9975646e-01],
       [9.8994821e-01],
       [9.9995428e-01],
       [9.9993575e-01],
       [7.2909915e-01]], dtype=float32)

In [93]:
#Convert to predictions
model_2_predictions = tf.squeeze(tf.round(model_2_pred))
model_2_predictions[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [94]:
print(classification_report(val_labels, model_2_predictions))

              precision    recall  f1-score   support

           0       0.76      0.84      0.80       414
           1       0.78      0.68      0.73       348

    accuracy                           0.77       762
   macro avg       0.77      0.76      0.76       762
weighted avg       0.77      0.77      0.77       762



### Model 3: GRU powered RNN

Another popular and effective RNN component is the GRU or Gated Recurrent Unit.

The GRU cell has similar features to an LSTM cell but has less parameters.

In [106]:
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GRU(64, return_sequences=True)(x)
x = layers.LSTM(64, return_sequences=True)(x)
x = layers.GRU(64, return_sequences=True)(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(64, activation='relu')(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model_3 = tf.keras.Model(inputs, outputs)

In [108]:
model_3.summary()

Model: "model_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_11 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_2 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 gru_10 (GRU)                (None, 15, 64)            37248     
                                                                 
 lstm_8 (LSTM)               (None, 15, 64)            33024     
                                                                 
 gru_11 (GRU)                (None, 15, 64)            24960     
                                                           

In [109]:
model_3.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer = tf.keras.optimizers.legacy.Adam())

In [110]:
model_3_history = model_3.fit(train_sentences, train_labels, epochs=5, validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [111]:
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:10]



array([[1.9681400e-02],
       [4.9318841e-01],
       [9.9999249e-01],
       [1.2864037e-01],
       [4.5325450e-08],
       [9.9994177e-01],
       [9.9937451e-01],
       [9.9999994e-01],
       [9.9999958e-01],
       [9.9739516e-01]], dtype=float32)

In [112]:
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [113]:
print(classification_report(val_labels, model_3_preds))

              precision    recall  f1-score   support

           0       0.76      0.84      0.79       414
           1       0.78      0.68      0.73       348

    accuracy                           0.77       762
   macro avg       0.77      0.76      0.76       762
weighted avg       0.77      0.77      0.76       762



### Model 4: Bidirectional LSTM Layer

In [124]:
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.Bidirectional(layers.LSTM(64, return_sequences = True))(x)
x = layers.Bidirectional(layers.GRU(64, return_sequences=True))(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(64, activation='relu')(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model_4 = tf.keras.Model(inputs, outputs)


In [125]:
model_4.summary()

Model: "model_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_15 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_2 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 bidirectional_4 (Bidirecti  (None, 15, 128)           98816     
 onal)                                                           
                                                                 
 bidirectional_5 (Bidirecti  (None, 15, 128)           74496     
 onal)                                                           
                                                          

In [126]:
model_4.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

In [127]:
model_4_history = model_4.fit(train_sentences, train_labels, epochs=5, validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [128]:
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]



array([[6.9248557e-01],
       [4.9983552e-01],
       [9.9999583e-01],
       [3.7908199e-01],
       [2.0135833e-07],
       [9.9998653e-01],
       [9.9950999e-01],
       [9.9999958e-01],
       [9.9999827e-01],
       [9.9134916e-01]], dtype=float32)

In [130]:
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 0., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [131]:
print(classification_report(val_labels, model_4_preds))

              precision    recall  f1-score   support

           0       0.75      0.82      0.79       414
           1       0.76      0.68      0.72       348

    accuracy                           0.76       762
   macro avg       0.76      0.75      0.75       762
weighted avg       0.76      0.76      0.76       762



### Model 5: Conv1D layer model

In [192]:
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.Conv1D(64, 5, activation='relu')(x)
x = layers.GlobalMaxPool1D()(x)
x = layers.Dense(128, activation='relu')(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model_5 = tf.keras.Model(inputs, outputs)

In [193]:
model_5.summary()

Model: "model_25"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_35 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_2 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 conv1d_47 (Conv1D)          (None, 11, 64)            41024     
                                                                 
 global_max_pooling1d_23 (G  (None, 64)                0         
 lobalMaxPooling1D)                                              
                                                                 
 dense_47 (Dense)            (None, 128)               832

In [194]:
model_5.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [195]:
model_5_history = model_5.fit(train_sentences, train_labels, epochs=5, validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [196]:
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:10]



array([[8.6750293e-01],
       [9.9913502e-01],
       [1.0000000e+00],
       [3.9535664e-02],
       [1.4018751e-11],
       [9.9999905e-01],
       [9.9986297e-01],
       [1.0000000e+00],
       [1.0000000e+00],
       [9.5057124e-01]], dtype=float32)

In [197]:
model_5_preds = tf.squeeze(tf.round(model_5_pred_probs))
model_5_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [198]:
print(classification_report(val_labels, model_5_preds))

              precision    recall  f1-score   support

           0       0.75      0.77      0.76       414
           1       0.71      0.69      0.70       348

    accuracy                           0.73       762
   macro avg       0.73      0.73      0.73       762
weighted avg       0.73      0.73      0.73       762



### Model 6: TensorFlow Hub Pretrained Feature Extractor (Sentence Encoder)

In [200]:
import tensorflow_hub as hub
embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')
embed_samples = embed([sample_sentence, "When you can the universal sentence encoder on a sentence, it turns into numbers."])
print(embed_samples[0][:50])

tf.Tensor(
[-0.01157025  0.02485908  0.0287805  -0.01271502  0.03971539  0.0882776
  0.02680984  0.05589836 -0.01068733 -0.00597292  0.00639322 -0.01819517
  0.00030816  0.09105889  0.05874644 -0.03180627  0.01512474 -0.05162928
  0.00991366 -0.06865346 -0.04209306  0.02678981  0.03011009  0.00321063
 -0.00337969 -0.04787361  0.02266721 -0.00985927 -0.04063616 -0.01292092
 -0.04666384  0.05630298 -0.03949255  0.00517684  0.02495823 -0.07014441
  0.02871507  0.0494768  -0.00633974 -0.08960193  0.0280712  -0.00808362
 -0.01360601  0.0599865  -0.10361787 -0.05195376  0.00232959 -0.02332532
 -0.03758109  0.03327732], shape=(50,), dtype=float32)


In [199]:
sample_sentence

"There's a flood in my street!"

In [201]:
embed_samples[0].shape

TensorShape([512])

In [204]:
#Create a Keras layer using the USE pretrained llayers
sentence_encoder_layer = hub.KerasLayer('https://tfhub.dev/google/universal-sentence-encoder/4',
                                        input_shape=[],
                                        dtype = tf.string,
                                        trainable=False,
                                       name='USE')

In [219]:
#Create the model using Sequential API
model_6 = tf.keras.Sequential([
    sentence_encoder_layer,
    # layers.GlobalMaxPool1D(),
    layers.Dense(128, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

In [220]:
model_6.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model_6.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dense_56 (Dense)            (None, 128)               65664     
                                                                 
 dense_57 (Dense)            (None, 1)                 129       
                                                                 
Total params: 256863617 (979.86 MB)
Trainable params: 65793 (257.00 KB)
Non-trainable params: 256797824 (979.61 MB)
_________________________________________________________________


In [221]:
model_6_history = model_6.fit(train_sentences, train_labels, epochs=5, validation_data=(val_sentences, val_labels))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [222]:
model_6_pred_probs = model_6.predict(val_sentences)
model_6_pred_probs[:10]



array([[0.14202861],
       [0.7695492 ],
       [0.9922468 ],
       [0.22077411],
       [0.61421853],
       [0.75251126],
       [0.98547614],
       [0.9786936 ],
       [0.95257086],
       [0.09724431]], dtype=float32)

In [223]:
model_6_preds = tf.squeeze(tf.round(model_6_pred_probs))
model_6_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 1., 1., 1., 1., 1., 0.], dtype=float32)>

In [224]:
print(classification_report(val_labels, model_6_preds))

              precision    recall  f1-score   support

           0       0.80      0.89      0.84       414
           1       0.85      0.74      0.79       348

    accuracy                           0.82       762
   macro avg       0.83      0.81      0.82       762
weighted avg       0.82      0.82      0.82       762



### Model 7: TF Hub Pretrained USE but with only 10% of the training data

In [230]:
train_len = int(0.1 * len(train_sentences))
train_sentences_10_percent = train_data_shuffle['text'][:train_len].to_list()
train_sentences_10_percent[0]

'So you have a new weapon that can cause un-imaginable destruction.'

In [231]:
train_label_10_percent = train_data_shuffle['target'][:train_len].to_list()
train_label_10_percent[0]

1

In [232]:
len(train_label_10_percent), len(train_sentences_10_percent)

(685, 685)

In [233]:
len(train_sentences), len(train_labels)

(6851, 6851)

In [235]:
model_7 = tf.keras.Sequential([
    sentence_encoder_layer,
    layers.Dense(128, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

In [236]:
model_7.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')

model_7.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dense_58 (Dense)            (None, 128)               65664     
                                                                 
 dense_59 (Dense)            (None, 1)                 129       
                                                                 
Total params: 256863617 (979.86 MB)
Trainable params: 65793 (257.00 KB)
Non-trainable params: 256797824 (979.61 MB)
_________________________________________________________________


In [242]:
# The training data came from 10% of train df shuffled but so did the validation data! We could have an overlap of data, lets change val_sentences and labels
val_sentences = train_data_shuffle['text'][train_len:].to_list()
val_sentences[0]

'Excited for Cyclone football https://t.co/Xqv6gzZMmN'

In [243]:
val_labels = train_data_shuffle['target'][train_len:].to_list()
val_labels[0]

0

In [244]:
model_7_history = model_7.fit(train_sentences_10_percent, train_label_10_percent, epochs=10, validation_data=(val_sentences, val_labels))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [245]:
model_7_pred_probs = model_7.predict(val_sentences)
model_7_pred_probs[:10]



array([[0.09672087],
       [0.997689  ],
       [0.7209684 ],
       [0.77340657],
       [0.9174134 ],
       [0.34599334],
       [0.9923149 ],
       [0.12435134],
       [0.9493512 ],
       [0.8939466 ]], dtype=float32)

In [246]:
model_7_preds = tf.squeeze(tf.round(model_7_pred_probs))
model_7_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 1., 1., 0., 1., 0., 1., 1.], dtype=float32)>

In [247]:
print(classification_report(val_labels, model_7_preds))

              precision    recall  f1-score   support

           0       0.81      0.78      0.80      3960
           1       0.72      0.75      0.74      2968

    accuracy                           0.77      6928
   macro avg       0.76      0.77      0.77      6928
weighted avg       0.77      0.77      0.77      6928



### Model 8: Making layers in pretrained model trainable?

In [250]:
sentence_encoder_layer.trainable = True

model_8 = tf.keras.Sequential([
    sentence_encoder_layer,
    layers.Dense(128, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model_8.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer = tf.keras.optimizers.legacy.Adam(.0001))

model_8_history = model_8.fit(train_sentences, train_labels, epochs=10, initial_epoch=model_6_history.epoch[-1], validation_data=(val_sentences, val_labels))

Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [251]:
model_8_pred_probs = model_8.predict(val_sentences)
model_8_pred_probs[:10]



array([[0.03602282],
       [0.9423729 ],
       [0.99671674],
       [0.08216799],
       [0.9304532 ],
       [0.99275005],
       [0.9961126 ],
       [0.99662423],
       [0.9954866 ],
       [0.04435186]], dtype=float32)

In [252]:
model_8_preds = tf.squeeze(tf.round(model_8_pred_probs))
model_8_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 1., 1., 1., 1., 1., 0.], dtype=float32)>

In [253]:
print(classification_report(val_labels, model_8_preds))

              precision    recall  f1-score   support

           0       0.82      0.76      0.79       414
           1       0.74      0.81      0.77       348

    accuracy                           0.78       762
   macro avg       0.78      0.78      0.78       762
weighted avg       0.79      0.78      0.78       762



## The Best model was model 6

The results make me feel as though there is a cap on how well the model can do due to the labels on the tweets. We did see some were mislabled when visualizing before

In [264]:
len(model_6_pred_probs)

762

In [279]:
confidence = []
model_6_pred_squeeze = tf.squeeze(model_6_pred_probs)
for i in range(len(model_6_pred_probs)):
    pred = model_6_pred_squeeze[i].numpy()
    if pred < 0.5:
        pred = 1 - pred
    pred = round(pred * 100, 2)
    confidence.append(pred)

In [280]:
confidence[0]

85.8

In [282]:
df = pd.DataFrame({'sentence': val_sentences.tolist(), 'label': val_labels.tolist(), 'pred': model_6_preds, 'confidence': confidence})
df

Unnamed: 0,sentence,label,pred,confidence
0,DFR EP016 Monthly Meltdown - On Dnbheaven 2015...,0,0.0,85.80
1,FedEx no longer to transport bioterror germs i...,0,1.0,76.95
2,Gunmen kill four in El Salvador bus attack: Su...,1,1.0,99.22
3,@camilacabello97 Internally and externally scr...,1,0.0,77.92
4,Radiation emergency #preparedness starts with ...,1,1.0,61.42
...,...,...,...,...
757,That's the ultimate road to destruction,0,0.0,87.80
758,@SetZorah dad why dont you claim me that mean ...,0,0.0,88.24
759,FedEx will no longer transport bioterror patho...,0,1.0,86.99
760,Crack in the path where I wiped out this morni...,0,1.0,63.87


In [283]:
df['correct'] = df['label'] == df['pred']
df.head()

Unnamed: 0,sentence,label,pred,confidence,correct
0,DFR EP016 Monthly Meltdown - On Dnbheaven 2015...,0,0.0,85.8,True
1,FedEx no longer to transport bioterror germs i...,0,1.0,76.95,False
2,Gunmen kill four in El Salvador bus attack: Su...,1,1.0,99.22,True
3,@camilacabello97 Internally and externally scr...,1,0.0,77.92,False
4,Radiation emergency #preparedness starts with ...,1,1.0,61.42,True


In [288]:
pd.set_option('display.max_colwidth', None)

top_wrong = df[df['correct']==False].sort_values('confidence', ascending=False)
top_wrong.head(20)

Unnamed: 0,sentence,label,pred,confidence,correct
38,Why are you deluged with low self-image? Take the quiz: http://t.co/XsPqdOrIqj http://t.co/CQYvFR4UCy,1,0.0,97.26,False
411,@SoonerMagic_ I mean I'm a fan but I don't need a girl sounding off like a damn siren,1,0.0,96.1,False
233,I get to smoke my shit in peace,1,0.0,95.96,False
23,Ron &amp; Fez - Dave's High School Crush https://t.co/aN3W16c8F6 via @YouTube,1,0.0,95.92,False
244,Reddit Will Now QuarantineÛ_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP,1,0.0,95.01,False
59,You can never escape me. Bullets don't harm me. Nothing harms me. But I know pain. I know pain. Sometimes I share it. With someone like you.,1,0.0,94.58,False
681,'The way you move is like a full on rainstorm and I'm a house of cards',1,0.0,94.05,False
294,Lucas Duda is Ghost Rider. Not the Nic Cage version but an actual 'engulfed in flames' badass. #Mets,1,0.0,93.97,False
536,@DavidVonderhaar At least you were sincere ??,1,0.0,93.75,False
486,VICTORINOX SWISS ARMY DATE WOMEN'S RUBBER MOP WATCH 241487 http://t.co/yFy3nkkcoH http://t.co/KNEhVvOHVK,1,0.0,93.71,False


### We can obviously see that many of these the model is confident about are labeled disaster but are not in actuality disasters, many are mislabeled

In [290]:
# Lets compare our results onto the test df!
test_data

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, stay safe everyone."
2,3,,,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all"
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTENERS XrWn
3259,10865,,,Storm in RI worse than last hurricane. My city&amp;3others hardest hit. My yard looks like it was bombed. Around 20000K still without power
3260,10868,,,Green Line derailment in Chicago http://t.co/UtbXLcBIuY
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) http://t.co/3X6RBQJHn3


In [291]:
test_sentences = test_data['text'].to_numpy()

In [295]:
model_6_test_pred_prob = model_6.predict(test_sentences)
model_6_test_pred_prob[:10]



array([[0.9731597 ],
       [0.9933057 ],
       [0.9819086 ],
       [0.91518265],
       [0.9266316 ],
       [0.93765   ],
       [0.03278901],
       [0.02341782],
       [0.01728554],
       [0.02358675]], dtype=float32)

In [299]:
confidence_test = []
model_6_pred_squeeze = tf.squeeze(model_6_test_pred_prob)
for i in range(len(model_6_test_pred_prob)):
    pred = model_6_pred_squeeze[i].numpy()
    if pred < 0.5:
        pred = 1 - pred
    pred = round(pred * 100, 2)
    confidence_test.append(pred)

In [296]:
model_6_pred = tf.squeeze(tf.round(model_6_test_pred_prob))
model_6_pred

<tf.Tensor: shape=(3263,), dtype=float32, numpy=array([1., 1., 1., ..., 1., 0., 1.], dtype=float32)>

In [300]:
test_df = pd.DataFrame({'Text': test_sentences.tolist(), 'Pred Label': model_6_pred, 'Confidence': confidence_test})
test_df.head(50)

Unnamed: 0,Text,Pred Label,Confidence
0,Just happened a terrible car crash,1.0,97.32
1,"Heard about #earthquake is different cities, stay safe everyone.",1.0,99.33
2,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",1.0,98.19
3,Apocalypse lighting. #Spokane #wildfires,1.0,91.52
4,Typhoon Soudelor kills 28 in China and Taiwan,1.0,92.66
5,We're shaking...It's an earthquake,1.0,93.77
6,"They'd probably still show more life than Arsenal did yesterday, eh? EH?",0.0,96.72
7,Hey! How are you?,0.0,97.66
8,What a nice hat?,0.0,98.27
9,Fuck off!,0.0,97.64


## Lets think about the speed/score tradeoff

Our best performing model is more accurate by about 2% but at what cost. Lets say we really work for twitter and we are seeing 1 million tweets per day. What if our deep model can only look at half those tweets because of the speed in which it predicts but the baseline model can look at all of them. Is that worth the tradeoff of 2% accuracy?

Lets run some speed tests

In [301]:
import time
def pred_timer(model, samples):
    start_time = time.perf_counter() 
    model.predict(samples)
    end_time = time.perf_counter()
    total_time = end_time - start_time #calculate how long predictions took to make
    time_per_pred = total_time/len(samples)
    return total_time, time_per_pred

In [303]:
#Calculate our best models time per prediction
model_6_total_time, model_6_time_per_pred = pred_timer(model_6, samples = val_sentences)
model_6_total_time, model_6_time_per_pred



(0.1453703340375796, 0.00019077471658475012)

In [304]:
#Calculate baseline models time per pred
baseline_total_time, baseline_time_per_pred = pred_timer(model_0, val_sentences)
baseline_total_time, baseline_time_per_pred

(0.025670750066637993, 3.368864838141469e-05)

In [305]:
total_time_diff = model_6_total_time/baseline_total_time
per_pred_diff = model_6_time_per_pred/baseline_time_per_pred

total_time_diff, per_pred_diff

(5.66287832105477, 5.66287832105477)

The baseline model is about 6x faster than the deep model with only 2% less accuracy. It would be up to the client which model they would prefer