<a href="https://colab.research.google.com/github/MANOJ-S-NEGI/Classification_NLP_DISASTER/blob/main/disaster_or_not_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What we're going to cover


- Downloading a text dataset
- Visualizing text data
- Converting text into numbers using tokenization
- Turning our tokenized text into an embedding
- Modelling a text dataset
- Starting with a baseline (TF-IDF)
- Building several deep learning text models
- Dense, LSTM, GRU, Conv1D, Transfer learning
- Comparing the performance of each our models
- Combining our models into an ensemble
- Saving and loading a trained model
- Find the most wrong predictions

In [158]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [159]:
# Turn .csv files into pandas DataFrame's
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.sample(3)

Unnamed: 0,id,keyword,location,text,target
4119,5854,hailstorm,Massachusetts,Twin Storms Blow Through Calgary ~ 1 http://t....,1
5810,8292,rubble,"Columbus, Georgia",'Refuse to let my life be reduced to rubble. W...,0
4816,6855,mass%20murder,"Victoria, Australia, Earth",@samanthaturne19 IIt may logically have been t...,1


In [55]:
#The test data doesn't have a target (that's what we'd try to predict)
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [56]:
# total number of class distribution:
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [57]:
# How many samples total?
print(f"Total training samples: {len(train_df)}")
print(f"Total test samples: {len(test_df)}")
print(f"Total samples: {len(train_df) + len(test_df)}")

Total training samples: 7613
Total test samples: 3263
Total samples: 10876


In [58]:
## distributing target 0/1 into eaqual proportion

target_0_frame = train_df[train_df.target==0]
target_1_frame = train_df[train_df.target==1]
target_0_frame = target_0_frame.sample(len(target_1_frame))


# Concatenate the two DataFrames
balanced_df = pd.concat([target_0_frame, target_1_frame])

# total number of class distribution:
balanced_df.target.value_counts()

0    3271
1    3271
Name: target, dtype: int64

In [59]:
## shuffling the traindata:
train_df_shuffled = balanced_df.sample(frac= 1,random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
7217,10336,weapons,"California, United States",#Kick Hendrixonfire @'=BLACKCATS= | BIG NOOB ...,0
2436,3499,derailed,Toronto,So derailed_benchmark is cool for paths. i won...,0
2580,3700,destroyed,,@justicemalala @nkeajresq Nkea destroyed lives...,0
2760,3964,devastation,,#HungerArticles: Nepal: Rebuilding Lives and L...,1
3427,4900,explode,my deli,what if i want to fuck the duck until explode....,0


In [60]:
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
random_index

5494

In [61]:
## visualizing the data
for i in train_df_shuffled[["text","target"]][random_index:random_index+5].itertuples():
    _, text, target = i

    if target > 0:
        print(f"Target:{target}-real disaster")

    else:
        print(f"Target:{target}-not real disaster")

    print(f"Text:\n{text}\n")
    print("\n")


Target:0-not real disaster
Text:
The Art World's Seismic Shift Back to the Oddball - Observer http://t.co/W0xR5gP8cW



Target:1-real disaster
Text:
MaFireEMS: RT WMUR9: Two buildings involved in fire on 2nd Street in #Manchester. WMUR9  http://t.co/QUFwXRJIql via KCarosaWMUR



Target:1-real disaster
Text:
After a suicide bombing in SuruÌ¤ that killed 32 people Turkey launches airstrikes against ISIL and Kurdistan Workers' Party camps in Iraq.



Target:0-not real disaster
Text:
Hellfire! We donÛªt even want to think about it or mention it so letÛªs not do anything that leads to it!



Target:0-not real disaster
Text:
0-day bug in fully patched OS X comes under active exploit to bypass password ... - Ars Technica http://t.co/F7OgzrNPfv





**Split data into training and validation sets**


In [62]:
from sklearn.model_selection import train_test_split

# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,) # dedicate 10% of samples to validation set


In [63]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)


(5887, 5887, 655, 655)

In [64]:

# View the first 10 training sentences and their labels
train_sentences[:10], train_labels[:10]


(array(['Love skiing', 'who makes these? http://t.co/28t3NWHdKy',
        'Russia stood down cold war nuke ban or face ocean superiority \nUnconditional surrender next putin\nGame set match\nRelease the hostages',
        'The Danger and Excitement of Underwater Cave Diving http://t.co/8c3fPloxcr http://t.co/cBGZ9xuN2k',
        "'the third generation atomic bombed survivor' Photo exhibition 11:00 to 18:00 8/6. \n#?? #Hiroshima http://t.co/gVAipmLSl0",
        "Jeff Locke. Train wreck. F'in disaster. Fortunately the Pirates acquired a top quality starter in J.A... Oh wait. #Blowltan",
        "Dad bought a DVD that looks like a science doc on the front but I read the back and it's actually about the impending biblical apocalypse",
        'Why must I have a meltdown every few days? ??',
        '@themagickidraps not upset with a rally upset with burning buildings businesses executing cops that have nothing to do with it etc',
        'Finnish ministers: Fennovoima nuclear reactor will 

**Converting text into numbers**

In [65]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
"""

# Use the default TextVectorization variables
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None) # how long should the output sequence of tokens be?
                                    # pad_to_max_tokens=True) # Not valid if using max_tokens=None

"""


'\n\n# Use the default TextVectorization variables\ntext_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in your text)\n                                    standardize="lower_and_strip_punctuation", # how to process text\n                                    split="whitespace", # how to split tokens\n                                    ngrams=None, # create groups of n-words?\n                                    output_mode="int", # how to map tokens to numbers\n                                    output_sequence_length=None) # how long should the output sequence of tokens be?\n                                    # pad_to_max_tokens=True) # Not valid if using max_tokens=None\n\n'

**Find average number of tokens (words) in training Tweets**

In [66]:
split_words = []
for i in train_sentences:
    split_text_length = len(i.split())
    split_words.append(split_text_length)


round((sum(split_words))/len(train_sentences))

15

In [67]:
# Setup text vectorization with custom variables
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does our model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)



In [68]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)


In [69]:
# Create sample sentence and tokenize it
sample_sentence = "There's a cyclone in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[337,   3, 585,   4,  13, 563,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [70]:
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\n\nVectorized version:")
text_vectorizer([random_sentence])

Original text:
Guns are for protection.. 
That shit really shouldn't be used unless your life in danger

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[1678,   22,   10, 8540,   17,  215,  188, 1785,   21,  598, 1758,
          36,  133,    4,  486]])>

In [71]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()


In [72]:
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}")
print(f"Bottom 5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
Top 5 most common words: ['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words: ['lynch', 'lyme', 'lyf', 'lwilliams13', 'lwb']


**Creating an Embedding using an Embedding Layer**

In [73]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = max_vocab_length, output_dim = 128, embeddings_initializer="uniform", # default, intialize randomly
                             input_length=max_length, # how long is each input
                             name="embedding_1")

embedding



<keras.src.layers.core.embedding.Embedding at 0x794dd88e0cd0>

In [74]:

# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
I liked a @YouTube video http://t.co/z8Cp77lVza Boeing 737 takeoff in snowstorm. HD cockpit view + ATC audio - Episode 18      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.02273828,  0.04144235, -0.02585566, ...,  0.01463411,
         -0.00014108,  0.02110289],
        [ 0.02224833,  0.04639992, -0.02492276, ...,  0.01341997,
          0.00885509, -0.0493438 ],
        [ 0.01160518,  0.04206905, -0.00753371, ..., -0.04967964,
          0.02973107,  0.01135657],
        ...,
        [ 0.02728396,  0.0049751 ,  0.03278517, ...,  0.03508178,
          0.049391  , -0.03323895],
        [-0.02456476, -0.0407282 ,  0.02884166, ..., -0.01318163,
          0.04505003,  0.008218  ],
        [ 0.00676936,  0.0210516 , -0.00805281, ...,  0.00774992,
          0.01094258, -0.00974723]]], dtype=float32)>

In [75]:
# Check out a single token's embedding
sample_embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.02273828,  0.04144235, -0.02585566,  0.00675478, -0.03799262,
        0.04176367, -0.00383574,  0.03577802,  0.04097796, -0.02759098,
        0.03246203, -0.01256744,  0.03210486,  0.0175028 ,  0.01184662,
       -0.02493802,  0.04878395, -0.0295491 , -0.04367533, -0.017664  ,
        0.0266109 ,  0.04998339, -0.03300886,  0.04162829, -0.00056447,
       -0.03968178, -0.01472308,  0.01155072,  0.02384028,  0.02931991,
       -0.01088241,  0.04341146, -0.00442127, -0.01447533, -0.00939434,
        0.02445364,  0.02526972, -0.04624981,  0.02307569,  0.00033665,
       -0.02428859,  0.01540739, -0.03604871,  0.04374227,  0.03015487,
        0.03558059,  0.0334487 ,  0.00769513, -0.04723771, -0.02090166,
        0.04887236, -0.01975104,  0.02275128, -0.03448569,  0.00442863,
       -0.00663497, -0.03542257,  0.0496541 ,  0.02257947,  0.00783515,
       -0.04855515,  0.02478392,  0.02178121,  0.02605383, -0.01574005,
       -0.039696

**Model 0: Getting a baseline**
- As with all machine learning modelling experiments, it's important to create a baseline model so you've got a benchmark for future experiments to build upon.

- To create our baseline, we'll create a Scikit-Learn Pipeline using the TF-IDF (term frequency-inverse document frequency) formula to convert our words to numbers and then model them with the Multinomial Naive Bayes algorithm. This was chosen via referring to the Scikit-Learn machine learning map.

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


In [77]:
# Create tokenization and modelling pipeline

model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text

])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

The benefit of using a shallow model like Multinomial Naive Bayes is that training is very fast.

Let's evaluate our model and find our baseline metric.

In [None]:
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")


Our baseline model achieves an accuracy of: 78.47%


In [None]:

# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1])

In [None]:

# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.

  Args:
  -----
  y_true = true labels in the form of a 1D array
  y_pred = predicted labels in the form of a 1D array

  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

In [None]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels, y_pred=baseline_preds)
baseline_results

{'accuracy': 78.47328244274809,
 'precision': 0.7852196788318341,
 'recall': 0.7847328244274809,
 'f1': 0.7847549026489754}

**Model 1: A simple dense model**

In [None]:
# Build model with the Functional API
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype="string") # inputs are 1-dimensional strings

x = text_vectorizer(inputs) # turn the input text into numbers

x = embedding(x) # create an embedding of the numerized numbers

x = layers.GlobalAveragePooling1D()(x) # lower the dimensionality of the embedding (try running the model without this layer and see what happens)

outputs = layers.Dense(1, activation="sigmoid")(x) # create the output layer, want binary outputs so use sigmoid activation

model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense") # construct the model

model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVe  (None, 15)                0         
 ctorization)                                                    
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (  (None, 128)               0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1280129 (4.88 MB)
Trainable params: 128

In [None]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy',  # Monitor validation accuracy
    patience=5,               # Number of epochs with no improvement after which training will be stopped
    verbose=1,               # Display log messages
    restore_best_weights=True # Restore model weights from the epoch with the best value of the monitored quantity
)


# Fit the model
model_1_history = model_1.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=10,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[early_stop])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 9: early stopping


In [None]:
# Check the results
model_1.evaluate(val_sentences, val_labels)



[0.47296831011772156, 0.7725191116333008]

In [None]:

embedding.weights

[<tf.Variable 'embedding_1/embeddings:0' shape=(10000, 128) dtype=float32, numpy=
 array([[ 0.03357914,  0.03828212,  0.02976629, ..., -0.00687335,
         -0.01128913, -0.00267786],
        [ 0.05080666, -0.03503846,  0.03293104, ..., -0.04100879,
         -0.00449625, -0.02019813],
        [-0.02277319,  0.05271037,  0.01285423, ...,  0.06001385,
         -0.04213574,  0.00950958],
        ...,
        [-0.0067892 , -0.07327034, -0.03786656, ..., -0.00715279,
          0.09196931, -0.03333502],
        [-0.0029752 ,  0.07715705,  0.00049048, ...,  0.02546285,
         -0.04773935,  0.03244756],
        [-0.04115189,  0.00360898,  0.01812258, ...,  0.04433915,
          0.04883465,  0.03293625]], dtype=float32)>]

In [None]:
embed_weights = model_1.get_layer("embedding_1").get_weights()[0]
print(embed_weights.shape)

(10000, 128)


In [None]:
# Assuming model_1_history is a dictionary with keys 'loss' and 'accuracy'
loss = model_1_history['loss']
accuracy = model_1_history['accuracy']

# Assuming you also have validation data and corresponding metrics
val_loss = model_1_history['val_loss']
val_accuracy = model_1_history['val_accuracy']

# Plotting the training loss and validation loss
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

# Plotting the training accuracy and validation accuracy
plt.subplot(1, 2, 2)
plt.plot(accuracy, label='Training Accuracy')
plt.plot(val_accuracy, label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()


In [None]:

# Make predictions (these come back in the form of probabilities)
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs[:10] # only print out the first 10 prediction probabilities




array([[0.05764266],
       [0.8897773 ],
       [0.01770181],
       [0.06314652],
       [0.98546815],
       [0.1925734 ],
       [0.2545943 ],
       [0.8501932 ],
       [0.5553487 ],
       [0.36082584]], dtype=float32)

In [None]:
# Turn prediction probabilities into single-dimension tensor of floats
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs)) # squeeze removes single dimensions
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 1., 0., 1., 0., 0.,
       0., 0., 1.], dtype=float32)>

In [None]:

# Calculate model_1 metrics
model_1_results = calculate_results(y_true=val_labels, y_pred=model_1_preds)
model_1_results

{'accuracy': 77.25190839694656,
 'precision': 0.7733297966275492,
 'recall': 0.7725190839694657,
 'f1': 0.7725190839694657}

**Visualizing learned embeddings**

In [None]:
# Get the vocabulary from the text vectorization layer
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

**Recurrent Neural Networks (RNN's)**

In [96]:
# Set random seed and create embedding layer (new embedding layer for each model)
tf.random.set_seed(42)
from tensorflow.keras import layers
model_2_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_2")

In [97]:

# Create LSTM model
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_2_embedding(x)
print(x.shape)

x = layers.LSTM(128, return_sequences=True)(x)
x = layers.Dropout(0.2)(x)
x = layers.LSTM(64)(x)
print(x.shape)

x = layers.Dense(64, activation="relu")(x) # optional dense layer on top of output of LSTM cell
outputs = layers.Dense(1, activation="sigmoid")(x)

model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")

model_2.summary()


(None, 15, 128)
(None, 64)
Model: "model_2_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding_2 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 lstm_10 (LSTM)              (None, 15, 128)           131584    
                                                                 
 dropout_4 (Dropout)         (None, 15, 128)           0         
                                                                 
 lstm_11 (LSTM)              (None, 64)                49408     
                           

In [98]:
# Compile model

# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd =  tf.keras.optimizers.SGD(learning_rate=0.01,  momentum=0.9, nesterov=True)

model_2.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])


In [99]:
# Fit model
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy',  # Monitor validation accuracy
    patience=3,               # Number of epochs with no improvement after which training will be stopped
    verbose=1,               # Display log messages
    restore_best_weights=True # Restore model weights from the epoch with the best value of the monitored quantity
)


model_2_history = model_2.fit(train_sentences,
                              train_labels,
                              epochs=10,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[early_stop])



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 13: early stopping


In [100]:
# Make predictions on the validation dataset
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs.shape, model_2_pred_probs[:10] # view the first 10



((655, 1),
 array([[0.19379182],
        [0.8485478 ],
        [0.19389717],
        [0.14243396],
        [0.19167963],
        [0.3942209 ],
        [0.48218733],
        [0.7780898 ],
        [0.19685441],
        [0.2736581 ]], dtype=float32))

In [102]:
# Round out predictions and reduce to 1-dimensional array
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 0., 0., 0., 0., 0., 1., 0., 0.], dtype=float32)>

In [103]:

# Calculate LSTM model results
model_2_results = calculate_results(y_true=val_labels, y_pred=model_2_preds)
model_2_results

{'accuracy': 77.09923664122137,
 'precision': 0.771145630713938,
 'recall': 0.7709923664122137,
 'f1': 0.7709539265468274}

- **Model 3: GRU**
- **Model 4: Bidirectonal RNN model**

In [111]:

# Set random seed and create embedding layer (new embedding layer for each model)
tf.random.set_seed(42)
from tensorflow.keras import layers
model_4_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_4")

# Build a Bidirectional RNN in TensorFlow
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_4_embedding(x)
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x) # stacking RNN layers requires return_sequences=True

x = layers.Dropout(0.2)(x)
x = layers.Bidirectional(layers.LSTM(64))(x) # bidirectional goes both ways so has double the parameters of a regular LSTM layer
outputs = layers.Dense(1, activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs, outputs, name="model_4_Bidirectional")






In [115]:
# Get a summary of our bidirectional model
model_4.summary()

Model: "model_4_Bidirectional"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_11 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_1 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding_4 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 bidirectional_1 (Bidirecti  (None, 15, 128)           98816     
 onal)                                                           
                                                                 
 dropout_5 (Dropout)         (None, 15, 128)           0         
                                                                 
 bidirectional_2 (Bidirecti  (None, 128)     

In [114]:
# Compile model

# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd =  tf.keras.optimizers.SGD(learning_rate=0.01,  momentum=0.9, nesterov=True)

model_2.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])


In [116]:
model_4_history = model_4.fit(train_sentences,
                              train_labels,
                              epochs=10,
                              validation_data=(val_sentences, val_labels),
                              callbacks=early_stop)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 4: early stopping


In [117]:

# Make predictions with bidirectional RNN on the validation data
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]




array([[0.22869554],
       [0.49174142],
       [0.25223005],
       [0.16676591],
       [0.15490757],
       [0.44104293],
       [0.30376655],
       [0.5333912 ],
       [0.16971526],
       [0.1866804 ]], dtype=float32)

In [118]:

# Convert prediction probabilities to labels
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]


<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], dtype=float32)>

In [119]:
# Calculate bidirectional RNN model results
model_4_results = calculate_results(val_labels, model_4_preds)
model_4_results


{'accuracy': 79.08396946564885,
 'precision': 0.7945646835684949,
 'recall': 0.7908396946564885,
 'f1': 0.7901519199674518}

## Using Pretrained Embeddings (transfer learning for NLP)

We can load in a TensorFlow Hub module using the hub.load() method and passing it the target URL of the module we'd like to use, in our case, it's "https://tfhub.dev/google/universal-sentence-encoder/4".

In [120]:
# Example of pretrained embedding with universal sentence encoder - https://tfhub.dev/google/universal-sentence-encoder/4
import tensorflow_hub as hub
#embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") # load Universal Sentence Encoder


In [121]:
sample_sentence

"There's a cyclone in my street!"

In [122]:
# "When you call the universal sentence encoder on a sentence, it turns it into numbers."])

embed_samples = embed([sample_sentence])

In [129]:
embed_samples[0][:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([-0.02916172,  0.01692663, -0.00798343,  0.02437593,  0.00955493,
        0.08220604,  0.03376466,  0.06680672, -0.01045315, -0.00750477,
        0.00795671, -0.04481372,  0.04022907,  0.09037361,  0.04888926,
        0.01342431, -0.04328211, -0.06109739,  0.01985499, -0.05867866],
      dtype=float32)>

In [132]:
# Each sentence has been encoded into a 512 dimension vector
embed_samples[0].shape

TensorShape([512])

Passing our sentences to the Universal Sentence Encoder (USE) encodes them from strings to 512 dimensional vectors,

In [134]:
# We can use this encoding layer in place of our text_vectorizer and embedding layer
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[], # shape of inputs coming to our model
                                        dtype=tf.string, # data type of inputs coming to the USE layer
                                        trainable=False, # keep the pretrained weights (we'll create a feature extractor)
                                        name="USE")

In [135]:

# Create model using the Sequential API
model_6 = tf.keras.Sequential([sentence_encoder_layer, # take in sentences and then encode them into an embedding
  layers.Dense(64, activation="relu"),
  layers.Dense(1, activation="sigmoid")
], name="model_6_USE")

# Compile model
model_6.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])



In [137]:
# Train a classifier on top of pretrained embeddings
model_6_history = model_6.fit(train_sentences,
                              train_labels,
                              epochs=8,
                              validation_data=(val_sentences, val_labels),
                              callbacks = [early_stop])

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 5: early stopping


In [138]:

# Make predictions with USE TF Hub model
model_6_pred_probs = model_6.predict(val_sentences)
model_6_pred_probs[:10]



array([[0.2696747 ],
       [0.07366128],
       [0.10768273],
       [0.07747962],
       [0.20215209],
       [0.61959004],
       [0.32175237],
       [0.6049156 ],
       [0.0809449 ],
       [0.06968644]], dtype=float32)

In [139]:

# Convert prediction probabilities to labels
model_6_preds = tf.squeeze(tf.round(model_6_pred_probs))
model_6_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], dtype=float32)>

In [140]:

# Calculate model 6 performance metrics
model_6_results = calculate_results(val_labels, model_6_preds)
model_6_results


{'accuracy': 82.59541984732824,
 'precision': 0.8267211534235268,
 'recall': 0.8259541984732824,
 'f1': 0.8258437798053486}

In [141]:
# Clone model_6 but reset weights
model_7 = tf.keras.models.clone_model(model_6)

# Compile model
model_7.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Get a summary (will be same as model_6)
model_7.summary()

Model: "model_6_USE"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dense_15 (Dense)            (None, 64)                32832     
                                                                 
 dense_16 (Dense)            (None, 1)                 65        
                                                                 
Total params: 256830721 (979.73 MB)
Trainable params: 32897 (128.50 KB)
Non-trainable params: 256797824 (979.61 MB)
_________________________________________________________________


**Saving and loading a trained model**

- The HDF5 format.
- The SavedModel format (default).

In [229]:
# Save TF Hub Sentence Encoder model to HDF5 format
model_6.save("/content/drive/MyDrive/colab_model/disaster_or_not_model_6.h5")

  saving_api.save_model(


In [143]:
# Load model with custom Hub Layer (required with HDF5 format)
loaded_model_6 = tf.keras.models.load_model(("/content/drive/MyDrive/colab_model/disaster_or_not_model_6.h5"),
                                            custom_objects={"KerasLayer": hub.KerasLayer})

In [144]:
# How does our loaded model perform?
loaded_model_6.evaluate(val_sentences, val_labels)



[0.410685658454895, 0.8259541988372803]

In [145]:
# Create dataframe with validation sentences and best performing model predictions
val_df = pd.DataFrame({"text": val_sentences,
                       "target": val_labels,
                       "pred": model_6_preds,
                       "pred_prob": tf.squeeze(model_6_pred_probs)})
val_df.head()

Unnamed: 0,text,target,pred,pred_prob
0,sleeping with sirens vai vir pra sp,0,0.0,0.269675
1,games that I really hope to see in AGDQ: Traum...,0,0.0,0.073661
2,Wait What??? http://t.co/uAVFRtlfs4 http://t.c...,0,0.0,0.107683
3,@MistressPip I'm amazed you have not been inun...,0,0.0,0.07748
4,what if i want to fuck the duck until explode....,0,0.0,0.202152


**Making predictions on the test dataset**

In [None]:
# Making predictions on the test dataset
prediction = []
prediction_probability = []
test_sentences = test_df["text"].to_list()
for test_sample in test_sentences:
  pred_prob = tf.squeeze(model_6.predict([test_sample])) # has to be list
  pred = tf.round(pred_prob)
  prediction.append(int(pred))
  prediction_probability.append(pred_prob)




In [173]:
# Extract numpy values
numpy_probs = [tensor.numpy() for tensor in prediction_probability]
numpy_probs[:10]

[0.7308816,
 0.90077615,
 0.8141924,
 0.93867433,
 0.96010053,
 0.697001,
 0.073662914,
 0.06838173,
 0.09514087,
 0.08482724]

In [208]:
predicted_dataframe = pd.DataFrame({"text":test_sentences, "target":prediction , "prediction_probability":numpy_probs})

# Add an 'id' column as the index
predicted_dataframe['id'] = range(0, len(predicted_dataframe))
predicted_dataframe.set_index('id', inplace=True)




In [209]:
predicted_dataframe

Unnamed: 0_level_0,text,target,prediction_probability
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Just happened a terrible car crash,1,0.730882
1,"Heard about #earthquake is different cities, s...",1,0.900776
2,"there is a forest fire at spot pond, geese are...",1,0.814192
3,Apocalypse lighting. #Spokane #wildfires,1,0.938674
4,Typhoon Soudelor kills 28 in China and Taiwan,1,0.960101
...,...,...,...
3258,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...,1,0.841084
3259,Storm in RI worse than last hurricane. My city...,1,0.950700
3260,Green Line derailment in Chicago http://t.co/U...,1,0.915232
3261,MEG issues Hazardous Weather Outlook (HWO) htt...,1,0.605705


## test data file prediction for kaggle submission


In [224]:
test_file_k = pd.read_csv("/content/test_file.csv")

len(test_file_k)

3263

In [225]:
predicted_dataframe_submission = pd.DataFrame({"id" :test_file_k['id'], "target":prediction })


In [226]:
len(predicted_dataframe_submission)

3263

In [228]:
import pandas as pd

# Specify the file path
file_path = 'prediction_test_file.csv'

# Save the DataFrame to a CSV file
predicted_dataframe_submission.to_csv(file_path, index=False)

---
---

In [None]:
from google.colab import drive
drive.mount('/content/drive')