<a id="section-zero"></a>

# TABLE OF CONTENTS


* [Library Importations](#section-one)
* [Loading Datasets](#section-two)
* [Exploratory Data Analysis](#section-three)
* [Data Preprocessing](#section-four)
    - [Text Normalization](#subsection-four-one)
    - [Stemming](#subsection-four-two)
    - [Lemmatization](#subsection-four-three)
* [Vector Transformation](#section-five)
    - [Bag Of Words](#subsection-five-one)
    - [TD IDF](#subsection-five-two)
    - [Word Embedding](#subsection-five-three)
* [Building Model](#section-six)
    - [Support Vector Machine](#subsection-six-one)
    - [XGBoost](#subsection-six-two)
    - [Naive Bayes Classifier](#subsection-six-three)
    - [Logistic Regression](#subsection-six-four)
    - [Neural Network](#subsection-six-five)
* [BERT](#section-seven)
* [Model Comparison](#section-eight)
* [Submission](#section-nine)

<a id="section-one"></a>
# Import all the required libraries

In [None]:
import numpy as np 
import pandas as pd
import os
import time

import string
import emoji
import re

from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers.experimental import preprocessing

import tensorflow as tf
from tensorflow import keras
from keras import layers
from keras import backend as K
from tensorflow.keras.layers import Dense, Input
from keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import EarlyStopping

from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub


# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

from wordcloud import WordCloud



# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

#For Stemming, NLTK is needed
import nltk
from nltk.stem.snowball import SnowballStemmer

import spacy
nlp = spacy.load('en_core_web_lg')


<a id="section-two"></a>
# Load datasets

Loading the train dataset to df_train and test dataset to df_test

In [None]:
df_train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
df_test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

<a id="section-three"></a>
# Exploratory Data Analysis (EDA) 

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Check various aspects of the dataset. It may or may not be useful. 

Checking which all columns contain NaN values(is missing). 'location' is missing a lot in both the train and test data sets

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

Count number of words in tweet and maximum count

In [None]:
count = df_train['text'].str.split().str.len()
count
print(max(count))

Lets see the data when 'keyword' is present

In [None]:
df_train[df_train.keyword.isnull()==False]

Lets see the data when keyword is absent

In [None]:
df_train[df_train.keyword.isnull()==True]

Checking shape of train and test datasets. Note that the test dataset does not have 'target' column.

In [None]:
print("Train dataset shape : ",df_train.shape)
print("Test dataset shape : ",df_test.shape)

Lets check how many tweets are related to disaster

In [None]:
df_train['target'].value_counts()

In [None]:
sns.barplot(df_train['target'].value_counts().index,df_train['target'].value_counts(),palette='rocket')

Lets explore the keyword column and see if it's useful

In [None]:
df_train['keyword'].value_counts()

Lets display first 25 as there are too many to display whole. Putting horizontal orientation and viridis palette. 

In [None]:
sns.barplot(y=df_train['keyword'].value_counts()[:25].index,x=df_train['keyword'].value_counts()[:25], orient='horizontal', palette='viridis')

**Lets explore the target column**

Lets see 5 tweets about a disaster.

In [None]:
# A disaster tweet
disaster_tweets = df_train[df_train['target']==1]['text']
disaster_tweets.values[:5]

Lets see 5 non-disaster tweets

In [None]:
non_disaster_tweets = df_train[df_train['target']==0]['text']
non_disaster_tweets.values[:5]

Generate a WordCloud for disaster tweets and non-disaster tweets

In [None]:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=[26, 8])
wordcloud1 = WordCloud( background_color='white',
                        width=600,
                        height=400).generate(" ".join(disaster_tweets))
ax1.imshow(wordcloud1)
ax1.axis('off')
ax1.set_title('Disaster Tweets',fontsize=40);

wordcloud2 = WordCloud( background_color='white',
                        width=600,
                        height=400).generate(" ".join(non_disaster_tweets))
ax2.imshow(wordcloud2)
ax2.axis('off')
ax2.set_title('Non Disaster Tweets',fontsize=40);

<a id="section-four"></a>
# Preprocessing the data

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset

We will create a 'clean' function which comprises of various cleaning function such as removal of emojis, punctuations etc

In [None]:
def clean(text):
    text = text.lower() #Lets make it lowercase
    text = removeStopwords(text)
    text = removePunctuations(text)
    text = removeEmojis(text)
    text = removeNumbers(text)
    text = removeLinks(text)
    return text

In [None]:
def removeStopwords(text):
    doc = nlp(text)
    clean_text = ' '
    for txt in doc:
        if (txt.is_stop == False):
            clean_text = clean_text + " " + str(txt)        
    
    return clean_text

print("Text before removeStopwords function: " + df_train['text'][1])
print("Text after removeStopwords function: " + removeStopwords(df_train['text'][1]))

In [None]:
def removePunctuations(text):
    return text.translate(str.maketrans('', '', string.punctuation))

print("Text before removePunctuations function: " + df_train['text'][1])
print("Text after removePunctuations function: " + removePunctuations(df_train['text'][1]))

In [None]:
def removeEmojis(text):
    allchars = [c for c in text]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI["en"]]
    clean_text = ' '.join([str for str in text.split() if not any(i in str for i in emoji_list)])
    return clean_text

test_string = "Hi' 🤔 How is your 🙈 and 😌. Have a nice weekend 💕👭👙".lower()
(test_string,removeEmojis(test_string))

In [None]:
def removeNumbers(text):
    clean_text = re.sub(r'\d+', '', text)
    return clean_text

test_string = "Hi 🙈 99 girls are running"
(test_string,removeNumbers(test_string))

In [None]:
def removeLinks(text):
    clean_text = re.sub('https?://\S+|www\.\S+', '', text)
    #https? will match both http and https
    #A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.
    #\S Matches any character which is not a whitespace character.
    #+ Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
    return clean_text

test_string = "http://www.youtube.com/ and https://www.youtube.com/ should be removed "
(test_string,removeLinks(test_string))

We will work with pre-processed data for both training and test data sets. Running this will take a while.

In [None]:
df_train['text']=df_train.text.apply(clean)
df_test['text']=df_test.text.apply(clean)

Checking if the transformation is applied as expected, in the text column

In [None]:
df_train.head()

Lets see the wordcloud for the df_train['text'] now

In [None]:
tweets = df_train['text']
fig, ax1, = plt.subplots(1,  figsize=[26, 8])
wordcloud1 = WordCloud( background_color='white',
                        width=600,
                        height=400).generate(" ".join(tweets))
ax1.imshow(wordcloud1)
ax1.axis('on')
ax1.set_title('Tweets',fontsize=40);

Thanks to https://www.kaggle.com/rftexas/text-only-bert-keras?scriptVersionId=31186559
Some data is wrong. For example, target of the training dataset at 328,443,513,2619,3640,3900,4342,5781,6552,6554,6570,6701,6702,6729,6861,7226 are given as 1 whereas they are obviously 0,
since they are not related to disaster.

We change it to 0.

In [None]:
ids_with_target_error = [328,443,513,2619,3640,3900,4342,5781,6552,6554,6570,6701,6702,6729,6861,7226]
df_train.at[df_train['id'].isin(ids_with_target_error),'target'] = 0
df_train[df_train['id'].isin(ids_with_target_error)]

<a id="subsection-four-one"></a>
# Text Normalization
Text normalization is the process of transforming text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”.

Lets convert all the abbreviations to its full form.
Thanks to https://www.kaggle.com/rftexas/text-only-bert-keras?scriptVersionId=31186559

In [None]:
abbreviations = {
    "$" : " dollar ",
    "€" : " euro ",
    "4ao" : "for adults only",
    "a.m" : "before midday",
    "a3" : "anytime anywhere anyplace",
    "aamof" : "as a matter of fact",
    "acct" : "account",
    "adih" : "another day in hell",
    "afaic" : "as far as i am concerned",
    "afaict" : "as far as i can tell",
    "afaik" : "as far as i know",
    "afair" : "as far as i remember",
    "afk" : "away from keyboard",
    "app" : "application",
    "approx" : "approximately",
    "apps" : "applications",
    "asap" : "as soon as possible",
    "asl" : "age, sex, location",
    "atk" : "at the keyboard",
    "ave." : "avenue",
    "aymm" : "are you my mother",
    "ayor" : "at your own risk", 
    "b&b" : "bed and breakfast",
    "b+b" : "bed and breakfast",
    "b.c" : "before christ",
    "b2b" : "business to business",
    "b2c" : "business to customer",
    "b4" : "before",
    "b4n" : "bye for now",
    "b@u" : "back at you",
    "bae" : "before anyone else",
    "bak" : "back at keyboard",
    "bbbg" : "bye bye be good",
    "bbc" : "british broadcasting corporation",
    "bbias" : "be back in a second",
    "bbl" : "be back later",
    "bbs" : "be back soon",
    "be4" : "before",
    "bfn" : "bye for now",
    "blvd" : "boulevard",
    "bout" : "about",
    "brb" : "be right back",
    "bros" : "brothers",
    "brt" : "be right there",
    "bsaaw" : "big smile and a wink",
    "btw" : "by the way",
    "bwl" : "bursting with laughter",
    "c/o" : "care of",
    "cet" : "central european time",
    "cf" : "compare",
    "cia" : "central intelligence agency",
    "csl" : "can not stop laughing",
    "cu" : "see you",
    "cul8r" : "see you later",
    "cv" : "curriculum vitae",
    "cwot" : "complete waste of time",
    "cya" : "see you",
    "cyt" : "see you tomorrow",
    "dae" : "does anyone else",
    "dbmib" : "do not bother me i am busy",
    "diy" : "do it yourself",
    "dm" : "direct message",
    "dwh" : "during work hours",
    "e123" : "easy as one two three",
    "eet" : "eastern european time",
    "eg" : "example",
    "embm" : "early morning business meeting",
    "encl" : "enclosed",
    "encl." : "enclosed",
    "etc" : "and so on",
    "faq" : "frequently asked questions",
    "fawc" : "for anyone who cares",
    "fb" : "facebook",
    "fc" : "fingers crossed",
    "fig" : "figure",
    "fimh" : "forever in my heart", 
    "ft." : "feet",
    "ft" : "featuring",
    "ftl" : "for the loss",
    "ftw" : "for the win",
    "fwiw" : "for what it is worth",
    "fyi" : "for your information",
    "g9" : "genius",
    "gahoy" : "get a hold of yourself",
    "gal" : "get a life",
    "gcse" : "general certificate of secondary education",
    "gfn" : "gone for now",
    "gg" : "good game",
    "gl" : "good luck",
    "glhf" : "good luck have fun",
    "gmt" : "greenwich mean time",
    "gmta" : "great minds think alike",
    "gn" : "good night",
    "g.o.a.t" : "greatest of all time",
    "goat" : "greatest of all time",
    "goi" : "get over it",
    "gps" : "global positioning system",
    "gr8" : "great",
    "gratz" : "congratulations",
    "gyal" : "girl",
    "h&c" : "hot and cold",
    "hp" : "horsepower",
    "hr" : "hour",
    "hrh" : "his royal highness",
    "ht" : "height",
    "ibrb" : "i will be right back",
    "ic" : "i see",
    "icq" : "i seek you",
    "icymi" : "in case you missed it",
    "idc" : "i do not care",
    "idgadf" : "i do not give a damn fuck",
    "idgaf" : "i do not give a fuck",
    "idk" : "i do not know",
    "ie" : "that is",
    "i.e" : "that is",
    "ifyp" : "i feel your pain",
    "IG" : "instagram",
    "iirc" : "if i remember correctly",
    "ilu" : "i love you",
    "ily" : "i love you",
    "imho" : "in my humble opinion",
    "imo" : "in my opinion",
    "imu" : "i miss you",
    "iow" : "in other words",
    "irl" : "in real life",
    "j4f" : "just for fun",
    "jic" : "just in case",
    "jk" : "just kidding",
    "jsyk" : "just so you know",
    "l8r" : "later",
    "lb" : "pound",
    "lbs" : "pounds",
    "ldr" : "long distance relationship",
    "lmao" : "laugh my ass off",
    "lmfao" : "laugh my fucking ass off",
    "lol" : "laughing out loud",
    "ltd" : "limited",
    "ltns" : "long time no see",
    "m8" : "mate",
    "mf" : "motherfucker",
    "mfs" : "motherfuckers",
    "mfw" : "my face when",
    "mofo" : "motherfucker",
    "mph" : "miles per hour",
    "mr" : "mister",
    "mrw" : "my reaction when",
    "ms" : "miss",
    "mte" : "my thoughts exactly",
    "nagi" : "not a good idea",
    "nbc" : "national broadcasting company",
    "nbd" : "not big deal",
    "nfs" : "not for sale",
    "ngl" : "not going to lie",
    "nhs" : "national health service",
    "nrn" : "no reply necessary",
    "nsfl" : "not safe for life",
    "nsfw" : "not safe for work",
    "nth" : "nice to have",
    "nvr" : "never",
    "nyc" : "new york city",
    "oc" : "original content",
    "og" : "original",
    "ohp" : "overhead projector",
    "oic" : "oh i see",
    "omdb" : "over my dead body",
    "omg" : "oh my god",
    "omw" : "on my way",
    "p.a" : "per annum",
    "p.m" : "after midday",
    "pm" : "prime minister",
    "poc" : "people of color",
    "pov" : "point of view",
    "pp" : "pages",
    "ppl" : "people",
    "prw" : "parents are watching",
    "ps" : "postscript",
    "pt" : "point",
    "ptb" : "please text back",
    "pto" : "please turn over",
    "qpsa" : "what happens", #"que pasa",
    "ratchet" : "rude",
    "rbtl" : "read between the lines",
    "rlrt" : "real life retweet", 
    "rofl" : "rolling on the floor laughing",
    "roflol" : "rolling on the floor laughing out loud",
    "rotflmao" : "rolling on the floor laughing my ass off",
    "rt" : "retweet",
    "ruok" : "are you ok",
    "sfw" : "safe for work",
    "sk8" : "skate",
    "smh" : "shake my head",
    "sq" : "square",
    "srsly" : "seriously", 
    "ssdd" : "same stuff different day",
    "tbh" : "to be honest",
    "tbs" : "tablespooful",
    "tbsp" : "tablespooful",
    "tfw" : "that feeling when",
    "thks" : "thank you",
    "tho" : "though",
    "thx" : "thank you",
    "tia" : "thanks in advance",
    "til" : "today i learned",
    "tl;dr" : "too long i did not read",
    "tldr" : "too long i did not read",
    "tmb" : "tweet me back",
    "tntl" : "trying not to laugh",
    "ttyl" : "talk to you later",
    "u" : "you",
    "u2" : "you too",
    "u4e" : "yours for ever",
    "utc" : "coordinated universal time",
    "w/" : "with",
    "w/o" : "without",
    "w8" : "wait",
    "wassup" : "what is up",
    "wb" : "welcome back",
    "wtf" : "what the fuck",
    "wtg" : "way to go",
    "wtpa" : "where the party at",
    "wuf" : "where are you from",
    "wuzup" : "what is up",
    "wywh" : "wish you were here",
    "yd" : "yard",
    "ygtr" : "you got that right",
    "ynk" : "you never know",
    "zzz" : "sleeping bored and tired"
}

In [None]:
def convert_abbrev(word):
    return abbreviations[word.lower()] if word.lower() in abbreviations.keys() else word

In [None]:
df_train['text']=df_train.text.apply(convert_abbrev)
df_test['text']=df_test.text.apply(convert_abbrev)

Lets check the number of times gud, goood, cool, coool etc occur.

In [None]:
text = df_train['text']
vectorizer = CountVectorizer()
vectorizer.fit(text)
print(vectorizer.vocabulary_['cooool'])
print(vectorizer.vocabulary_['cool'])


<a id="subsection-four-two"></a>
# Stemming

We will use NLTK for stemming since Spacy doesn't contain any function for stemming as it relies on lemmatization only
There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers.
Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. So we will use that.

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster.

In [None]:
stemmer = SnowballStemmer(language='english')

tokens = df_train['text'][1].split()
clean_text = ' '

for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

In [None]:
def stemWord(text):
    stemmer = SnowballStemmer(language='english')
    tokens = text.split()
    clean_text = ' '
    for token in tokens:
        clean_text = clean_text + " " + stemmer.stem(token)      
    
    return clean_text

print("Text before stemWord function: " + df_train['text'][1])
print("Text after stemWord function: " + stemWord(df_train['text'][1]))

In [None]:
df_train['text']=df_train.text.apply(stemWord)
df_test['text']=df_test.text.apply(stemWord)

In [None]:
df_train.text

for txt in df_train.text[:40]:
    print(txt)
                        

<a id="subsection-four-three"></a>
# Lemmatization

Though we could not perform stemming with spaCy, we can perform lemmatization using spaCy.
This is a time consuming process.

Output of lemmatization is an actual word in English unlike Stemming.
(word.lemma_ will print word's lemma in SPacy)

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
#for token in doc:
   # print(token.lemma_)
for noun in doc.noun_chunks:
    print(noun.text)

In [None]:
for word in doc:
    print(word.text,  word.lemma_)

In [None]:
def lemmatizeWord(text):
    tokens=nlp(text)
    clean_text = ' '
    for token in tokens:
        clean_text = clean_text + " " + token.lemma_      
    
    return clean_text

print("Text before lemmatizeWord function: " + df_train['text'][1])
print("Text after lemmatizeWord function: " + lemmatizeWord(df_train['text'][1]))

doc = "Apple is looking at buying U.K. startup for $1 billion"
lemmatizeWord(doc)

lemmatizeWord converts words into its lemma form. (Will take a while to run)

In [None]:
df_train['text']=df_train.text.apply(lemmatizeWord)
df_test['text']=df_test.text.apply(lemmatizeWord)

In [None]:
df_train['text']

# Data Augmentation

Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. 

Lets generate texts with a RNN

In [None]:
text = ''
for txt in df_train.text:
    text = text + "\n" + txt


In [None]:
print(f'Length of text: {len(text)} characters')

In [None]:
# The unique characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

In [None]:
vocab

In [None]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

In [None]:
ids_from_chars = preprocessing.StringLookup(
    vocabulary=list(vocab))

In [None]:
ids = ids_from_chars(chars)
ids

In [None]:
chars_from_ids = tf.keras.layers.experimental.preprocessing.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True)

In [None]:
chars2 = chars_from_ids(ids)
chars2

In [None]:
tf.strings.reduce_join(chars, axis=-1).numpy()

In [None]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

In [None]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids


In [None]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [None]:
for ids in ids_dataset.take(10):
    #print(chars_from_ids(ids).numpy().decode('utf-8'))
    print(chars_from_ids(ids).numpy())

In [None]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

In [None]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq).numpy())

In [None]:
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

In [None]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [None]:
split_input_target(list("Earthquake"))

In [None]:
dataset = sequences.map(split_input_target)

In [None]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

In [None]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [None]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [None]:
model = MyModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

In [None]:
model.summary()

In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

In [None]:
sampled_indices

In [None]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

In [None]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [None]:
example_batch_loss = loss(target_example_batch, example_batch_predictions)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)

In [None]:
tf.exp(mean_loss).numpy()

In [None]:
model.compile(optimizer='adam', loss=loss)

In [None]:
## Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
EPOCHS = 40

In [None]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

In [None]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "" or "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['', '[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "" or "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [None]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

In [None]:
start = time.time()
states = None
next_char = tf.constant(['ablaze'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

TO DO

<a id="section-five"></a>
# Transforming tokens to a vector

<a id="subsection-five-one"></a>
**Bag of Words model**

A bag-of-words (B.o.w) is a representation of text that describes the occurrence of words within a document. It involves two things:

A vocabulary of known words.
A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
count_vectorizer = CountVectorizer()
train_bag = count_vectorizer.fit_transform(df_train['text'])
test_bag = count_vectorizer.transform(df_test["text"])

<a id="subsection-five-two"></a>
**TFIDF Features**

Another common representation is TF-IDF (Term Frequency - Inverse Document Frequency). TF-IDF is similar to bag of words except that each term count is scaled by the term's frequency in the corpus. Using TF-IDF can potentially improve your models.

> Term Frequency: is a scoring of the frequency of the word in the current document.

TF = (Number of times term t appears in a document)/(Number of terms in the document)
> Inverse Document Frequency: is a scoring of how rare the word is across documents.

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

In [None]:
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
train_tfidf = tfidf.fit_transform(df_train['text'])
test_tfidf = tfidf.transform(df_test["text"])

<a id="subsection-five-three"></a>
**Word Vectors/Word Embeddings**

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

TIP: Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings

In [None]:
with nlp.disable_pipes():
    train_vectors = np.array([nlp(text).vector for text in df_train.text])
    test_vectors = np.array([nlp(text).vector for text in df_test.text])

<a id="section-six"></a>
# Building a Text Classification Model

Let's try out different classifiers on word embeddings representation train_vectors and test_vectors

<a id="section-six-one"></a>
# 1. **Support Vector Machines**

In [None]:
# Set dual=False to speed up training, and it's not needed
svc_wordEmbed = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc_wordEmbed.fit(train_vectors, df_train.target)

Evaluate F1 Score using scikit learns model_selection.cross_val_score

In [None]:

scores = model_selection.cross_val_score(svc_wordEmbed, train_vectors, df_train["target"], cv=3, scoring="f1")
scores

We get decent F1 scores

<a id="section-six-two"></a>
#  **2. XGBoost**

Lets try XGBoost now on word embeddings

In [None]:

xgb_wordEmbed = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)


In [None]:
scores = model_selection.cross_val_score(xgb_wordEmbed, train_vectors, df_train["target"], cv=3, scoring="f1")
scores

In [None]:
#clf_xgb_TFIDF = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
#                     subsample=0.8, nthread=10, learning_rate=0.1)
#scores = model_selection.cross_val_score(clf_xgb_TFIDF, train_tfidf, df_train["target"], cv=3, scoring="f1")
#scores

<a id="section-six-three"></a>
#  **3. Naives Bayes Classifier**

Lets try Naive Bayes Classifier on Bag of Words model

In [None]:
# Fitting a simple Naive Bayes on Counts
clf_NB = MultinomialNB()
scores = model_selection.cross_val_score(clf_NB, train_bag, df_train["target"], cv=3, scoring="f1")
scores

In [None]:
clf_NB.fit(train_bag, df_train["target"])

Lets try Naive Bayes Classifer on TFIDF

In [None]:
# Fitting a simple Naive Bayes on TFIDF
clf_NB_TFIDF = MultinomialNB()
scores = model_selection.cross_val_score(clf_NB_TFIDF, train_tfidf, df_train["target"], cv=3, scoring="f1")
scores

In [None]:
clf_NB_TFIDF.fit(train_tfidf, df_train["target"])

In [None]:
# Fitting a simple Naive Bayes on TFIDF
clf_NB_wEmbed = MultinomialNB()
scores = model_selection.cross_val_score(clf_NB_wEmbed, train_vectors, df_train["target"], cv=3, scoring="f1")
scores

<a id="section-six-four"></a>
#  **4. Logistic Regression Classifier**

**Bag of words model**

In [None]:
# Fitting a simple Logistic Regression on Counts
clf = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(clf, train_bag, df_train["target"], cv=3, scoring="f1")
scores

In [None]:
clf.fit(train_bag, df_train["target"])

**TF-IDF**

In [None]:
# Fitting a simple Logistic Regression on TFIDF
clf_tfidf = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(clf_tfidf, train_tfidf, df_train["target"], cv=3, scoring="f1")
scores

In [None]:
clf_tfidf.fit(train_bag, df_train["target"])

<a id="section-six-five"></a>
#  **5. Neural Network**

What should be the input_shape to the neural network?


In [None]:
train_vectors.shape

Lets define recall, precision and f1 score

In [None]:


def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

Let's do Early Stoppping and Learning Rate reduction

In [None]:


learning_rate_reduction = ReduceLROnPlateau(monitor='val_f1_m', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)


early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=5, # how many epochs to wait before stopping
    restore_best_weights=True,
)

In [None]:
#  Neural Network
nn = keras.Sequential([
    layers.Dense(256, activation='relu', input_shape=[7613,300]),
    layers.Dropout(0.4),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(1,activation='sigmoid')
])

nn.compile(loss='binary_crossentropy',optimizer='adam',metrics=[f1_m])
history=nn.fit(
    train_vectors,df_train["target"],
    validation_split=0.1,
    batch_size=128,
    epochs=25,
    callbacks=[early_stopping,learning_rate_reduction])

In [None]:
history_frame = pd.DataFrame(history.history)
history_frame.loc[:, ['f1_m','val_f1_m']].plot()
history_frame.loc[:, ['loss','val_loss']].plot();

In [None]:
pred = nn.predict(test_vectors)

pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0

In [None]:
#sample_submission = pd.read_csv(submission_file_path)
#sample_submission["target"] = Pred.astype('int64')
#sample_submission.to_csv("submission.csv", index=False)

<a id="section-seven"></a>
# BERT

BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

Run this with GPU

In [None]:
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
 # get the official tokenization created by the Google team

In [None]:
import tokenization

Helper functions for BERT

In [None]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

Giving metrics as F1 Score instead of Accuracy here as evaluation of this competetion is on F1 score

In [None]:
def build_model(bert_layer, max_len = 128, lr = 1e-5):
    input_word_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32,name="input_word_ids")
    input_mask = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32,name="input_mask")
    segment_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32,name="segment_ids")
        
    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    dense_out = Dense(1,activation="relu")(pooled_output)
    drop_out = tf.keras.layers.Dropout(0.8)(dense_out)
    out = Dense(1,activation="sigmoid")(pooled_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    adam = tf.keras.optimizers.Adam(lr)
    model.compile(optimizer=adam, loss='binary_crossentropy', metrics=[f1_m])
        
    return model

In [None]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
train_input = bert_encode(df_train.text.values, tokenizer, max_len=160)
test_input = bert_encode(df_test.text.values, tokenizer, max_len=160)
train_labels = df_train.target.values

In [None]:
train_input

In [None]:
train_labels

In [None]:
model = build_model(bert_layer, max_len=160)
model.summary()

Will take lot of time to run this.

In [None]:
# Thanks to https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
# Train BERT model with my tuning
#checkpoint = ModelCheckpoint('model_BERT.h5', monitor='val_loss', save_best_only=True)
valid = 0.2
epochs_num = 3
batch_size_num = 16
train_history = model.fit(
    train_input, train_labels,
    validation_split = valid,
    epochs = epochs_num, # recomended 3-5 epochs
    #callbacks=[checkpoint],
    batch_size = batch_size_num
)
#model.save('model.h5')

In [None]:
history_frame = pd.DataFrame(train_history.history)
history_frame.loc[:, ['accuracy','val_accuracy']].plot()
history_frame.loc[:, ['loss','val_loss']].plot();

Val loss is increasing even though training loss decreases -> Overfitting

In [None]:
#model.load_weights('model.h5')
test_pred = model.predict(test_input)

In [None]:
test_pred

In [None]:
# submit
#submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
#submission['target'] = np.round(test_pred).astype('int')
#submission.to_csv('submission.csv', index=False)
#submission.groupby('target').count()

<a id="section-eight"></a>
# Compare Models

Lets write a function to get scores and compare different models 

Defining various models below. Just copying and pasting from the above section.

In [None]:
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
clf_NB = MultinomialNB()
clf = LogisticRegression(C=1.0)
clf_xgb = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)

The getScore function will return F1 score of various models.

In [None]:
def getScore(model, vector):
    scores = model_selection.cross_val_score(model, vector, df_train["target"], cv=3, scoring="f1")
    return scores

#print(getScore(clf_xgb,train_vectors).mean())
    

In [None]:
train_vectors

In [None]:
clf_xgb

In [None]:
model_name = type(clf_xgb).__name__
model_name

The appendToModelReport will append Model name, Representation and its corresponding mean F1 score in a dataframe.

In [None]:
def appendToModelReport(model_report, model,representation,vector):
    model_report=model_report.append({"Model" : type(model).__name__, "Representation":representation, "F1 Score": getScore(model,vector).mean() },ignore_index = True)
    return model_report

In [None]:
model_report = pd.DataFrame(columns=['Model','Representation','F1 Score'])
#model_report.append({"Model" : type(clf_xgb).__name__, "Representation":"Word Embedding", "F1 Score": getScore(clf_xgb,train_vectors).mean() },ignore_index = True)
#XGB Boost model reports
model_report = appendToModelReport(model_report, clf_xgb, "Word Embedding", train_vectors)
model_report = appendToModelReport(model_report, clf_xgb, "Bag of Words", train_bag)
model_report = appendToModelReport(model_report, clf_xgb, "TF IDF", train_tfidf)

#Support Vector Machines
model_report = appendToModelReport(model_report, svc, "Word Embedding", train_vectors)
model_report = appendToModelReport(model_report, svc, "Bag of Words", train_bag)
model_report = appendToModelReport(model_report, svc, "TF IDF", train_tfidf)

#Naive Bayes Classifier
model_report = appendToModelReport(model_report, clf_NB, "Word Embedding", train_vectors)
model_report = appendToModelReport(model_report, clf_NB, "Bag of Words", train_bag)
model_report = appendToModelReport(model_report, clf_NB, "TF IDF", train_tfidf)

#Logistic Regression Classifier
model_report = appendToModelReport(model_report, clf, "Word Embedding", train_vectors)
model_report = appendToModelReport(model_report, clf, "Bag of Words", train_bag)
model_report = appendToModelReport(model_report, clf, "TF IDF", train_tfidf)

model_report

Find out the best model and representation

In [None]:
model_report[model_report['F1 Score'] == max(model_report["F1 Score"])]

<a id="section-nine"></a>
# Making the submission

In [None]:
def submission(submission_file_path,model,test_vectors):
    sample_submission = pd.read_csv(submission_file_path)
    sample_submission["target"] = model.predict(test_vectors)
    sample_submission.to_csv("submission.csv", index=False)
    

In [None]:
submission_file_path = "/kaggle/input/nlp-getting-started/sample_submission.csv"

model=nn
submission(submission_file_path,model,test_vectors)

[Back to Top](#section-zero)