<a href="https://colab.research.google.com/github/SaketMunda/introduction-to-nlp/blob/master/nlp_with_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with TensorFlow

NLP has the goal of deriving information out of natural language (could be sequences text or speech).

Another common term for NLP problems is sequence to sequence problems (seq2seq).

In [1]:
# Since we're going to experiment deep-learning models so we need to enable GPUs
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-fe32b411-19d4-fd24-d6b0-6eb0c37e0f5b)


## Get Helper functions

In [2]:
# Get helper_functions.py script from Github
!wget https://raw.githubusercontent.com/SaketMunda/ml-helpers/master/helper_functions.py

from helper_functions import unzip_data, create_tensorboard_callback

--2023-01-28 03:15:20--  https://raw.githubusercontent.com/SaketMunda/ml-helpers/master/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2904 (2.8K) [text/plain]
Saving to: ‘helper_functions.py’


2023-01-28 03:15:21 (32.0 MB/s) - ‘helper_functions.py’ saved [2904/2904]



## Get a Text Dataset
The dataset that we're going to be using is Kaggle's introduction to NLP dataset (text samples of Tweets labelled as disaster or not disaster)

See the original source here: https://www.kaggle.com/c/nlp-getting-started

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# unzip the data
unzip_data('nlp_getting_started.zip')

--2023-01-28 03:15:27--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.194.128, 74.125.68.128, 74.125.24.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.194.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-01-28 03:15:27 (122 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing Text Dataset

To visualize our text samples, we first have to read them in, so we can do it through pandas.

In [4]:
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# check the shapes
train_df.shape, test_df.shape

((7613, 5), (3263, 4))

In [None]:
# view some samples
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


So here, `text` is the tweet and `target` variable is to identify whether the tweet is a disaster or not, so if `1` then it's a disaster else not a disaster.

Let's visualize some random `training` samples, but before that this is a good practice to shuffle the training samples first,

In [5]:
train_df_shuffled = train_df.sample(frac=1, random_state=17)
# frac=1 means 100% of samples will be shuffled
train_df_shuffled

Unnamed: 0,id,keyword,location,text,target
7027,10072,typhoon,,Typhoon Soudelor: When will it hit Taiwan ÛÒ ...,1
318,463,armageddon,,RT @RTRRTcoach: #Love #TrueLove #romance lith ...,0
1681,2425,collide,www.youtube.com?Malkavius2,I liked a @YouTube video from @gassymexican ht...,0
5131,7318,nuclear%20reactor,"New York, New York",Japan's Restart of Nuclear Reactor Fleet Fast ...,1
2967,4262,drowning,"Hendersonville, NC",#ICYMI #Annoucement from Al Jackson... http://...,0
...,...,...,...,...,...
406,584,arson,"Jerusalem, Israel",Mourning notices for stabbing arson victims st...,1
5510,7863,quarantined,"Livonia, MI",Reddit's new content policy goes into effect m...,0
2191,3139,debris,,Plane debris discovered on Reunion Island belo...,1
7409,10600,wounded,santo domingo,Police Officer Wounded Suspect Dead After Exch...,1


In [None]:
# how does the test set looks like ?
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [6]:
# How many examples of each class ?
train_df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [7]:
# Let's visualize some random samples

import random
random_index = random.randint(0, len(train_df)-5)
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(disaster)" if target > 0 else "(not a disaster)")
  print(f"Text:\n{text}\n")
  print("----------\n")

Target: 0 (not a disaster)
Text:
i blaze jays fuck the dutch slave trade.

----------

Target: 0 (not a disaster)
Text:
Aw man. 'Apollo Crews' just screams 'we can't think of a name for this black guy quick name some and mash them together'

----------

Target: 1 (disaster)
Text:
Refugio oil spill may have been costlier bigger than projected http://t.co/7L6bHeXIXv | https://t.co/eMOSrMUvQa

----------

Target: 0 (not a disaster)
Text:
New crime: knowing your rights. Punishable by death

----------

Target: 0 (not a disaster)
Text:
Evacuation drill at work. The fire doors wouldn't open so i got to smash the emergency release glass #feelingmanly

----------



## Split dataset into Train and Validation sets

Since the test set doesn't contain the target variable so we might need some unseen data for model to be validated after training, so how about splitting our training set for validating purpose with some amount.


In [8]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=17)

## Converting Text into Numbers

Our labels are in numerical form (0 and 1) but our tweets are in string form.

But machine learning algorithm learns only through numbers so we have to convert those tweets/texts into numbers.

In NLP, there are two main concepts for turning text into numbers,
- **Tokenization** : A straight mapping from **word**(known as *word-level tokenization*) or character(which is *character-level tokenization*) or sub-word(*sub-word tokenization*) to a numerical value. Just like One hot encoding, suppose we have a sentence as "My name is Alpha", then if we are mapping according to word, "My" would `0`, "name" as `1`, "is" as `2` and "Alpha" as `3`.
- **Embeddings** : An embedding is a representation of natural language which can be learned. Representation comes in the form of **feature-vector**. For example the word "Alpha" could be represented by 5-D vector `[0.564, 0.897, 0.456, -0.987, 0.15]`. The size of the feature vector is tuneable. There are two ways to use embeddings:

    - **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as `tf.keras.layers.Embedding`) and an embedding representation will be learned during model training.
    - **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.


Simply, 

**Tokenization** : Straight mapping from word to number.

**Embedding** : Richer representation of relationships between tokens.

It depends on your problem. You could try character-level tokenization/embeddings and word-level tokenization/embeddings and see which perform best. You might even want to try stacking them (e.g. combining the outputs of your embedding layers using [tf.keras.layers.concatenate](https://www.tensorflow.org/api_docs/python/tf/keras/layers/concatenate)).

If you're looking for pre-trained word embeddings, [Word2vec embeddings](https://jalammar.github.io/illustrated-word2vec/), [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) and many of the options available on TensorFlow Hub are great places to start.

Much like searching for a pre-trained computer vision model, we can search for pre-trained word embedding to use for your problem. Try searching for something like "use pre-trained word embeddings in TensorFlow".

### Text Vectorization

Mapping words to numbers.

To tokenize our words, we'll use the preprocessing layer,
`tf.keras.layers.preprocessing.TextVectorization`

In [11]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Using the default TextVectorization variables
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                 standardize="lower_and_strip_punctuation", # how to process the text
                                 split="whitespace", # how to split the text
                                 ngrams=None, # create groups of n-words
                                 output_mode='int', # how to map tokens to numbers
                                 output_sequence_length=None) # How long should the output sequence of tokens be?

About the above params,

- `max_tokens` : The maximum number of words in your vocabulary (e.g 20000 or the number of unique words in your text), includes a value for OOV(out of vocabulary) tokens
- `standardize` : Methods for standardizing text
- `split`: split the text
- `ngrams`: how many words to contain per token split, for example if 2, it splits tokens into continous sequences of 2
- `output_mode`: How to output tokens can be `int`(integer mapping), `binary`(OHE), `count` or `tf-idf`
- `output_sequence_length`: Length of tokenized sequence to output, For example if set to 150, all tokenized sequences will be 150 tokens long.

In the above cell, we have initialized the object with the default settings but let's customize it a little bit for our own use case.

In particular, let's set values for `max_tokens` and `output_sequence_length`.

For `max_tokens`(the number of words in the vocabulary), multiples of 10,000(`10,000`, `20,000`, `30,000`) or the exact number of unqiue words in your text(e.g `32,179`) are common values.

For our use case, `10,000`

And for the `output_sequence_length` we'll use the average number of tokens per Tweet in the training set. But first, we'll need to find it.

In [14]:
# Find average number of tokens (words) in training tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [16]:
# Setup text vectorization with custom variables
max_vocab_length = 10000
output_sequence_length = 15

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=output_sequence_length)

To map our `TextVectorization` instance `text_vectorizer` to our data, we can call the `adapt()` method on it whilst passing it our training set.

In [18]:
# fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

Training data mapped! Let's try our `text_vectorizer` on a custom sentence.

In [19]:
# create a sample sentence
sample_sentence = "There's a flood in my village!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[281,   3, 214,   4,  13, 881,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

Try our `text_vectorizer` on a few random sentences ?

In [21]:
random_sentence = random.choice(train_sentences)
print(f'Original Text: \n{random_sentence}\
        \n\n Vectorized Text:')
text_vectorizer([random_sentence])

Original Text: 
#Newswatch: 2 vehicles collided at Lock and Lansdowne Sts in #Ptbo. Emerg crews on their way        

 Vectorized Text:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[5037,   71, 2225,  330,   17, 3681,    7, 5240, 4523,    4, 4839,
        5738,  684,   11,  116]])>

We can also check the unique tokens in our vocabulary using the `get_vocabulary()` method

In [23]:
# get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens

print(f"Number of words in Vocab:{len(words_in_vocab)}")
print(f"Top 5 most common words:{top_5_words}")
print(f"Bottom 5 least common words:{bottom_5_words}")

Number of words in Vocab:10000
Top 5 most common words:['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words:['ovo', 'overåÊhostages', 'overzero', 'overwatch', 'overturns']


### Embedding

**Create an Embedding using Embedding Layer**

We've got a way to map our text to numbers. How about we go a step further and turn those numbers into an embedding?

The powerful thing about an embedding is it can be learned during training. This means rather that just be static, a word's numeric representation can be improved as a model goes through data samples.

We can see what an embedding of a word looks like by using the `tf.keras.layers.Embedding` layer.



In [26]:
tf.random.set_seed(17)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=28,
                             embeddings_initializer="uniform",
                             input_length=output_sequence_length,
                             name="embedding_1")
embedding

<keras.layers.core.embedding.Embedding at 0x7f94ea279d00>