<a href="https://colab.research.google.com/github/SilahicAmil/Intro-NLP/blob/main/Into_To_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP fundamentals in Tensorflow

NLP has the goal of deriving information of of natural language (could be, sequences, text or speech)

Another common term for NLP provblems is sequence to sequence problem

## Check for GPU


In [1]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-f0ce945a-f0d7-40c1-1c92-e8d614d29b29)


## Helper functions

In [2]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

# Import a series of helper functions
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys


--2021-06-30 03:43:55--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2021-06-30 03:43:55 (63.5 MB/s) - ‘helper_functions.py’ saved [10246/10246]



## Get our text data set

The data we're going to be using is Kaggle's intro to NLP dataset (text samples of Tweets labelled as a disaster or non disaster)

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

#Unzip data
unzip_data("nlp_getting_started.zip")

--2021-06-30 03:43:57--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.142.128, 74.125.20.128, 74.125.195.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.142.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-06-30 03:43:57 (148 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualize a text dataset

In [4]:
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
#Shuffle training data

train_df_shuffled = train_df.sample(frac=1,
                                    random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [6]:
# What does the test data look like?

test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
# How many examples of each class are there?
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [8]:
len(train_df), len(test_df)

(7613, 3263)

In [9]:
#Visualize random training examples
import random
random_index = random.randint(0, len(train_df)-5) # Creat random indexes
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row

  print(f"Target: {target}", "(Real Disaster)" if target > 0 else "(Not a real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 1 (Real Disaster)
Text:
#BBSNews latest 4 #Palestine &amp; #Israel -  Six Palestinians Kidnapped in West Bank Hebron Home Demolished http://t.co/gne1fW0XHE

---

Target: 0 (Not a real disaster)
Text:
@VictoriaGittins what do you take me for I'm not a mass murderer! Just the one...

---

Target: 1 (Real Disaster)
Text:
Trafford Centre film fans angry after Odeon cinema evacuated following false fire alarm http://t.co/pFMn63VnAm http://t.co/vKwqbOJFJc

---

Target: 0 (Not a real disaster)
Text:
Sweater Stretcher http://t.co/naTz5iPV1x http://t.co/leaEBy6cR2

---

Target: 1 (Real Disaster)
Text:
Baltimore City : I-95 NORTH AT MP 54.8 (FORT MCHENRY TUNNEL BORE 3: Collision: I-95 NORTH AT MP 54.8 (FORT MCHENRY TUNNEL BORE 3 Nort...

---



## Split data into training and validation sets

In [10]:
from sklearn.model_selection import train_test_split


In [11]:
# Use train test split to split data into training and validation sets 
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)

In [12]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [13]:
# Check the first 10 examples

train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

## Converting text into numbers

When dealing with a text problem. One of the first things youll have to do before you can build the model is to convert the text to numbers

There are a few ways:
* Tokenization - Direct mapping of token
* Embedding- Create a matrix of feature vector for each token


## Text Vectorization (Tokenization)

In [14]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Use the default Text Vectorization parameters
text_vectorizer = TextVectorization(max_tokens=None, # How many words in the vocabulary (Auto adds <OOV>)
                                       standardize="lower_and_strip_punctuation",
                                       split="whitespace",
                                       ngrams=None, # Create groups of n_words,
                                       output_mode="int", # How to map tokens to numbers
                                       output_sequence_length=None, # How long do you want the sequence to be
                                       pad_to_max_tokens=True)

In [15]:
# Find the average number of tokens (words) in the training tweets

round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [16]:
#Setup text vectorization variables
max_vocab_length =  10000 # Max # of words to have in vocab
max_length = 15 #Max lengeth sequence will be (average)

# Update vectorizer
text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [18]:
# fit the text vectorizer to the training text

text_vectorizer.adapt(train_sentences)

In [21]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [25]:
# Choose a random sentence form the training dataset and tokenize it

random_sentence = random.choice(train_sentences)
print(f"Original text:\n {random_sentence}\
      \n\nVetorized Version:")

text_vectorizer([random_sentence])

Original text:
 Those that I have sworn to defend have proven themselves to be friends of the House Hailstorm.      

Vetorized Version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 161,   16,    8,   24, 7853,    5, 3065,   24, 9516, 2656,    5,
          21,  819,    6,    2]])>

In [31]:
# Get the unique words in the vocab
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5]
bottom_5_words= words_in_vocab[-5:]
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"5 Most common words: {top_5_words}")
print(f"5 Least common words: {bottom_5_words}")


Number of words in vocab: 10000
5 Most common words: ['', '[UNK]', 'the', 'a', 'in']
5 Least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']
