# Natural Language Procesing 

 Natural Language Processing or NLP is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages.

NLP is used in businesses throughtout the world: chatbots, sentiment analysis, word embeddings, text classification and many other uses have found its way into the industry. But how does a machine view text? 

We know that algorithms use numbers to optimize loss functions thus texts need to be converted to numbers in a smart way for computers to gain meaning from them. 

# Text Preprocesing using Tensorflow

Lets create some simple sentences: 

In [4]:
texts = [
    "My dog is a very good boy!",
    "I do not think my cat likes me, therefore, I don't know if I like her"
]

As mentioned in the introduction, the first thing to do is to convert the given texts to numbers.

## Text index and sequences

The Tokenizer class from tensorflow does all the heavy lifting for us. The method **fit_on_texts()** removes all punctuations, lowercases the words and creates the word index dictionary where each unique word is assigned a unique integer. 

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Initiating the class
# The oov_token is used to encode words that were not seen in the training set (out of vocabulary)
tokenizer = Tokenizer(oov_token='<OOV>')

# Fitting on our texts
tokenizer.fit_on_texts(texts)

# Printing out the word index
print(tokenizer.word_index)

{'<OOV>': 1, 'i': 2, 'my': 3, 'dog': 4, 'is': 5, 'a': 6, 'very': 7, 'good': 8, 'boy': 9, 'do': 10, 'not': 11, 'think': 12, 'cat': 13, 'likes': 14, 'me': 15, 'therefore': 16, "don't": 17, 'know': 18, 'if': 19, 'like': 20, 'her': 21}


The word index is sorted by the frequency of word appearance so the first words appearing in the index are the most frequent. We can view the word count using the tokenizer.word_counts object:

In [20]:
print(tokenizer.word_counts)

OrderedDict([('my', 2), ('dog', 1), ('is', 1), ('a', 1), ('very', 1), ('good', 1), ('boy', 1), ('i', 3), ('do', 1), ('not', 1), ('think', 1), ('cat', 1), ('likes', 1), ('me', 1), ('therefore', 1), ("don't", 1), ('know', 1), ('if', 1), ('like', 1), ('her', 1)])


We now have integers representing words. The next step is to convert the sentences to sequences. 

In [26]:
# Helper functions to decode integers into words 
index_word = dict((v,k) for k,v in tokenizer.word_index.items())

print(index_word)

{1: '<OOV>', 2: 'i', 3: 'my', 4: 'dog', 5: 'is', 6: 'a', 7: 'very', 8: 'good', 9: 'boy', 10: 'do', 11: 'not', 12: 'think', 13: 'cat', 14: 'likes', 15: 'me', 16: 'therefore', 17: "don't", 18: 'know', 19: 'if', 20: 'like', 21: 'her'}


In [22]:
sequences = tokenizer.texts_to_sequences(texts)

print(sequences)

[[3, 4, 5, 6, 7, 8, 9], [2, 10, 11, 12, 3, 13, 14, 15, 16, 2, 17, 18, 19, 2, 20, 21]]


The two lists represent the two original sentences that were created. Each integer represent a word from the word index. Let us see what happens when new words appear in sentences and we use the already trained tokenizer. 

In [24]:
test_text = ['My cat and my dog are good buddies'] 

# Converting to sequence 
print(tokenizer.texts_to_sequences(test_text))

[[3, 13, 1, 3, 4, 1, 8, 1]]

In [27]:
# Converting to characters 
[index_word.get(x) for x in tokenizer.texts_to_sequences(test_text)[0]]

['my', 'cat', '<OOV>', 'my', 'dog', '<OOV>', 'good', '<OOV>']

Notice that because the words 'and', 'are' and 'buddies' were not in our training corpus, they are labeled as '<OOV>' because that is what we defined for the words that were not in the training vocabulary. 

## Padding 

Now that we have our training sentences transformed into integer sequances we have the problem of them beeing of different lenghts. Recall that for most machine learning algorithms, all the observations needs to have the same number of columns, even if some features are missing. 

In our example case, the first sentence has 7 words and the second sentence has 16 words. To create a matrix with 16 columns and 2 rows we use the **pad_sequence** method from tensorflow. 

In [31]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Padding the created sentences 
padded = pad_sequences(sequences)

# Printing out the results
print(padded)

[[ 0  0  0  0  0  0  0  0  0  3  4  5  6  7  8  9]
 [ 2 10 11 12  3 13 14 15 16  2 17 18 19  2 20 21]]


By default, the **pad_sequences()** method finds the longest sequence and pads to the length of the longest sentence. As we can see from the results, there are a bunch of zeroes added to the front of the first sentence in order to create a tidy structure of the matrix. 

I personaly like when the zeroes are added to the back. It can be achieved by using the **padding=post** parameter. 

In [32]:
print(pad_sequences(sequences, padding='post'))

[[ 3  4  5  6  7  8  9  0  0  0  0  0  0  0  0  0]
 [ 2 10 11 12  3 13 14 15 16  2 17 18 19  2 20 21]]


# Example using Twitter data 

Lets wrap everything up using a real world example using twitter data regarding disaster tweets. In this dataset there are two types of tweets: ones regarding natural disasters and the other ones regarding not disasters.

The data can be downloaded here: https://www.kaggle.com/c/nlp-getting-started

In [44]:
import pandas as pd 
import os

tweets = pd.read_csv(f'{os.getcwd()}/NLP/twitter-disaster-tweets/tweets.csv')['text'].values.tolist()

print(f'Total tweets: {len(tweets)}')

Total tweets: 7613


In [45]:
# Sample of some tweets
tweets[0:10]

['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
 'Forest fire near La Ronge Sask. Canada',
 "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
 '13,000 people receive #wildfires evacuation orders in California ',
 'Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ',
 '#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires',
 '#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas',
 "I'm on top of the hill and I can see a fire in the woods...",
 "There's an emergency evacuation happening now in the building across the street",
 "I'm afraid that the tornado is coming to our area..."]

In [42]:
# Initating the tokenizer
tokenizer = Tokenizer(oov_token='<OOV>')

# Fitting on our texts
tokenizer.fit_on_texts(tweets)

# Text to sequences
sequences = tokenizer.texts_to_sequences(tweets)

# Padding
tweet_matrix = pad_sequences(sequences, padding='post')

In [43]:
tweet_matrix

array([[ 124, 4507,   27, ...,    0,    0,    0],
       [ 186,   47,  227, ...,    0,    0,    0],
       [  43, 1702, 1859, ...,    0,    0,    0],
       ...,
       [2733, 2333,  678, ...,    0,    0,    0],
       [  79, 1115,   41, ...,    0,    0,    0],
       [   5,  203,   57, ...,    0,    0,    0]])

In [49]:
tweet_matrix.shape

(7613, 33)

In [52]:
print(f'Number of unique words in the vocabulary: {len(tokenizer.word_index)}')

Number of unique words in the vocabulary: 21719
