<a href="https://colab.research.google.com/github/Apptrixie/NLP__using_tensorflow/blob/main/NLP_using_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural language processing 

## <u>Tokenization</u>
In this we try to input sentences and create indexed tokens for each word. 

In [None]:
#importing libraries 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
#define the list of sentences
sentences = [
             'I love my dog',
             'I love my cat'
]

In [None]:
#creating an instance of the tokenizer object
#num_words parameter is the maximum no.of words ot keep
tokenizer = Tokenizer(num_words = 100)

#to get the most frequent 100 words used in sentences
#the tokenizer goes through all the tokens or words and fits itself to most frequent used
tokenizer.fit_on_texts(sentences)

#the list of words is available as the index
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


In [None]:
#the tokenizer is smart enough to not notice the symbol following the word and create a new token for the same 
#for eg:
sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog!'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


## <u>Turning sentences into data</u>
A sentence is a sequence of tokens. So, we try to form sequences for a given sentence.

In [None]:
#a sentence is a sequence of indexed tokens
#adding list of sentences has sentences of various no. of words
sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog!',
             'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
print()

#the text_to_sequences method converts the tokens to a given sentence sequence 
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


In [None]:
#let's create a test sentence sequence
test_data = [
             'i really love my dog',
             'my dog loves me too'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[4, 2, 1, 3], [1, 3]]


Here, we see that the words 'really', 'loves', 'me', 'too' do not belong to the training sentences list. The tokens only exist for the words in the training corpus. Only the words in the corpus are used to build a sequence.

In [None]:
#now we tend to lose the words that are not in the training corpus and also lose on length of the sentence sequence
#to overcome this, use the parameter oov_token and 
#set it to a string that it will not encounter anywhere in the test/training corpus
sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog!',
             'Do you think my dog is amazing?'
]

#oov is 'out of vocabulary' token
tokenizer = Tokenizer(num_words= 100, oov_token= "<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print("The word index for training are:")
print(word_index)
print()

sequences = tokenizer.texts_to_sequences(sentences)
print("The word sequences for training set are:")
print(sequences)
print()

test_data = [
             'i really love my dog',
             'my dog loves me too'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print("The word sequences for test set are: ")
print(test_seq)

The word index for training are:
{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

The word sequences for training set are:
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

The word sequences for test set are: 
[[5, 1, 3, 2, 4], [2, 4, 1, 1, 1]]


Here, the length of the sentence sequence is preserved as everytime a new word is encountered in test set it is replaced by < OOV > in the sequence. But, still we have lost some meaning.
Also, when oov replaces a word in sequence, it can form the same sequence for different sentences.
For eg: "my dog loves me too" is [2, 4, 1, 1, 1]
and "my dog loves her too" is [2, 4, 1, 1, 1]


To create sequences of uniform length we use padding.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
sentences = [
             'I love my dog',
             'I love my cat',
             'You love my dog!',
             'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words= 100, oov_token= "<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print("The word index for training are:")
print(word_index)
print()

sequences = tokenizer.texts_to_sequences(sentences)
print("The word sequences for training set are:")
print(sequences)
print()

#all the sentences are set to length of the longest sequence
print("The padded sequence is:")
padded = pad_sequences(sequences)
print(padded)


The word index for training are:
{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

The word sequences for training set are:
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

The padded sequence is:
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


In [None]:
#if padding after the sentence ends
print('Padding after the sentence:')
padded = pad_sequences(sequences, padding= 'post')
print(padded)

Padding after the sentence:
[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


In [None]:
#to specify the maximum length of the padded sequences
print('Padding after the sentence and the max length of sequences is 5:')
padded = pad_sequences(sequences, padding= 'post',maxlen=5)
print(padded)

Padding after the sentence and the max length of sequences is 5:
[[ 5  3  2  4  0]
 [ 5  3  2  7  0]
 [ 6  3  2  4  0]
 [ 9  2  4 10 11]]


Here, we see that the last sequence in the list shows the last five elements and all other sequences show the first five sequences. The sequences are by default in pre truncate.

In [None]:
#if sentence are longer than the specified maxlen the you truncate (either post or pre truncate)
print('Padding after the sentence and the max length of sequences is 5, post-truncating:')
padded = pad_sequences(sequences, padding= 'post',maxlen=5, truncating='post')
print(padded)

Padding after the sentence and the max length of sequences is 5, post-truncating:
[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]


In [None]:
#pre truncate
print('Padding after the sentence and the max length of sequences is 5, pre-truncating:')
padded = pad_sequences(sequences, padding= 'post',maxlen=5, truncating='pre')
print(padded)

Padding after the sentence and the max length of sequences is 5, pre-truncating:
[[ 5  3  2  4  0]
 [ 5  3  2  7  0]
 [ 6  3  2  4  0]
 [ 9  2  4 10 11]]


## <u>Training a model to recognize sentiment in text</u>

After completing the pre-processing, we now try and build a classifier to recognize the sentiments in a given text.    

We use dataset of news headlines, where the headlines have been classified into being sarcastic or not. 
We build classifier on this

In [None]:
#importing the  json library to read a json file
import json

#loading the sarcasm.json file using the json library
with open("Sarcasm_Headlines_Dataset_v2.json", 'r') as f:
    datastore = json.load(f)

#creating lists for headlines, whether of not it is sarcastic and the link to the article
sentences = []
labels = []
urls = []

#adding items to the list
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])


FileNotFoundError: ignored