<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/NLP%20with%20Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

### Importing the dependencies

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

### Word Encodings

#### Define training sentences

In [2]:
train_sentences = [
                   "I am Satwik",
                   "I am Vinay",
                   "We live in Bengaluru"
]

#### Setup the tokenizer

In [3]:
tokenizer = Tokenizer(num_words = 100)

In [4]:
tokenizer.fit_on_texts(train_sentences)

In [5]:
tokenizer.word_index

{'am': 2,
 'bengaluru': 8,
 'i': 1,
 'in': 7,
 'live': 6,
 'satwik': 3,
 'vinay': 4,
 'we': 5}

#### Text to Sequences

In [6]:
tokenizer.texts_to_sequences(train_sentences)

[[1, 2, 3], [1, 2, 4], [5, 6, 7, 8]]

### Tokenizing new data using same tokenizer

In [7]:
new_sentences = [
                 "its sunny day",
                 "Cloudy wheather",
]

In [8]:
tokenizer.texts_to_sequences(new_sentences)

[[], []]

We can see that tokenizer was not trained on these words. Hence we got empty list.

#### Using OOV to avoid this

In [9]:
tokenizer = Tokenizer(num_words = 100, oov_token = "<oov>")

In [10]:
tokenizer.fit_on_texts(train_sentences)

In [11]:
tokenizer.word_index

{'<oov>': 1,
 'am': 3,
 'bengaluru': 9,
 'i': 2,
 'in': 8,
 'live': 7,
 'satwik': 4,
 'vinay': 5,
 'we': 6}

In [12]:
tokenizer.texts_to_sequences(train_sentences)

[[2, 3, 4], [2, 3, 5], [6, 7, 8, 9]]

In [13]:
tokenizer.texts_to_sequences(new_sentences)

[[1, 1, 1], [1, 1]]

### Padding

In [14]:
word_index = tokenizer.word_index

In [15]:
word_index

{'<oov>': 1,
 'am': 3,
 'bengaluru': 9,
 'i': 2,
 'in': 8,
 'live': 7,
 'satwik': 4,
 'vinay': 5,
 'we': 6}

In [16]:
sequences = tokenizer.texts_to_sequences(train_sentences)

In [17]:
sequences

[[2, 3, 4], [2, 3, 5], [6, 7, 8, 9]]

In [18]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [19]:
padded_sequence = pad_sequences(sequences = sequences)

In [20]:
padded_sequence

array([[0, 2, 3, 4],
       [0, 2, 3, 5],
       [6, 7, 8, 9]], dtype=int32)

In [21]:
print(train_sentences)
print(word_index)
print(sequences)
print(padded_sequence)

['I am Satwik', 'I am Vinay', 'We live in Bengaluru']
{'<oov>': 1, 'i': 2, 'am': 3, 'satwik': 4, 'vinay': 5, 'we': 6, 'live': 7, 'in': 8, 'bengaluru': 9}
[[2, 3, 4], [2, 3, 5], [6, 7, 8, 9]]
[[0 2 3 4]
 [0 2 3 5]
 [6 7 8 9]]


#### Customizing padding sequences

In [22]:
padded_sequence = pad_sequences(sequences = sequences, padding = "post",
                                maxlen = 5, truncating = "post")

In [25]:
padded_sequence

array([[2, 3, 4, 0, 0],
       [2, 3, 5, 0, 0],
       [6, 7, 8, 9, 0]], dtype=int32)

In [29]:
train_sentences = [
                   "I am Satwik",
                   "I am Vinay",
                   "We live in Bengaluru, Karnataka, India. ok , ieieeuheh"
]

In [30]:
padded_sequence = pad_sequences(sequences = sequences, padding = "post",
                                maxlen = 5, truncating = "post")

In [31]:
padded_sequence

array([[2, 3, 4, 0, 0],
       [2, 3, 5, 0, 0],
       [6, 7, 8, 9, 0]], dtype=int32)