<a href="https://colab.research.google.com/github/EISHKARAN/TSS-Resources/blob/main/Tokenization_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### About
Tokenization is one of the first steps of data processing when it comes to working with data in the domain of NLP.

* We will use spacy to tokenize input sentences and compare it's results with basic tokenization performed via Python.


#### Requirements
```
pip install spacy
python -m spacy download en_core_web_sm
```

In [None]:
# tokenization of a text using python
doc = "Hi, There ! This is a notebook on Tokenization"
for i,token in enumerate(doc.split(" ")):
    print("Token {} - {}".format(i,token))

Token 0 - Hi,
Token 1 - There
Token 2 - !
Token 3 - This
Token 4 - is
Token 5 - a
Token 6 - notebook
Token 7 - on
Token 8 - Tokenization


But, This straightforward approach of tokenisation encounters a lot of loopholes as text contains tokens which are noisy. Like associated with hyphens or name of various nouns.

* BERT uses the concept of sub-word tokens to permute over various combinations of characters which can form part of the vocabulary. It helps it in narrowing down to the OOV(Out of vocabulary) tokens.

Thus, We use spacy as an efficient tokenizer for NLP.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

#tokenizigng
doc = nlp("Hi, There ! This is a notebook on Tokenization")
for token in doc:
    print("Token: {}".format(token))

Token: Hi
Token: ,
Token: There
Token: !
Token: This
Token: is
Token: a
Token: notebook
Token: on
Token: Tokenization


One can also add his own tokenizer rules. Visit  <a href="https://spacy.io/usage/linguistic-features#special-cases"> Link </a>

Besides this, Each model like BERT, BART and its variants come with their own tokenizers. Let's have a look at one such variant.

In [None]:
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    "Hi, This is our first Tokenizer Notebook",
    "Glad to see you here.",
    "What are you upto ?"
]

In [None]:
tokenizer = Tokenizer(num_words=20)
tokenizer.fit_on_texts(sentences)
word_idx = tokenizer.word_index
print(word_idx)

{'you': 1, 'hi': 2, 'this': 3, 'is': 4, 'our': 5, 'first': 6, 'tokenizer': 7, 'notebook': 8, 'glad': 9, 'to': 10, 'see': 11, 'here': 12, 'what': 13, 'are': 14, 'upto': 15}


In [None]:
# converting each tokenized sentence into sequence
sequences = tokenizer.texts_to_sequences(sentences)
for seq in sequences:
    print(seq)

[2, 3, 4, 5, 6, 7, 8]
[9, 10, 11, 1, 12]
[13, 14, 1, 15]


In [None]:
# to ensure that each sequence contains same number of tokens which are a primary need for any NN. We'll pad
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences, padding='post')
for seq in padded_sequences:
    print(seq)

[2 3 4 5 6 7 8]
[ 9 10 11  1 12  0  0]
[13 14  1 15  0  0  0]
