## `Tokenization:`
breaking down sentences into single unit is known as tokenization

Tokenization could be of three kinds. 
1. Word-Based --> Text is divided into individual words
2. Character-Based --> splits text into individual characters
3. Subword-Based --> Frequently used words remain unsplit, while infrequent words broken down


## Word-Based Tokenizer

In [None]:
import spacy
text = "Unicorns are real. I saw a unicorn yesterday. I couldn't see it today"
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
token_list = [token.text for token in doc]
print("Tokens: ",token_list)

In [5]:
!pip install transformers


Collecting numpy<2.0,>=1.17 (from transformers)
  Obtaining dependency information for numpy<2.0,>=1.17 from https://files.pythonhosted.org/packages/1a/2e/151484f49fd03944c4a3ad9c418ed193cfd02724e138ac8a9505d056c582/numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.8/114.8 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.4
    Uninstalling numpy-2.2.4:
      Successfully uninstalled numpy-2.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. Th

In [1]:
import transformers

## Subword-based Tokenization

1. `WordPiece:` Evaluated the benefits and drawbacks of splitting and merging two symbols
2. `Unigram:` Breaks text into smaller pieces
              Narrows down a large list of possibilities based on the frequency of apperance
3. `SentencePiece:` Segments text into manageable parts and assigns unique IDs. 

### WordPiece

In [2]:
from transformers import BertTokenizer

# Structure --> variable = tokenizer_name.from_pretrainded('model_name')
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") 
t_1 = tokenizer.tokenize('My name is Tinon Turja.')
t_2 = tokenizer.tokenize("You would get better by rigorous practicing.")
t_3 = tokenizer.tokenize("Your constant appreciation makes me shy.")
t_4 = tokenizer.tokenize('My name is Afsan Nahiyan Tripto.')
t_1,t_2,t_3,t_4

(['my', 'name', 'is', 'tin', '##on', 'tu', '##r', '##ja', '.'],
 ['you', 'would', 'get', 'better', 'by', 'rigorous', 'practicing', '.'],
 ['your', 'constant', 'appreciation', 'makes', 'me', 'shy', '.'],
 ['my',
  'name',
  'is',
  'af',
  '##san',
  'nah',
  '##iya',
  '##n',
  'trip',
  '##to',
  '.'])

The `##` symbol indicates the word should be attached the previous word without a space

## Unigram and SentencePiece

In [3]:
from transformers import XLNetTokenizer

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

tokenizer.tokenize("IBM taught me tokenization")

['▁IBM', '▁taught', '▁me', '▁token', 'ization']

The tokens are prefixed with an underscore to indicate they are new words preceded by a space in the original text. 

## Tokenization with PyTorch

1. Use `torchtext` library for tokenization
2. Use the `build_vocab_from_iterator_function`: Creates a vocabulary from the tokens

   * Assigns each token a unique index --> model use this indecies to map a word in the vocabulary

To know about more details: https://pytorch.org/text/stable/index.html

In [1]:
import torchtext 
import torchtext.transforms
# from torchtext.transforms import BERTTokenizer

In [2]:
dataset = [
    (1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1,"NLP Named Entity,Sentiment Analysis, Machine Translation"),
    (1,"Machine Translation with NLP"),
    (1,"Named Entity vs Sentiment Analysis NLP")
]

In [26]:
type(dataset[0]),type(dataset)

(tuple, list)

In [27]:
dataset[1][1]

'Basics of PyTorch'

In [5]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
for idx,_ in dataset: # idx --> x, sample -->y
    print(tokenizer(dataset[idx][1]))
    


['basics', 'of', 'pytorch']
['nlp', 'techniques', 'for', 'text', 'classification']
['basics', 'of', 'pytorch']
['named', 'entity', 'recognition', 'with', 'pytorch']
['named', 'entity', 'recognition', 'with', 'pytorch']
['named', 'entity', 'recognition', 'with', 'pytorch']
['basics', 'of', 'pytorch']
['basics', 'of', 'pytorch']
['basics', 'of', 'pytorch']


In [6]:
# Tokenization with PyTorch
def yield_tokens(data_iter): # yield_tokens take a list as an input, and 
    for _,text in data_iter: # inside this function it process each text from the data iterator using a tokenizer
                                # and show the tokenized output individually.
        yield tokenizer(text)
my_iterator = yield_tokens(dataset) 

In [7]:
next(my_iterator)
next(my_iterator)

['basics', 'of', 'pytorch']

In [8]:
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(yield_tokens(dataset),specials =["<unk>"])
#vocab.set_default_index(vocab["unk"]) # set unk as a default word, if the word is not found in the vocabulary
vocab.get_stoi()

{'vs': 21,
 'to': 19,
 'recognition': 16,
 'introduction': 14,
 'for': 13,
 'basics': 11,
 'with': 9,
 'translation': 8,
 'sentiment': 7,
 'classification': 12,
 'nlp': 1,
 ',': 10,
 'named': 6,
 'using': 20,
 'machine': 5,
 'text': 18,
 'entity': 4,
 'techniques': 17,
 '<unk>': 0,
 'of': 15,
 'analysis': 3,
 'pytorch': 2}

In [9]:
vocab(['introduction','to','nlp'])

[14, 19, 1]

## Creating Tokens and indices

In [14]:
def get_tokenized_sentences_and_indices(iterator):
    
    tokenized_sentence = next(iterator)
    # tokenized sentence shows output of tokens of each sentence --> i.e ['introduction','to','nlp']
    token_indices = [vocab[token] for token in tokenized_sentence]
    return tokenized_sentence,token_indices


tokenized_sentence,token_indices = get_tokenized_sentences_and_indices(my_iterator)
next(my_iterator)

print(f"Tokenized Sentence: {tokenized_sentence}")
print(f"Token Indices: {token_indices}")

Tokenized Sentence: ['sentiment', 'analysis', 'using', 'pytorch']
Token Indices: [7, 3, 20, 2]


In [13]:
tokenized_sentence,token_indices = get_tokenized_sentences_and_indices(my_iterator)


tokenized_sentence,token_indices 

(['named', 'entity', 'recognition', 'with', 'pytorch'], [6, 4, 16, 9, 2])

## Build Special Tokens and Build Vocab from iterator

In [30]:
pip install --force-reinstall --no-cache-dir torch thinc spacy


Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/0b/fa/f33a4148c6fb46ca2a3f8de39c24d473822d5774d652b66ed9b1214da5f7/torch-2.6.0-cp311-none-macosx_11_0_arm64.whl.metadata
  Downloading torch-2.6.0-cp311-none-macosx_11_0_arm64.whl.metadata (28 kB)
Collecting thinc
  Obtaining dependency information for thinc from https://files.pythonhosted.org/packages/b3/93/d0ac7fe8ff682b95ee90d0bc03a87bf8e47b319c0fb23ae6b015c3a8e7ea/thinc-9.1.1-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading thinc-9.1.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (14 kB)
Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/a4/49/fb2ea739841f44150a77d400ce747d99240fcaff727c418bbf96b91f0f9d/spacy-3.8.5-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading spacy-3.8.5-cp311-cp311-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting filelock (from torch)
  Obtaining dependency information for filelock fro

In [31]:
!pip install spacy
!python -m spacy download en_core_web_sm


Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/spacy/__init__.py", line 6, in <module>
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/spacy/errors.py", line 3, in <module>
    from .compat import Literal
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/spacy/compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/thinc/__init__.py", line 5, in <module>
    from .config import registry
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/thinc/config.py", line 5, in <module>
    from .types import Decorator
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/thinc/types.py", line 2

In [32]:
tokenizer_en = get_tokenizer(tokenizer = 'spacy',language="en_core_web_sm")
tokens = []
max_length = 0


OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [15]:
for line in lines:
    tokenized_line = 

['My', 'name', 'is', 'turja']

In [51]:
from torchtext.transforms import BERTTokenizer
tokenizer = BERTTokenizer(vocab_path='textfile.txt',
                         do_lower_case=True,
                         return_tokens=True)
tokenizer('My name is Tinon Turja')

['[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]']

In [30]:
idx,sample = dataset[0]
type(idx)

int

In [32]:
for i,sample in dataset:
    print(i)
    print(sample)

1
Introduction to NLP
2
Basics of PyTorch
1
NLP Techniques for Text Classification
3
Named Entity Recognition with PyTorch
3
Sentiment Analysis using PyTorch
3
Machine Translation with PyTorch
1
NLP Named Entity,Sentiment Analysis, Machine Translation
1
Machine Translation with NLP
1
Named Entity vs Sentiment Analysis NLP


In [35]:
tokenizer(dataset[3][1])

['named', 'entity', 'recognition', 'with', 'pytorch']

In [16]:
s = 'My name is turja'
s.split(' ')

['My', 'name', 'is', 'turja']