## `Tokenization:`
breaking down sentences into single unit is known as tokenization

Tokenization could be of three kinds. 
1. Word-Based --> Text is divided into individual words
2. Character-Based --> splits text into individual characters
3. Subword-Based --> Frequently used words remain unsplit, while infrequent words broken down


## Word-Based Tokenizer

In [3]:
import spacy
text = "Unicorns are real. I saw a unicorn yesterday. I couldn't see it today"
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
token_list = [token.text for token in doc]
print("Tokens: ",token_list)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/Users/tinonturjamajumder/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 736, in start
    se

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [5]:
!pip install transformers


Collecting numpy<2.0,>=1.17 (from transformers)
  Obtaining dependency information for numpy<2.0,>=1.17 from https://files.pythonhosted.org/packages/1a/2e/151484f49fd03944c4a3ad9c418ed193cfd02724e138ac8a9505d056c582/numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.8/114.8 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.4
    Uninstalling numpy-2.2.4:
      Successfully uninstalled numpy-2.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. Th

In [6]:
import transformers

## Subword-based Tokenization

1. `WordPiece:` Evaluated the benefits and drawbacks of splitting and merging two symbols
2. `Unigram:` Breaks text into smaller pieces
              Narrows down a large list of possibilities based on the frequency of apperance
3. `SentencePiece:` Segments text into manageable parts and assigns unique IDs. 

### WordPiece

In [13]:
from transformers import BertTokenizer

# Structure --> variable = tokenizer_name.from_pretrainded('model_name')
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") 
t_1 = tokenizer.tokenize('My name is Tinon Turja.')
t_2 = tokenizer.tokenize("You would get better by rigorous practicing.")
t_3 = tokenizer.tokenize("Your constant appreciation makes me shy.")
t_4 = tokenizer.tokenize('My name is Afsan Nahiyan Tripto.')
t_1,t_2,t_3,t_4

(['my', 'name', 'is', 'tin', '##on', 'tu', '##r', '##ja', '.'],
 ['you', 'would', 'get', 'better', 'by', 'rigorous', 'practicing', '.'],
 ['your', 'constant', 'appreciation', 'makes', 'me', 'shy', '.'],
 ['my',
  'name',
  'is',
  'af',
  '##san',
  'nah',
  '##iya',
  '##n',
  'trip',
  '##to',
  '.'])

The `##` symbol indicates the word should be attached the previous word without a space

## Unigram and SentencePiece

In [14]:
from transformers import XLNetTokenizer

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

tokenizer.tokenize("IBM taught me tokenization")

['▁IBM', '▁taught', '▁me', '▁token', 'ization']

The tokens are prefixed with an underscore to indicate they are new words preceded by a space in the original text. 

## Tokenization with PyTorch

1. Use `torchtext` library for tokenization
2. Use the `build_vocab_from_iterator_function`: Creates a vocabulary from the tokens

   * Assigns each token a unique index --> model use this indecies to map a word in the vocabulary

To know about more details: https://pytorch.org/text/stable/index.html

In [20]:
import torchtext 
import torchtext.transforms
from torchtext.transforms import BERTTokenizer

In [22]:
dataset = [
    (1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1,"NLP Named Entity,Sentiment Analysis, Machine Translation"),
    (1,"Machine Translation with NLP"),
    (1,"Named Entity vs Sentiment Analysis NLP")
]

In [26]:
type(dataset[0]),type(dataset)

(tuple, list)

In [27]:
dataset[1][1]

'Basics of PyTorch'

In [28]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
for idx,sample in dataset:
    tokenizer(dataset[idx][sample])
    


TypeError: tuple indices must be integers or slices, not str