# What is Tokenization?

Tokenizers are one of the most important tools in nlp, which break down text into smaller units called tokens. These tokens can be words, characters or subwords, making complex sentence understandable to computers. Mainly tokenizers bridge the gap between human language and machine understanding.

# Setup

For this lab, the following libraries are gonna be used:

* [`nltk`](https://www.nltk.org/) or natural language toolkit, will be employed for `data management` tasks. It offers comprehensive tools and resources for processing natural language task, making it a valuable choice for tasks such as text preprocessing, and analysis

* [`spaCy`](https://spacy.io/) is an open-source library for advanced natural preocessing in Python. `spaCy` is renowned for its speed and accuracy in processing large volumes of text data

* [`BertTokenizer`](https://huggingface.co/docs/transformers/main_classes/tokenizer#berttokenizer) is part of the Hugging Face Transformers Library, a popular library for working with state-of-the-art pre-trained language models. `BertTokenizer` is specially designed for `tokenizing` text according to the BERT model's specifications.

* [`XLNetTokenizer`](https://huggingface.co/docs/transformers/main_classes/tokenizer#xlnettokenizer) is another component of the Hugging Face Transformers library. It is tailored for tokenizing text in alignment with the XLNet model's requirements.

* [`torchtext`](https://pytorch.org/text/stable/index.html) It is part of the PyTorch ecosystem, to handle various natural language processing tasks. It  simplifies the process of working with text data and provides functionalities for data preprocessing, tokenization, vocabulary management, and batching.


# Installing Required Libraries

In [5]:
!pip install nltk
!pip install transformers==4.42.1
!pip install sentencepiece
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install scikit-learn
!pip install torch==2.2.2
!pip install torchtext==0.17.2
!pip install numpy==1.26.0


Collecting numpy>=1.19.0 (from spacy)
  Obtaining dependency information for numpy>=1.19.0 from https://files.pythonhosted.org/packages/2b/3e/e7247c1d4f15086bb106c8d43c925b0b2ea20270224f5186fa48d4fb5cbd/numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl.metadata
  Using cached numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl.metadata (62 kB)
Using cached numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl (5.4 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.0
    Uninstalling numpy-1.26.0:
      Successfully uninstalled numpy-1.26.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tables 3.8.0 requires blosc2~=2.0.0, which is not installed.
tables 3.8.0 requires cython>=0.29.21, which is not installed.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
numba 0.57.1 requires numpy<1.25,>=1.21, b

# Importing Required Libraries

In [None]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from transformers import BertTokenizer
from transformers import XLNetTokenizer


from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

def warn(*args,**kwarg)