<a href="https://colab.research.google.com/github/Randoot/NLP/blob/main/NLU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization Techniques in Python
This notebook demonstrates various tokenization techniques in Python, including basic tokenization, using `nltk`, `spaCy`, `transformers`, sentence tokenization, and custom regex tokenization.

## 1. Basic Word Tokenization with `split()`
This method splits a sentence into tokens based on spaces.

In [None]:
#This line of code initializes a variable called sentence with the string
sentence = 'Hello, world! Welcome to NLP.'
#split() splits the string at any whitespace (spaces, tabs, newlines). It returns a list of substrings (tokens)
tokens = sentence.split()
print(tokens)

['Hello,', 'world!', 'Welcome', 'to', 'NLP.']


## 2. Word Tokenization with `nltk`
`nltk` provides a more advanced tokenization function `word_tokenize`.

This is a more advanced way to tokenize text using the Natural Language Toolkit (NLTK), a popular library for natural language processing in Python

In [None]:
import nltk
T#his imports the word_tokenize function from NLTK’s tokenize module.
#This function is used for tokenizing text into words, handling punctuation and other language-specific features more accurately than the basic split() method.
from nltk.tokenize import word_tokenize
# Now we download the Punkt tokenizer models (pre-trained model) for tokenizing text into sentences or words.
# It’s necessary for word_tokenize to work properly because it relies on these models to handle various punctuation and language-specific rules.
nltk.download('punkt')
#This line of code initializes a variable called sentence with the string
sentence = 'Hello, world! Welcome to NLP.'
tokens = word_tokenize(sentence)   # to handle punctuation and special characters correctly
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Hello', ',', 'world', '!', 'Welcome', 'to', 'NLP', '.']


## 3. Word Tokenization with `spaCy`
`spaCy` provides tokenization with additional linguistic features.

In [None]:
# SpaCy library for text processing and analysis.
import spacy
# en_core_web_sm:  a small pre-trained model is for English text (tokenization, part-of-speech tagging, named entity recognition)
nlp = spacy.load('en_core_web_sm')

sentence = 'Hello, world! Welcome to NLP.'
# we apply the SpaCy pipeline nlp to the sentence.
# The nlp object processes the text and returns a Doc object.
# The Doc object is a container for the processed text and includes tokenized text along with additional linguistic annotations (like part-of-speech tags, dependencies, etc.).
doc = nlp(sentence)

#iterates over each token in the Doc object and extract the text attribute (string) of each token.
tokens = [token.text for token in doc]
print(tokens)

['Hello', ',', 'world', '!', 'Welcome', 'to', 'NLP', '.']


## 4. Subword Tokenization with `transformers`
We will use the Hugging Face `transformers` library which allows subword tokenization using pre-trained models.

Remember: Each output can vary depending on the tokenizer’s vocabulary and how it handles punctuation and subword splits.

In [None]:
# import AutoTokenizer class from the Transformers library.
# The AutoTokenizer class automatically loads the appropriate tokenizer for a given pre-trained model.
from transformers import AutoTokenizer

# loads a pre-trained tokenizer for the BERT model with the identifier 'bert-base-uncased'.
# This specific model, BERT (Bidirectional Encoder)  is an uncased: meaning it does not distinguish between uppercase and lowercase letters.
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentence = 'Hello, world! Welcome to NLP.'

# tokenizes the sentence (plit the input text into tokens) using the pre-trained BERT tokenizer (splits into subwords and handling punctuation)
tokens = tokenizer.tokenize(sentence)
print(tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['hello', ',', 'world', '!', 'welcome', 'to', 'nl', '##p', '.']


## 5. Sentence Tokenization with `nltk`
Tokenize a paragraph into sentences using `nltk`.
splitting a text into individual sentences

In [None]:
from nltk.tokenize import sent_tokenize   #imports the sent_tokenize function from NLTK’s tokenize module.
paragraph = 'Hello, world! Welcome to NLP. This is an exciting field.'

sentences = sent_tokenize(paragraph)
print(sentences)

['Hello, world!', 'Welcome to NLP.', 'This is an exciting field.']


## 6. Sentence Tokenization with `spaCy`
Tokenize a paragraph into sentences using `spaCy`.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm') # pretrained model
paragraph = 'Hello, world! Welcome to NLP. This is an exciting field.'
doc = nlp(paragraph)
sentences = [sent.text for sent in doc.sents]
print(sentences)

['Hello, world!', 'Welcome to NLP.', 'This is an exciting field.']


## 7. Custom Tokenization with Regular Expressions
Use regular expressions re to customize tokenization.

In [None]:
import re
sentence = 'Hello, world! Welcome to NLP.'
tokens = re.findall(r'\b\w+\b', sentence)
#findall function from the re module to search the sentence for all substrings that match the regular expression pattern r'\b\w+\b'.
# \b: A word boundary. This ensures that the pattern matches whole words and not parts of words.
# \w+: One or more word characters.
# The \w character class matches any alphanumeric character (letters and digits) and underscores.
# The + quantifier means "one or more" of these characters.
The combination \b\w+\b matches whole words bounded by word boundaries.
print(tokens)   # ignore punctuation and only captures the words.

['Hello', 'world', 'Welcome', 'to', 'NLP']
