<a href="https://colab.research.google.com/github/Danalmestadi/T5-Week-seven/blob/main/NLU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization Techniques in Python
This notebook demonstrates various tokenization techniques in Python, including basic tokenization, using `nltk`, `spaCy`, `transformers`, sentence tokenization, and custom regex tokenization.

## 1. Basic Word Tokenization with `split()`
This method splits a sentence into tokens based on spaces.

In [None]:
sentence = 'Hello, world! Welcome to NLP.'
tokens = sentence.split()
print(tokens)

['Hello,', 'world!', 'Welcome', 'to', 'NLP.']


## 2. Word Tokenization with `nltk`
`nltk` provides a more advanced tokenization function `word_tokenize`.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentence = 'Hello, world! Welcome to NLP.'
tokens = word_tokenize(sentence)
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Hello', ',', 'world', '!', 'Welcome', 'to', 'NLP', '.']


In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentance='Hello, world! Welcome to NLP.'
tokens=word_tokenize(sentance)
print(tokens)

['Hello', ',', 'world', '!', 'Welcome', 'to', 'NLP', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 3. Word Tokenization with `spaCy`
`spaCy` provides tokenization with additional linguistic features.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = 'Hello, world! Welcome to NLP.'
doc = nlp(sentence)
tokens = [token.text for token in doc]
print(tokens)

['Hello', ',', 'world', '!', 'Welcome', 'to', 'NLP', '.']


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
sentance='Hello, world! Welcome to NLP.'
doc =nlp(sentance)
tokens=[token.text for token in doc]
print(tokens)



['Hello', ',', 'world', '!', 'Welcome', 'to', 'NLP', '.']


## 4. Subword Tokenization with `transformers`
`transformers` library allows subword tokenization using pre-trained models.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentence = 'Hello, world! Welcome to NLP.'
tokens = tokenizer.tokenize(sentence)
print(tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['hello', ',', 'world', '!', 'welcome', 'to', 'nl', '##p', '.']


In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentance='Hello, world! Welcome to NLP.'
tokens=tokenizer.tokenize(sentance)
print(tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['hello', ',', 'world', '!', 'welcome', 'to', 'nl', '##p', '.']


## 5. Sentence Tokenization with `nltk`
Tokenize a paragraph into sentences using `nltk`.

In [None]:
from nltk.tokenize import sent_tokenize
paragraph = 'Hello, world! Welcome to NLP. This is an exciting field.'
sentences = sent_tokenize(paragraph)
print(sentences)

['Hello, world!', 'Welcome to NLP.', 'This is an exciting field.']


In [None]:
from nltk.tokenize import sent_tokenize
paraghraph='Hello, world! Welcome to NLP. This is an exciting field.'
sentances = sent_tokenize(paraghraph)
print(sentances)

['Hello, world!', 'Welcome to NLP.', 'This is an exciting field.']


## 6. Sentence Tokenization with `spaCy`
Tokenize a paragraph into sentences using `spaCy`.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
paragraph = 'Hello, world! Welcome to NLP. This is an exciting field.'
doc = nlp(paragraph)
sentences = [sent.text for sent in doc.sents]
print(sentences)

['Hello, world!', 'Welcome to NLP.', 'This is an exciting field.']


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
paraghraph='Hello,world! Welcome to NLP. This is an exciting field.'
doc=nlp(paraghraph)
sentances=[sent.text for sent in doc.sents]
print(sentances)



['Hello,world!', 'Welcome to NLP.', 'This is an exciting field.']


## 7. Custom Tokenization with Regular Expressions
Use regular expressions to customize tokenization.

In [None]:
import re
sentence = 'Hello, world! Welcome to NLP.'
tokens = re.findall(r'\b\w+\b', sentence)
print(tokens)

['Hello', 'world', 'Welcome', 'to', 'NLP']


In [None]:
import re
sentance='Hello, world! Welcome to NLP.'
tokens=re.findall(r'\b\w+\b', sentance)
print(tokens)

['Hello', 'world', 'Welcome', 'to', 'NLP']
