### Tokenization

Tokenization is the process of breaking text or data into smaller units called tokens (like words, subwords, or characters).
Itâ€™s used in NLP and security to analyze text or replace sensitive data with safe placeholders.

- Corpus: A large collection of text data used for analysis or training NLP models.
- Document: A single text file or piece of content within a corpus.
- Paragraph: A block of related sentences within a document.
- Sentence: A grammatical unit expressing a complete thought.
- Token: The smallest unit of text (word, subword, or symbol) after tokenization.


**NLTK** is a Python library focused on teaching and research in NLP, offering many algorithms and datasets.

**spaCy** is a fast, production-ready NLP library with efficient pipelines and pretrained models. 

In [None]:
## Install nltk library
# pip install nltk 

In [1]:
## Lets create a corpus
corpus = """Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. 
The ultimate goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP encompasses a variety of tasks, 
including text analysis, sentiment analysis, machine translation, speech recognition, and chatbot development. By leveraging techniques from linguistics, computer science, and machine learning, 
NLP aims to bridge the gap between human communication and computer understanding, enabling more intuitive and efficient interactions with technology."""
print(corpus)


Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. 
The ultimate goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP encompasses a variety of tasks, 
including text analysis, sentiment analysis, machine translation, speech recognition, and chatbot development. By leveraging techniques from linguistics, computer science, and machine learning, 
NLP aims to bridge the gap between human communication and computer understanding, enabling more intuitive and efficient interactions with technology.


In [6]:
## Tokenization
## Paragraph into sentences
from nltk.tokenize import sent_tokenize         ## to tokenize sentences
documents = sent_tokenize(corpus)
## sentences is a list so for reading pupouse we can use for loop
for i in documents:
    print(i)

Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language.
The ultimate goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.
NLP encompasses a variety of tasks, 
including text analysis, sentiment analysis, machine translation, speech recognition, and chatbot development.
By leveraging techniques from linguistics, computer science, and machine learning, 
NLP aims to bridge the gap between human communication and computer understanding, enabling more intuitive and efficient interactions with technology.


In [None]:
## Paragraph into words
from nltk.tokenize import word_tokenize         ## to tokenize words
words = word_tokenize(corpus)
print(words)
print("----------------------------------------------------------------------------")

for sentence in documents:
    print(word_tokenize(sentence))

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', '(', 'AI', ')', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'through', 'natural', 'language', '.', 'The', 'ultimate', 'goal', 'of', 'NLP', 'is', 'to', 'enable', 'computers', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', 'in', 'a', 'way', 'that', 'is', 'both', 'meaningful', 'and', 'useful', '.', 'NLP', 'encompasses', 'a', 'variety', 'of', 'tasks', ',', 'including', 'text', 'analysis', ',', 'sentiment', 'analysis', ',', 'machine', 'translation', ',', 'speech', 'recognition', ',', 'and', 'chatbot', 'development', '.', 'By', 'leveraging', 'techniques', 'from', 'linguistics', ',', 'computer', 'science', ',', 'and', 'machine', 'learning', ',', 'NLP', 'aims', 'to', 'bridge', 'the', 'gap', 'between', 'human', 'communication', 'and', 'computer', 'understanding', ',', 'enabling', 'more', 'intuitive', 'an

In [10]:
from nltk.tokenize import wordpunct_tokenize           ## to tokenize words and punctuations

words_punct = wordpunct_tokenize(corpus)
print(words_punct)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', '(', 'AI', ')', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'through', 'natural', 'language', '.', 'The', 'ultimate', 'goal', 'of', 'NLP', 'is', 'to', 'enable', 'computers', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', 'in', 'a', 'way', 'that', 'is', 'both', 'meaningful', 'and', 'useful', '.', 'NLP', 'encompasses', 'a', 'variety', 'of', 'tasks', ',', 'including', 'text', 'analysis', ',', 'sentiment', 'analysis', ',', 'machine', 'translation', ',', 'speech', 'recognition', ',', 'and', 'chatbot', 'development', '.', 'By', 'leveraging', 'techniques', 'from', 'linguistics', ',', 'computer', 'science', ',', 'and', 'machine', 'learning', ',', 'NLP', 'aims', 'to', 'bridge', 'the', 'gap', 'between', 'human', 'communication', 'and', 'computer', 'understanding', ',', 'enabling', 'more', 'intuitive', 'an

In [None]:
from nltk.tokenize import TreebankWordTokenizer           ## to tokenize using treebank tokenizer to include full stops etc. in words
tokenizer = TreebankWordTokenizer()
treebank_tokens = tokenizer.tokenize(corpus)
print(treebank_tokens)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', '(', 'AI', ')', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'through', 'natural', 'language.', 'The', 'ultimate', 'goal', 'of', 'NLP', 'is', 'to', 'enable', 'computers', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', 'in', 'a', 'way', 'that', 'is', 'both', 'meaningful', 'and', 'useful.', 'NLP', 'encompasses', 'a', 'variety', 'of', 'tasks', ',', 'including', 'text', 'analysis', ',', 'sentiment', 'analysis', ',', 'machine', 'translation', ',', 'speech', 'recognition', ',', 'and', 'chatbot', 'development.', 'By', 'leveraging', 'techniques', 'from', 'linguistics', ',', 'computer', 'science', ',', 'and', 'machine', 'learning', ',', 'NLP', 'aims', 'to', 'bridge', 'the', 'gap', 'between', 'human', 'communication', 'and', 'computer', 'understanding', ',', 'enabling', 'more', 'intuitive', 'and', 'efficie