# <b> NLP: Intro to Tokenization
> Description:
  * Tokenization is the process of breaking down a piece of text into smaller units, called tokens. These tokens can be words, phrases, or even individual characters. Tokenization is an essential step in natural language processing (NLP) because it helps to convert unstructured text data into structured data that can be used for further analysis.
  * Tokenization can be performed using various techniques such as regular expressions, rule-based methods, and machine learning algorithms. The choice of technique depends on the specific task and the complexity of the text data being analyzed.

In [1]:
# importing Libraries :
import nltk
from nltk.tokenize import word_tokenize

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
# Creating a paragraph on which we will be performing Tokenization.

paragraph  = '''The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe. It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.

The CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey. They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.

The CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius). However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000. These tiny fluctuations are thought to be the result of slight density variations in the early universe, which were stretched out by cosmic expansion to form the large-scale structures we see today, such as galaxies and clusters of galaxies.

Studying the CMB has been crucial to our understanding of the universe and its evolution. It has provided strong evidence for the Big Bang theory, as well as for the existence of dark matter and dark energy. It has also allowed astronomers to measure the age, size, and composition of the universe with unprecedented accuracy.

In recent years, the study of the CMB has entered a new era, with a number of high-precision experiments, such as the Planck satellite and the Atacama Cosmology Telescope, providing even more detailed maps of the CMB and shedding light on some of the universe's deepest mysteries.
'''

### Tokenization can be performed in several diiferent ways :
* word tokenization
* character tokenization
* sentence tokenization

In [6]:
# PERFORMING SENTENCE TOKENIZATION

# This divides the whole paragraph into several sentences  
sentence = nltk.sent_tokenize(paragraph)
print (sentence) 

['The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe.', 'It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.', 'The CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey.', 'They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky.', 'After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.', 'The CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius).', 'However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000.', 'These tiny fluctuations are thought to be the result of slight density variations in the early univers

In [8]:
# PERFORMING WORD TOKENIZATION

# This divides the whole paragraph into several words  
tokens = word_tokenize(paragraph)
print((tokens)) 

['The', 'Cosmic', 'Microwave', 'Background', '(', 'CMB', ')', 'is', 'a', 'form', 'of', 'electromagnetic', 'radiation', 'that', 'pervades', 'the', 'entire', 'universe', '.', 'It', 'is', 'thought', 'to', 'be', 'the', 'afterglow', 'of', 'the', 'Big', 'Bang', ',', 'the', 'event', 'that', 'marks', 'the', 'beginning', 'of', 'the', 'universe', 'as', 'we', 'know', 'it', '.', 'The', 'CMB', 'was', 'first', 'discovered', 'in', '1964', 'by', 'two', 'radio', 'astronomers', ',', 'Arno', 'Penzias', 'and', 'Robert', 'Wilson', ',', 'who', 'were', 'working', 'at', 'Bell', 'Labs', 'in', 'New', 'Jersey', '.', 'They', 'were', 'using', 'a', 'large', 'horn-shaped', 'antenna', 'to', 'study', 'radio', 'waves', 'emitted', 'by', 'the', 'Milky', 'Way', ',', 'but', 'they', 'kept', 'detecting', 'a', 'mysterious', 'signal', 'that', 'seemed', 'to', 'be', 'coming', 'from', 'all', 'directions', 'in', 'the', 'sky', '.', 'After', 'ruling', 'out', 'a', 'number', 'of', 'possible', 'explanations', ',', 'they', 'realized', '

In [10]:
# This is how to perform Character tokenization : 


# Tokenize the text into characters (its a simple straightforward process)
tokens = list(paragraph)
print((tokens))

['T', 'h', 'e', ' ', 'C', 'o', 's', 'm', 'i', 'c', ' ', 'M', 'i', 'c', 'r', 'o', 'w', 'a', 'v', 'e', ' ', 'B', 'a', 'c', 'k', 'g', 'r', 'o', 'u', 'n', 'd', ' ', '(', 'C', 'M', 'B', ')', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'o', 'r', 'm', ' ', 'o', 'f', ' ', 'e', 'l', 'e', 'c', 't', 'r', 'o', 'm', 'a', 'g', 'n', 'e', 't', 'i', 'c', ' ', 'r', 'a', 'd', 'i', 'a', 't', 'i', 'o', 'n', ' ', 't', 'h', 'a', 't', ' ', 'p', 'e', 'r', 'v', 'a', 'd', 'e', 's', ' ', 't', 'h', 'e', ' ', 'e', 'n', 't', 'i', 'r', 'e', ' ', 'u', 'n', 'i', 'v', 'e', 'r', 's', 'e', '.', ' ', 'I', 't', ' ', 'i', 's', ' ', 't', 'h', 'o', 'u', 'g', 'h', 't', ' ', 't', 'o', ' ', 'b', 'e', ' ', 't', 'h', 'e', ' ', 'a', 'f', 't', 'e', 'r', 'g', 'l', 'o', 'w', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'B', 'i', 'g', ' ', 'B', 'a', 'n', 'g', ',', ' ', 't', 'h', 'e', ' ', 'e', 'v', 'e', 'n', 't', ' ', 't', 'h', 'a', 't', ' ', 'm', 'a', 'r', 'k', 's', ' ', 't', 'h', 'e', ' ', 'b', 'e', 'g', 'i', 'n', 'n', 'i', 'n', 'g', ' ', 'o', 'f',