## **Tokenization**

Tokenization is the process of breaking down a text into smaller units called tokens. In the context of natural language processing (NLP), tokens are usually words, phrases, symbols, or other meaningful elements. The primary goal of tokenization is to simplify the text and make it easier to analyze.

In [1]:
corpus = """The moon, Earth's only natural satellite, orbits at an average distance of 238,855 miles and has a diameter of 3,474 kilometers, making it about 1/4th the size of our planet.
With just 1/6th of Earth's gravitational force, the moon lacks an atmosphere, resulting in a unique environment.
Its surface displays phases like new moon, first quarter, full moon, and last quarter due to its orbital dance with Earth, showcasing craters, mountains, and plains from meteoroid impacts.
The moon's synchronous rotation means the same side always faces Earth, while the far side remains hidden.
Playing a pivotal role in Earth's tides through gravitational interactions, the moon, with lunar regolith covering its surface, left an indelible mark on human history, especially through the successful Apollo missions landing astronauts between 1969 and 1972."""

In [2]:
print(corpus)

The moon, Earth's only natural satellite, orbits at an average distance of 238,855 miles and has a diameter of 3,474 kilometers, making it about 1/4th the size of our planet.
With just 1/6th of Earth's gravitational force, the moon lacks an atmosphere, resulting in a unique environment.
Its surface displays phases like new moon, first quarter, full moon, and last quarter due to its orbital dance with Earth, showcasing craters, mountains, and plains from meteoroid impacts.
The moon's synchronous rotation means the same side always faces Earth, while the far side remains hidden.
Playing a pivotal role in Earth's tides through gravitational interactions, the moon, with lunar regolith covering its surface, left an indelible mark on human history, especially through the successful Apollo missions landing astronauts between 1969 and 1972.


In [3]:
## Tokenization
## Paragraphs --> Sentence
from nltk.tokenize import sent_tokenize

In [4]:
documents = sent_tokenize(corpus)

In [5]:
type(documents), documents

(list,
 ["The moon, Earth's only natural satellite, orbits at an average distance of 238,855 miles and has a diameter of 3,474 kilometers, making it about 1/4th the size of our planet.",
  "With just 1/6th of Earth's gravitational force, the moon lacks an atmosphere, resulting in a unique environment.",
  'Its surface displays phases like new moon, first quarter, full moon, and last quarter due to its orbital dance with Earth, showcasing craters, mountains, and plains from meteoroid impacts.',
  "The moon's synchronous rotation means the same side always faces Earth, while the far side remains hidden.",
  "Playing a pivotal role in Earth's tides through gravitational interactions, the moon, with lunar regolith covering its surface, left an indelible mark on human history, especially through the successful Apollo missions landing astronauts between 1969 and 1972."])

In [6]:
## Tokenization
## Paragraphs --> Words
## Sentences --> Words
from nltk.tokenize import word_tokenize

In [7]:
word_tokenize(corpus)[:10]

['The',
 'moon',
 ',',
 'Earth',
 "'s",
 'only',
 'natural',
 'satellite',
 ',',
 'orbits']

In [8]:
for sentence in documents:
    print(word_tokenize(sentence))

['The', 'moon', ',', 'Earth', "'s", 'only', 'natural', 'satellite', ',', 'orbits', 'at', 'an', 'average', 'distance', 'of', '238,855', 'miles', 'and', 'has', 'a', 'diameter', 'of', '3,474', 'kilometers', ',', 'making', 'it', 'about', '1/4th', 'the', 'size', 'of', 'our', 'planet', '.']
['With', 'just', '1/6th', 'of', 'Earth', "'s", 'gravitational', 'force', ',', 'the', 'moon', 'lacks', 'an', 'atmosphere', ',', 'resulting', 'in', 'a', 'unique', 'environment', '.']
['Its', 'surface', 'displays', 'phases', 'like', 'new', 'moon', ',', 'first', 'quarter', ',', 'full', 'moon', ',', 'and', 'last', 'quarter', 'due', 'to', 'its', 'orbital', 'dance', 'with', 'Earth', ',', 'showcasing', 'craters', ',', 'mountains', ',', 'and', 'plains', 'from', 'meteoroid', 'impacts', '.']
['The', 'moon', "'s", 'synchronous', 'rotation', 'means', 'the', 'same', 'side', 'always', 'faces', 'Earth', ',', 'while', 'the', 'far', 'side', 'remains', 'hidden', '.']
['Playing', 'a', 'pivotal', 'role', 'in', 'Earth', "'s", 

In [9]:
from nltk.tokenize import wordpunct_tokenize

In [10]:
wordpunct_tokenize(corpus)[:10]

['The', 'moon', ',', 'Earth', "'", 's', 'only', 'natural', 'satellite', ',']

In [11]:
from nltk.tokenize import TreebankWordTokenizer

In [12]:
tokenizer = TreebankWordTokenizer()

In [13]:
tokenizer.tokenize(corpus)[:10]

['The',
 'moon',
 ',',
 'Earth',
 "'s",
 'only',
 'natural',
 'satellite',
 ',',
 'orbits']