<a href="https://colab.research.google.com/gist/Melvinchen0404/8ed2215da4ad813d8e24858324b016f5/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##NLP Technique 1: Tokenization
**Tokenization** refers to the process of breaking a piece of raw text into smaller units (called **tokens**) for processing \
\
**STEP 1:** Download the **Natural Language Toolkit (NLTK)**: https://www.nltk.org/ The **Natural Language Toolkit (NLTK)** is an open source Python library for Natural Language Processing \
**STEP 2:** Import the `nltk` package 

In [None]:
import nltk
nltk.download('punkt')

class color:
   BOLD = '\033[1m'
   END = '\033[0m'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**STEP 3:** Import the relevant tokenization functions (`word_tokenize`, `sent_tokenize`) \
**STEP 4:** For our example, we are using the first two lines from James Joyce's *Ulysses* as our text (https://www.gutenberg.org/files/4300/4300-h/4300-h.htm#chap01) \
**STEP 5:** Get rid of punctuation in **word tokenization** by using the `.isalnum()` function: this function checks whether a character in a string is alphanumeric or not and only retains the alphanumeric characters

In [None]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.probability import FreqDist
text = "Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air."
tokens1 = nltk.word_tokenize(text)
tokens2 = [word for word in tokens1 if word.isalnum()]
print(color.BOLD + 'Original text: \n' + color.END, text)
print(color.BOLD + 'Tokenized words: \n' + color.END, tokens2)

[1mOriginal text: 
[0m Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air.
[1mTokenized words: 
[0m ['Stately', 'plump', 'Buck', 'Mulligan', 'came', 'from', 'the', 'stairhead', 'bearing', 'a', 'bowl', 'of', 'lather', 'on', 'which', 'a', 'mirror', 'and', 'a', 'razor', 'lay', 'crossed', 'A', 'yellow', 'dressinggown', 'ungirdled', 'was', 'sustained', 'gently', 'behind', 'him', 'on', 'the', 'mild', 'morning', 'air']


**STEP 6:** The `FreqDist` function will allow us to count the number of words in a string \
**STEP 7**: When we apply `most_common()`, we will return a list of the *n* most common elements and their counts from the most common to the least. This will yield the frequency of each word, from the most common to the least common

In [None]:
fdist1 = FreqDist(tokens2)
print(color.BOLD + 'Number of distinct and tokenized words (samples) and number of tokenized words (outcomes): \n' + color.END, fdist1)
print(color.BOLD + 'Frequency of each distinct and tokenized word: \n' + color.END, fdist1.most_common())

[1mNumber of distinct and tokenized words (samples) and number of tokenized words (outcomes): 
[0m <FreqDist with 32 samples and 36 outcomes>
[1mFrequency of each distinct and tokenized word: 
[0m [('a', 3), ('the', 2), ('on', 2), ('Stately', 1), ('plump', 1), ('Buck', 1), ('Mulligan', 1), ('came', 1), ('from', 1), ('stairhead', 1), ('bearing', 1), ('bowl', 1), ('of', 1), ('lather', 1), ('which', 1), ('mirror', 1), ('and', 1), ('razor', 1), ('lay', 1), ('crossed', 1), ('A', 1), ('yellow', 1), ('dressinggown', 1), ('ungirdled', 1), ('was', 1), ('sustained', 1), ('gently', 1), ('behind', 1), ('him', 1), ('mild', 1), ('morning', 1), ('air', 1)]


**STEP 8**: We can tokenize the sentences in the text too

In [None]:
fdist2 = FreqDist(sent_tokenize(text))
print(color.BOLD + 'Tokenized sentences: \n' + color.END, sent_tokenize(text))
print(color.BOLD + 'Number of distinct and tokenized sentences (samples) and number of tokenized sentences (outcomes): \n' + color.END, fdist2)

[1mTokenized sentences: 
[0m ['Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.', 'A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air.']
[1mNumber of distinct and tokenized sentences (samples) and number of tokenized sentences (outcomes): 
[0m <FreqDist with 2 samples and 2 outcomes>
