# NLP - Tokenization

#### The first thing you need to do in any NLP project is text preprocessing.

#### Types:
    1. Stop Word Removal
    2. Tokenization
    3. Stemming

## Tokenization

#### The process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens.

### Need of Tokenization
####
    1. Breaking unstructured data and natural language text into chunks of information that can be considered as discrete elements.
    2. Turning an unstructured string (text document) into a numerical data structure suitable for machine learning.

#### 2 Main Ways:
    1. Word Tokenization
    2. Sentence Tokenization

### Tools for Tokenization
    1. NLTK
    2. SpaCy
    3. TextBLOB
    4. Gensim
    5. Keras

### White Space Tokenization
####
    1. Simplest way of Tokenization
    2. Splits the sentence into words based on White Space

In [1]:
text = "I was born in India in 2000."
print(text.split())

['I', 'was', 'born', 'in', 'India', 'in', '2000.']


## NLTK Word Tokenize

In [2]:
import nltk
import nltk.tokenize as tkzr

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SameerBidi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
text = "Hope, is the only thing stronger than Fear! #Hope #Sam.B"

#### Word Tokenizer

In [5]:
print(tkzr.word_tokenize(text))

['Hope', ',', 'is', 'the', 'only', 'thing', 'stronger', 'than', 'Fear', '!', '#', 'Hope', '#', 'Sam.B']


#### Sentence Tokenizer

In [6]:
print(tkzr.sent_tokenize(text))

['Hope, is the only thing stronger than Fear!', '#Hope #Sam.B']


#### Punctuation Based Tokenizer

In [7]:
print(tkzr.wordpunct_tokenize(text))

['Hope', ',', 'is', 'the', 'only', 'thing', 'stronger', 'than', 'Fear', '!', '#', 'Hope', '#', 'Sam', '.', 'B']


#### Tweet Tokenizer

In [9]:
text = "Don't take cryptocurrency advice from Twitter 😂👌"

tokenizer = tkzr.TweetTokenizer()

print(tokenizer.tokenize(text))

["Don't", 'take', 'cryptocurrency', 'advice', 'from', 'Twitter', '😂', '👌']


#### MWE Tokenizer

In [10]:
text = "Hope, is the only this stronger than fear! Hunger Games #Hope"

tokenizer = tkzr.MWETokenizer()

print(tokenizer.tokenize(tkzr.word_tokenize(text)))

tokenizer.add_mwe(("Hunger", "Games"))

print(tokenizer.tokenize(tkzr.word_tokenize(text)))

['Hope', ',', 'is', 'the', 'only', 'this', 'stronger', 'than', 'fear', '!', 'Hunger', 'Games', '#', 'Hope']
['Hope', ',', 'is', 'the', 'only', 'this', 'stronger', 'than', 'fear', '!', 'Hunger_Games', '#', 'Hope']
