### Tokenization:-

1. Tokenization is a crucial process in Natural Language Processing (NLP) that involves breaking down a piece of text into smaller units called tokens. 
2. These tokens can be words, parts of words, or even characters like punctuation . 
3. The primary goal of tokenization is to represent text in a manner that's meaningful for machines without losing its context.
4. It turns an unstructured string (text document) into a numerical data structure suitable for machine learning .
5. The motivation behind tokenization is to present the computer with some finite set of symbols that it can combine to produce the desired result. 

***all imports are taken from:  from nltk.tokenize import ......***

In [1]:
sents = '''Hello friends!
How are you? Welcome to python programming.'''

### 1. Sentence Tokenization:

In [2]:
from nltk.tokenize import sent_tokenize

sent_token = sent_tokenize(sents)   # it will return list of sentences from the text

sent_token

['Hello friends!', 'How are you?', 'Welcome to python programming.']

### 2. Word Tokenization:

In [3]:
from nltk.tokenize import word_tokenize

word_token = word_tokenize(sents)  # it will return the list if each word as individual even punctuations as word

word_token

['Hello',
 'friends',
 '!',
 'How',
 'are',
 'you',
 '?',
 'Welcome',
 'to',
 'python',
 'programming',
 '.']

### 3. White Space Tokenization:

In [4]:
from nltk.tokenize import WhitespaceTokenizer

white_token = WhitespaceTokenizer()   #it is a fun

white_token.tokenize(sents)

['Hello',
 'friends!',
 'How',
 'are',
 'you?',
 'Welcome',
 'to',
 'python',
 'programming.']

### 4. Space Tokenization:

In [5]:
from nltk.tokenize import SpaceTokenizer

sp_token = SpaceTokenizer()

sp_token.tokenize(sents)

['Hello',
 'friends!\nHow',
 'are',
 'you?',
 'Welcome',
 'to',
 'python',
 'programming.']

### 5. Line Tokenization:

In [6]:
from nltk.tokenize import LineTokenizer

ln_token = LineTokenizer()    #it will tokenize at '\n' new line and return list 

ln_token.tokenize(sents)

['Hello friends!', 'How are you? Welcome to python programming.']

### 6. Tab Tokenization:

In [7]:
sent1 = '''Hello \tfriends!
How are you? Welcome to\t python programming.'''

from nltk.tokenize import TabTokenizer

tab_token = TabTokenizer()    # return the list of tab seperated part

tab_token.tokenize(sent1)

['Hello ', 'friends!\nHow are you? Welcome to', ' python programming.']

### 7. Multi-word Tokenization:

In [8]:
sent = 'Van Rossom is in pune today. We welcomed Van Rossom here'

from nltk.tokenize import MWETokenizer

mwt_token = MWETokenizer(separator = ' ')

mwt_token.add_mwe(('Van', 'Rossom'))

token = mwt_token.tokenize(word_tokenize(sent))

token

['Van Rossom',
 'is',
 'in',
 'pune',
 'today',
 '.',
 'We',
 'welcomed',
 'Van Rossom',
 'here']

### 8. Tweet Tokenizer:

In [9]:
from nltk.tokenize import TweetTokenizer

sent2 = '''Hello friends! 😂 :!
How are you? Welcome to ☠️python programming 🫀 🫁 🧠.:D
Check my web: https://python.org'''

tk = TweetTokenizer()

tk.tokenize(sent2)    ## it will trturn list with emojis as well

['Hello',
 'friends',
 '!',
 '😂',
 ':',
 '!',
 'How',
 'are',
 'you',
 '?',
 'Welcome',
 'to',
 '☠',
 '️python',
 'programming',
 '🫀',
 '🫁',
 '🧠',
 '.',
 ':D',
 'Check',
 'my',
 'web',
 ':',
 'https://python.org']

### 9. How to make our own tokenizer:

In [10]:
import re

def custom_tokenizer(text):
    return re.split(r"[.,;?!\s]+",text)    # it will split the string whenever mentioned characters occure...

In [11]:
custom_tokenizer(sents)

['Hello',
 'friends',
 'How',
 'are',
 'you',
 'Welcome',
 'to',
 'python',
 'programming',
 '']

                                                                                              Thank you :)