### Tokenization

**Tokenization** is the process of breaking down text into smaller units called tokens. These tokens can be words, phrases, or characters, depending on the type of tokenization.

**Purpose of Tokenization:**<br>
Tokenization is a fundamental step in Natural Language Processing (NLP) to prepare text for analysis or further processing, such as:

- Text classification  
- Sentiment analysis  
- Machine translation  

**Types of Tokenization:**<br>
- Word Tokenization:

    Splits text into individual words.  
    Example: "Hello, world!" → ["Hello", ",", "world", "!"]

- Sentence Tokenization:

    Splits text into sentences.  
    Example: "Hello world! How are you?" → ["Hello world!", "How are you?"]

- Character Tokenization:

    Splits text into individual characters.  
    Example: "Hello" → ["H", "e", "l", "l", "o"]

In [1]:
!pip install nltk



In [2]:
corpus = """I am Priyanka Dandale,
I am a AI enthusiast person! 
I am currently working as a data scientist. It's great to learn about AI."""

In [3]:
print(corpus)

I am Priyanka Dandale,
I am a AI enthusiast person! 
I am currently working as a data scientist. It's great to learn about AI.


In [4]:
# import nltk
# nltk.download('punkt')

##### Sentence Tokenization

In [5]:
from nltk.tokenize import sent_tokenize

In [6]:
#paragraphs/corpus to sentences/documents
documents  = sent_tokenize(corpus)
documents

['I am Priyanka Dandale,\nI am a AI enthusiast person!',
 'I am currently working as a data scientist.',
 "It's great to learn about AI."]

In [7]:
type(documents)

list

In [8]:
for i in documents:
    print(i)

I am Priyanka Dandale,
I am a AI enthusiast person!
I am currently working as a data scientist.
It's great to learn about AI.


##### Words Tokenization

In [9]:
## paragraphs to words
from nltk.tokenize import word_tokenize
words = word_tokenize(corpus)
words

['I',
 'am',
 'Priyanka',
 'Dandale',
 ',',
 'I',
 'am',
 'a',
 'AI',
 'enthusiast',
 'person',
 '!',
 'I',
 'am',
 'currently',
 'working',
 'as',
 'a',
 'data',
 'scientist',
 '.',
 'It',
 "'s",
 'great',
 'to',
 'learn',
 'about',
 'AI',
 '.']

In [10]:
# sentences to words
for i in documents:
    word_tokens = word_tokenize(i)
    print(word_tokens)

['I', 'am', 'Priyanka', 'Dandale', ',', 'I', 'am', 'a', 'AI', 'enthusiast', 'person', '!']
['I', 'am', 'currently', 'working', 'as', 'a', 'data', 'scientist', '.']
['It', "'s", 'great', 'to', 'learn', 'about', 'AI', '.']


In [11]:
from nltk.tokenize import wordpunct_tokenize

for sent in documents:
    print(wordpunct_tokenize(sent))   #Observe 's ..here ' and s are separated

['I', 'am', 'Priyanka', 'Dandale', ',', 'I', 'am', 'a', 'AI', 'enthusiast', 'person', '!']
['I', 'am', 'currently', 'working', 'as', 'a', 'data', 'scientist', '.']
['It', "'", 's', 'great', 'to', 'learn', 'about', 'AI', '.']


##### Character Tokenization

In [14]:
# Input text
text = "Hello, World!"

# Character tokenization: split the string into individual characters
char_tokens = list(text)

# Print the result
print(char_tokens)


['H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd', '!']


# End!