# Tokenization

1. Tokenization is the process of splitting text into smaller units called tokens.
2. Tokens can be words, subwords, sentences, or characters depending on the method.
3. It helps convert raw text into a structured format that machines can understand.
4. It is usually the first step in NLP preprocessing.
5. Used in tasks like text classification, sentiment analysis, translation, and chatbots.
6. Helps models understand context, meaning, and relationships between words.


In [2]:
!pip install nltk



In [3]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

text="""Artificial Intelligence is transforming healthcare. It helps doctors diagnose diseases early.
 Moreover,AI enables personalised treatment for each patient.

This shift can improve outcomes and reduces costs. Many hospitals now rely on AI-powered systems.
"""

#word Tokenizer
word_tokens=word_tokenize(text)
print(word_tokens)

#sentence Tokenizer
sent_tokens=sent_tokenize(text)
print("sent tokens are:",sent_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['Artificial', 'Intelligence', 'is', 'transforming', 'healthcare', '.', 'It', 'helps', 'doctors', 'diagnose', 'diseases', 'early', '.', 'Moreover', ',', 'AI', 'enables', 'personalised', 'treatment', 'for', 'each', 'patient', '.', 'This', 'shift', 'can', 'improve', 'outcomes', 'and', 'reduces', 'costs', '.', 'Many', 'hospitals', 'now', 'rely', 'on', 'AI-powered', 'systems', '.']
sent tokens are: ['Artificial Intelligence is transforming healthcare.', 'It helps doctors diagnose diseases early.', 'Moreover,AI enables personalised treatment for each patient.', 'This shift can improve outcomes and reduces costs.', 'Many hospitals now rely on AI-powered systems.']


In [4]:
#word Tokenizer
word_tokens=word_tokenize(text)
print(word_tokens)
print("before removing dupication:",len(word_tokens))
word_tokens=len(set(word_tokens))
print("after removing duplications:",word_tokens)

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words=set(stopwords.words('english'))
print(stop_words)

['Artificial', 'Intelligence', 'is', 'transforming', 'healthcare', '.', 'It', 'helps', 'doctors', 'diagnose', 'diseases', 'early', '.', 'Moreover', ',', 'AI', 'enables', 'personalised', 'treatment', 'for', 'each', 'patient', '.', 'This', 'shift', 'can', 'improve', 'outcomes', 'and', 'reduces', 'costs', '.', 'Many', 'hospitals', 'now', 'rely', 'on', 'AI-powered', 'systems', '.']
before removing dupication: 40
after removing duplications: 36
{'off', 'than', 'some', 'any', 'between', 'shouldn', 'you', 'doesn', 'where', 'but', "it'd", 'aren', "mightn't", "wasn't", 'hasn', 'up', 'yourselves', 'both', "don't", "he'd", "couldn't", 'hers', 'yours', 'its', 'at', 'out', 'very', 'ma', 'are', 'll', 'been', "shan't", 'same', "they're", 'or', 'can', 'these', 'o', "didn't", 'shan', 'each', 'a', "should've", 'weren', 'through', "i'd", 'against', 'not', 'from', 'has', 'under', 'while', 'and', 'above', 'so', "they'll", 'once', 'were', 'here', 'ours', 't', 'my', 'of', 'being', 'having', "he's", "needn't"

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
#sentence Tokenizer
sent_tokens=sent_tokenize(text)
print("sent tokens are:",sent_tokens)
for i in enumerate(sent_tokens):
  print(f"{i}: {sent_tokens}")

sent tokens are: ['Artificial Intelligence is transforming healthcare.', 'It helps doctors diagnose diseases early.', 'Moreover,AI enables personalised treatment for each patient.', 'This shift can improve outcomes and reduces costs.', 'Many hospitals now rely on AI-powered systems.']
(0, 'Artificial Intelligence is transforming healthcare.'): ['Artificial Intelligence is transforming healthcare.', 'It helps doctors diagnose diseases early.', 'Moreover,AI enables personalised treatment for each patient.', 'This shift can improve outcomes and reduces costs.', 'Many hospitals now rely on AI-powered systems.']
(1, 'It helps doctors diagnose diseases early.'): ['Artificial Intelligence is transforming healthcare.', 'It helps doctors diagnose diseases early.', 'Moreover,AI enables personalised treatment for each patient.', 'This shift can improve outcomes and reduces costs.', 'Many hospitals now rely on AI-powered systems.']
(2, 'Moreover,AI enables personalised treatment for each patient.'

In [7]:
#paragraphn Tokenizer
paragraphs=[p.strip() for p in text.strip().split("\n\n") if p]
print(paragraphs)
for i,p in enumerate(paragraphs, 1):
  print(f"{i}: {p}")

['Artificial Intelligence is transforming healthcare. It helps doctors diagnose diseases early.\n Moreover,AI enables personalised treatment for each patient.', 'This shift can improve outcomes and reduces costs. Many hospitals now rely on AI-powered systems.']
1: Artificial Intelligence is transforming healthcare. It helps doctors diagnose diseases early.
 Moreover,AI enables personalised treatment for each patient.
2: This shift can improve outcomes and reduces costs. Many hospitals now rely on AI-powered systems.
