# **# Tokenization in NLP**

### **First we learn tokenization without any library**

In [None]:
# split by white spaces
import re
text ='I\'m with you for the entire life in U.K.!'
words = re.split(r'\W+',text)
print(words[:100])

['I', 'm', 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'U', 'K', '']


In this code, we performed basic tokenization without using any external NLP library. Tokenization is the process of breaking down a text into smaller units, usually words, so that we can analyze or process it further.

Here, we used Python's `re` module (regular expressions) to split the text. The pattern `\W+` matches any non-word characters (like spaces, punctuation, and special symbols), so the text is split wherever such characters appear.

In [None]:
# select words
words = re.split(r'\W+',text)
print(words[:100])

['I', 'm', 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'U', 'K', '']


In this code, we extracted all the words from a text by splitting it based on non-word characters using Python’s `re` module.

* `re.split(r'\W+', text)` tells Python to divide the text wherever a non-word character appears, such as spaces, punctuation, or symbols.

* The result is a list of individual words, which makes it easier to analyze, process, or use in NLP tasks.

* `print(words[:100])` simply shows the first 100 words from the list to verify our tokenization.

This approach is a basic form of tokenization without using any NLP library, helping us understand how raw text can be broken down into meaningful word units.

In [None]:
import string
import re
# split into words by white space
words = text.split()
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in words]
print(stripped[:100])

['Im', 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'UK']


In this code, we performed tokenization with punctuation removal using basic Python techniques, without relying on any NLP library.

* First, we split the text into words using the `split()` method, which separates the text at every whitespace.

* Next, we created a regular expression pattern to match all punctuation characters from Python’s string.punctuation.

* Using a list comprehension, we removed punctuation from each word in the list, resulting in clean words without symbols like . , ! ?.

* Finally, print(stripped[:100]) displays the first 100 words after cleaning, so we can verify the output.

This approach is useful for preparing text for NLP tasks, as punctuation is often unnecessary for basic word-level analysis.

In [None]:
# string.printable inverse of string.punctuation
re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', w) for w in words]
print(result)

["I'm", 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'U.K.!']


In this code, we cleaned the text to keep only printable characters using Python’s re module:

* `string.printable` contains all characters that can be printed, including letters, digits, punctuation, and whitespace.

* `[^%s]` in the regular expression acts as an inverse match, meaning it selects everything except printable characters.

* Using a list comprehension, we removed all non-printable characters from each word in the list.

* `print(result)` shows the cleaned words, ensuring the text contains only readable and standard characters, which is useful before further NLP processing.

This method is helpful to filter out unwanted or invisible characters that could interfere with tokenization or text analysis.

In [None]:
# Normalizing Case

# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

["i'm", 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'u.k.!']


In this code, we **normalized the text by converting all words to lowercase**. This ensures that words like `Apple` and `apple` are treated the same, which is important for consistent text analysis in NLP.


In [None]:
# _______________________________ Working On Spacy _________________________________


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
string = '"I\'m with you for the entire life in P.K.!"'
print(string)

"I'm with you for the entire life in P.K.!"


In [None]:
doc = nlp(string)
for token in doc:
    print(token.text, end=' | ')

" | I | 'm | with | you | for | the | entire | life | in | P.K. | ! | " | 

In this code, we **used an NLP library (like SpaCy) to tokenize the text**.

* `nlp(string)` processes the text and creates a **document object**.
* The `for token in doc` loop goes through each **token (word or punctuation)** in the text.
* `print(token.text, end=' | ')` displays each token separated by a vertical bar, so we can clearly see how the text has been split.

This approach is a **library-based tokenization**, which handles words, punctuation, and special cases more accurately than manual splitting.


In [None]:
doc2 =nlp(u"we're here to help! Send snail-mail, email fahad@gmail.com or visit us at https://hahad.blogspot.com/!")
for t in doc2:
    print(t)

we
're
here
to
help
!
Send
snail
-
mail
,
email
fahad@gmail.com
or
visit
us
at
https://hahad.blogspot.com/
!


In [None]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')
for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


In [None]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")
for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


In [None]:
len(doc)

13

In [None]:
len(doc.vocab)

797

In [None]:
doc5 = nlp(u'It is better to give than to receive.')
# Retrieve the third token:
doc5[2]

better

In [None]:
# retrieve three tokens from the-middle
doc5[2:5]

better to give

In [None]:
# retrieve the last 4 tokens
doc5[-4:]

than to receive.

In [None]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

In [None]:
# try to change 'My dinner was horrible.' to 'Your dinner was delicious.'
doc6[3] = doc7[3]

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

In [None]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')

print('\n----')

for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In this code, we **processed a text using SpaCy** to perform **tokenization and named entity recognition (NER)**:

1. `doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')` processes the text and creates a SpaCy **document object**.
2. The first `for token in doc8` loop **prints each token** (word, number, or punctuation) separated by `|`, showing how the text is split into meaningful units.
3. The second loop `for ent in doc8.ents` identifies **named entities** in the text, such as organizations, locations, and money amounts.
4. For each entity, `ent.text` shows the entity itself, `ent.label_` shows its type (like `ORG`, `GPE`, `MONEY`), and `spacy.explain(ent.label_)` gives a **short explanation** of the entity type.

This demonstrates how SpaCy can **automatically detect important real-world entities** from text while also tokenizing it.


In [None]:
len(doc8.ents)

3

In [None]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [None]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


In [None]:
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater


In [None]:
from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [None]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

In [None]:
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.




### **Important Points About Tokenization**

1. **Definition**:
   Tokenization is the process of **breaking text into smaller units**, usually words, phrases, or sentences, called **tokens**.

2. **Purpose**:

   * Prepares text for **further NLP tasks** like stemming, lemmatization, or analysis.
   * Helps in **counting words, building vocabulary, and text representation**.

3. **Types of Tokenization**:

   * **Word-level**: Splits text into individual words (`"I am happy"` → `['I', 'am', 'happy']`).
   * **Sentence-level**: Splits text into sentences.
   * **Subword or character-level**: Used in advanced NLP, like BERT or GPT models.

4. **Methods**:

   * **Manual/Regex-based**: Using Python methods (`split()`) or regular expressions (`re.split()`).
   * **Library-based**: Using NLP libraries like **NLTK** or **SpaCy**, which handle punctuation, contractions, and special cases better.

5. **Considerations**:

   * **Case normalization**: Convert text to lowercase to treat `Apple` and `apple` the same.
   * **Punctuation**: Decide whether to remove punctuation depending on the task.
   * **Special characters**: Remove non-printable or unwanted characters for clean tokens.
   * **Handling contractions**: Some libraries handle `"I'm"` → `['I', "'m"]` correctly.
   * **Language-specific rules**: Tokenization may differ for languages with different writing systems.

6. **Output**:
   The result of tokenization is typically a **list of tokens** that can be used in further NLP tasks like text analysis, vectorization, or named entity recognition.

---




### **Limitations of Tokenization**

1. **Ambiguity in Words**:
   Words with multiple meanings or abbreviations can be misinterpreted. For example, `US` could mean "United States" or "us" depending on context.

2. **Handling Contractions**:
   Simple tokenization may fail to split contractions correctly, e.g., `"I'm"` might be treated as one token instead of `["I", "'m"]`.

3. **Punctuation Issues**:
   Deciding whether to keep or remove punctuation can affect meaning, especially in questions, exclamations, or URLs.

4. **Complex Languages**:
   Languages without clear word boundaries (like Chinese, Japanese) require **special tokenizers**, as simple whitespace splitting fails.

5. **Numbers, Symbols, and Emojis**:
   Tokenizers may not always handle numbers, currency, hashtags, or emojis correctly, which can affect analysis.

6. **Context Ignorance**:
   Basic tokenization doesn’t understand **semantic meaning**, so “Apple” (company) vs “apple” (fruit) are treated the same without additional NLP processing.

7. **Inconsistent Outputs Across Tools**:
   Different libraries (NLTK, SpaCy, etc.) may tokenize the **same text differently**, making results inconsistent if not standardized.





### **Pros of Tokenization**

1. **Simplifies Text Processing** – Breaks text into manageable units for analysis.
2. **Foundation for NLP Tasks** – Essential for stemming, lemmatization, vectorization, and NER.
3. **Flexible Methods** – Can be done manually (regex) or with libraries (NLTK, SpaCy).
4. **Supports Multi-Level Analysis** – Works at word, sentence, or subword level.
5. **Improves Consistency** – When combined with case normalization, it helps treat similar words uniformly.

---

### **Cons of Tokenization**

1. **Ambiguity Handling** – Can misinterpret abbreviations, contractions, or special cases.
2. **Language Limitations** – Whitespace-based tokenization fails for languages like Chinese or Japanese.
3. **Punctuation and Symbols** – Requires careful handling; may lose important information.
4. **Context Ignorance** – Doesn’t understand meaning; “Apple” the company vs “apple” the fruit are treated the same.
5. **Inconsistent Outputs** – Different tools may tokenize the same text differently.


