# Tokenization in NLP

**Tokenization** is the process of breaking down text into smaller units called **tokens**. These tokens can be words, subwords, or characters depending on the tokenization strategy.

## Why is Tokenization Important?
- It's the **first step** in most NLP pipelines
- Helps in understanding the structure of text
- Enables further processing like POS tagging, NER, etc.

In this notebook, we'll explore tokenization using **spaCy**.

In [1]:
import spacy

## Basic Tokenization with spaCy

Let's create a blank English model and tokenize a simple sentence. The blank model only performs tokenization without any additional NLP components.

In [2]:
nlp = spacy.blank("en")

doc = nlp("The quick brown fox jumps over 13 lazy dogs, and it doesn't stop.")

for token in doc:
    print(token)

The
quick
brown
fox
jumps
over
13
lazy
dogs
,
and
it
does
n't
stop
.


In [5]:
doc = nlp("I gave three $ to Peter.")

## Token Attributes

Each token in spaCy is a `Token` object with many useful attributes. Let's explore some of them:
- `token.text` - The original text of the token
- `token.i` - The index of the token in the document
- `token.is_alpha` - Is the token alphabetic?
- `token.is_punct` - Is the token punctuation?
- `token.like_num` - Does the token look like a number?
- `token.is_currency` - Is the token a currency symbol?

In [6]:
token0 = doc[0]
token0

I

In [8]:
type(token0)

spacy.tokens.token.Token

In [12]:
token2 = doc[2]
token2.text

'three'

In [10]:
token2.like_num

True

In [11]:
token3 = doc[3]
token3.text

'$'

In [13]:
token3.is_currency

True

In [14]:
for token in doc:
    print(token, "==>", "index: ", token.i,
          "is_alpha:", token.is_alpha,
          "is_punct:", token.is_punct,
          "like_num:", token.like_num,
          "is_currency:", token.is_currency)

I ==> index:  0 is_alpha: True is_punct: False like_num: False is_currency: False
gave ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
three ==> index:  2 is_alpha: True is_punct: False like_num: True is_currency: False
$ ==> index:  3 is_alpha: False is_punct: False like_num: False is_currency: True
to ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
Peter ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
. ==> index:  6 is_alpha: False is_punct: True like_num: False is_currency: False


In [15]:
with open("students.txt") as f:
    text = f.readlines()
text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n',
 '\n',
 '\n']

## Practical Example: Extracting Emails from Text

Let's use the `like_email` token attribute to extract all email addresses from a file. This demonstrates how tokenization combined with token attributes can be used for information extraction.

In [16]:
text = ' '.join(text)
text



In [17]:
doc = nlp(text)
emails = []
for token in doc:
    if token.like_email:
        emails.append(token.text)
emails

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

In [18]:
nlp = spacy.blank("hi")

doc = nlp("राम ने सीता को एक पत्र लिखा।")
for token in doc:
    print(token)

राम
ने
सीता
को
एक
पत्र
लिखा
।


## Multilingual Tokenization

spaCy supports tokenization for many languages. Let's try tokenizing Hindi text using a blank Hindi model.

In [20]:
doc = nlp("gimme double cheese extra large healthy pizza")

token = [token.text for token in doc]
token

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

## Custom Tokenization Rules

Sometimes the default tokenizer doesn't handle certain words the way we want. We can add **special cases** to customize tokenization behavior.

For example, "gimme" should be split into "gim" + "me" for better language understanding.

In [22]:
from spacy.symbols import ORTH

nlp.tokenizer.add_special_case("gimme", [{ORTH: "gim"}, {ORTH: "me"}])

doc = nlp("gimme double cheese extra large healthy pizza")

token = [token.text for token in doc]
token

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

In [23]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion. Hydra is a dragon of India")
for sentence in doc.sents:
    print(sentence)

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

## Sentence Tokenization (Sentence Boundary Detection)

Breaking text into sentences is another form of tokenization. spaCy uses the **sentencizer** component to detect sentence boundaries.

> **Note:** The blank model doesn't have sentence detection by default. We need to add the `sentencizer` pipe to enable it.

In [25]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7fe9d74da490>

In [26]:
nlp.pipe_names

['sentencizer']

In [27]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion. Hydra is a dragon of India")
for sentence in doc.sents:
    print(sentence)

Apple is looking at buying U.K. startup for $1 billion.
Hydra is a dragon of India
