# NLP Processing - Tokenization

**Tokenization** is the process of breaking down a raw text into smaller units (tokens) for machines. These tokens can be words, subwords, or only characters to make text understandable for the machines.

**The parts of Tokenization:**
> **A Full Text:** A complete sentence of the text. <br>
`"I'm having a wedding party in my U.K. residence!"` <br>
> **Split on Whitespace:** Seperated tokens on whitespace only. <br>
`["I'm] [having] [a] [wedding] [party] [in] [my] [U.K.] [residence!"]` <br>
> **Prefix:** Characters at the beginning of a text. <br>
`["] [I'm] [having] [a] [wedding] [party] [in] [my] [U.K.] [residence!"]` <br>
> **Exceptions:** Handling special cases (e.g. contractions, punctuations) that needed seperate set of rules to create tokens. <br>
`["] [I] ['m] [having] [a] [wedding] [party] [in] [my] [U.K.] [residence] [!"]` <br>
> **Suffix:** Characters at the ending of a text. <br> 
`["] [I] ['m] [having] [a] [wedding] [party] [in] [my] [U.K.] [residence] [!] ["]` <br>
> **Tokenized:** Complete tokenized form of a text. <br>
`["] [I] ['m] [having] [a] [wedding] [party] [in] [my] [U.K.] [residence!] ["]` <br><br>

<img src="Tokenization_The Parts of Tokens.png" width=80%></img>

| Token Type | Description | Example |
| --- | --- | --- |
| **Prefix** | Characters at the beginning | `$ ( { ‟ |
| **Suffix** | Characters at the ending | `km ) , . ! ?` |
| **Infix** | Characters in between | `- — / ...` |
| **Exceptions** | Special-case rules for splitting | `can't U.S.` |

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
mystring = "You\'re in a hurry for work in U.K.!"
print(mystring)

You're in a hurry for work in U.K.!


In [4]:
doc = nlp(mystring)
for token in doc:
    print(token.text)

You
're
in
a
hurry
for
work
in
U.K.
!


In [5]:
mystring2 = "Let's have a user-based experience on the 1st demo of our site https://www.rhsp.com!"
print(mystring2)

Let's have a user-based experience on the 1st demo of our site https://www.rhsp.com!


In [6]:
doc = nlp(mystring2)
for token in doc:
    print(token.text, end=' | ')

Let | 's | have | a | user | - | based | experience | on | the | 1st | demo | of | our | site | https://www.rhsp.com | ! | 

In [7]:
mystring3 = "Achilles confronts Paris about what Hector died for, Troy or honor?"
print(mystring3)

Achilles confronts Paris about what Hector died for, Troy or honor?


In [8]:
doc1 = nlp(mystring3)
# for named entities
for entity in doc1.ents:
    print(entity, "\t=", spacy.explain(entity.label_))

Paris 	= Countries, cities, states
Hector 	= People, including fictional


In [9]:
doc2 = nlp(u"Google is not investing $300 million dollar for Taiwan-based stratups")

for entity in doc2.ents:
    print(f"{entity.text:<20} = {entity.label_} = {spacy.explain(entity.label_)}")

Google               = ORG = Companies, agencies, institutions, etc.
$300 million dollar  = MONEY = Monetary values, including unit
Taiwan               = GPE = Countries, cities, states


In [10]:
# noun chunks
for chunk in doc1.noun_chunks:
    print(chunk)

Achilles confronts
what
Hector
Troy
honor


In [11]:
for chunk in doc2.noun_chunks:
    print(chunk)

Google
$300 million dollar
Taiwan-based stratups


#### Summerize spaCy Functions

> `token.text()` --> Shows each token from the given text.<br>
> `doc.ents` --> Seperates NER (Named Entitiy Recognition).`entity.label_` shows NER type.<br>
> `spacy.explain()` --> Shows human-readable description for tags and labels. <br>
> `doc.noun_chunks` --> Seperates base noun phrase words. <br>

In [12]:
f_string = (
    "Tesla's new Cybertruck demo shouldn't be on the streets of L.A.! The design isn't normal and takes a lot more space " 
    "than anyother normal car on the streets. Yes, it's unique and fashionable, but you've to think about others too. This "
    "$2.5 billion investment is not only for autonomous, eco-friendly, and futuristic-style, but also for the betterment of "
    "the people around the big cities. People involves: drivers, pedestrians, service-worker, and also the passengers..."
)
f_string

"Tesla's new Cybertruck demo shouldn't be on the streets of L.A.! The design isn't normal and takes a lot more space than anyother normal car on the streets. Yes, it's unique and fashionable, but you've to think about others too. This $2.5 billion investment is not only for autonomous, eco-friendly, and futuristic-style, but also for the betterment of the people around the big cities. People involves: drivers, pedestrians, service-worker, and also the passengers..."

In [13]:
doc = nlp(f_string)

for token in doc:
    print(token.text, end=' | ')

Tesla | 's | new | Cybertruck | demo | should | n't | be | on | the | streets | of | L.A. | ! | The | design | is | n't | normal | and | takes | a | lot | more | space | than | anyother | normal | car | on | the | streets | . | Yes | , | it | 's | unique | and | fashionable | , | but | you | 've | to | think | about | others | too | . | This | $ | 2.5 | billion | investment | is | not | only | for | autonomous | , | eco | - | friendly | , | and | futuristic | - | style | , | but | also | for | the | betterment | of | the | people | around | the | big | cities | . | People | involves | : | drivers | , | pedestrians | , | service | - | worker | , | and | also | the | passengers | ... | 

In [14]:
for entity in doc.ents:
    print(f"{entity.text:<15} {entity.label_:<8} = {spacy.explain(entity.label_)}")

Tesla           ORG      = Companies, agencies, institutions, etc.
Cybertruck      PERSON   = People, including fictional
L.A.            GPE      = Countries, cities, states
$2.5 billion    MONEY    = Monetary values, including unit


In [15]:
for chunk in doc.noun_chunks:
    print(chunk)

Tesla's new Cybertruck demo
the streets
L.A.
The design
a lot more space
anyother normal car
the streets
it
you
others
This $2.5 billion investment
the betterment
the people
the big cities
People
drivers
pedestrians
service-worker
also the passengers
