## What is Tokenization

`Tokenization` is the process of splitting the text into meaningful segments.

In [5]:
!python -m spacy download en_core_web_sm -q

2023-04-05 08:51:48.469379: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m101.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
import spacy

## Initalizing the NLP Object


As we've discussed Space is an `object-oriented` library, so in order to have access to its methods and classes we need to initialize an object. This object is called **nlp object**. 

In [23]:
# Creates an empty Pipline for the english language (another option is `load()` which created a pre-trained pipline)
nlp = spacy.blank("en")

type(nlp)

spacy.lang.en.English

To see all the language models: https://spacy.io/usage/models

## Creating a Document Object

In [24]:
doc = nlp("Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate.")

type(doc)

spacy.tokens.doc.Doc

In [22]:
# By default, the above line of code also performs tokenization
for token in doc:
    print(token)

Dr.
Strange
loves
pav
bhaji
of
mumbai
as
it
costs
only
2
$
per
plate
.


In [11]:
# We have also access to those tokens using indexing
doc[0]

Dr.

In [25]:
type(doc[0])

spacy.tokens.token.Token

In [27]:
# We can also use slicing
span = doc[0:3]

print(span)
type(span)

Dr. Strange loves


spacy.tokens.span.Span

## How Spacy Tokenizes the Document?

Imaging the sentence: `"Let's go to N.Y.!"`
1. Splitting the sentence on the spaces:
```
["let's, go, to, N.Y.!"]
```
2. Splitting the words on the prefix (where it can be ", (, etc.): 
```
[", Let's, go, to, N.Y.!"]
```
3. Splitting on the excetpions:
```
[", Let, 's, go, to, N.Y.!"]
```
4. Splitting on the suffix:
```
[", Let, 's, go, to, N.Y.!, "]
```
5. Again on exceptions:
```
[", Let, 's, go, to, N.Y., !, "]
```

And we are done!

In [28]:
# We can see all the methods that are available of the tokens
dir(doc[0])

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

In [31]:
# Converting a token into text
text = doc[0].text

text, type(text)

('Dr.', str)

In [40]:
# Two powerfull attributes are `like_num`, `is_currency`
doc = nlp("Dr. Strange borrows from Tony two $")

print(doc[-2])
print(doc[-2].like_num)

print(doc[-1])
print(doc[-1].is_currency)

two
True
$
True


## Example 1

In [52]:
# Imaging the following document
d = """
Dayton high school, 8th grade students information
==================================================

Name    Birth Day       Email
----    --------        -----
Virat   5 June, 1882    virat@kholi.com
Maria   12 April, 2001  maria@sharapova.com
Serena  24 May, 1998    serena@gmail.com
Joe     4 July, 2004    joe@yahoo.de

""".lstrip()

print(d)

Dayton high school, 8th grade students information

Name    Birth Day       Email
----    --------        -----
Virat   5 June, 1882    virat@kholi.com
Maria   12 April, 2001  maria@sharapova.com
Serena  24 May, 1998    serena@gmail.com
Joe     4 July, 2004    joe@yahoo.de




How we can extract the emails using Spacy?

In [59]:
# Convert the document into a single sentence
text = " ".join(d.split("\n"))
text



In [61]:
# Creating the Document object
doc = nlp(text)

In [63]:
# Getting the email tokens
emails = []

for token in doc:
    if token.like_email:
        emails.append(token)

emails

[virat@kholi.com, maria@sharapova.com, serena@gmail.com, joe@yahoo.de]

## Customizing the Tokenization Rules

In [67]:
# Let's say we want to replace the word `gimme` into `give` and `me`
doc = nlp("gimme double cheese pizza please")

for token in doc:
    print(token)

gim
me
double
cheese
pizza
please


In [66]:
from spacy.symbols import ORTH

nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"},
    {ORTH: "me"}
])

In [None]:
# After aplying the spacial case:
doc = nlp("gimme double cheese pizza please")

for token in doc:
    print(token)

## Sentece Tokenization

Remember that the Pipeline we have created is a blank Pipeline. So we cannot use `doc.sents` to get the sentence tokenization.

To solve this we need to add a Pipe to our Pipeline, which we can do that using `add_pipe()`.

In [70]:
# Printing the Pipeline
nlp.pipe_names

['sentencizer']

In [68]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7f9c9a99d640>

In [71]:
# Now the object knows how to split any sentence into tokens
doc = nlp("Dr. Strange loves pav bhaji of mubai. Hulk loves chaat of delhi")

for snetence_token in doc.sents:
    print(snetence_token)

Dr. Strange loves pav bhaji of mubai.
Hulk loves chaat of delhi


## Example 2

You are an NLP engineer working for some company and you want to collect all dataset websites from this book. To keep exercise simple you are given a paragraph from a book and you want to grab all urls from this paragraph using spacy.

In [73]:
text = """
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
""".lstrip()

print(text)

Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.



In [74]:
urls = []
for token in nlp(text):
    if token.like_url:
        urls.append(token)

urls

[http://www.data.gov/,
 http://www.science,
 http://data.gov.uk/.,
 http://www3.norc.org/gss+website/,
 http://www.europeansocialsurvey.org/.]

## Example 3

Extract all money transaction from below sentence along with currency.

In [83]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"

results = []
num = ''
for token in nlp(transactions):

    if token.like_num:
        num = token

    if token.is_currency:
        results.append(num.text + " " + token.text)
        num = ''

results

['two $', '500 €']

For more information about the linguistuc features of Spacy: https://spacy.io/usage/linguistic-features#tokenization