<center><h1 style="color:green">Tokenization</center>

In [1]:
import spacy

<b>Create blank language object and tokenize words in a sentence

In [2]:
nlp = spacy.blank("en")

In [11]:
doc = nlp("Let's go to N.Y.!")

In [12]:
for token in doc:
    print(token)

Let
's
go
to
N.Y.
!


In [13]:
doc = nlp('''"Let's go to N.Y.!"''')

In [14]:
for token in doc:
    print(token)

"
Let
's
go
to
N.Y.
!
"


In [15]:
doc = nlp("Dr. Strange loves biriyani of Dhaka as it costs only 2$ per plate.")

In [16]:
for token in doc:
    print(token)

Dr.
Strange
loves
biriyani
of
Dhaka
as
it
costs
only
2
$
per
plate
.


Creating blank language object gives a tokenizer and an empty pipeline.

<img src="spacy_blank_pipeline.jpg">

<b>Using index to grab tokens

In [17]:
doc[0]

Dr.

In [18]:
doc[4]

of

In [19]:
doc[3]

biriyani

In [20]:
doc[-1]

.

In [21]:
token = doc[1]
token.text

'Strange'

In [22]:
dir(token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

In [23]:
type(nlp)

spacy.lang.en.English

In [24]:
type(doc)

spacy.tokens.doc.Doc

In [25]:
type(token)

spacy.tokens.token.Token

In [26]:
nlp.pipe_names

[]

<b>Span object

In [29]:
span = doc[0:5]
span

Dr. Strange loves biriyani of

In [30]:
type(span)

spacy.tokens.span.Span

<b>Token attributes

In [31]:
doc = nlp("Tony gave two $ to Peter.")

In [32]:
token0 = doc[0]
token0

Tony

Output: True if the token is purely alphabetic, False otherwise

In [54]:
token0.is_alpha 

True

In [38]:
doc[3]

$

In [39]:
doc[3].is_alpha

False

In [46]:
doc[1]

gave

Output: True if the token is number, False otherwise

In [53]:
doc[1].like_num

False

In [45]:
doc[2]

two

In [44]:
doc[2].like_num

True

In [47]:
doc[3]

$

Output: True if the token is currency, False otherwise

In [52]:
doc[3].is_currency

True

In [55]:
for token in doc:
    print(token, "==>", "index: ", token.i, "is_alpha:", token.is_alpha, 
          "is_punct:", token.is_punct, 
          "like_num:", token.like_num,
          "is_currency:", token.is_currency,
         )

Tony ==> index:  0 is_alpha: True is_punct: False like_num: False is_currency: False
gave ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
two ==> index:  2 is_alpha: True is_punct: False like_num: True is_currency: False
$ ==> index:  3 is_alpha: False is_punct: False like_num: False is_currency: True
to ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
Peter ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
. ==> index:  6 is_alpha: False is_punct: True like_num: False is_currency: False


<b>Collecting email ids of students from students information sheet

In [58]:
with open("student.txt") as f:
    text = f.readlines()
text

['\n',
 'Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com']

In [59]:
text = " ".join(text)
text



In [60]:
doc = nlp(text)
emails = []
for token in doc:
    if token.like_email:
        emails.append(token.text)
emails 

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

<b>Support in other languages</b><br>
Spacy support many language models. Some of them do not support pipelines though! https://spacy.io/usage/models#languages

In [72]:
nlp = spacy.blank("bn")
doc = nlp("ভাই! ৫০০০ ৳ ঋণ ছিল, সেটা ফেরত দিয়ে দিন।")
for token in doc:
    print(token)

ভাই
!
৫০০০
৳
ঋণ
ছিল
,
সেটা
ফেরত
দিয়ে
দিন
।


In [73]:
doc[2].like_num

True

In [74]:
doc[3]

৳

In [75]:
doc[3].is_currency

True

<b>Customizing tokenizer

In [76]:
from spacy.symbols import ORTH

nlp = spacy.blank("en")
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

In [77]:
nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"},
    {ORTH: "me"},
])
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

<b>Sentence Tokenization or Segmentation

In [86]:
nlp = spacy.blank("en")
doc = nlp("Dr. Strange loves biriyani of Dhaka.It costs only 2$ per plate.")
for sentence in doc.sents:
    print(sentence)

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

In [81]:
nlp.pipeline

[]

In [87]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x151ad715b40>

In [88]:
nlp.pipeline

[('sentencizer', <spacy.pipeline.sentencizer.Sentencizer at 0x151ad715b40>)]

In [89]:
doc = nlp("Dr. Strange loves biriyani of Dhaka. It costs only 2$ per plate.")
for sentence in doc.sents:
    print(sentence)

Dr. Strange loves biriyani of Dhaka.
It costs only 2$ per plate.


<b>Collecting dataset websites from a book paragraph

In [90]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

In [91]:
doc = nlp(text)
data_websites = [token.text for token in doc if token.like_url ] 
data_websites

['http://www.data.gov/',
 'http://www.science',
 'http://data.gov.uk/.',
 'http://www3.norc.org/gss+website/',
 'http://www.europeansocialsurvey.org/.']

In [93]:
text='''
When searching for data, consider academic institutions and international organizations, 
as they often provide high-quality, publicly accessible datasets. Repositories like 
https://dataverse.harvard.edu/ and https://datahub.io/ host a variety of datasets across multiple disciplines. 
The World Bank's Open Data platform at https://data.worldbank.org/ and the United Nations' 
data portal at https://data.un.org/ are excellent sources for global economic, social, and environmental statistics. 
For those interested in machine learning and AI research, https://www.kaggle.com/datasets offers a 
diverse collection of datasets curated by the data science community.
'''

In [94]:
doc = nlp(text)
data_websites = [token.text for token in doc if token.like_url ] 
data_websites

['https://dataverse.harvard.edu/',
 'https://datahub.io/',
 'https://data.worldbank.org/',
 'https://data.un.org/',
 'https://www.kaggle.com/datasets']

<b>Figure out all transactions from this text with amount and currency

In [92]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc = nlp(transactions)
for token in doc:
    if token.like_num and doc[token.i+1].is_currency:
        print(token.text, doc[token.i+1].text)       

two $
500 €
