# Tokenization using spacy

In [3]:
import spacy

In [7]:
nlp = spacy.load("en_core_web_sm")

doc = nlp('''"Let's go to N.Y.!"''')

for token in doc:
    print(token)

"
Let
's
go
to
N.Y.
!
"


In [8]:
display(type(nlp))
display(type(doc))

spacy.lang.en.English

spacy.tokens.doc.Doc

The token object has built in functions to identify specific properties of the token, like currency, number, alpha digit and so on.

In [10]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

" " PUNCT `` punct " False False
Let let VERB VB ROOT Xxx True False
's us PRON PRP nsubj 'x False True
go go VERB VB ccomp xx True True
to to ADP IN prep xx True True
N.Y. N.Y. PROPN NNP pobj X.X. False False
! ! PUNCT . punct ! False False
" " PUNCT '' punct " False False


You can create custom tokens in spacy.

In [11]:
phrase = "gimme double cheese extra large healthy pizza"

doc = nlp(phrase)

tokens = [token.text for token in doc]
print(tokens)

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']


In [12]:
from spacy.symbols import ORTH

nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"}, 
    {ORTH: "me"}
])

doc = nlp(phrase)

tokens = [token.text for token in doc]
print(tokens)

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']


# Exercise
(1) Think stats is a free book to study statistics (https://greenteapress.com/thinkstats2/thinkstats2.pdf)

This book has references to many websites from where you can download free datasets. You are an NLP engineer working for some company and you want to collect all dataset websites from this book. To keep exercise simple you are given a paragraph from this book and you want to grab all urls from this paragraph using spacy

In [13]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

# TODO: Write code here
# Hint: token has an attribute that can be used to detect a url
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
url_tokens = []

for token in doc:
    if token.like_url:
        url_tokens.append(token.text)
        
print(url_tokens)

['http://www.data.gov/', 'http://www.science', 'http://data.gov.uk/.', 'http://www3.norc.org/gss+website/', 'http://www.europeansocialsurvey.org/.']


(2) Extract all money transaction from below sentence along with currency. Output should be,

two $

500 €

In [22]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"

# TODO: Write code here
# Hint: Use token.i for the index of a token and token.is_currency for currency symbol detection
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(transactions)

curr_tokens = []


for token in doc:
    if token.is_currency:
        i = token.i
        token_curr = doc[i-1].text + " " + token.text
        curr_tokens.append(token_curr)
        
print(curr_tokens)

['two $', '500 €']
