# Tokenization

Every text corpus is composed of "elements", this elements are individual units also know as "tokens". Split the corpus is important because each token plays a role in the text semantic.

There are different token types and different ways to tokenize.

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


Spacy tokenization

In [2]:
doc = nlp(mystring)
for token in doc:
    print(token)

"
We
're
moving
to
L.A.
!
"


Spacy is able to understand complex texts like email addresses and web urls and the role each character plays in the sentence.

In [3]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


Notice in this case the 'dot' character is used in the email and url but it was not interpreted as a separated token.

## Entities
Spacy has the power to infer entities or proper nouns

In [4]:
doc = nlp("Apple to build a Hong Kong factory for $6 million.")
for entity in doc.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




Notice how the library is able to pre-classify some words as a good correspondance of what it is.

## Noun chuncks
Spacy is able to detect compound nouns, i.e, nouns that are composed of more than one words.

In [6]:
doc = nlp("Autonomous cars shift insurance liability toward manufacturers.")
for chunk in doc.noun_chunks:
    print(chunk)

Autonomous cars
insurance liability
manufacturers


Notice that, in the previous example 'Autonomous car' is a composed noun and spacy is able to identify it as a chunk

## Displaycy

This module is for visualization of Spacy objects.

In [9]:
from spacy import displacy

doc = nlp("Apple is going to build a U.K. factory for $6 million.")
# dep for syntactic dependency
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [10]:
doc = nlp("Over the last quarter Apple sold nearly 20 thousand iPds for  profit of $6 million.")
# dep for syntactic dependency
displacy.render(doc, style='ent', jupyter=True)