# Tokenization

In [1]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [3]:
# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
  print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

<img src="https://miro.medium.com/max/1400/1*Sibm12vBIZTjTDBp3I7yeQ.png" width=600/>

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

## Prefixes, Suffixes and Infixes


In [4]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
  print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


<font color=lightgreen>Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.</font>

In [5]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
  print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


<font color=lightgreen>Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved.</font>

## Exceptions

In [6]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
  print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


<font color=lightgreen>Here the abbreviations for "Saint" and "United States" are both preserved.</font>

## Counting Tokens
`Doc` objects have a set number of tokens:

In [7]:
len(doc)

8

## Counting Vocab Entries
`Vocab` objects contain a full library of items!

In [8]:
len(doc.vocab)

512

<font color=lightgreen>NOTE: This number changes based on the language library loaded at the start, and any new lexemes introduced to the `vocab` when the `Doc` was created.</font>

## Tokens can be retrieved by index position and slice
`Doc` objects can be thought of as lists of `token` objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:

In [9]:
doc5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token:
doc5[2]

better

In [10]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

In [11]:
# Retrieve the last four tokens:
doc5[-4:]

than to receive.

## Tokens cannot be reassigned
Although `Doc` objects can be considered lists of tokens, they do *not* support item reassignment:

In [12]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[3] = doc7[3]

TypeError: ignored

# Named Entities

In [13]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
  print(token.text, end=' | ')

print('\n----')

for ent in doc8.ents:
  print(f"{ent.text} - {ent.label_} - {str(spacy.explain(ent.label_))}")

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


<font color=lightgreen>Note how two tokens combine to form the entity `Hong Kong`, and three tokens combine to form the monetary entity:  `$6 million`</font>

In [14]:
# Counting the Number of Named Entities
len(doc8.ents)

3

# Noun Chunks

In [15]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
  print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [16]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
  print(chunk.text)

Red cars
higher insurance rates


In [17]:
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
  print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater


# Built-in Visualizers

## Visualizing the dependency parse

In [18]:
from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

## Visualizing the entity recognizer

In [19]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

## Creating Visualizations Outside of Jupyter

In [22]:
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
