# Tokenization
The first step in creating a `Doc` object is to break down the incoming text into component pieces or "tokens".

In [1]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ')

"We're moving to L.A.!"
" | We | 're | moving | to | L.A. | ! | " | 

<img src="../screenshots/tokenization.png" width="600">



**Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
**Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
**Infix**:	Character(s) in between &#9656; `- -- / ...`
**Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

## Prefix, Suffix, Infix

spaCy handles smartly tokenization smartly. See how email and website are preserved, or how adress are split. E.g. "St."

In [27]:
doc2 = nlp("Send email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

Send
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


## Exceptions
Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [26]:
doc4 = nlp("Let's visit St. Louis in the U.S.")

for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.


## Tokens & Vocab
Vocab objects contain a full library of items!

In [15]:
print(len(doc))
print(len(doc.vocab))

8
786


## Named Entities
spaCy can automatically recognize entities such as countries, organizations, money, etc.

In [20]:
doc5 = nlp('Apple will build a Hong Kong facotry for $6 millions')


for token in doc5:
    print(token.text, end=' | ')

Apple | will | build | a | Hong | Kong | facotry | for | $ | 6 | millions | 

### Entities

In [25]:
for ent in doc5.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 millions - MONEY - Monetary values, including unit


## Noun Chunks
In natural language processing (NLP), noun chunks refer to phrases that consist of a noun and the words that modify it, such as adjectives, determiners, and prepositional phrases. Noun chunks provide valuable information about the structure and content of a sentence.

Noun chunks are often used in various NLP tasks, including information extraction, text summarization, and entity recognition. By identifying and extracting noun chunks from text, NLP systems can gain insights into the relationships between entities and their descriptors.

In [30]:
doc6 = nlp("An autonomous car uses computer vision. Tesla is an example of companies that make autonomous cars")

for chunk in doc6.noun_chunks:
    print(chunk)

An autonomous car
computer vision
Tesla
an example
companies
that
autonomous cars


## Visualizing Tokenization

In [35]:
from spacy import displacy

doc7 = nlp("Microsoft has built a house in U.K. for $6 grand")

In [43]:
displacy.render(doc7, style='dep', jupyter=True, options={'distance': 65})

In [44]:
displacy.render(doc7, style='ent', jupyter=True)

### Visualizing in the browser

In [None]:
# displacy.serve(doc7, style="dep")