<a href="https://colab.research.google.com/github/ANanade/Natural-Language-Processing/blob/master/01_Tokenization_Using_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [4]:
# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text)

"
We
're
moving
to
L.A.
!
"


sPacy clearly tokenized punctuation marks, !, preffix, suffix. However for emails let's see

In [5]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.

In [6]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved

## Exceptions
Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [7]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


Here the abbreviations for "Saint" and "United States" are both preserved.

## Counting Tokens
Doc objects have a set number of tokens

In [8]:
len(doc4)

11

## Counting Vocab Entries
Vocab objects contain a full library of items!

In [9]:
len(doc.vocab)

512

NOTE: This number changes based on the language library loaded at the start, and any new lexemes introduced to the vocab when the Doc was created

## Tokens can be retrieved by index position and slice
Doc objects can be thought of as lists of token objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:

In [10]:
doc5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token:
doc5[2]

better

In [11]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

In [12]:
# Retrieve the last four tokens:
doc5[-4:]

than to receive.

## Tokens cannot be reassigned
Although Doc objects can be considered lists of tokens, they do not support item reassignment:

In [13]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

In [14]:
# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[3] = doc7[3]

TypeError: ignored

## Named Entities
Going a step beyond tokens, named entities add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the ents property of a Doc object.

In [16]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')



Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [20]:
for entity in doc8.ents:
    print(entity)
    print(entity.label_)
    print('\n')

Apple
ORG


Hong Kong
GPE


$6 million
MONEY




In [21]:
for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [22]:
len(doc8.ents)

3

## Noun Chunks
Similar to Doc.ents, Doc.noun_chunks are another object property. Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in Sheb Wooley's 1958 song, a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [23]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [24]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


In [25]:
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater


## Built-in Visualizers
spaCy includes a built-in visualization tool called displaCy. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

In [26]:
from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

## Visualizing the entity recognizer

In [27]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

## Creating Visualizations Outside of Jupyter

In [28]:
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
