## Tokenization Basics

Walkthrough for the blog:

### Loading in spaCy library and the target Language object

In [1]:
import spacy
from spacy import displacy

In [2]:
# Loading in the spaCy library
nlp = spacy.load('en_core_web_sm')

Viewing the contents within our **nlp pipeline**

In [3]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x1fa93b19308>),
 ('parser', <spacy.pipeline.DependencyParser at 0x1fa93b00c48>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x1fa93b1b228>)]

Creating a document object

In [4]:
doc = nlp(u"Here is our new fancy document. It's not very complex, but it will get the job done.")

Viewing each token within the document object

In [5]:
for token in doc:
    print(token.text)

Here
is
our
new
fancy
document
.
It
's
not
very
complex
,
but
it
will
get
the
job
done
.


In [6]:
# Viewing the length of the document object
len(doc)

21

Now we will create a function that prints out each token's text, part of speech, and syntactic dependency.

In [7]:
def doc_breakdown(doc):
    for token in doc:
        print(f"Actual text: {token.text:{10}} "
              f"Part of Speech: {token.pos_:{10}} "
              f"Syntactic dependency: {token.dep_:{10}}")

In [8]:
doc_breakdown(doc)

Actual text: Here       Part of Speech: ADV        Syntactic dependency: advmod    
Actual text: is         Part of Speech: VERB       Syntactic dependency: ROOT      
Actual text: our        Part of Speech: ADJ        Syntactic dependency: poss      
Actual text: new        Part of Speech: ADJ        Syntactic dependency: amod      
Actual text: fancy      Part of Speech: ADJ        Syntactic dependency: amod      
Actual text: document   Part of Speech: NOUN       Syntactic dependency: nsubj     
Actual text: .          Part of Speech: PUNCT      Syntactic dependency: punct     
Actual text: It         Part of Speech: PRON       Syntactic dependency: nsubj     
Actual text: 's         Part of Speech: VERB       Syntactic dependency: ROOT      
Actual text: not        Part of Speech: ADV        Syntactic dependency: neg       
Actual text: very       Part of Speech: ADV        Syntactic dependency: advmod    
Actual text: complex    Part of Speech: ADJ        Syntactic dependency: aco

Document objects are technically lists, but they do not support item reassignment.

In [9]:
doc[0]

Here

In [10]:
try:
    doc[0] = 'Hear'
except TypeError as e:
    print(e)

'spacy.tokens.doc.Doc' object does not support item assignment


If we are not sure what a tag means, we can use the spacy.explain() method to give a definition of the tag or label.

In [11]:
print(f"Part of Speech: \n{doc[0].pos_} = {spacy.explain(doc[0].pos_)}\n\n"
     f"Syn. Dependency: \n{doc[0].dep_} = {spacy.explain(doc[0].dep_)}")

Part of Speech: 
ADV = adverb

Syn. Dependency: 
advmod = adverbial modifier


We can also directly pass in a string of the tag or label we want to be explained.

In [12]:
spacy.explain('advmod')

'adverbial modifier'

### Understanding Named Entities in text

Named Entity objects take tokens to the next level. If we check the contents of the nlp pipeline, we see that it contains an **'NER'** object. This object is a named entity recognizer. The nlp pipeline can recognize that certain words are organizational names, locations, monetary values, dates, etc. Named entities are accessible through the **.ents** property of a Doc object.

In [13]:
# Creating a new document
doc2 = nlp(u"Tesla Company will pay $750,000 and build a solar roof to settle dozens of \
air-quality violations at its Fremont factory.")

In [14]:
print(doc2)

Tesla Company will pay $750,000 and build a solar roof to settle dozens of air-quality violations at its Fremont factory.


For each named entity found in doc2, we will print out the text of the named entity, the tag/label the pipeline predicts it to be, and an explanation of the tag/label.

In [15]:
for entity in doc2.ents:
    print(f"Entity: {entity}\nLabel: {entity.label_}\nLabel Explanation: {spacy.explain(entity.label_)}\n")

Entity: Tesla Company
Label: ORG
Label Explanation: Companies, agencies, institutions, etc.

Entity: 750,000
Label: MONEY
Label Explanation: Monetary values, including unit

Entity: dozens
Label: CARDINAL
Label Explanation: Numerals that do not fall under another type

Entity: Fremont
Label: GPE
Label Explanation: Countries, cities, states



### Using displacy With the Experimental Jupyter Parameter

One final thing we will touch on before ending this blog is the **displacy module** within the spaCy library. **Displacy** is a built-in dependency visualizer that lets us check our model's predictions. We can pass in one or more Document objects and start a web server, export HTML files, or even view the visualization directly from a Jupyter Notebook. Since we are using a Jupyter Notebook for this blog, we will be viewing our visualizations directly from our notebook.

In [16]:
displacy.render(doc2, style='ent', jupyter=True)

In [17]:
colors = {"MONEY": "lightgreen",
          "CARDINAL": "linear-gradient(180deg, yellow, orange)"
         }

options = {"colors": colors, 
          "ents": ["MONEY", "CARDINAL"]}

displacy.render(doc2, style='ent', jupyter=True, options=options)

In [18]:
colors = {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)", 
          "MONEY": "linear-gradient(45deg, lightgreen, white)",
          "CARDINAL": "linear-gradient(180deg, yellow, orange)",
          "GPE": "lightblue"}
options = {"ents": ["ORG", "MONEY", "CARDINAL", "GPE"], "colors": colors}

displacy.render(doc2, style='ent', jupyter=True, options=options)

### The Dependency Style

In [19]:
doc3 = nlp(u"SpaCy Basics: The Importance of Tokens in Natural Language Processing")

In [20]:
displacy.render(doc3, style='dep', jupyter=True)

In [34]:
options = {
           'bg': 'linear-gradient(180deg, orange, #FEE715FF)', 
          'color': 'black', 
          'font': 'Verdana'
}

In [35]:
displacy.render(doc3, style='dep', options=options, jupyter=True)

### Extra Stuff

In [36]:
# Create a Doc object with a unicode string (u-string)
doc4 = nlp(u"SpaCy is a library for advanced Natural Language Processing in Python \
and Cython. It's built on the very latest research, and was designed from day \
one to be used in real products. SpaCy comes with pretrained pipelines and currently \
supports tokenization and training for 60+ languages. It features state-of-the-art \
speed and neural network models for tagging, parsing, named entity recognition, \
text classification and more, multi-task learning with pretrained transformers \
like BERT, as well as a production-ready training system and easy model packaging, \
deployment and workflow management. SpaCy is commercial open-source software, released \
under the MIT license.")

Understand that tokens are the basic building blocks of a doc object. Everything that helps us comprehend the meaning of text is derived from a token object and the relationship between tokens.

spaCy is also able to detect and separate sentences in a Doc object.

In [37]:
for i, sentence in enumerate(doc4.sents):
    print(f"{i+1}. {sentence}")

1. SpaCy is a library for advanced Natural Language Processing in Python and Cython.
2. It's built on the very latest research, and was designed from day one to be used in real products.
3. SpaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages.
4. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management.
5. SpaCy is commercial open-source software, released under the MIT license.


In [38]:
for chunk in doc4.noun_chunks:
    print(chunk)

SpaCy
a library
advanced Natural Language Processing
Python
Cython
It
the very latest research
day
real products
SpaCy
pretrained pipelines
tokenization
training
60+ languages
It
the-art
neural network models
text classification
more, multi-task learning
pretrained transformers
BERT
a production-ready training system
easy model packaging
deployment
workflow management
SpaCy
commercial open-source software
the MIT license


In [39]:
options = {"compact": True, 
           "bg": "#09a3d5",
           "color": "white", 
          "distance": 250}
displacy.render(doc2, jupyter=True, options=options)

### Using Lemmatization on Tokens

In [40]:
doc5 = nlp(u"I love to hike, especially on the weekends. I went hiking yesterday with my hiker friends.")

In [41]:
for token in doc3:
    print(f"{token.text:{10}} {token.pos_:{10}} {token.lemma:{20}} {token.lemma_:>{10}}")

SpaCy      PROPN      10639093010105930009      spacy
Basics     NOUN        2744231088585378001      basic
:          PUNCT      11532473245541075862          :
The        DET         7425985699627899538        the
Importance PROPN      14560146131139610552 importance
of         ADP          886050111519832510         of
Tokens     PROPN       8705076186404633026     tokens
in         ADP         3002984154512732771         in
Natural    PROPN       3743574233330547430    natural
Language   PROPN       8740476009882919263   language
Processing PROPN      10935198773122488114 processing


### Coarse Versus Fine grained Tokens

Using fine grained tokens let's us take advantage of things like present and past tense within the tokens.

In [42]:
doc5 = nlp(u"I love to hike, especially on the weekends. I went hiking yesterday with my hiker friends.")

In [58]:
for token in doc5:
    print(f"{token.text:{10}} {token.pos_:{6}} {token.tag:<{25}} {token.tag_:{10}} {spacy.explain(token.tag_)}")

I          PRON   13656873538139661788      PRP        pronoun, personal
love       VERB   9188597074677201817       VBP        verb, non-3rd person singular present
to         PART   5595707737748328492       TO         infinitival to
hike       VERB   14200088355797579614      VB         verb, base form
,          PUNCT  2593208677638477497       ,          punctuation mark, comma
especially ADV    164681854541413346        RB         adverb
on         ADP    1292078113972184607       IN         conjunction, subordinating or preposition
the        DET    15267657372422890137      DT         determiner
weekends   NOUN   783433942507015291        NNS        noun, plural
.          PUNCT  12646065887601541794      .          punctuation mark, sentence closer
I          PRON   13656873538139661788      PRP        pronoun, personal
went       VERB   17109001835818727656      VBD        verb, past tense
hiking     NOUN   15308085513773655218      NN         noun, singular or mass
yesterday

In [64]:
# Creating a dictionary to view the type of part-of-speech and count within doc4
doc4_dict = doc4.count_by(spacy.attrs.POS)

In [79]:
# Viewing coarse-grained part-of-speech
for k,v in doc4_dict.items():
    print(f"{spacy.explain(doc4.vocab[k].text)}: {v:}")

punctuation: 19
symbol: 1
verb: 17
adjective: 11
adposition: 13
adverb: 4
coordinating conjunction: 8
determiner: 5
noun: 30
numeral: 2
particle: 1
pronoun: 2
proper noun: 10


In [80]:
# Viewing fine-grained tags
doc4_dict = doc4.count_by(spacy.attrs.TAG)
for k,v in doc4_dict.items():
    print(f"{spacy.explain(doc4.vocab[k].text)}: {v:}")

adjective, comparative: 1
verb, gerund or present participle: 2
noun, proper singular: 10
verb, 3rd person singular present: 6
determiner: 5
adjective: 9
pronoun, personal: 2
cardinal number: 2
verb, base form: 1
conjunction, subordinating or preposition: 13
punctuation mark, sentence closer: 5
adverb: 4
symbol: 1
adjective, superlative: 1
verb, past tense: 1
conjunction, coordinating: 8
infinitival to: 1
noun, singular or mass: 25
verb, past participle: 7
punctuation mark, comma: 8
noun, plural: 5
punctuation mark, hyphen: 6
