## Tokenization Basics

Walkthrough for the blog:

In [87]:
import spacy
from spacy import displacy

In [100]:
# Loading in the spaCy library
nlp = spacy.load('en_core_web_sm')

In [132]:
doc = nlp(u"Here is our new fancy document. It's not very complex, but it will get the job done.")

In [133]:
for token in doc:
    print(token.text)

Here
is
our
new
fancy
document
.
It
's
not
very
complex
,
but
it
will
get
the
job
done
.


In [134]:
len(doc)

21

In [139]:
def doc_breakdown(doc):
    for token in doc:
        print(f"Actual text: {token.text:{10}} "
              f"Part of Speech: {token.pos_:{10}} "
              f"Syntactic dependency: {token.dep_:{10}}")

In [140]:
doc_breakdown(doc)

Actual text: Here       Part of Speech: ADV        Syntactic dependency: advmod    
Actual text: is         Part of Speech: VERB       Syntactic dependency: ROOT      
Actual text: our        Part of Speech: ADJ        Syntactic dependency: poss      
Actual text: new        Part of Speech: ADJ        Syntactic dependency: amod      
Actual text: fancy      Part of Speech: ADJ        Syntactic dependency: amod      
Actual text: document   Part of Speech: NOUN       Syntactic dependency: nsubj     
Actual text: .          Part of Speech: PUNCT      Syntactic dependency: punct     
Actual text: It         Part of Speech: PRON       Syntactic dependency: nsubj     
Actual text: 's         Part of Speech: VERB       Syntactic dependency: ROOT      
Actual text: not        Part of Speech: ADV        Syntactic dependency: neg       
Actual text: very       Part of Speech: ADV        Syntactic dependency: advmod    
Actual text: complex    Part of Speech: ADJ        Syntactic dependency: aco

In [101]:
banner = nlp(u"Tokenization Basics: The Building Blocks of Natural Language Processing")

In [102]:
options = {'distance':100, 
          'bg': 'linear-gradient(180deg, orange, #FEE715FF)', 
          'color': 'black', 
          'font': 'Verdana'}
banner_img = displacy.render(banner, style='dep', jupyter=True, options=options)

In [3]:
# Create a Doc object with a unicode string (u-string)
doc = nlp(u"SpaCy is a library for advanced Natural Language Processing in Python \
and Cython. It's built on the very latest research, and was designed from day \
one to be used in real products. SpaCy comes with pretrained pipelines and currently \
supports tokenization and training for 60+ languages. It features state-of-the-art \
speed and neural network models for tagging, parsing, named entity recognition, \
text classification and more, multi-task learning with pretrained transformers \
like BERT, as well as a production-ready training system and easy model packaging, \
deployment and workflow management. SpaCy is commercial open-source software, released \
under the MIT license.")

Understand that tokens are the basic building blocks of a doc object. Everything that helps us comprehend the meaning of text is derived from a token object and the relationship between tokens.

In [6]:
# Print each token separately
for token in doc[:9]:
    print(f"Actual text: {token.text:{10}} Part of Speech: {token.pos_:{10}} "
          f"Syntactic dependency: {token.dep_:{10}}")

Actual text: SpaCy      Part of Speech: PROPN      Syntatic dependency: nsubj     
Actual text: is         Part of Speech: VERB       Syntatic dependency: ROOT      
Actual text: a          Part of Speech: DET        Syntatic dependency: det       
Actual text: library    Part of Speech: NOUN       Syntatic dependency: attr      
Actual text: for        Part of Speech: ADP        Syntatic dependency: prep      
Actual text: advanced   Part of Speech: ADJ        Syntatic dependency: amod      
Actual text: Natural    Part of Speech: PROPN      Syntatic dependency: compound  
Actual text: Language   Part of Speech: PROPN      Syntatic dependency: compound  
Actual text: Processing Part of Speech: PROPN      Syntatic dependency: pobj      


Checking to see what components are currently existing within the nlp pipeline

In [131]:
for item in nlp.pipeline:
    print(item)

('tagger', <spacy.pipeline.Tagger object at 0x0000016124840708>)
('parser', <spacy.pipeline.DependencyParser object at 0x0000016124834948>)
('ner', <spacy.pipeline.EntityRecognizer object at 0x0000016124834EE8>)


In [44]:
nlp.pipe_names

['tagger', 'parser', 'ner']

If we are not sure what the abbreviation of the "Part of speech' or the 'Syntatic dependency' is, we can use the .explain() method to get a better understanding:

In [45]:
print(f"Part of speech: {spacy.explain(doc[0].pos_)}\n"
      f"Syntactic Dependency: {spacy.explain(doc[0].dep_)}")

Part of speech: proper noun
Syntatic Dependency: nominal subject


spaCy is also able to detect and separate sentences in a Doc object.

In [51]:
for i, sentence in enumerate(doc.sents):
    print(f"{i+1}. {sentence}")

1. SpaCy is a library for advanced Natural Language Processing in Python and Cython.
2. It's built on the very latest research, and was designed from day one to be used in real products.
3. SpaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages.
4. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management.
5. SpaCy is commercial open-source software, released under the MIT license.


### Understanding Entities in text

In [110]:
doc2 = nlp(u"Tesla Company will pay $750,000 and build a solar roof to settle dozens of \
air-quality violations at its Fremont factory.")

In [111]:
print(doc2)

Tesla Company will pay $750,000 and build a solar roof to settle dozens of air-quality violations at its Fremont factory.


In [112]:
for entity in doc2.ents:
    print(f"Entity: {entity}\nLabel: {entity.label_}\nLabel Explanation: {spacy.explain(entity.label_)}\n")

Entity: Tesla Company
Label: ORG
Label Explanation: Companies, agencies, institutions, etc.

Entity: 750,000
Label: MONEY
Label Explanation: Monetary values, including unit

Entity: dozens
Label: CARDINAL
Label Explanation: Numerals that do not fall under another type

Entity: Fremont
Label: GPE
Label Explanation: Countries, cities, states



In [113]:
colors = {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)", 
          "MONEY": "linear-gradient(45deg, lightgreen, white)",
          "CARDINAL": "linear-gradient(180deg, yellow, orange)",
          "GPE": "lightblue"}
options = {"ents": ["ORG", "MONEY", "CARDINAL", "GPE"], "colors": colors}

displacy.render(doc2, style='ent', jupyter=True, options=options)

In [114]:
for chunk in doc2.noun_chunks:
    print(chunk)

Tesla Company
a solar roof
dozens
air-quality violations
its Fremont factory


In [115]:
options = {"compact": True, "bg": "#09a3d5",
           "color": "white"}
displacy.render(doc2, jupyter=True, options=options)

### Using Lemmatization on Tokens

In [127]:
doc3 = nlp(u"I love to hike, especially on the weekends. I went hiking yesterday with my hiker friends.")

In [128]:
for token in doc3:
    print(f"{token.text:{10}} {token.pos_:{10}} {token.lemma:{20}} {token.lemma_:>{10}}")

I          PRON         561228191312463089     -PRON-
love       VERB        3702023516439754181       love
to         PART        3791531372978436496         to
hike       VERB       13848590707136088471       hike
,          PUNCT       2593208677638477497          ,
especially ADV        13751905263548122051 especially
on         ADP         5640369432778651323         on
the        DET         7425985699627899538        the
weekends   NOUN        6105388159866233819    weekend
.          PUNCT      12646065887601541794          .
I          PRON         561228191312463089     -PRON-
went       VERB        8004577259940138793         go
hiking     NOUN       17096437952121612491     hiking
yesterday  NOUN        1756787072497230782  yesterday
with       ADP        12510949447758279278       with
my         ADJ          561228191312463089     -PRON-
hiker      NOUN        8618649641654341037      hiker
friends    NOUN       16302678419497547123     friend
.          PUNCT      126460