# Using spaCy for Natural Language Processing

In [1]:
import spacy

Load your language model. "en" loads the small english model "en_core_web_sm" which is already installed.

In [2]:
nlp = spacy.load('en')

vocab is a storage class for vocabulary and other data shared across a language

In [3]:
nlp.vocab.length

478

In [4]:
nlp.vocab[0].text
# This statement will print out text, if there is an equivalent string for the hash value at that index of the vocab
# If you remove .text, it prints the hash value at the location
# 454 is the last index which can print a string

''

**Bonus:** *Can you find out whether the first index of vocab contains a number? An alphabet? Use nlp.vocab[].is_num and .is_alpha* 

## Exploring the pipeline

In [5]:
nlp.pipe_names

['tagger', 'parser', 'ner']

I've used the first line from Wizard of Oz. The article will explain some topics using the results from this, and other texts. Feel free to play around with them later!

In [6]:
text = "Dorothy lived in the midst of the great Kansas prairies, with Uncle Harry, who was a farmer, and Aunt Em, who was the farmer's wife."
doc_oz = nlp(text)

Tokens

In [7]:
[token.text for token in doc_oz]

['Dorothy',
 'lived',
 'in',
 'the',
 'midst',
 'of',
 'the',
 'great',
 'Kansas',
 'prairies',
 ',',
 'with',
 'Uncle',
 'Harry',
 ',',
 'who',
 'was',
 'a',
 'farmer',
 ',',
 'and',
 'Aunt',
 'Em',
 ',',
 'who',
 'was',
 'the',
 'farmer',
 "'s",
 'wife',
 '.']

### Part of Speech

In [8]:
[(token.text, token.pos_, spacy.explain(token.pos_)) for token in doc_oz]

[('Dorothy', 'PROPN', 'proper noun'),
 ('lived', 'VERB', 'verb'),
 ('in', 'ADP', 'adposition'),
 ('the', 'DET', 'determiner'),
 ('midst', 'NOUN', 'noun'),
 ('of', 'ADP', 'adposition'),
 ('the', 'DET', 'determiner'),
 ('great', 'ADJ', 'adjective'),
 ('Kansas', 'PROPN', 'proper noun'),
 ('prairies', 'NOUN', 'noun'),
 (',', 'PUNCT', 'punctuation'),
 ('with', 'ADP', 'adposition'),
 ('Uncle', 'PROPN', 'proper noun'),
 ('Harry', 'PROPN', 'proper noun'),
 (',', 'PUNCT', 'punctuation'),
 ('who', 'PRON', 'pronoun'),
 ('was', 'AUX', 'auxiliary'),
 ('a', 'DET', 'determiner'),
 ('farmer', 'NOUN', 'noun'),
 (',', 'PUNCT', 'punctuation'),
 ('and', 'CCONJ', 'coordinating conjunction'),
 ('Aunt', 'PROPN', 'proper noun'),
 ('Em', 'PROPN', 'proper noun'),
 (',', 'PUNCT', 'punctuation'),
 ('who', 'PRON', 'pronoun'),
 ('was', 'AUX', 'auxiliary'),
 ('the', 'DET', 'determiner'),
 ('farmer', 'NOUN', 'noun'),
 ("'s", 'PART', 'particle'),
 ('wife', 'NOUN', 'noun'),
 ('.', 'PUNCT', 'punctuation')]

In [9]:
#frequency of POS tags
pos_freq = doc_oz.count_by(spacy.attrs.POS)
for k,v in sorted(pos_freq.items()):
    print(f'{doc_oz.vocab[k].text:{6}}: {v}')

ADJ   : 1
ADP   : 3
AUX   : 2
CCONJ : 1
DET   : 4
NOUN  : 5
PART  : 1
PRON  : 2
PROPN : 6
PUNCT : 5
VERB  : 1


**Bonus:** *Try finding the POS of two sentences that use homonyms. Eg- "Bear with me." and "I saw a bear today." What POS does the word bear have in each sentence?*

In [10]:
doc1 = nlp("")
doc2 = nlp("")

for token in doc1:
  print(token.text, token.pos_, spacy.explain(token.pos_))
print("-------------")
for token in doc2:
  print(token.text, token.pos_, spacy.explain(token.pos_))

-------------


### Parser - Dependancy parsing, sentence boundary detection and chunking

Sentence boundary

In [11]:
doc_two = nlp("The Bible is a religious book. A Tale of Two Cities was written by Charles Dickens and first published in 1859. It is set in London and Paris before and during the French Revolution.")
for sent in doc_two.sents:
  print(sent)

The Bible is a religious book.
A Tale of Two Cities was written by Charles Dickens and first published in 1859.
It is set in London and Paris before and during the French Revolution.


In [12]:
[(token.text, token.dep_) for token in doc_oz]

[('Dorothy', 'nsubj'),
 ('lived', 'ROOT'),
 ('in', 'prep'),
 ('the', 'det'),
 ('midst', 'pobj'),
 ('of', 'prep'),
 ('the', 'det'),
 ('great', 'amod'),
 ('Kansas', 'compound'),
 ('prairies', 'pobj'),
 (',', 'punct'),
 ('with', 'prep'),
 ('Uncle', 'compound'),
 ('Harry', 'pobj'),
 (',', 'punct'),
 ('who', 'nsubj'),
 ('was', 'relcl'),
 ('a', 'det'),
 ('farmer', 'attr'),
 (',', 'punct'),
 ('and', 'cc'),
 ('Aunt', 'compound'),
 ('Em', 'conj'),
 (',', 'punct'),
 ('who', 'nsubj'),
 ('was', 'relcl'),
 ('the', 'det'),
 ('farmer', 'poss'),
 ("'s", 'case'),
 ('wife', 'attr'),
 ('.', 'punct')]

Importing the visualiser to better understand dependancies

In [13]:
from spacy import displacy

In [14]:
options = {"compact": True, "color": "blue"}
displacy.render(doc_oz, style="dep",jupyter=True, options=options )

A shorter sentence for a clearer understanding of dependancy parsing


In [15]:
doc = nlp("The dog walked up the hill.")
[(token.text, token.dep_) for token in doc]

[('The', 'det'),
 ('dog', 'nsubj'),
 ('walked', 'ROOT'),
 ('up', 'prep'),
 ('the', 'det'),
 ('hill', 'pobj'),
 ('.', 'punct')]

In [16]:
doc = nlp("The dog walked up the hill.")
displacy.render(doc, style="dep", jupyter=True)

### Ner - Named entity recognition

In [17]:
for entity in doc_oz.ents:
  print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))
displacy.render(doc_oz, style="ent", jupyter=True)
# As you can see, Dorothy is not considered an ner

Kansas - GPE - Countries, cities, states
Uncle Harry - PERSON - People, including fictional
Aunt Em - PERSON - People, including fictional


doc_two had more entities

In [18]:
for entity in doc_two.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))
displacy.render(doc_two, style="ent", jupyter=True)
# A tale of two cities should be a work of art

Bible - WORK_OF_ART - Titles of books, songs, etc.
Two - CARDINAL - Numerals that do not fall under another type
Charles Dickens - PERSON - People, including fictional
first - ORDINAL - "first", "second", etc.
1859 - DATE - Absolute or relative dates or periods
London - GPE - Countries, cities, states
Paris - GPE - Countries, cities, states
the French Revolution - EVENT - Named hurricanes, battles, wars, sports events, etc.


**Bonus:** _Try to find some other named entities (eg- nationalities, non-GPE locations, languages, time, money, quantity and percent)_

In [19]:
doc_ent=nlp("")

for ent in doc_ent.ents:
  print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))
displacy.render(doc_ent, style="ent", jupyter=True)

  "__main__", mod_spec)


## Adding custom pipe components

###Custom Pipeline for ',' sentence boundary

In [20]:
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ",":
          # if the current token is a comma, 
          # the next token should be considered the start of a sentence
            doc[token.i+1].is_sent_start = True
    return doc

nlp_custom = spacy.load("en")
nlp_custom.add_pipe(set_custom_boundaries, before="parser")
# Why do you think I put set_custom_boundaries before the parser?

In [21]:
doc_cities = nlp("It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.")
doc_cities_custom = nlp_custom("It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.")

print("WITHOUT CUSTOM: ")
for sent in doc_cities.sents:
  print(sent.text)
print("------------------------")
print("WITH CUSTOM: ")
for sent in doc_cities_custom.sents:
  print(sent.text)

WITHOUT CUSTOM: 
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
------------------------
WITH CUSTOM: 
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing b

## Comparing similarities

We need to load the larger model to do comparison. This can take 3-5 minutes. You may need to restart the kernel afterwards (Runtime -> Restart runtime or ctrl+M)

You can also use the medium model if you wish. Simply comment out the first line and remove the '#' before the second.

In [22]:
 !python -m spacy download en_core_web_lg

# !python -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


If you had to restart the kernel/runtime, remove the '#' before two import statements (first two sentences)

If you downloaded the medium model, comment out third line and un-comment the last line

In [23]:
# import spacy
# from spacy import displacy

nlp = spacy.load('en_core_web_lg')

# nlp = spacy.load('en_core_web_md')

In [24]:
nlp.vocab.length

1340241

oov stands for out of vocabulary

In [25]:
tokens = nlp("apple dog broken banana cat onomatopoeia asdfkj")
# asdfkj is a variant of "keyboard smashing" - a phenomenon which is seen on most social media
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

apple True 7.1346846 False
dog True 7.0336733 False
broken True 5.5968375 False
banana True 6.700014 False
cat True 6.6808186 False
onomatopoeia True 6.8262777 False
asdfkj False 0.0 True


You may get errors as 'asdfkj' does not have a vector.

In [26]:
for tok1 in tokens:
    for tok2 in tokens:
        print(tok1.text, tok2.text, tok1.similarity(tok2))

apple apple 1.0
apple dog 0.26339024
apple broken 0.30717567
apple banana 0.5831845
apple cat 0.28213844
apple onomatopoeia 0.04173978
apple asdfkj 0.0
dog apple 0.26339024
dog dog 1.0
dog broken 0.2948628
dog banana 0.24327643
dog cat 0.80168545
dog onomatopoeia 0.020367466
dog asdfkj 0.0
broken apple 0.30717567
broken dog 0.2948628
broken broken 1.0
broken banana 0.2577424
broken cat 0.30218005
broken onomatopoeia -0.06426965
broken asdfkj 0.0
banana apple 0.5831845
banana dog 0.24327643
banana broken 0.2577424
banana banana 1.0
banana cat 0.28154364
banana onomatopoeia 0.04627617
banana asdfkj 0.0
cat apple 0.28213844
cat dog 0.80168545
cat broken 0.30218005
cat banana 0.28154364
cat cat 1.0
cat onomatopoeia 0.04033484
cat asdfkj 0.0
onomatopoeia apple 0.04173978
onomatopoeia dog 0.020367466
onomatopoeia broken -0.06426965
onomatopoeia banana 0.04627617
onomatopoeia cat 0.04033484
onomatopoeia onomatopoeia 1.0
onomatopoeia asdfkj 0.0
asdfkj apple 0.0
asdfkj dog 0.0
asdfkj broken 0.0

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


Try it on your own data!

In [27]:
doc_1 = nlp("")
doc_2 = nlp("")
doc_1.similarity(doc_2)

1.0

## Custom Pipeline: Training the model on new data

In [28]:
# We will train nlp_train
nlp_train = spacy.load('en_core_web_lg')

Let's test the ner capabilities once again

In [29]:
doc = nlp("Fridge need to be replaced ASAP. Dorothy has a dog named Toto. Horses are tall.")
for ent in doc.ents:
  print(ent.text, ent.label_)
displacy.render(doc, style="ent", jupyter=True)

Dorothy PERSON
Toto PERSON


We'll attempt to teach spaCy how to identify products and animals.

**Note:** *This is not a good example of training data. Good data is larger and more diverse.*

In [30]:
TRAINING_DATA=[
               # format is  ("your_string", {"entities:[(start_index, end_index, "type")]}),
               # the first letter of your_string is at index 0
               # this code only works for one entity per sentence for simplicity
    ("I left Vellore yesterday.", {"entities": [(7, 14, "GPE")]}),
    ("I need to buy more clothes.", {"entities": [(19, 26, "PRODUCT")]}),
    ("I rented a house.", {"entities": [(11, 16, "PRODUCT")]}),
    ("Fridge needs to be replaced ASAP ", {"entities": [(0,6, "PRODUCT")]}),
    ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
    ("There has been severe flooding in North India", {"entities": [(34, 45, "GPE")]}),
    ("I got my truck stolen", {"entities": [(9,14, "PRODUCT")]}),
    ("Aprajita orders clothes from amazon", {"entities": [(0,8, "PERSON")]}),
    ("I recently ordered from Shoppers Stop", {"entities": [(24,37,"ORG")]}),
    ("I bought a new bicycle", {"entities": [(15,22, "PRODUCT")]}),
    ("I donated my old toys", {"entities": [(17,21, "PRODUCT")]}),
    ("I bought a fancy new watch", {"entities": [(21,26, "PRODUCT")]}),
    ("I rented a cabin for our vacation", {"entities": [(11,16, "PRODUCT")]}),
    ("I borrowed a ball from our neighbour", {"entities": [(13,17, "PRODUCT")]}),
    ("I repaired my car", {"entities": [(14,17, "PRODUCT")]}),
    ("I got my computer fixed", {"entities": [(9,17, "PRODUCT")]}),
    ("Richa is starting school today", {"entities":[(0,5,"PERSON")]}),
    ("They adopted a boy named Amar", {"entities":[(25, 29,"PERSON")]}),
    ("Sanjay Dutt released a new film", {"entities":[(0,11,"PERSON")]}),
    ("Horses are too tall and they will hurt your feelings", {"entities":[(0, 6, "ANIMAL")]}),
    ("I want a dog", {"entities":[(9,12,"ANIMAL")]}),
    ("Cats are known to be evil", {"entities":[(0,4,"ANIMAL")]}),
    ("I saw a bird today", {"entities":[(8, 12, "ANIMAL")]}),
    ("Snoopy is a dog", {"entities":[(12,15,"ANIMAL")]}),
    ("Rio is a movie about birds", {"entities":[(21, 26,"ANIMAL")]}),
    ("Dogs should not eat chocolate", {"entities":[(0,4,"ANIMALS")]}),
    ("Cows give milk", {"entities":[(0,4,"ANIMALS")]}),
    ("Goats eat grass", {"entities":[(0, 5, "ANIMALS")]}),
    ("I do not buy Donkeys on Amazon", {"entities":[(13, 20, "ANIMALS")]}),
    ("Tigers are predators", {"entities":[(0, 6, "ANIMALS")]}),
    ("Rachna uses Amazon regularly", {"entities":[(0,6,"PERSON")]}),
    ("Archana uses Flipkart to order clothes", {"entities":[(0, 7, "PERSON")]}),
    ("Aparna has a purse", {"entities":[(0, 6, "PERSON")]})
   ]

In [31]:
# Storing the ner pipeline as a variable for easy access
ner = nlp_train.get_pipe("ner")

We need to add the new labels to ner.

In [32]:
for _, annotations in TRAINING_DATA:
    for ent in annotations.get("entities"):
      # the second part of the entity is the label. 
      # labels that are already present won't be added
        ner.add_label(ent[2])

Disable the pipeline components that should not be changed

In [33]:
# pipe_exc includes pipeline components that we want to change
pipe_exc = ["ner", "trf_wordpiecer", "trf_tok2vec"]
disabled_pipes = [pipe for pipe in nlp_train.pipe_names if pipe not in pipe_exc]
# disabled_pipes includes tokenizer, tagger and parser

In [34]:
# We need random to randomize input, minibatch to create minibatches of text data, compouding func to yield an infinite series of compouding values
import random
from spacy.util import minibatch, compounding
from pathlib import Path

## Training the model


This may take a couple of minutes.

In [35]:
with nlp_train.disable_pipes(*disabled_pipes):

  # Training for 180 iterations
  for iteration in range(180):

    # shuffling examples  before every iteration
    random.shuffle(TRAINING_DATA)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAINING_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp_train.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
        print("Losses", losses)

Losses {'ner': 25.80991291999817}
Losses {'ner': 55.057631969451904}
Losses {'ner': 82.3158814907074}
Losses {'ner': 107.46443819999695}
Losses {'ner': 131.09159231185913}
Losses {'ner': 154.8016436100006}
Losses {'ner': 171.26841616630554}
Losses {'ner': 192.00530004501343}
Losses {'ner': 193.98279032099185}
Losses {'ner': 24.513182520866394}
Losses {'ner': 42.650454223155975}
Losses {'ner': 67.68861073255539}
Losses {'ner': 95.73404043912888}
Losses {'ner': 122.27933818101883}
Losses {'ner': 148.3598888516426}
Losses {'ner': 174.05323773622513}
Losses {'ner': 197.3733792901039}
Losses {'ner': 203.11729405422557}
Losses {'ner': 16.88648808002472}
Losses {'ner': 41.96008884906769}
Losses {'ner': 69.14534389972687}
Losses {'ner': 90.61756098270416}
Losses {'ner': 106.46225643157959}
Losses {'ner': 132.0712844133377}
Losses {'ner': 156.8713639974594}
Losses {'ner': 173.17460453510284}
Losses {'ner': 180.69876942038536}
Losses {'ner': 21.526761770248413}
Losses {'ner': 44.38239645957947}


Testing the data

In [36]:
doc_test = nlp_train("Fridge need to be replaced ASAP. Dorothy has a dog named Toto. Horses are tall.")
displacy.render(doc_test, "ent", jupyter=True)

In [37]:
test_without = [nlp("Horses have weak legs."),
        nlp("Rabbits are nice."),
        nlp("Shriya is nice."),
        nlp("I bought a new car. Ansh already asked to drive it."),
        nlp("Roland owns a watch.") ,
        nlp("Snakes are predators."),
        nlp("I threw away my old laptop."),
        nlp("I saw a snake today.")]

In [38]:
from spacy import displacy
for item in test_without:
  displacy.render(item, style="ent", jupyter=True)

  "__main__", mod_spec)


  "__main__", mod_spec)


  "__main__", mod_spec)


  "__main__", mod_spec)


In [40]:
test = [nlp_train("Horses have weak legs."),
        nlp_train("Rabbits are nice."),
        nlp_train("Shriya is nice."),
        nlp_train("I bought a new car. Ansh already asked to drive it."),
        nlp_train("Roland owns a watch.") ,
        nlp_train("Snakes are predators."),
        nlp_train("I threw away my old laptop."),
        nlp_train("I saw a snake today.")]

In [41]:
from spacy import displacy
for item in test:
  displacy.render(item, style="ent", jupyter=True)

  "__main__", mod_spec)


As you can see, objects that weren't even mentioned in the training data have now been identified. We know that the model is not rote learning because snake got identified in one sentence but not in the other.

**Bonuses:** 

1.   Run the test data on a new nlp object, what happens?
2.   Try adding more data to the training data. Try running the training a couple times. Do you see any changes in the results?
3.    Try to add a new pos to the model.