# Spacy 

On their website, Spacy has an interactive course structure for beginners to perform nlp tasks using spacy. This notebook tries to understand and perform those tasks and prepare a comprehensive summary of those chapters.

You can view it here: https://course.spacy.io/chapter1

This notebook contains a brief synopsis of all the concepts covered in the first three chapters of this interactive course.

# nlp() object

In [1]:
# Import english language class
from spacy.lang.en import English

In [2]:
# Create an nlp object.
nlp = English()

The nlp object contains the processing pipeline. It contains language specific rules for tokenization.

In [3]:
doc = nlp("I am your father! This is the year 2019...")
for token in doc:
    print(token.text)

I
am
your
father
!
This
is
the
year
2019
...


This is similar to word_tokenize in nltk library. It tokenizes the text by default when you pass it to the nlp object which in our case is English(). You can now get a slice of the text from above nlp() object or even a single word from the nlp object. 

In [4]:
span = doc[2:4]
for i in span:
    print(i.text)

your
father


# Lexical Attributes 

You can check whether a word is a punctuation or an alphabet or a number simply using a predefined function.

- is_punct: To check if a token is a punctuation 
- is_alpha: To check if a token is an English word 
- like_num: To check if a token is a numeric value

In [5]:
# All tokens
print("Tokens : {}\n".format([i.text for i in doc]))

# Get numbers only
print("Numeric: {}\n".format([i.text for i in doc if i.like_num]))

# Get punctuations only
print("Punctuations: {}\n".format([i.text for i in doc if i.is_punct]))

# Get alphabets/words only
print("Words: {}\n".format([i.text for i in doc if i.is_alpha]))

Tokens : ['I', 'am', 'your', 'father', '!', 'This', 'is', 'the', 'year', '2019', '...']

Numeric: ['2019']

Punctuations: ['!', '...']

Words: ['I', 'am', 'your', 'father', 'This', 'is', 'the', 'year']



## Token indices

Get the tokens index while iterating over an nlp object. 

In [6]:
[[token.text, token.i] for token in doc]

[['I', 0],
 ['am', 1],
 ['your', 2],
 ['father', 3],
 ['!', 4],
 ['This', 5],
 ['is', 6],
 ['the', 7],
 ['year', 8],
 ['2019', 9],
 ['...', 10]]

# Spacy Statistical Models

Spacy has the following three features on which it has already run multiple training instances and works well. But one can also update/train it in a custom way to identify their own peculiar field.

- Part of Speech Tags
- Syntactic Dependencies
- Named Entities

In [7]:
import spacy
# en_core_web_sm is a small(sm) English model that supports all core capabilities and is trained a lot on web text.
nlp = spacy.load('en_core_web_sm')

## Part of Speech Tag

Link - https://spacy.io/usage/linguistic-features#pos-tagging

<img src="spacy_parts_of_speech.PNG" w=500 h=100> 

In [8]:
doc = nlp("There's a feeling within me, an everglow")

In [9]:
for token in doc:
    print("{0:<10}".format(token.text), token.pos_)

There      ADV
's         VERB
a          DET
feeling    NOUN
within     ADP
me         PRON
,          PUNCT
an         DET
everglow   NOUN


## Syntactic Dependencies Prediction
In addition to predicting the part of speech, spacy also returns the predicted dependency of a word i.e. whether it's a subject or an object or a verb that connects the two etc. Here's how to do it.

In [10]:
for token in doc:
    print('{0:<15}'.format(token.text), '{0:8}'.format(token.pos_), '{0:7}'.format(token.dep_), token.head.text)

There           ADV      expl    's
's              VERB     ROOT    's
a               DET      det     feeling
feeling         NOUN     attr    's
within          ADP      prep    feeling
me              PRON     pobj    within
,               PUNCT    punct   feeling
an              DET      det     everglow
everglow        NOUN     conj    feeling


## Named Entity Recognition

Named entities are groups/categories which a certain class of words belong to. {Brady, Destin, Derek} these will form a part of a named entity called <button>name</button> or person, {Australia, Alabama, Kingston} these will be a part of another named entity called <button>places</button> and so on.

Link - Link - https://spacy.io/api/annotation#pos-tagging

In [11]:
doc = nlp("Go to Alabama and Australia and use a sink to test the Coriolis Effect.")
for token in doc.ents:
    print('{0:<8}'.format(token.text), token.label_)    

Alabama  GPE
Australia GPE
the Coriolis Effect FAC


**Aside Tip**

Use the explain method to get the meaning of common tags and labels.

In [12]:
spacy.explain('FAC')

'Buildings, airports, highways, bridges, etc.'

In [13]:
spacy.explain('GPE')

'Countries, cities, states'

In [14]:
spacy.explain('NNP')

'noun, proper singular'

# Rule Based Matching

You can match patterns based on parts of speech, lemmas, text, named_entities etc. i.e. You can make your own pattern and get text strings or values out of any given document based on this pattern.

Import matcher object and use it to accomplish the above task.

You can give multiple match patterns to the matcher object.

You can call this on your document and then use the returned object by the match function to extract the tags which have matched.

In [15]:
pattern = [{'LEMMA':'love', 'POS':'VERB'}, {'POS':'PROPN'}]

This pattern above says that match the substring which has any form of the word "love" followed by a noun. 

In [16]:
doc = nlp("I loved Kevin Spacey as Keyser Soze in The Usual Suspects 1995 but post 2017 that love has dwindled")

In [17]:
from spacy.matcher import Matcher
mt = Matcher(nlp.vocab)
mt.add('Pattern 1', None, pattern)
matches = mt(doc)    
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

loved Kevin


In [18]:
for token in doc:
    print('{0:<10}'.format(token.text), token.pos_)

I          PRON
loved      VERB
Kevin      PROPN
Spacey     PROPN
as         ADP
Keyser     PROPN
Soze       PROPN
in         ADP
The        DET
Usual      ADJ
Suspects   NOUN
1995       NUM
but        CCONJ
post       VERB
2017       NUM
that       ADP
love       NOUN
has        VERB
dwindled   VERB


In [19]:
pattern2 = [{'LOWER':'the'}, {'LOWER':'usual'}, {'LOWER':'suspects'}, {'IS_DIGIT':True}]

In [20]:
mt.add('MovieTitle', None, pattern2)
matches = mt(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

loved Kevin
The Usual Suspects 1995


<img src="Rule_Baded_Matching.PNG" w=500 h=50> 

# Spacy Datastructures

## Hash Values

- Vocab: stores data shared across multiple documents
- To save memory, spaCy encodes all strings to hash values
- Strings are only stored once in the StringStore via nlp.vocab.strings
- String store: lookup table in both directions

spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp dot vocab dot strings.

It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. 

**NOTE: YOU NEED TO ADD THE WORD TO AN NLP OBJECT BEFORE GENERATING A HASH FOR THAT WORD ELSE, IT WILL THROW AN ERROR**

In [21]:
nlp = spacy.load('en_core_web_sm')

In [22]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('Badreesh Vinayak')
for tokens in doc:
    student_hash = nlp.vocab.strings[tokens.text]
    student_string = nlp.vocab.strings[student_hash]
    print(student_string, student_hash)

Badreesh 5385398226147478911
Vinayak 16178419844459703194


## The Doc Object

Here we're creating a Doc from three words. The spaces are a list of boolean values indicating whether the word is followed by a space. Every token includes that information – even the last one!

The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [23]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
# Spaces - Whether there's a space after the respective word or not.
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

## The Span Object

To create a Span manually, we can also import the class from spacy dot tokens. We can then instantiate it with the doc and the span's start and end index.

To add an entity label to the span, we first need to look up the string in the string store. We can then provide it to the span as the label argument.

The doc dot ents are writable, so we can add entities manually by overwriting it with a list of spans

In [24]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

## The Lexeme Object

The object returned after looking up for an item in the vocabulary is called a lexeme. They hold the text, the hash and attributes like is_alpha, like_num etc. 

In [25]:
doc = nlp("Tom Hanks is a great guy. I like him.")
lexeme = nlp.vocab['Tom']
print("{0:<15}".format("Text:"), lexeme.text)
print("{0:<15}".format("Hash:"), lexeme.orth)
print("{0:<15}".format("Is aphabet:"), lexeme.is_alpha)
print("{0:<15}".format("Is Numeric:"), lexeme.like_num)
print("{0:<15}".format("Is Punctuation:"), lexeme.is_punct)

Text:           Tom
Hash:           6005358355014000477
Is aphabet:     True
Is Numeric:     False
Is Punctuation: False


## Word Vectors and Semantic Similarities

Get the similarity between sentences based on spacy's similarity based on word vectors

In [26]:
# Get the medium version of spacy vocab
# Do this to load word vectors which come with the medium version and not with small version
nlp = spacy.load('en_core_web_md')

**Comparing two sentences/two documents**

In [27]:
doc1 = nlp("Naruto is a great anime.")
doc2 = nlp("Dragon Ball Z is an awesome anime.")

print(doc1.similarity(doc2))

0.8885209527571564


**Comparing a token in a sentence/document with another token from another sentence/document**

In [28]:
print(doc1[0].similarity(doc2[0]))

0.49898103


**Comparing a span with a document** 

In [29]:
doc1 = nlp("Naruto and Pokemon are awesomoe anime")
doc2 = nlp("Dragon Ball Z and Duel Masters are great Anime")

print(doc1[3:].similarity(doc2))

0.7416453577383586


<img src="wordVectors_Spacy.PNG" w=100 h=100> 

In [30]:
doc1[3].vector[:10]

array([-0.19859 , -0.062818, -0.36614 , -0.41786 ,  0.20962 , -0.26728 ,
        0.246   ,  0.12783 , -0.045845,  2.5253  ], dtype=float32)

**Caveat about word similarity**

Similarity is highly context-specific. Look at the following example. Although one has a positive sentiment and another has a negative sentiment associated with them, they both are talking about sentiment held by a person toward a computer in which case the nature of two statements are similar.

In [31]:
d1 = nlp("I love computers.")
d2 = nlp("I hate computers.")
d1.similarity(d2)

0.9524976951533566

## Statistical Models and Rule Based Matching

Combining these two is greatly helpful in many ways. The following table summarizes how so.

<img src="stats_vs_rules.png" w=200 h=100>

After doing rule based matching, you can find out different attributes of the individual tokens. 

In [32]:
from spacy.matcher import Matcher
mt = Matcher(nlp.vocab)
mt.add('Pat1', None, [{'LOWER':'successful'}, {'LOWER':'chap'}])
doc = nlp("Simon Cowell is an extremely successful chap")

for mt_id, start, end in mt(doc):
    span = doc[start:end]
    print('Matched Span:', span.text)
    print('\nMatch ID:', mt_id)
    print('\nStart Token:', start)
    print('\nEnd Token:', end)
    
    # Get the span's root and root head tokens
    print('\nRoot token: ', span.root.text)
    print('\nRoot head token: ', span.root.head.text)
    
    # Previous Token
    print("\nPrevious token: ", doc[start-1].text)

Matched Span: successful chap

Match ID: 6509405884090561913

Start Token: 5

End Token: 7

Root token:  chap

Root head token:  is

Previous token:  extremely


## Phrase Matcher
It is similar to Rule based matching but instead of the patterns, we are going to pass a doc object to the Matcher. It is fast and efficient and easier to understand than rule-based matching technique.

In [33]:
from spacy.matcher import PhraseMatcher
mt = PhraseMatcher(nlp.vocab)

pattern = nlp("Simon Cowell")
mt.add("Celebrity", None, pattern)
doc = nlp("Simon Cowell is a great chap.")

for mt_id, start, end in mt(doc):
    span = doc[start:end]
    print("Matched Span:", span.text)

Matched Span: Simon Cowell


**Exercise**

In the doc provided below:

Create a <button>Pattern1</button> so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.

Create a <button>Pattern2</button> so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.

In [34]:
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"},{'IS_PUNCT':True},{'LOWER':'free'}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


# Processing Pipelines

*What does the nlp object actually do?*

First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the Doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed Doc is returned, so you can work with it.

The following are different components of the processing pipeline.
<img src = "pipeline_1.png" w = 100 h = 100>

Descriptions of the above components is shown below

- The *part-of-speech tagger* sets the *token dot tag attribute.


- The *dependency parser* adds the *token dot dep* and *token dot head* attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.


- The *named entity recognizer* adds the detected entities to the *doc dot ents* property. It also sets *entity type attributes* on the tokens that indicate if a token is part of an entity or not.


- Finally, the *text classifier* sets *category labels* that apply to the whole text, and adds them to the *doc dot cats* property.


- Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.


<img src = "pipeline_components.png" w = 100 h = 100>

In [35]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [36]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x2181d9c32e8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x2181d8c2108>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x2181d8c2168>)]

## Custom Pipeline Components

You can create a custom component or a custom function and add it to the pipeline at any stage in the pipeline that you want to. There are ways of adding a custom component which are as follows:
<img src="pipeline_custom_component.png" w=200 h=200>

In [37]:
nlp = spacy.load('en_core_web_sm')

In [38]:
def new_component(doc):
    length = len(doc)
    print("Doc: {}".format(doc), "\nThe length of this document is: {}".format(length))
    return doc

In [39]:
nlp.add_pipe(new_component, first = True)

In [40]:
doc = nlp("Pokemon is the best anime ever.")

Doc: Pokemon is the best anime ever. 
The length of this document is: 7


## Extensions

They can be used in order to add meta_data to docs, tokens and spans. The data may be added at the start. It can also be overwritten/changed subsequently at a later stage.

They can be accessed using the ._ attribute

In [41]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

In [42]:
# Overwrite an extension
doc = nlp("Mr Blue Sky Please tell us why")
doc[1]._.is_color = True

Doc: Mr Blue Sky Please tell us why 
The length of this document is: 7


In [43]:
for token in doc:
    if token._.is_color:
        print(token)

Blue


You can also define a custom getter and setter method for the meta-attribute value. 

In [44]:
from spacy.tokens import Token

# Define a getter
def get_col(doc):
    colors = ['orange', 'blue', 'purple']
    return doc.text.lower() in colors

Token.set_extension('is_color', getter = get_col, force = True)

print(doc[1]._.is_color, ':', doc[1].text)

True : Blue


You can define a custom method for the global token class which every token in the scope will have. 

In [45]:
def has_token(doc, token_text):
    in_doc = token_text in [token.text.lower() for token in doc]
    return in_doc

Doc.set_extension('has_token_', method = has_token)

doc = nlp("Mr. Blue sky please tell us why...")

print(doc._.has_token_('blue'), '- blue')

print(doc._.has_token_('sky'), 'sky')

Doc: Mr. Blue sky please tell us why... 
The length of this document is: 8
True - blue
True sky


The below code shows the power of extensions. It creates a search string of Wikipedia for a person in the document provided. 

In [46]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print("{0:<20}".format(ent.text), "{0:<20}".format(ent.label_), ent._.wikipedia_url)

over fifty years     DATE                 None
first                ORDINAL              None
David Bowie          PERSON               https://en.wikipedia.org/w/index.php?search=David_Bowie


## Scaling and Performance.

Techniques to make spacy run as fast as possible for processing large amount of text.

### nlp.pipe()

If you need to process a lot of texts and create a lot of Doc objects in a row, the nlp dot pipe method can speed this up significantly.

**It processes the texts as a stream and yields Doc objects.**

**It is much faster than just calling nlp on each text, because it batches up the texts.**

nlp dot pipe is a generator that yields Doc objects, so in order to get a list of Docs, remember to call the list method around it.

In [47]:
docs = ['You are awesome', 'I love the way you handle things', 'I am honored to be with you']
# BAD - [nlp(doc) for doc in docs]
docs = list(nlp.pipe(docs))
docs

[You are awesome,
 I love the way you handle things,
 I am honored to be with you]

### Adding Context

nlp dot pipe also supports passing in tuples of text / context if you set "as tuples" to True.

The method will then yield doc / context tuples.

This is useful for passing in additional metadata, like an ID associated with the text, or a page number.

In [48]:
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
And another text 16


### Adding Meta data
You can even add the context meta data to custom attributes.

In this example, we're registering two extensions, "id" and "page number", which default to None.

After processing the text and passing through the context, we can overwrite the doc extensions with our context metadata.

In [49]:
from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

### Temporary Pipe Disabling

After the with block, the disabled pipeline components are automatically restored.

In the with block, spaCy will only run the remaining components.

In [50]:
# Disable tagger and parser
text = "You are amazing."
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.text, doc.ents)

You are amazing. ()
