# NLP with spaCy

In [1]:
import spacy


In [3]:
spacy_obj=spacy.load('en_core_web_sm') # create a spaCy obj

In [4]:
spacy_obj.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x260dddfa6d0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x260e0c1e860>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x260df097580>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x260df097520>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x260e0cb76c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x260e0cb9cc0>)]

In [5]:
spacy_obj.pipe_names # view the pipeline component names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

In [None]:
# to disable pipeline components we use the disable_pipes()
#spacy_obj.disable('attribute_ruler', 'tagger')

In [6]:
# Read a text into the spaCy object  
text='''Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm’d;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm’d;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander’st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this and this gives life to thee.
'''

doc=spacy_obj(text)

In [7]:
print(doc)

Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm’d;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm’d;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander’st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this and this gives life to thee.



---

## 1. Tokenization  
Split strings into tokens. Tokens can be a word or a character.  


**(A) Tokenization into sentences**  
Use the `'sents'` attribute to tokenize the text doc into sentences.  

In [8]:
for sentences in doc.sents:
    print(sentences)

Shall I compare thee to a summer’s day?


Thou art more lovely and more temperate:

Rough winds do shake the darling buds of May,

And summer’s lease hath all too short a date:

Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm’d;

And every fair from fair sometime declines,

By chance or nature’s changing course untrimm’d;

But thy eternal summer shall not fade

Nor lose possession of that fair thou owest;

Nor shall Death brag thou wander’st in his shade,

When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,

So long lives this and this gives life to thee.



**(B) Tokenization into individual words**  
This will divide the entire sentence into word by word, including the punctuation marks and escape sentences.  

In [10]:
for token in doc[:10]:
    print(token.text)

Shall
I
compare
thee
to
a
summer
’s
day
?


In [11]:
# print individual tokens by using the slicing notation
print(doc[0])
print(doc[3])

Shall
thee


spaCy follows a specific rule for tokenization of text  
1. Initially it starts preprocessing the raw text into tokens from left to right based on whitespaces.  
2. Then it performs spliting tokens into sub-tokens by performing two different checks that are:  
  - (a) Exception rule check: punctuation marks in between tokens are looked over and are left untokenised.  
  - (b) Prefix-Suffix and Infix check: punctuation marks like commas, hyphens, quotation marks, periods, etc. are identified and made as a separate token.  
  
    
This rule checking is applied iteratively on the tokens from left to right. 

In [21]:
test_text=spacy_obj("I love cats and coffee, \"I'll get a big house for lots of cats in the U.S.A \"")

for token in test_text:
    print(token.text)

I
love
cats
and
coffee
,
"
I
'll
get
a
big
house
for
lots
of
cats
in
the
U.S.A
"


- spaCy identifies each quotation mark, commas, question mark and other punctuation marks that are present in form of prefix, suffix, infix and separates them into an indvidual token. (In above example, I'll be was separated into 'I & 'll' two different tokens.)    
- Punctuation marks that exists as part of a known abbereviation will not be separated. (In above example, U.S.A is kept as U.S.A.)  
-Also, punctuation marks used as infixes will be exempted from tokenisation in cases of email address, website or some numerical figures.

**(C) Spanning/Slicing of words in the text**  


In [22]:
print(doc[0:7])


Shall I compare thee to a summer


To check whether a particular word is the starting of a sentence we can use the 'is_sent_start' attribute with the doc object. It returns a Boolean value, True in case the word is the starting of any sentence otherwise it returns False.

In [24]:
print(doc[0].is_sent_start)
print(doc[1].is_sent_start)
print(doc[10].is_sent_start)
print(doc[18].is_sent_start)

True
False
True
False


**(D) Assignment of tokens is not allowed in spaCy.**



In [27]:
#doc[0] = doc[1] # will throw an error.

---

## 2. Lemmatization  
Lemmatization is the process of grouping inflected words from a common root word. This groups them into a single term for analysis. It considers the language's full vocabulary so it can apply morphological analysis on words.  
  
**The point to be noted is that spaCy library does not have stemming feature, because it prefers lemmitization as it is considerd to be more informative than stemming.**

In [32]:
for token in doc[:50]:
    print(token.text,"------->",token.pos_, "------->",token.lemma_)

Shall -------> AUX -------> shall
I -------> PRON -------> I
compare -------> VERB -------> compare
thee -------> PRON -------> thee
to -------> ADP -------> to
a -------> DET -------> a
summer -------> NOUN -------> summer
’s -------> PART -------> ’s
day -------> NOUN -------> day
? -------> PUNCT -------> ?

 -------> SPACE -------> 

Thou -------> DET -------> thou
art -------> NOUN -------> art
more -------> ADV -------> more
lovely -------> ADJ -------> lovely
and -------> CCONJ -------> and
more -------> ADV -------> more
temperate -------> NOUN -------> temperate
: -------> PUNCT -------> :

 -------> SPACE -------> 

Rough -------> ADJ -------> rough
winds -------> NOUN -------> wind
do -------> AUX -------> do
shake -------> VERB -------> shake
the -------> DET -------> the
darling -------> NOUN -------> darling
buds -------> NOUN -------> bud
of -------> ADP -------> of
May -------> PROPN -------> May
, -------> PUNCT -------> ,

 -------> SPACE -------> 

And -------> CCONJ

> Points to be noted:  
> - Lemma for a particular word is determined by keeping in mind the **part-of-speech**. Thus, we can verify Lemmitization of a word also depends upon the parts-of-speech.
> - Also, Lemmatization doesn't reduce words to their most basic synonym.

---

## 3. Stopwards in spaCy  
There are many words in the English dictionary that are very common and are of no important use to us for finding some useful information from them. These can be words like- is, am ,are, a, an, the, etc. If we keep them as it is in the text, they will tend to increase the vocabulary size and making use of these large size vocab to train our model can be really time taking and also our model can become inefficient as these gie no useful information.  

**spaCy holds an in-built set of 326 stopwords in the English by default, that are removed from our text document in the preproceesing pipeline. These stopwards are-**

In [33]:
print('[INFO] length of built in stopwaords: ', len(spacy_obj.Defaults.stop_words))
print('[INFO] list of the stopwords \n', spacy_obj.Defaults.stop_words)

[INFO] length of built in stopwaords:  326
[INFO] list of the stopwords 
 {'all', 'already', 'further', 'say', 'what', 'former', 'will', 'sometime', 'that', 'a', 'two', 'really', 'top', 'almost', 'rather', 'amongst', 'themselves', 'anywhere', 'whereafter', 'very', 'give', 'wherever', 'well', 'if', 'none', '‘re', 'whether', 'among', 'nine', 'over', 'does', 'own', 'whatever', 'bottom', 'mine', 'between', 'at', 'why', 'which', 'across', 'whereby', 'became', 'make', 'every', 'must', 'your', 'there', 'though', 'throughout', 'am', 'from', 'n‘t', 'latterly', 'the', 'hence', "n't", 'until', 'whither', 'one', 'another', 'keep', "'m", 'empty', 'ca', 'seeming', 'than', 'through', 'itself', 'beside', 'into', 'ours', 'such', 'somewhere', 'amount', 'might', 'front', 'us', 'enough', 'not', 'since', 'fifteen', 'indeed', 'latter', 'meanwhile', 'her', 'therein', 'thus', 'seems', 'while', 'nobody', 'could', 'about', 'cannot', 'serious', 'becomes', "'ve", 'least', 'hundred', 'would', 'fifty', 'on', 'with'

**Removal of stopwards from the text data**  


In [45]:
textual_data=spacy_obj("Oh Brick. I get so lonely. Living with someone you love can be lonelier than living entirely alone when the one you love doesn’t love you. You can’t even stand drinking out of the same glass can you? … No! No, I wouldn’t. Why can’t you lose your good looks Brick? Most drinking men lose theirs. Why can’t you. I think you’ve even gotten better looking since you weren’t on the bottle. You were such a wonderful love. … You were so exciting to be in love with. Mostly I guess because you were … If I thought you’d never never made love to me again, why I’d find me the longest sharpest knife I could and I’d stick it straight into my heart. I’d do that. Oh Brick how long does this have to go on, this punishment? Haven’t I served my term? Can’t I apply for a pardon? … Is it any wonder. You know what I feel like? I feel all the time like a cat on a hot tin roof.")

all_stopwords=spacy_obj.Defaults.stop_words  

# Defining a remoe stopwords function to remove all the stopwords from the text  
def remove_stopwords(data):
    # Creating an empty list that will store all individual tokens 
    all_tokens=[] 
    
    for token in textual_data:
        all_tokens.append(token.text.lower())

    print("\n[INFO] original token list: \n", all_tokens)
    
    
    tokens_left = [word for word in all_tokens if not word in all_stopwords]
    print("\n[INFO] token list after stopword removal: \n", tokens_left)
    
    refined_textual_data=(' ').join(tokens_left)
    
    return refined_textual_data

print('\n[INFO] Actual data after stopword removal is: \n', remove_stopwords(textual_data))



[INFO] original token list: 
 ['oh', 'brick', '.', 'i', 'get', 'so', 'lonely', '.', 'living', 'with', 'someone', 'you', 'love', 'can', 'be', 'lonelier', 'than', 'living', 'entirely', 'alone', 'when', 'the', 'one', 'you', 'love', 'does', 'n’t', 'love', 'you', '.', 'you', 'ca', 'n’t', 'even', 'stand', 'drinking', 'out', 'of', 'the', 'same', 'glass', 'can', 'you', '?', '…', 'no', '!', 'no', ',', 'i', 'would', 'n’t', '.', 'why', 'ca', 'n’t', 'you', 'lose', 'your', 'good', 'looks', 'brick', '?', 'most', 'drinking', 'men', 'lose', 'theirs', '.', 'why', 'ca', 'n’t', 'you', '.', 'i', 'think', 'you', '’ve', 'even', 'gotten', 'better', 'looking', 'since', 'you', 'were', 'n’t', 'on', 'the', 'bottle', '.', 'you', 'were', 'such', 'a', 'wonderful', 'love', '.', '…', 'you', 'were', 'so', 'exciting', 'to', 'be', 'in', 'love', 'with', '.', 'mostly', 'i', 'guess', 'because', 'you', 'were', '…', 'if', 'i', 'thought', 'you', '’d', 'never', 'never', 'made', 'love', 'to', 'me', 'again', ',', 'why', 'i', '’

**Adding a custom stopword for removal from the text**  

Suppose there are certain cases where our text data consists of some irrelevant words that are of no use to us and we want to remove those words from the text data and these words do not exist in the in-built stopword set. So, for such cases spaCy gives us the freedom to add our custom stopwords to the default set.

In [49]:
spacy_obj.Defaults.stop_words.add('theirs')
spacy_obj.vocab['theirs'].is_stop = True

# Verifying that our custom stopwords are added into the default list
print('[INFO] added extra stopword ', len(spacy_obj.Defaults.stop_words))

# Calling the remove_stopwords function again on the textual_data
print("[INFO] \nActual data after stop word removal is : \n",remove_stopwords(textual_data))

[INFO] added extra stopword  327

[INFO] original token list: 
 ['oh', 'brick', '.', 'i', 'get', 'so', 'lonely', '.', 'living', 'with', 'someone', 'you', 'love', 'can', 'be', 'lonelier', 'than', 'living', 'entirely', 'alone', 'when', 'the', 'one', 'you', 'love', 'does', 'n’t', 'love', 'you', '.', 'you', 'ca', 'n’t', 'even', 'stand', 'drinking', 'out', 'of', 'the', 'same', 'glass', 'can', 'you', '?', '…', 'no', '!', 'no', ',', 'i', 'would', 'n’t', '.', 'why', 'ca', 'n’t', 'you', 'lose', 'your', 'good', 'looks', 'brick', '?', 'most', 'drinking', 'men', 'lose', 'theirs', '.', 'why', 'ca', 'n’t', 'you', '.', 'i', 'think', 'you', '’ve', 'even', 'gotten', 'better', 'looking', 'since', 'you', 'were', 'n’t', 'on', 'the', 'bottle', '.', 'you', 'were', 'such', 'a', 'wonderful', 'love', '.', '…', 'you', 'were', 'so', 'exciting', 'to', 'be', 'in', 'love', 'with', '.', 'mostly', 'i', 'guess', 'because', 'you', 'were', '…', 'if', 'i', 'thought', 'you', '’d', 'never', 'never', 'made', 'love', 'to', '

**Removing a stopword from the stopword set**  

If we want to remove some stopwords from the by default stopword set, we can also do it. It can be done as follows:

In [50]:
spacy_obj.Defaults.stop_words.remove('theirs')
spacy_obj.vocab['theirs'].is_stop = False
# Verifying that our custom stopwords we added are removed from the default list
print('[INFO] removed the stopword we added, length should be original ',len(spacy_obj.Defaults.stop_words))

[INFO] removed the stopword we added, length should be original  326


---  

## 4. Parts of Speech tagging (POS tagging)  
The next step in the preprocessing pipeline after tokenisation is to assign appropriate parts-of-speech to each token. This step can be useful in many NLP tasks for information extraction, feature engineering, language understanding, etc. As discussed earlier spaCy library is already trained with statistical models that allows it to achieve the objective of POS tagging efficiently. The statistical model contains binary data and is already trained over a lot of examples that enables it to make generalized predictions.  
  
### Different types of POS tags available in spaCy  
There are two types of POS tags available in spaCy that are-  

**(a) coarse POS tags**    
These are ordinary POS tags that we know. Each token is assigned a its own coarse POS tag.  
`{ ADJ, ADP, ADV, AUX, CONJ, CCONJ, DET, INTJ, NOUN }  `    
  
![coarse POS tags](../../pos_1.png)

**(b) fine-grained tags**  
In this each token is assigned a more detailed POS tag based on the **morphology**.  
The full list can be found on the documentation page.  
  
![fine-grained tags](../../pos_2.png)  
  
  


Now that we have gained some knowledge about various POS tags available in spaCy. Let's get into skin of it by working on some hands-on examples. For the text that we read earlier, we will be generating the POS tags for each token in that text data. We can make use of the `pos_` tag and `tag_` tag along with each token to view its respective coarse and fine-grained POS tag.

In [51]:
text_1 = "Look in thy glass, and tell the face thou viewest Now is the time that face should form another;."

doc_1 = spacy_obj(text_1)

for token in doc_1:
    print( "[INFO] Token '"+token.text + "'  its coarse tag is:   "+ token.pos_ + "  and it's fine-grained POS tag is: " + token.tag_)

[INFO] Token 'Look'  its coarse tag is:   VERB  and it's fine-grained POS tag is: VB
[INFO] Token 'in'  its coarse tag is:   ADP  and it's fine-grained POS tag is: IN
[INFO] Token 'thy'  its coarse tag is:   PRON  and it's fine-grained POS tag is: PRP$
[INFO] Token 'glass'  its coarse tag is:   NOUN  and it's fine-grained POS tag is: NN
[INFO] Token ','  its coarse tag is:   PUNCT  and it's fine-grained POS tag is: ,
[INFO] Token 'and'  its coarse tag is:   CCONJ  and it's fine-grained POS tag is: CC
[INFO] Token 'tell'  its coarse tag is:   VERB  and it's fine-grained POS tag is: VB
[INFO] Token 'the'  its coarse tag is:   DET  and it's fine-grained POS tag is: DT
[INFO] Token 'face'  its coarse tag is:   NOUN  and it's fine-grained POS tag is: NN
[INFO] Token 'thou'  its coarse tag is:   DET  and it's fine-grained POS tag is: DT
[INFO] Token 'viewest'  its coarse tag is:   NOUN  and it's fine-grained POS tag is: NN
[INFO] Token 'Now'  its coarse tag is:   ADV  and it's fine-grained P

We can use the `spacy.explain()` function to see the full information about the tag used.

In [53]:
print(spacy.explain('ADP'))
print(spacy.explain('DET'))
print(spacy.explain('NNP'))

adposition
determiner
noun, proper singular


It should be noted down that spaCy encodes all the strings token to a unique hash value in order to reduce memory usage and improve its efficiency. So, we can use the `pos` tag and `tag` tag to view the hash values. By using these tags, we can get the hash values of he corresponsding token. Also, these are the short hand notations of the two tags we read earlier that are `pos_` and `tag_`. We add an `underscore (_)` sign with these tag names so that we can get the hash values in readable string format

In [54]:
# Printing the hash value of both the coarse POS tag and fine-grained POS tag for the very first token in doc_1.

print(doc_1[0].text + ' has coarse POS hash value: ' + str(doc_1[0].pos))
print(doc_1[0].text + ' has fine-grained POS hash value: '+ str(doc_1[0].tag))

Look has coarse POS hash value: 100
Look has fine-grained POS hash value: 14200088355797579614


It is possible that the same words in two different sentences can give different meanings. To determine the real essence of each sentence spaCy library uses something known as **morphology**.

Let's have two sentences and analyze their results after doing POS tagging:

In [56]:
sent_1 = spacy_obj('Jesse loves to read books on Computer Vision.')
sent_2 = spacy_obj('Jesse read a book yesterday.')

word_1 = sent_1[3] # Assigning word_1 = read
word_2 = sent_2[1] # Assigning word_2 = read

#printing POS tags for read token in the first sentence
print('[INFO] sentence 1 POS tags: ', word_1.text,word_1.pos_, word_1.tag_, spacy.explain(word_1.tag_))

#printing POS tags for read token in the second sentence
print('[INFO] sentence 2 POS tags: ', word_2.text,word_2.pos_, word_2.tag_, spacy.explain(word_2.tag_))


[INFO] sentence 1 POS tags:  read VERB VB verb, base form
[INFO] sentence 2 POS tags:  read VERB VBD verb, past tense


If we analyze the result of the tagging, we can see spaCy was able to identify correct form of the common word in both the sentence correctly. It is possible because it uses morphology to determine the correct essence of the text.

---  
## 5. Dependency Parsing  
Dependency Parsing is a method which we can extract dependencies of a sentence that helps to represent the grammatical structure of the sentence. The dependency usually exists between the root word of the sentence and all other existing words. Usually, the 'Verb' inside the sentence is treated as a root word.  
  
The dependency amongst the words in a sentence can be represented using a directed graph. Various components that a directed graph represents are listed below:  
- Each token(word) is represented using a node.  
- The existing grammatical relation between two tokens is represented using edges.  
  

In [57]:
## Visualizing the Parts Of Speech using the displaCy

from spacy import displacy

data = spacy_obj('Jesse loves to read comic books.')
displacy.render(data, style='dep', jupyter=True, options={'distance': 100})

---

## References:  
https://www.kaggle.com/saurabh48782/nlp-using-spacy-library-part-1

---