# NLP with spaCy

In [1]:
import spacy


In [3]:
spacy_obj=spacy.load('en_core_web_sm') # create a spaCy obj

In [4]:
spacy_obj.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x260dddfa6d0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x260e0c1e860>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x260df097580>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x260df097520>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x260e0cb76c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x260e0cb9cc0>)]

In [5]:
spacy_obj.pipe_names # view the pipeline component names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

In [None]:
# to disable pipeline components we use the disable_pipes()
#spacy_obj.disable('attribute_ruler', 'tagger')

In [6]:
# Read a text into the spaCy object  
text='''Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm’d;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm’d;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander’st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this and this gives life to thee.
'''

doc=spacy_obj(text)

In [7]:
print(doc)

Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm’d;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm’d;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander’st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this and this gives life to thee.



---

## 1. Tokenization  
Split strings into tokens. Tokens can be a word or a character.  


**(A) Tokenization into sentences**  
Use the `'sents'` attribute to tokenize the text doc into sentences.  

In [8]:
for sentences in doc.sents:
    print(sentences)

Shall I compare thee to a summer’s day?


Thou art more lovely and more temperate:

Rough winds do shake the darling buds of May,

And summer’s lease hath all too short a date:

Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm’d;

And every fair from fair sometime declines,

By chance or nature’s changing course untrimm’d;

But thy eternal summer shall not fade

Nor lose possession of that fair thou owest;

Nor shall Death brag thou wander’st in his shade,

When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,

So long lives this and this gives life to thee.



**(B) Tokenization into individual words**  
This will divide the entire sentence into word by word, including the punctuation marks and escape sentences.  

In [10]:
for token in doc[:10]:
    print(token.text)

Shall
I
compare
thee
to
a
summer
’s
day
?


In [11]:
# print individual tokens by using the slicing notation
print(doc[0])
print(doc[3])

Shall
thee


spaCy follows a specific rule for tokenization of text  
1. Initially it starts preprocessing the raw text into tokens from left to right based on whitespaces.  
2. Then it performs spliting tokens into sub-tokens by performing two different checks that are:  
  - (a) Exception rule check: punctuation marks in between tokens are looked over and are left untokenised.  
  - (b) Prefix-Suffix and Infix check: punctuation marks like commas, hyphens, quotation marks, periods, etc. are identified and made as a separate token.  
  
    
This rule checking is applied iteratively on the tokens from left to right. 

In [21]:
test_text=spacy_obj("I love cats and coffee, \"I'll get a big house for lots of cats in the U.S.A \"")

for token in test_text:
    print(token.text)

I
love
cats
and
coffee
,
"
I
'll
get
a
big
house
for
lots
of
cats
in
the
U.S.A
"


- spaCy identifies each quotation mark, commas, question mark and other punctuation marks that are present in form of prefix, suffix, infix and separates them into an indvidual token. (In above example, I'll be was separated into 'I & 'll' two different tokens.)    
- Punctuation marks that exists as part of a known abbereviation will not be separated. (In above example, U.S.A is kept as U.S.A.)  
-Also, punctuation marks used as infixes will be exempted from tokenisation in cases of email address, website or some numerical figures.

**(C) Spanning/Slicing of words in the text**  


In [22]:
print(doc[0:7])


Shall I compare thee to a summer


To check whether a particular word is the starting of a sentence we can use the 'is_sent_start' attribute with the doc object. It returns a Boolean value, True in case the word is the starting of any sentence otherwise it returns False.

In [24]:
print(doc[0].is_sent_start)
print(doc[1].is_sent_start)
print(doc[10].is_sent_start)
print(doc[18].is_sent_start)

True
False
True
False


**(D) Assignment of tokens is not allowed in spaCy.**



In [27]:
#doc[0] = doc[1] # will throw an error.

---

## 2. Lemmatization  
Lemmatization is the process of grouping inflected words from a common root word. This groups them into a single term for analysis. It considers the language's full vocabulary so it can apply morphological analysis on words.  
  
**The point to be noted is that spaCy library does not have stemming feature, because it prefers lemmitization as it is considerd to be more informative than stemming.**

In [32]:
for token in doc[:50]:
    print(token.text,"------->",token.pos_, "------->",token.lemma_)

Shall -------> AUX -------> shall
I -------> PRON -------> I
compare -------> VERB -------> compare
thee -------> PRON -------> thee
to -------> ADP -------> to
a -------> DET -------> a
summer -------> NOUN -------> summer
’s -------> PART -------> ’s
day -------> NOUN -------> day
? -------> PUNCT -------> ?

 -------> SPACE -------> 

Thou -------> DET -------> thou
art -------> NOUN -------> art
more -------> ADV -------> more
lovely -------> ADJ -------> lovely
and -------> CCONJ -------> and
more -------> ADV -------> more
temperate -------> NOUN -------> temperate
: -------> PUNCT -------> :

 -------> SPACE -------> 

Rough -------> ADJ -------> rough
winds -------> NOUN -------> wind
do -------> AUX -------> do
shake -------> VERB -------> shake
the -------> DET -------> the
darling -------> NOUN -------> darling
buds -------> NOUN -------> bud
of -------> ADP -------> of
May -------> PROPN -------> May
, -------> PUNCT -------> ,

 -------> SPACE -------> 

And -------> CCONJ

> Points to be noted:  
> - Lemma for a particular word is determined by keeping in mind the **part-of-speech**. Thus, we can verify Lemmitization of a word also depends upon the parts-of-speech.
> - Also, Lemmatization doesn't reduce words to their most basic synonym.

---