# Natural Language Processing with Spacy and Python

##  NoteBook Context :

*   **[SpaCy Structure](#SpaCy-Structure)**
*   **[Sentence Boundary Detection(SBD):](#Sentence-Boundary-Detection(SBD):)**
*   **[Token Attributes](#Token-Attributes)**
*   **[Part of Speech Tagging(POS)](#Part-of-Speech-Tagging(POS))**
*   **[Named Entity Recognition(NER)](#Named-Entity-Recognition(NER))**
*   **[Word Vectors](#Word-Vectors)**
*   **[spaCy's Pipelines](#spaCy's-Pipelines)**

In [2]:
## importing packages

import spacy
import en_core_web_sm

In [3]:
## creating instance for work

nlp = spacy.load("en_core_web_sm")

In [4]:
nlp

<spacy.lang.en.English at 0x266e1f92e50>

In [5]:
with open('data/wiki_us.txt', 'r') as f:
    text = f.read()

In [6]:
# print(text)

## SpaCy Structure

![SpaCy Structure](http://spacy.pythonhumanities.com/_images/spacy_containers.png)

In [7]:
## applying the instance

Doc = nlp(text)

In [8]:
# print(Doc)

In [9]:
## compare between Doc and text

print(len(text))
print(len(Doc))

3535
652


In [10]:
## deep dive into text and Doc

for token in text[:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [11]:
for token in Doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


**Doc object is considering only the tokens like word or punctuations, whereas the text object is taking all letter seperately into accounts.**

In [12]:
for token in text.split(' ')[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


## Sentence Boundary Detection(SBD):

**In NLP, sentence boundary detection, or SBD, is the identification of sentences in a text.** -- Sentence Tokenizer

In [13]:
for sent in Doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [14]:
# sentence1 = Doc.sents[0]

    TypeError: 'generator' object is not subscriptable


**We got an error. That is because the sents attribute is a generator. In python, we can usually iterate over generators by converting them into a list. So, let’s do that.**

In [15]:
sentence1 = list(Doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


## Token Attributes

The token object contains a lot of different attributes that are VITAL do performing NLP in spaCy, such as:

*  text

*  head

*  left_edge

*  right_edge

*  ent_type_

*  iob_

*  lemma_

*  morph

*  pos_

*  dep_

*  lang_


*  Text: **The original word text.**
*  Lemma: **The base form of the word.**
*  POS: **The simple UPOS part-of-speech tag.**
*  Tag: **The detailed part-of-speech tag.**
*  Dep: **Syntactic dependency, i.e. the relation between tokens.**
*  Shape: **The word shape – capitalization, punctuation, digits.**
*  is alpha: **Is the token an alpha character?**
*  is stop: **Is the token part of a stop list, i.e. the most common words of the language?**

In [16]:
token2 = sentence1[7]
print(token2)

or


### Text

In [17]:
token2.text

'or'

### Head

**This tells to which word it is governed by, in this case, the primary verb, “is”, as it is part of the noun subject.**

In [18]:
token2.head

U.S.A.

### left_edge

In [19]:
token2.left_edge

or

### right_edge

In [20]:
token2.right_edge

or

In [21]:
### type of entity

token2.ent_type

0

In [22]:
### name of entity type

token2.ent_type_

''

In [23]:
### ent_iob

'''IOB code of named entity tag. 
“B” means the token begins an entity, 
“I” means it is inside an entity, 
“O” means it is outside an entity, 
and "" means no entity tag is set.'''

token2.ent_iob_

'O'

In [24]:
### lemma

'''Base form of the token, with no inflectional suffixes'''

token2.lemma_

'or'

In [25]:
# sentence1[12]

sentence1[12].lemma_

'know'

In [26]:
### morph

token2.morph, sentence1[12].morph

(ConjType=Cmp, Aspect=Perf|Tense=Past|VerbForm=Part)

In [27]:
### POS(Parts of Speech)


'''Coarse-grained part-of-speech from the Universal POS tag set.'''

token2.pos_, sentence1[12].pos_

('CCONJ', 'VERB')

In [28]:
### Syntatic Dependency

'''Syntactic dependency relation.'''

token2.dep_, sentence1[12].dep_, sentence1[2].dep_ 

('cc', 'acl', 'nsubj')

*  'cc' --> conjunction
*  'nsubj' --> noun subject
*  'acl' --> adnominal clause

In [29]:
### Language

'''Languageof the parent document's vocabulary'''

token2.lang_

'en'

## Part of Speech Tagging(POS)

In [30]:
for token in sentence1:
    print(token.text, token.pos_, token.dep_)

The DET det
United PROPN compound
States PROPN nsubj
of ADP prep
America PROPN pobj
( PUNCT punct
U.S.A. PROPN appos
or CCONJ cc
USA PROPN conj
) PUNCT punct
, PUNCT punct
commonly ADV advmod
known VERB acl
as ADP prep
the DET det
United PROPN compound
States PROPN pobj
( PUNCT punct
U.S. PROPN appos
or CCONJ cc
US PROPN conj
) PUNCT punct
or CCONJ cc
America PROPN conj
, PUNCT punct
is AUX ROOT
a DET det
country NOUN attr
primarily ADV advmod
located VERB acl
in ADP prep
North PROPN compound
America PROPN pobj
. PUNCT punct


In [31]:
text2 = "I hope this framework will be very useful for all of your machine learning projects that you plan to deploy in the future. Let's go to the next video."

In [32]:
Doc2 = nlp(text2)
print(Doc2)

I hope this framework will be very useful for all of your machine learning projects that you plan to deploy in the future. Let's go to the next video.


In [33]:
sentence2 = Doc2.sents
print(list(sentence2))

[I hope this framework will be very useful for all of your machine learning projects that you plan to deploy in the future., Let's go to the next video.]


In [34]:
for token in Doc2:
    print(token.text, token.pos_, token.dep_)

I PRON nsubj
hope VERB ROOT
this DET det
framework NOUN nsubj
will AUX aux
be AUX ccomp
very ADV advmod
useful ADJ acomp
for ADP prep
all PRON pobj
of ADP prep
your PRON poss
machine NOUN compound
learning NOUN compound
projects NOUN pobj
that PRON dobj
you PRON nsubj
plan VERB relcl
to PART aux
deploy VERB xcomp
in ADP prep
the DET det
future NOUN pobj
. PUNCT punct
Let VERB ROOT
's PRON nsubj
go VERB ccomp
to ADP prep
the DET det
next ADJ amod
video NOUN pobj
. PUNCT punct


In [35]:
## visualize the sentence with displacy function

from spacy import displacy
displacy.render(Doc2, style = 'dep')

In [36]:
text3 = 'My name is Saibal.'

Doc3 = nlp(text3)

In [37]:
for token in Doc3:
    print(token.text, token.lemma_, token.pos_,
         token.tag_, token.dep_, token.shape_,
         token.is_alpha, token.is_stop)

My my PRON PRP$ poss Xx True True
name name NOUN NN nsubj xxxx True True
is be AUX VBZ ROOT xx True True
Saibal Saibal PROPN NNP attr Xxxxx True False
. . PUNCT . punct . False False


In [38]:
## visualize the sentence

from spacy import displacy
displacy.render(Doc3, style = 'dep')

## Named Entity Recognition(NER)

In [39]:
for ent in Doc.ents:
    print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
third- or fourth DATE
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
1775Ã¢â‚¬â€œ1783 CARDINAL
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
The SpanishÃ¢â‚¬â€œAmerican War and World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the

In [40]:
## Visualize the Named Entity Recognition(NER)

from spacy import displacy
displacy.render(Doc, style = 'ent')

## Word Vectors

    Important


**Word vectors, or word embeddings, are numerical representations of words in multidimensional space through matrices. The purpose of the word vector is to get a computer system to understand a word. Computers cannot understand text efficiently. They can, however, process numbers quickly and well. For this reason, it is important to convert a word into a number.**

--------------------

In [41]:
## importing libraries

import spacy
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.1/en_core_web_md-3.4.1-py3-none-any.whl (42.8 MB)
     ---------------------------------------- 42.8/42.8 MB 8.2 MB/s eta 0:00:00
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


In [42]:
nlp = spacy.load('en_core_web_md')
with open('data/wiki_us.txt', "r") as f:
    text = f.read()
Doc = nlp(text) 
sentence1 = list(Doc.sents)[0]

In [43]:
sentence1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

In [44]:
# ## find synonyms of words in PyDictionary

# from PyDictionary import PyDictionary

# dictionary = PyDictionary()

# words = ['like', 'hate']

# for word in words:
#     syns = dictionary.synonym(word)
#     print(f"{word}: {syns[0:5]}\n")

In [45]:
# sentence1[0].vector

### Why to use WordVectors

In [46]:
import numpy as np


word = input()

## most similar
ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n = 10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]

distances = ms[2]
print(words)

country
['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


In [47]:
## Document similarity

doc1 = nlp("I am good boy.")
doc2 = nlp("I donot study for whole year.")

print(doc1, "<->", doc2, doc1.similarity(doc2))

I am good boy. <-> I donot study for whole year. 0.5653599591420513


In [48]:
doc3 = nlp("Google, amazon, microsoft are big tech giant.")
doc4 = nlp("All these company has started layoff.")

print(doc3, "<->", doc4, doc3.similarity(doc4))

Google, amazon, microsoft are big tech giant. <-> All these company has started layoff. 0.6975344676195314


In [52]:
doc5 = nlp("John likes Harleen.")
doc6 = nlp("Harleen likes John")

print(doc5, "<->", doc6, doc5.similarity(doc6))

John likes Harleen. <-> Harleen likes John 0.8435589703721093


In [54]:
doc7 = nlp("Write down the code of building website using HTML, CSS, JS")
doc8 = nlp("I want to write the code to build portfolio website using HTML, CSS, HS")
print(doc7, "<->", doc8, doc7.similarity(doc8))

Write down the code of building website using HTML, CSS, JS <-> I want to write the code to build portfolio website using HTML, CSS, HS 0.7628555129063627


## spaCy's Pipelines