### spaCy
* #### Tokenisation
* #### Lemmatisation
* #### Stop words/ Punctuation Removal
* #### Name Entity Recognition
* #### Parser 
* #### Part of speech
* #### Dependency parser
* #### Data Cleaning


In [1]:
!pip install spacy 




In [7]:
! python -m spacy download en_core_web_md

^C


In [8]:
# !pip install spacy 
# ! python -m spacy download en_core_web_sm/en_core_web_md/lg
import spacy
nlp = spacy.load('en_core_web_sm',) # creating nlp object pipeline

In [9]:
nlp

<spacy.lang.en.English at 0x2080f966828>

2021-02-05 00:24:05.595052: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-02-05 00:24:05.595110: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Collecting en-core-web-md==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.0.0/en_core_web_md-3.0.0-py3-none-any.whl (47.1 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


#### Tokenization by spacy

In [16]:
text = 'India spend 2.5% of the G.D.P on their defence budget'
doc = nlp(text)
list(doc)

[India, spend, 2.5, %, of, the, G.D.P, on, their, defence, budget]

In [13]:
for token in doc:
    print(token)

India
spend
2.5
%
of
G.D.P
on
their
defence
budget


### Removing Stopwords by using Spacy

In [17]:
tokens = list(doc) # finding tokens



# here  we use "is_stop"
filtered_text= [token for token in doc if not token.is_stop]
filtered_text


[India, spend, 2.5, %, G.D.P, defence, budget]

### Removing Punctuation by Spacy

In [18]:
no_punct = [token for token in doc if not token.is_punct]
no_punct

[India, spend, 2.5, of, the, G.D.P, on, their, defence, budget]

### Lemmatisation 
lemmatisation is slow process than stemming 

In [19]:
text = 'dog is barking at the human who looking  very much scariest'
doc = nlp(text)
for token in doc:
    print(token , '=>', token.lemma_)

dog => dog
is => be
barking => bark
at => at
the => the
human => human
who => who
looking => look
  =>  
very => very
much => much
scariest => scary


#### Dependency Parser

In a sentence, the words have a relationship with each other. The one word in a sentence which is independent of others, is called as Head /Root word. All the other word are dependent on the root word, they are termed as dependents.

Dependency Parsing is the method of analyzing the relationship/ dependency between different words of a sentence.

* nsubj :Subject of the sentence

* ROOT: The headword or independent word,(generally the verb)

* prep: prepositional modifier, it modifies the meaning of a noun, verb, adjective, or preposition.

* cc and conj : Linkages between words.For example : is , and, etc..

* pobj : Denotes the object of the preposition

* aux : Denotes it is an auxiliary word

* dobj : direct object of the verb

* det : Not specific, but not an independent word.

In [22]:
import spacy
nlp = spacy.load("en_core_web_sm") # creating a spacy pipeline

In [23]:
# Printing the dependency of each token
my_text='Ardra fell into a well and fractured her leg'
my_doc=nlp(my_text)

In [24]:
for token in my_doc:
    print(token,'=>', token.dep_)

Ardra => nsubj
fell => ROOT
into => prep
a => det
well => pobj
and => cc
fractured => conj
her => poss
leg => dobj


In [25]:
from spacy import displacy
displacy.render(my_doc ,style='dep', jupyter=True,)


### Pos Tagging

Each word has it’s own role in a sentence. For example,’Geeta is dancing’. Geeta is the person or ‘Noun’ and dancing is the action performed by her ,so it is a ‘Verb’.Likewise,each word can be classified. This is referred as POS or Part of Speech Tagging.

What are the parts of Speech?

* Noun
* Verb
* Adjective
* Pronoun
* Adverb
* Preposition
* Conjunction
* Interjection


In [26]:
text='Nitika was screaming out loud as usual Her roommate kept ignoring her '
text_doc=nlp(text)

for token in text_doc:
    print(token.text.ljust(10),'-----',token.pos_)

Nitika     ----- PROPN
was        ----- AUX
screaming  ----- VERB
out        ----- ADV
loud       ----- ADV
as         ----- ADP
usual      ----- ADJ
Her        ----- PRON
roommate   ----- NOUN
kept       ----- VERB
ignoring   ----- VERB
her        ----- PRON


### Name Entity Recognition
What is NER ?

NER is the technique of identifying named entities in the text corpus and assigning them pre-defined categories such as ‘ person names’ , ‘ locations’ ,’organizations’,etc..

* ‘ORG’ : companies,organizations,etc..
* ‘PERSON’ : names of people
* ‘GPE’ : countries,states,etc..
* ‘PRODUCT’ : vehicles,food products and so on
* ‘LANGUAGE’ : Names of different languages
There are many other labels too.

In [59]:
# Creating a spacy doc of a sentence
sentence=' The building is located at London. It is the headquaters of Yahoo. John works there. He speaks English'
doc=nlp(sentence)
for entity in doc.ents:
    print(entity.text,"-N.E.R->", entity.label_)

London -N.E.R-> GPE
Yahoo -N.E.R-> ORG
English -N.E.R-> LANGUAGE


In [63]:
# plotting 
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)