# Introudction to spaCy

spaCy is a relatively new framework in the Python NLP environment but it quickly gains ground and will most likely become the de facto library. There are good reasons for its popularity.

#### It's really FAST
Written in Cython, it was specifically designed to be as fast as possible

#### It's really ACCURATE
spaCy implementation of its dependency parser is one of the best-performing in the world.


#### Important feature included
* Index preserving tokenization 
* Models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
* Supports 8 languages out of the box
* Easy and beautiful visualizations
* Pretrained word vectors

#### Extensible
It plays nicely with all the other already existing tools that you know and love: Scikit-Learn, TensorFlow, gensim.

#### DeepLearning Ready
It also has its own deep learning framework that’s especially designed for NLP tasks: Thinc


#note: dependent parser

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads. The figure below shows a dependency parse of a short sentence. The arrow from the word moving to the word faster indicates that faster modifies moving, and the label advmod assigned to the arrow describes the exact nature of the dependency.

<img src="nndep-example.png">

*** in shell 
* pip install -U spaCy
 or 
* conda install -c conda-forge spacy (anaconda) 

*** in shell type
python -m spacy download en


In [42]:
import spacy
nlp = spacy.load('en')
doc = nlp('Hello   World!')
for token in doc:
   print('"' + token.text + '"')
#    print('"' + token.text + '"', token.idx)

"Hello"
"  "
"World"
"!"


In [75]:
doc = nlp("Next week I'll   be in Madrid.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

Next	0	next	False	False	Xxxx	ADJ	JJ
week	5	week	False	False	xxxx	NOUN	NN
I	10	-PRON-	False	False	X	PRON	PRP
'll	11	will	False	False	'xx	VERB	MD
  	15	  	False	True	  	SPACE	_SP
be	17	be	False	False	xx	VERB	VB
in	20	in	False	False	xx	ADP	IN
Madrid	23	madrid	False	False	Xxxxx	PROPN	NNP
.	29	.	True	False	.	PUNCT	.
2019/07/20	31	2019/07/20	False	False	dddd/dd/dd	NUM	CD


#### google spaCy token class 
#### Check  Penn Tree Bank P.O.S. tag
* https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [77]:
# Sentence segmentation

doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.
N.Y. is the biggest city of U.S.


In [45]:
doc = nlp("Next week I'll be in Madrid.")
print([(token.text, token.tag_) for token in doc])

[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Madrid', 'NNP'), ('.', '.')]


In [46]:
doc = nlp("Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)
    
# CHECK 
# https://spacy.io/api/annotation
# DOC class in spaCy

Next week DATE
Madrid GPE


In [47]:
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)

2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG


In [48]:
from spacy import displacy
 
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)

#### Dependent Parsing
This is what spaCy really stands out! 

In [55]:
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN


In [79]:
doc = nlp('I saw a girl with my glasses.')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

I/PRP <--nsubj-- saw/VBD
saw/VBD <--ROOT-- saw/VBD
a/DT <--det-- girl/NN
girl/NN <--dobj-- saw/VBD
with/IN <--prep-- girl/NN
my/PRP$ <--poss-- glasses/NNS
glasses/NNS <--pobj-- with/IN
./. <--punct-- saw/VBD


In [57]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("This is a sentence.")
displacy.serve(doc, style="dep")
# try parser here if the spaCy gives you the error
# https://spacy.io/usage/linguistic-features

TypeError: __init__() got an unexpected keyword argument 'encoding'

In [60]:
# download vector model first 
# python -m spacy download en_core_web_lg
 
nlp = spacy.load('en_core_web_lg')
print(nlp.vocab['banana'].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

In [61]:
banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']
 
print(dog.similarity(animal), dog.similarity(fruit))
print(banana.similarity(fruit), banana.similarity(animal)) 

0.66185343 0.2355285
0.67148364 0.24272852


In [82]:
target = nlp("Cats are beautiful animals. Sometimes, their behaviours are funny.")
 
doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")
doc4 = nlp("The Texas A&M accepts only 10% of applicants.")
 
print(target.similarity(doc1))  
print(target.similarity(doc2)) 
print(target.similarity(doc3))  
print(target.similarity(doc4)) 

0.7583627113755185
0.7731197920150381
0.72560987664483
0.3492357838011923


In [74]:
from spacy.tokens import Doc
from nltk.sentiment.vader import SentimentIntensityAnalyzer
 
sentiment_analyzer = SentimentIntensityAnalyzer()
def polarity_scores(doc):
    return sentiment_analyzer.polarity_scores(doc.text)
 
Doc.set_extension('polarity_scores', getter=polarity_scores, force=True)
doc = nlp("This isn't so great!")
nlp = spacy.load('en')
#doc = nlp("This is great, isn't it?")
print(doc._.polarity_scores)

{'neg': 0.596, 'neu': 0.404, 'pos': 0.0, 'compound': -0.6631}
