# Trying to use SpaCy

### Contents:

* [Installing SpaCy](#SpaCy-install)
* [Using the parser](#Using-the-Parser)
* [Using the tokens from the parser](#Some-basic-usage-on-tokens)
* [Getting noun chunks and entities](#Noun-Chunks-and-Entities)
* [Getting logical dependency](#Logical-Dependency)
* [Parallelization](#Local-Parallelization)
* [Extra thing possible with the loaded dictionary](#Finding-other-similar-words)
* [Issues](#Issues)

## SpaCy install
` 
pip install -U spacy
python -m spacy download en
`

Spacy comes with a dictionary, and there's this much more comprehensive one:

`
python -m spacy download en_core_web_md
`

In [1]:
s = \
u'''Dr. Howard's dog, Ubuntu, jumped out the window.
I took a shot of whiskey; it tasted nothing like Angel's Envy from yesterday.
It smelled of unripe bananas, blood oranges and tart, but no lactic sourness.
'''

Loading the more comprehensive Spacy directly as a parser.

In [3]:
import spacy
# parser = spacy.load('en_core_web_md')

ImportError: cannot import name util

Using numpy for cosine similarity to compare later.

In [4]:
from numpy import dot
from numpy.linalg import norm

cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
cos_similarity = lambda w1, w2: cosine(w1.vector, w2.vector)

## Using the Parser
Running 'parser' tokenizes a single document
(A document must be unicode!)

In [4]:
document = parser(s)
print(document)

Dr. Howard's dog, Ubuntu, jumped out the window.
I took a shot of whiskey; it tasted nothing like Angel's Envy from yesterday.
It smelled of unripe bananas, blood oranges and tart, but no lactic sourness.



In [5]:
for sentence in document.sents:
    print(sentence)

Dr. Howard's dog, Ubuntu, jumped out the window.

I took a shot of whiskey; it tasted nothing like Angel's Envy from yesterday.

It smelled of unripe bananas, blood oranges and tart, but no lactic sourness.



## Some basic usage on tokens
Going through the tokens in the document, printing the lemma, word class, its similarity to "Apples" (for fun), and the sum of its word vector (to check it exists).

In [6]:
apples = parser(u'apples')

print("{:10}  {:10}  {:10}  {:10}  {:10}  {:10}".format("Word", "Lemma", "Class", "Builtin", "Cosine", "Vector"))
for token in document: 
    if token.is_alpha : print("{:10}  {:10}  {:10}  {:.5f}  {:10.5f} {:10.4f}".format(token,
                                                                                      token.lemma_,
                                                                                      token.pos_,
                                                                                      token.similarity(apples),
                                                                                      cos_similarity(apples,token),
                                                                                      sum(token.vector)))

Word        Lemma       Class       Builtin     Cosine      Vector    
Howard      howard      PROPN       0.09012     0.09012    -4.7779
dog         dog         NOUN        0.20634     0.20634     2.9046
Ubuntu      ubuntu      PROPN       0.05689     0.05689   -13.1991
jumped      jump        VERB        0.16759     0.16759    -5.1246
out         out         ADP         0.28142     0.28142     4.5354
the         the         DET         0.18507     0.18507    -1.2607
window      window      NOUN        0.14864     0.14864     4.4842
I           -PRON-      PRON        0.20443     0.20443    -7.8471
took        take        VERB        0.18219     0.18219     2.5328
a           a           DET         0.12858     0.12858     0.0272
shot        shot        NOUN        0.12384     0.12384    -0.6114
of          of          ADP         0.17137     0.17137     1.1305
whiskey     whiskey     NOUN        0.34929     0.34929   -12.0690
it          -PRON-      PRON        0.29234     0.29234   

Spacy claims oranges are closer to apples than bananas.

## Noun Chunks and Entities

Amazingly spacy can directly extract noun chunks and entities. I'm not sure how entities might be useful though.

In [7]:
red_apples = parser(u'Red Apples')

print("{:20}  {:20}  {:20}".format("Noun Chunk", "Lemmatized Chunk", "Similarity to Red Apples"))
for chunk in document.noun_chunks: 
    print("{:20}  {:20}  {:.5f}".format(chunk, chunk.lemma_, chunk.similarity(red_apples)))

Noun Chunk            Lemmatized Chunk      Similarity to Red Apples
Dr. Howard's dog      dr. howard 's dog     0.31278
the window            the window            0.35839
I                     -PRON-                0.31608
a shot                a shot                0.32617
whiskey               whiskey               0.38732
it                    -PRON-                0.38155
nothing               nothing               0.37697
Angel's Envy          angel 's envy         0.38522
yesterday             yesterday             0.23516
It                    -PRON-                0.38155
unripe bananas        unripe banana         0.62457
blood oranges         blood orange          0.69092
tart                  tart                  0.57832
no lactic sourness    no lactic sourness    0.37129


In [8]:
print(document.ents)

(Howard, Angel, yesterday)


## Logical Dependency
Spacy can also find out how words are linked in a document. Could be useful for making features.

In [9]:
print("{:14}  {:14}  {:14}  {:24}  {:24}".format("Relationship", "Word", "Target", "Left dependencies", "Right dependencies"))
for token in document:
    if token.is_alpha : print("{:14}  {:14}  {:14}  {:24}  {:24}".format(token.dep_, token.orth_, token.head.orth_,
                                                                         [t.orth_ for t in token.lefts],
                                                                         [t.orth_ for t in token.rights]))

Relationship    Word            Target          Left dependencies         Right dependencies      
poss            Howard          dog             [u'Dr.']                  [u"'s"]                 
nsubj           dog             jumped          [u'Howard']               [u',', u'Ubuntu', u','] 
appos           Ubuntu          dog             []                        []                      
ROOT            jumped          jumped          [u'dog']                  [u'out', u'.']          
prep            out             jumped          []                        [u'window']             
det             the             window          []                        []                      
pobj            window          out             [u'the']                  []                      
nsubj           I               took            []                        []                      
ccomp           took            tasted          [u'I']                    [u'shot']               
det       

## Local Parallelization
Spacy can parallelize across threads. I have no idea if it can automatically parallelize across machines, but that might just be asking for too much.

In [10]:
def s_again_generator(n):
    "generate n documents about Dr. Howard's Dog"
    for i in range(n):
        yield s

In [11]:
def time_pipe(n):
    for doc in parser.pipe(s_again_generator(n), batch_size=50, n_threads=4):
        pass

def time_naive(n):
    for doc in s_again_generator(n):
        parser(doc)

print("Parallelized Time")
% time time_pipe(500)

print("\nUnparallelized Time")
% time time_naive(500)

Parallelized Time
CPU times: user 2.54 s, sys: 49.7 ms, total: 2.59 s
Wall time: 2.73 s

Unparallelized Time
CPU times: user 2.99 s, sys: 55.2 ms, total: 3.04 s
Wall time: 3.15 s


## Finding other similar words
Since a whole dictionary is loaded into spacy, it can find words similar to a target.

In [12]:
whiskey = parser(u'whiskey')
is_actually_a_word = lambda word: word.has_vector and word.is_alpha and word.is_lower and word.lower_

spacy_dictionary = filter(is_actually_a_word, parser.vocab)
spacy_dictionary.sort(key=lambda word: word.similarity(whiskey))
spacy_dictionary.reverse()

print("Size of SpaCy's dictionary: {}".format(len(spacy_dictionary)))

Size of SpaCy's dictionary: 247539


In [13]:
spd = spacy_dictionary

print("Top 50 most similar words to whiskey:")
for w1, w2, w3, w4, w5 in zip(spd[:10], spd[10:20], spd[20:30], spd[30:40], spd[40:50]):   
    print("{:15}  {:15}  {:15}  {:15}  {:15}".format(w1.orth_, w2.orth_, w3.orth_, w4.orth_, w5.orth_))

Top 50 most similar words to whiskey:
whiskey          liquor           bottle           sherry           ale            
whisky           booze            whiskies         bitters          syrup          
bourbon          moonshine        beers            rye              alcoholic      
scotch           malt             liquors          brew             distilled      
gin              distillery       liqueur          cask             cocktails      
rum              whiskeys         drank            drinking         bottles        
vodka            cognac           martini          bottled          lager          
brandy           drink            coke             wine             molasses       
tequila          cider            pint             drinks           vermouth       
beer             wiskey           distilleries     schnapps         soda           


## Issues
* #### I can't get sentiments to work

In [14]:
spacy_dictionary.sort(key=lambda word: word.sentiment)
print("Maximal sentiment word <{}> has a sentiment of {}.".format(spacy_dictionary[0].orth_, spacy_dictionary[0].sentiment))

Maximal sentiment word <whiskey> has a sentiment of 0.0.


* #### The way I treat parser(u'whiskey') like a token was deceptive

In [15]:
whiskey = parser(u'whiskey')
whiskey_token = list(whiskey)[0]
print("document: {}, token: {}".format(whiskey, whiskey_token))

document: whiskey, token: whiskey


In [16]:
print(type(whiskey))
print(type(whiskey_token))

<type 'spacy.tokens.doc.Doc'>
<type 'spacy.tokens.token.Token'>


In [17]:
print("whiskey-whiskey similarity: {}".format(whiskey.similarity(whiskey_token)))

whiskey-whiskey similarity: 1.00000000298


* #### Don't know how to get the similarity function to use lemmas (I don't plan to use it though)

In [18]:
distil_doc = list(parser(u'distillery distilleries'))
print("lemmas: {} {}".format(distil_doc[0].lemma_, distil_doc[1].lemma_))
print("cosine: {}\n".format(distil_doc[0].similarity(distil_doc[1])))

print("token type: {}, lemma type: {}".format(type(distil_doc[0]),type(distil_doc[0].lemma_)))

lemmas: distillery distillery
cosine: 0.813246243064

token type: <type 'spacy.tokens.token.Token'>, lemma type: <type 'unicode'>
