# spaCy

spaCy is an NLP library, similar to NLTK https://spacy.io/

In [1]:
text = 'The Guardian is a British daily newspaper. It was founded in 1821 as The Manchester Guardian, and changed its name in 1959. Along with its sister papers The Observer and The Guardian Weekly, the Guardian is part of the Guardian Media Group, owned by the Scott Trust. The trust was created in 1936 to "secure the financial and editorial independence of the Guardian in perpetuity and to safeguard the journalistic freedom and liberal values of the Guardian free from commercial or political interference".[4] The trust was converted into a limited company in 2008, with a constitution written so as to maintain for The Guardian the same protections as were built into the structure of the Scott Trust by its creators. Profits are reinvested in journalism rather than distributed to owners or shareholders.'

#from wikipedia

To analyze this text, first we need to download a suitable model from spaCy's webpage https://spacy.io/models

In [2]:
#!python -m spacy download en_core_web_lg

In [3]:
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(text)

Since the text is now a spaCy object, we can use several spaCy attributes to analyze it. First we can split the text into sentences.

In [7]:
sentences = list(doc.sents)
for i in range(len(sentences)):
    print(sentences[i].text) 
    print("Number of characters:", len(sentences[i].text))
    print(" — — — — — — — — — — — — — — — — — -")

The Guardian is a British daily newspaper.
Number of characters: 42
 — — — — — — — — — — — — — — — — — -
It was founded in 1821 as The Manchester Guardian, and changed its name in 1959.
Number of characters: 80
 — — — — — — — — — — — — — — — — — -
Along with its sister papers
Number of characters: 28
 — — — — — — — — — — — — — — — — — -
The Observer and The Guardian Weekly, the Guardian is part of the Guardian Media Group, owned by the Scott Trust.
Number of characters: 113
 — — — — — — — — — — — — — — — — — -
The trust was created in 1936 to "secure the financial and editorial independence of the Guardian in perpetuity and to safeguard the journalistic freedom and liberal values of the Guardian free from commercial or political interference".[4]
Number of characters: 240
 — — — — — — — — — — — — — — — — — -
The trust was converted into a limited company in 2008, with a constitution written so as to maintain for The Guardian the same protections as were built into the structure of the 

spaCy allows us also to split the text into tokens. Moreover, for every token we can obtain the part-of-speech tag and dependencies.

In [8]:
for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

The         DET       det       
Guardian    PROPN     nsubj     
is          VERB      ROOT      
a           DET       det       
British     ADJ       amod      
daily       ADJ       amod      
newspaper   NOUN      attr      
.           PUNCT     punct     
It          PRON      nsubjpass 
was         VERB      auxpass   
founded     VERB      ROOT      
in          ADP       prep      
1821        NUM       pobj      
as          ADP       prep      
The         DET       det       
Manchester  PROPN     compound  
Guardian    PROPN     pobj      
,           PUNCT     punct     
and         CCONJ     cc        
changed     VERB      conj      
its         DET       poss      
name        NOUN      dobj      
in          ADP       prep      
1959        NUM       pobj      
.           PUNCT     punct     
Along       ADP       ROOT      
with        ADP       prep      
its         DET       poss      
sister      NOUN      pobj      
papers      VERB      appos     
The       

In particular we can get more information about certain words called entities. 

In [9]:
# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Guardian ORG
British NORP
daily DATE
1821 DATE
The Manchester Guardian ORG
1959 DATE
The Observer and The Guardian Weekly WORK_OF_ART
Guardian ORG
the Guardian Media Group ORG
the Scott Trust ORG
1936 DATE
Guardian ORG
Guardian ORG
2008 DATE
Guardian ORG
the Scott Trust ORG


We can integrate this information into the text.

In [11]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

The method Matcher finds the patterns that matches expresions by their part-of-speech tag.

In [17]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [18]:
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 4
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice responses


The way spaCy deals with tokens is using a model similar to Word2Vec, converting the word into a numerical vector. That way we can make computations such as cosine similarity or even addition and substraction of words. 

In [19]:
from scipy import spatial
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
print("apple vs banana: ", cosine_similarity(nlp.vocab['apple'].vector, nlp.vocab['banana'].vector))
print("car vs banana: ", cosine_similarity(nlp.vocab['car'].vector, nlp.vocab['banana'].vector))
print("car vs bus: ", cosine_similarity(nlp.vocab['car'].vector, nlp.vocab['bus'].vector))
print("tomatos vs banana: ", cosine_similarity(nlp.vocab['tomatos'].vector, nlp.vocab['banana'].vector))
print("tomatos vs cucumber: ", cosine_similarity(nlp.vocab['tomatos'].vector, nlp.vocab['cucumber'].vector))

apple vs banana:  0.5831844210624695
car vs banana:  0.16172660887241364
car vs bus:  0.48169606924057007
tomatos vs banana:  0.38079631328582764
tomatos vs cucumber:  0.5478045344352722


In [20]:
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
queen = nlp.vocab['queen'].vector
king = nlp.vocab['king'].vector
calculated_king = man -woman + queen
print('similarity between our calculated king vector and real king vector:', cosine_similarity(calculated_king, king))

similarity between our calculated king vector and real king vector: 0.771614134311676


We can find the similarity between words or texts. 

In [21]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525
