# Chapter 2: Large-scale data analysis with spaCy
You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.


## Data Structures (1): Vocab, Lexemes and StringStore


In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')

### Shared vocab and string store (2)
- Vocab: stores data shared across multiple documents
- To save memory, spaCy encodes all strings to hash values
- Strings are only stored once in the StringStore via nlp.vocab.strings
- String store: lookup table in both directions


In [2]:
nlp.vocab.strings

<spacy.strings.StringStore at 0x7f4f4da211e0>

In [3]:
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]
coffee_hash, coffee_string

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'. This usually refers to an issue with the `Vocab` or `StringStore`."

In [None]:
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

### Lexemes
Lexemes are context-independent entries in the vocabulary.

You can get a lexeme by looking up a string or a hash ID in the vocab.

Lexemes expose attributes, just like tokens.

They hold context-independent information about a word, like the text, or whether the the word consists of alphabetic characters.

Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.


In [None]:

# A Lexeme object is an entry in the vocabulary
# Contains the context-independent information about a word
# Word text: lexeme.text and lexeme.orth (the hash)
# Lexical attributes like lexeme.is_alpha
# Not context-dependent part-of-speech tags, dependencies or entity labels

In [None]:

doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

![alt asdf][./static/images/vocab_stringstore.png]

The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.

Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.

In [11]:
from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()# Create an n# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)lp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

In [12]:
# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings['Bowie']
print(bowie_id)

2644858412616767388


In [14]:

# Look up the ID for 'Bowie' in the vocab
#print(nlp_de.vocab.strings[bowie_id])
#Why does this code throw an error?
# Hashes can’t be reversed. To prevent this problem, add the word to the new vocab by processing a text or looking up the string, or use the same vocab to resolve the hash back to a string.

## Data Structures (2): Doc, Span and Token


### Doc

In [16]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
doc

Hello world!

### Span

In [35]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

span_with_label, doc.ents, [ent.label_ for ent in doc.ents]

(Hello world, (Hello world,), ['GREETING'])

### Best practices
- Doc and Span are very powerful and hold references and relationships of words and sentences
- Convert result to strings as late as possible
- Use token attributes if available – for example, token.i for the token index
- Don't forget to pass in the shared vocab

Let's practice!


In [40]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [41]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


In [50]:
#___________________
doc = nlp("Berlin is a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]
print(pos_tags)
for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

['PROPN', 'AUX', 'DET', 'ADJ', 'NOUN']


In [53]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if token.i + 1 < len(doc) and doc[token.i + 1].pos_ == "VERB":
            result = token.text
            print("Found proper noun before a verb:", result)

__________________
## Word vectors and semantic similarity

### Comparing semantic similarity
- spaCy can compare two objects and predict similarity
- Doc.similarity(), Span.similarity() and Token.similarity()
- Take another object and return a similarity score (0 to 1)
- Important: needs a model that has word vectors included, for example:  

✅ en_core_web_md (medium model)  
✅ en_core_web_lg (large model)  
🚫 NOT en_core_web_sm (small model)  



In [1]:
# spacy.load('en_core_web_md')
import en_core_web_sm
nlp = en_core_web_sm.load()

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1, "==", token2, token1.similarity(token2))

0.8018373287411041
pizza == pasta 0.32624283


*How does spaCy predict similarity?*
- Similarity is determined using word vectors
- Multi-dimensional meaning representations of words
- Generated using an algorithm like Word2Vec and lots of text
- Can be added to spaCy's statistical models
- Default: cosine similarity, but can be adjusted
- Doc and Span vectors default to average of token vectors
- Short phrases are better than long documents with many irrelevant words

In [9]:
doc = nlp("I have a banana")

# Access the vector via the token.vector attribute
print(doc[3].vector)

[ 0.9383564  -2.9524927   1.1866798   0.49744225 -0.11475766  0.804008
  0.4672468  -1.1062207   2.9193573   1.800931   -0.31358248  1.1920271
 -1.2406584  -2.3237133   2.099902   -0.66673994 -0.96991694  0.8316833
  0.10666084 -0.42245626  1.6402073   0.95437694  1.2855074  -2.038612
 -0.7317371  -0.17545497  0.14752543  1.327169    3.2502053  -3.9332502
  1.7409098  -0.73711336  1.4852796  -2.8246899  -1.8938334  -1.2638527
  5.298433   -1.2850044  -2.7470415  -1.5607052   5.181785    2.242096
 -2.1922808  -5.310454    1.0295098   1.484088   -1.5894104  -0.14745024
  1.7829046   1.8879583   4.152973   -3.1493165  -0.18937713  2.09369
 -2.1269834   0.63290507  2.6979058   1.800016   -2.3953576   2.54901
  1.0445759  -1.3137031   2.4631662  -0.07756937 -1.129545    0.1169464
  1.3869805   0.53586185 -2.242661    2.8641388  -3.8719153  -0.6409143
  0.6971829   4.484493   -1.6210997   2.494869    0.7218447  -3.3112261
 -0.2163549  -2.5339773  -1.1702836  -0.9627162  -3.7210062   1.559916

#### Similarity depends on the application context
- Useful for many applications: recommendation systems, flagging duplicates etc.
- There's no objective definition of "similarity"
- Depends on the context and what application needs to do

In [11]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.849750871465223


In [12]:
#__________________________________________________

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)


[ 1.5189981   1.0769155  -2.4386053  -1.0038385  -0.09056883  3.2566302
  1.1464092  -1.7798254   0.94800925  3.154657    1.5740933   1.4181501
 -0.6316143   0.6489268  -2.9598496  -3.2275555  -2.5892558   0.47790402
 -0.9010248  -0.08130313  1.0560243  -2.3348064  -2.1118944  -0.06434575
  2.2199314   3.2613754  -0.10951382 -2.4045494  -1.9903516   0.31546313
  0.34890515 -0.995184   -3.042478   -3.2609143   1.6715308  -1.5878627
  4.6034174  -1.991154   -0.45828784 -1.1500666   9.190532    3.157382
  1.064983   -1.8014977  -1.0308608  -0.79789555  0.17627871 -0.3339194
  0.9729749   2.3049207  -3.8109362  -3.8562965   0.1393863  -0.5373318
 -3.0516777   0.18260963  2.7131486  -0.23883188 -2.8489332   0.93734485
 -1.8697594   4.6760097  -0.4455679   0.33904123  3.8083265  -0.60235274
  1.409046   -2.2682066   1.1596813   0.8128674   1.7838109   0.44814005
 -0.62783265  2.7147424   0.922505   -0.02261603  0.06355977 -4.513331
  0.05755246 -3.054604   -2.3484707   3.1645236  -3.7710278 

In [16]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.5361258385408284


In [18]:
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.28746358


In [24]:

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[-4:-1]

# Get the similarity of the spans
similarity = span1.similarity(span2)
span1, span2, similarity

(great restaurant, really nice bar, 0.616326)

_____________________________
### Combining models and rules
#### Statistical models
Statistical models are useful if your application needs to be able to generalize based on a few examples.

- Use cases: application needs to generalize based on examples	⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀
- Real-world examples: product names, person names, subject/object relationships	
- spaCy features: entity recognizer, dependency parser, part-of-speech tagger

#### Rule-based systems
Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find. For example, all countries or cities of the world, drug names or even dog breeds.


- Use cases: dictionary with finite number of examples
- Real-world examples: countries of the world, cities, drug names, dog breeds
- spaCy features: tokenizer, Matcher, PhraseMatcher

For complex tasks, it’s usually better to train a statistical entity recognition model. However, statistical models require training data, so for many situations, rule-based approaches are more practical. This is especially true at the start of a project: you can use a rule-based approach as part of a data collection process, to help you “bootstrap” a statistical model.

[rule-based matching](https://spacy.io/usage/rule-based-matching)



In [33]:

# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]
matcher.add('VERY_HAPPY', None, pattern)

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matcher.vocab.strings[match_id], matched_span.text)

LOVE_CATS love cats
VERY_HAPPY very happy
VERY_HAPPY very very happy


In [34]:
# Adding statistical prediction
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)
    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


#### Efficient phrase matching (1)
- PhraseMatcher like regular expressions or keyword search – but with access to the tokens!
- Takes Doc object as patterns
- More efficient and faster than the Matcher
- Great for matching large word lists

The phrase matcher is another helpful tool to find sequences of words in your data.

It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context.

It takes Doc objects as patterns.

It's also really fast.

This makes it very useful for matching large dictionaries and word lists on large volumes of text

In [35]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


In [41]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"IS_PUNCT": True}, {"LOWER":"free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


In [40]:
[token.text for token in nlp("ad-free viewing")]

['ad', '-', 'free', 'viewing']

In [45]:
import json
from spacy.lang.en import English

with open("static/countries.json") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
print(patterns)
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia, Argentina]
[Czech Republic, Slovakia]


In [52]:
# Let’s use that country matcher on a longer text, analyze the syntax and update the document’s entities with the matched countries.

# Iterate over the matches and create a Span with the label "GPE" (geopolitical entity).
# Overwrite the entities in doc.ents and add the matched span.
# Get the matched span’s root head token.
# Print the text of the head token and the span.

from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

with open("static/countries.json") as f:
    COUNTRIES = json.loads(f.read())
with open("static/country_text.txt") as f:
    TEXT = f.read()

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Create a doc and find matches in it
doc = nlp(TEXT)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    # The span’s root token is available as span.root. A token’s head is available via the token.head attribute.
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

Namibia --> Namibia
South --> South Africa
Cambodia --> Cambodia
Kuwait --> Kuwait
Somalia --> Somalia
Haiti --> Haiti
Mozambique --> Mozambique
Somalia --> Somalia
Rwanda --> Rwanda
Singapore --> Singapore
Sierra --> Sierra Leone
Afghanistan --> Afghanistan
Iraq --> Iraq
Sudan --> Sudan
Congo --> Congo
Haiti --> Haiti
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]
