# Chapter 2: Large-scale data analysis with spaCy

https://course.spacy.io/en/chapter2

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

In [1]:
import spacy
from spacy.lang.en import English
from spacy.matcher import Matcher

# Data Structures (1)
## Vocab, Lexemes, and StringStore

### Shared Vocab and StringStore Part 1
- __spaCy only communicates in hash ids__
- spaCy stores shared strings/tokens/data across multiple documents.
- spaCy saves memory by encoding all strings to hash values.
- Strings are only stored once in the `StringStore` via `nlp.vocab.strings`
- String store: lookup table in both directions.
    - Passing a string returns a hash value
    ```python
    # Hash value
    earth_hash = nlp.vocab.strings['Earth']
    ```
    - Passing a hash value returns a string
    ```python
    # String value
    nlp.vocab.strings[earth_hash]
    ```
- Hashes cannot be reversed

In [2]:
nlp = spacy.load("en_core_web_lg")

nlp.vocab.strings['Earth']



10533021089177626446

`nlp.vocab.strings[hash_value]` will raise an error because the nlp object __has not seen the hash value of Earth__.

In [3]:
nlp.vocab.strings[10533021089177626446]

'Earth'

__Always pass around the shared vocab between a doc and the nlp object__

To use the string and hash value as inputs in `nlp.vocab.string[input]` we need to give the nlp object text that contains the word we're trying to look up with it's hash value.

In [4]:
doc = nlp("I live on Earth. It's a beautiful planet with diverse life forms." \
          "Over millions of years, these creatures have adapted to harsh climates.")

# The nlp object has `memory` of the word 'Earth' an successfully returns the string, given its hash value.
nlp.vocab.strings[10533021089177626446]

'Earth'

### Shared Vocab and String Store Part 2
You can use the __nlp object__ and the __doc object__ to look up the string value or hash value of a token.

#### Find the string and hash values using the nlp object

In [5]:
doc = nlp("I love black coffee, from the hearts mountains of Costa Rica.")

# Display the hash value of the string "coffee"
print("Hash value:", nlp.vocab.strings['coffee'])

Hash value: 3197928453018144401


In [6]:
# Display the string of the hash value 3197928453018144401
print("String value:", nlp.vocab.strings[3197928453018144401])

String value: coffee


#### Find the string and hash values using the doc object

In [7]:
print("Hash value:", doc.vocab.strings['coffee'])
print("String value:", doc.vocab.strings[3197928453018144401])

Hash value: 3197928453018144401
String value: coffee


### Lexemes: entries in the vocabulary
A `Lexeme` object is an entry in the vocabulary. It contains the __context-independent__ information about a word.
- Word text: lexeme.text for the string and lexeme.orth for the hash value
- Lexical attributes of the string, e.g. lexeme.is_alpha
- Lexemes __DO NOT__ contain Parts-of-speech tags, dependencies, or entity labels.
>These attributes depend on the __CONTEXT__ of a sentence.

## Exercises: Strings to hashes

### Part 1

    Look up the string “cat” in nlp.vocab.strings to get the hash.
    Look up the hash to get back the string.


In [8]:
from spacy.lang.en import English

nlp = English()
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings['cat']
print(f"The hash id of the word 'cat' is {cat_hash}", end='\n\n')

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(f"The string of the hash id {cat_hash}, is '{cat_string}'")

The hash id of the word 'cat' is 5439657043933447811

The string of the hash id 5439657043933447811, is 'cat'


### Part 2

    Look up the string label “PERSON” in nlp.vocab.strings to get the hash.
    Look up the hash to get back the string.


In [9]:
from spacy.lang.en import English

nlp = English()
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(f"The hash id of the word 'PERSON' is {person_hash}", end='\n\n')

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(f"The string of the hash id {person_hash}, is '{person_string}'")

The hash id of the word 'PERSON' is 380

The string of the hash id 380, is 'PERSON'


## Exercises: Vocab, hashes, and lexemes

    Why does this code throw an error?

```python
from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings["Bowie"]
print(bowie_id)

# Look up the ID for "Bowie" in the vocab
print(nlp_de.vocab.strings[bowie_id])
```
<br>

    Answer: The string "Bowie" isn’t in the German vocab, so the hash can’t be resolved in the string store.

# Data Structures (2)
## Doc, Span and Token

- The __doc__ object is the most important data structure.
- The doc object is created automatically whenever text processed by a nlp object

### The doc object

In [10]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a Doc object manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
doc.text

'Hello world!'

### The Span object(1)

The span is the slice of a Doc consisting of one or more tokens.

The __Span__ object takes 3 arguments:
- The doc object it refers to.
- The start index
- The end index, exclusive

### The Span object(2)

In [11]:
# Import the Doc and Span classes from spaCy
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "World", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
Span(doc, 0, 2)

Hello World

In [12]:
# Create a span with a label
span_with_label = Span(doc, 0, 2, label='GREETING')
print(span_with_label)
print(span_with_label.label)

Hello World
12946562419758953770


In [13]:
# Add span to the doc.ents attribute to add an entity to
# The list of entities in doc
doc.ents = [span_with_label]

In [14]:
doc.ents

(Hello World,)

## Best Practices

`Doc` and `Span` are very powerful and hold references and relationships of words and sentences.
- Convert result to strings as late as possible. Converting strings earlier in the process will cause a loss of all relationships between the tokens.
- Use token attributes if available - for example, token.i for the token index.
- Always pass shared vocab between the doc and the nlp object.

## Exercises: Creating a Doc

### Part 1

    Import the Doc from spacy.tokens.
    Create a Doc from the words and spaces. Don’t forget to pass in the vocab!


In [15]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


### Part 2

    Import the Doc from spacy.tokens.
    Create a Doc from the words and spaces. Don’t forget to pass in the vocab!


In [16]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]

# The bools indicate if a word has a space inbetween the tokens
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


### Part 3

    Import the Doc from spacy.tokens.
    Complete the words and spaces to match the desired text and create a doc.

In [17]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


## Exercises: Docs, spans, entities from scratch

In this exercise, you’ll create the `Doc` and `Span` objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

    1. Import the `Doc` and `Span` classes from spacy.tokens.
    2. Use the `Doc` class directly to create a doc from the words and spaces.
    3. Create a `Span` for “David Bowie” from the doc and assign it the label "PERSON".
    4. Overwrite the `doc.ents` with a list of one entity, the “David Bowie” span.


In [18]:
from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words, spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


## Exercises: Data structures best practices

The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.

### Part 1

    Why is the code bad?

```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)
```

    Answer: It only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.
    
    Response: That's correct!

    Always convert the results to strings as late as possible, and try to use native token attributes to keep things consistent. 

### Part 2

    Rewrite the code to use the native token attributes instead of lists of `token_texts` and `pos_tags`.
    Loop over each token in the doc and check the `token.pos_` attribute.
    Use `doc[token.i + 1]` to check for the next token and its `.pos_ attribute`.
    If a proper noun before a verb is found, print its `token.text`.


In [19]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")


for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


    ✔ Great work! While the solution here works fine for the given example,
    there are still things that can be improved. If the doc ends with a proper noun,
    doc[token.i + 1] will fail. To make sure the code generalizes, you should first
    check if token.i + 1 < len(doc).

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")


for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB" and (token.i+1 < len(doc)):
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


# Word vectors and semantic similarity

### Comparing semantic similarity
- spaCy can compare two objects and predict similarity.
- `Doc.similarity()`, `Span.similarity()`, `Token.similarity()`
- Take a Doc/Span and return a similarity score between 0 and 1.
- To compare similarity use a model that has word vectors included:
    - `en_core_web_md` or
    - `en_core_web_lg`
        
### Similarity example (1)

In [21]:
# Load a spaCy model with vectors
nlp = spacy.load("en_core_web_lg")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")

print(f"doc1 and doc2 have {doc1.similarity(doc2):.0%} similarity.")

doc1 and doc2 have 86% similarity.


In [22]:
# Comparing two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]

print(f"token1 and token2 have {token1.similarity(token2):.0%} similarity.")

token1 and token2 have 74% similarity.


### Similarity example (2)

In [23]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(f"doc and token have {doc.similarity(token):.0%} similarity.")

doc and token have 33% similarity.


In [24]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(f"span and doc have {span.similarity(doc):.0%} similarity.")

span and doc have 62% similarity.


## How does spaCy predict similarity?

- similarity is determined using word vectors
- Multi-dimensional meaning representation of words
- Generated using an algorithm like Word2Vec and lots of text
    - Word2Vec is used to train word vectors from raw text
- Can be added to spaCy's statistical models
- Default: consine similarity, but can be adjusted
- `Doc` and `Span` vectors default to average of token vectors
- Short phrases are better than long documents with many irrelevant words

### Word vectors in spaCy

In [25]:
# Load the larger model with vectors
nlp = spacy.load("en_core_web_lg")

doc = nlp("I have a banana")

# Access the vector via the token.vector attribute - Vector of banana
print(doc[3].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

### Similarity depends on the application context

- Useful for many applications: recommendation systems, flagging duplicates (posts on an online platform) etc.
- There's no objective definition of "similarity"
- Depends on othe context and what the application needs to do

In this example, the subject of both docs is 'cat'. The similarity of these two docs is 95%. The both contain the word 'I' and 'cats'. In a sentiment analysis application this similarity would be incorrect.

In [26]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9501446702124066


## Inspecting word vectors

In this exercise, you’ll use a larger English model, which includes around 685.000 word vectors. The model is already pre-installed.

    Load the medium "en_core_web_md" model with word vectors.
    Print the vector for "bananas" using the token.vector attribute.


In [27]:
import spacy

# Load the en_core_web_lg model
nlp = spacy.load("en_core_web_lg")

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

    ✔ Well done! In the next exercise, you'll be using spaCy to predict
    similarities between documents, spans and tokens via the word vectors under the
    hood.

## Comparing similarities
In this exercise, you’ll be using spaCy’s similarity methods to compare `Doc`, `Token` and `Span` objects and get similarity scores.

### Part 1

    Use the doc.similarity method to compare doc1 to doc2 and print the result.


In [28]:
import spacy

nlp = spacy.load("en_core_web_md")

doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


### Part 2
    Use the token.similarity method to compare token1 to token2 and print the result.

In [29]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.2232533


### Part 3

    Create spans for “great restaurant”/“really nice bar”.
    Use span.similarity to compare them and print the result.

In [30]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.7517392


    ✔ Well done! Feel free to experiment with comparing more objects, if
    you like. The similarities are not *always* this conclusive. Once you're getting
    serious about developing NLP applications that leverage semantic similarity, you
    might want to train vectors on your own data, or tweak the similarity
    algorithm.

# Combining models and rules

Combining statistical models with rules systems is powerful NLP tool

## Statistical predictions vs. rules

|                         | Statistical models | Rule-based systems |
|:----------------------- | :----------------- | :----------------- |
| __Use Cases__           | Application needs to _generalize_ based on examples | Dictionary with a finite number of examples |
| __Real-world Examples__ | Product names. person names, subject/object relationships | countries of the world, cities, drug names, dog breeds |
| __spaCy features__      | Entity recognizer, dependency parser, part-of-speech tagger | tokenizer, Matcher, PhraseMatcher |


### Efficient phrase matching (1)
- `PhraseMatcher` like regular expressions or keyword search - but with access to the tokens!
- Takes `Doc` objects as patterns
- More efficient and faster than the `Matcher`
- Useful for matching large word lists

### Efficient phrase matching (2)

In [31]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)


# Passing a Doc object as a new pattern using PhraseMatcher
pattern = nlp("Golden Retriever")
matcher.add("DOG", [pattern])


doc = nlp("I have a Golden Retriever")

# Iterate over the matches in doc
for match_id, start, end in matcher(doc):
    # Return the matched span
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: Golden Retriever


## Debugging patterns (1)

Why does this pattern not match the tokens “Silicon Valley” in the `doc`?

```python
pattern = [{"LOWER": "silicon"}, {"TEXT": " "}, {"LOWER": "valley"}]

doc = nlp("Can Silicon Valley workers rein in big tech from within?")
```

Answer: The tokenizer doesn’t create tokens for single spaces, so there’s no token with the value " " in between.

That's correct!

The tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token. 

## Debugging patters (2)

Both patterns in this exercise contain mistakes and won’t match as expected. Can you fix them? If you get stuck, try printing the tokens in the doc to see how the text will be split and adjust the pattern so that each dictionary represents one token.

    Edit pattern1 so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
    Edit pattern2 so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.


In [32]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]

# The hypen in `add-free` is a token in spaCy. It must be split into `ad`, `-`, and `free`
pattern2 = [{"LOWER": "ad"}, {"TEXT": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", [pattern1])
matcher.add("PATTERN2", [pattern2])

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


## Efficient phrase matching

Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable `COUNTRIES`.

    Import the PhraseMatcher and initialize it with the shared vocab as the variable matcher.
    Add the phrase patterns and call the matcher on the doc.


In [33]:
import json
from spacy.lang.en import English

with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.____ import ____

matcher = ____(____)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = ____(____)
print([doc[start:end] for match_id, start, end in matches])

FileNotFoundError: [Errno 2] No such file or directory: 'exercises/en/countries.json'

In [None]:
COUNTRIES = ['Afghanistan',
 'Åland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba',
 'Bosnia and Herzegovina',
 'Botswana',
 'Bouvet Island',
 'Brazil',
 'British Indian Ocean Territory',
 'United States Minor Outlying Islands',
 'Virgin Islands (British)',
 'Virgin Islands (U.S.)',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cabo Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas Island',
 'Cocos (Keeling) Islands',
 'Colombia',
 'Comoros',
 'Congo',
 'Congo (Democratic Republic of the)',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Falkland Islands (Malvinas)',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'French Southern Territories',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guernsey',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Heard Island and McDonald Islands',
 'Holy See',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 "Côte d'Ivoire",
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Isle of Man',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jersey',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 "Lao People's Democratic Republic",
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Macao',
 'Macedonia (the former Yugoslav Republic of)',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Micronesia (Federated States of)',
 'Moldova (Republic of)',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Montserrat',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nauru',
 'Nepal',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Niue',
 'Norfolk Island',
 "Korea (Democratic People's Republic of)",
 'Northern Mariana Islands',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Palestine, State of',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Pitcairn',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Kosovo',
 'Réunion',
 'Romania',
 'Russian Federation',
 'Rwanda',
 'Saint Barthélemy',
 'Saint Helena, Ascension and Tristan da Cunha',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin (French part)',
 'Saint Pierre and Miquelon',
 'Saint Vincent and the Grenadines',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Sint Maarten (Dutch part)',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Georgia and the South Sandwich Islands',
 'Korea (Republic of)',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Svalbard and Jan Mayen',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syrian Arab Republic',
 'Taiwan',
 'Tajikistan',
 'Tanzania, United Republic of',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tokelau',
 'Tonga',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'Turks and Caicos Islands',
 'Tuvalu',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland',
 'United States of America',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam',
 'Wallis and Futuna',
 'Western Sahara',
 'Yemen',
 'Zambia',
 'Zimbabwe']

In [None]:
nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", *[patterns])

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

    ✔ Well done! Let's use this matcher to add some custom entities.

## Extracting countries and relationships

In the previous exercise, you wrote a script using spaCy’s `PhraseMatcher` to find country names in text. Let’s use that country matcher on a longer text, analyze the syntax and update the document’s entities with the matched countries.

    Iterate over the matches and create a Span with the label "GPE" (geopolitical entity).
    Overwrite the entities in doc.ents and add the matched span.
    Get the matched span’s root head token.
    Print the text of the head token and the span.


In [None]:
TEXT = 'After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a "false renaissance" for the organization, given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations such as Somalia, Haiti, Mozambique, and the former Yugoslavia. The UN mission in Somalia was widely viewed as a failure after the US withdrawal following casualties in the Battle of Mogadishu, and the UN mission to Bosnia faced "worldwide ridicule" for its indecisive and confused mission in the face of ethnic cleansing. In 1994, the UN Assistance Mission for Rwanda failed to intervene in the Rwandan genocide amid indecision in the Security Council. Beginning in the last decades of the Cold War, American and European critics of the UN condemned the organization for perceived mismanagement and corruption. In 1984, the US President, Ronald Reagan, withdrew his nation\'s funding from UNESCO (the United Nations Educational, Scientific and Cultural Organization, founded 1946) over allegations of mismanagement, followed by Britain and Singapore. Boutros Boutros-Ghali, Secretary-General from 1992 to 1996, initiated a reform of the Secretariat, reducing the size of the organization somewhat. His successor, Kofi Annan (1997–2006), initiated further management reforms in the face of threats from the United States to withhold its UN dues. In the late 1990s and 2000s, international interventions authorized by the UN took a wider variety of forms. The UN mission in the Sierra Leone Civil War of 1991–2002 was supplemented by British Royal Marines, and the invasion of Afghanistan in 2001 was overseen by NATO. In 2003, the United States invaded Iraq despite failing to pass a UN Security Council resolution for authorization, prompting a new round of questioning of the organization\'s effectiveness. Under the eighth Secretary-General, Ban Ki-moon, the UN has intervened with peacekeepers in crises including the War in Darfur in Sudan and the Kivu conflict in the Democratic Republic of Congo and sent observers and chemical weapons inspectors to the Syrian Civil War. In 2013, an internal review of UN actions in the final battles of the Sri Lankan Civil War in 2009 concluded that the organization had suffered "systemic failure". One hundred and one UN personnel died in the 2010 Haiti earthquake, the worst loss of life in the organization\'s history. The Millennium Summit was held in 2000 to discuss the UN\'s role in the 21st century. The three day meeting was the largest gathering of world leaders in history, and culminated in the adoption by all member states of the Millennium Development Goals (MDGs), a commitment to achieve international development in areas such as poverty reduction, gender equality, and public health. Progress towards these goals, which were to be met by 2015, was ultimately uneven. The 2005 World Summit reaffirmed the UN\'s focus on promoting development, peacekeeping, human rights, and global security. The Sustainable Development Goals were launched in 2015 to succeed the Millennium Development Goals. In addition to addressing global challenges, the UN has sought to improve its accountability and democratic legitimacy by engaging more with civil society and fostering a global constituency. In an effort to enhance transparency, in 2016 the organization held its first public debate between candidates for Secretary-General. On 1 January 2017, Portuguese diplomat António Guterres, who previously served as UN High Commissioner for Refugees, became the ninth Secretary-General. Guterres has highlighted several key goals for his administration, including an emphasis on diplomacy for preventing conflicts, more effective peacekeeping efforts, and streamlining the organization to be more responsive and versatile to global needs.\n'

In [None]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)

# Creating a loop add each country into the matcher object
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", *[patterns])

# Create a doc and reset existing entities
doc = nlp(TEXT)

# Reset entities and manually create GPE entities
doc.ents = []

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

In [None]:
# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

<br>
<br>

# Appendix

In [None]:
lexeme = nlp.vocab['love']

print(lexeme.text, lexeme.orth, lexeme.is_alpha)

This function is awesome!
```python
dir(lexeme)
```

In [None]:
lexeme.text

In [None]:
doc = nlp("I love black coffee, from the hearts mountains of Costa Rica.")

for i in doc.noun_chunks:
    print(i)