# Chapter 2: Large-scale data analysis with spaCy

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

### 2.1 Strings to hashes
**Part 1** 

- Look up the string “cat” in `nlp.vocab.strings` to get the hash.
- Look up the hash to get back the string.

In [1]:
from spacy.lang.en import English

nlp = English()
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


**Part 2**

- Look up the string label “PERSON” in `nlp.vocab.strings` to get the hash.
- Look up the hash to get back the string.

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


#### Exercise 2.1: Why does this code throw an error?

In [3]:
# Exercise 2.1:
# -----------------
# Why does this code throw an error?


from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings["Bowie"]
print(bowie_id)

# Look up the ID for "Bowie" in the vocab
print(nlp_de.vocab.strings[bowie_id])

2644858412616767388


KeyError: "[E018] Can't retrieve string for hash '2644858412616767388'. This usually refers to an issue with the `Vocab` or `StringStore`."

*Your answer here...*

- Luulen, että errori tulee siitä jos `Bowie` merkkijonoa ei löydy saksalaisesta sanastosta, joten sen hash-numerosarjaakaan ei voida ratkaista.
- Toisekseen huomasin, että `Bowie` sanaa haettiin aluksi käyttämällä `nlp` -objektia, mutta ID:n tarkisuksessa käytettiin saksan kieliseksi `nlp_de` määritettyä objektia (?) 

### 2.2 Creating a Doc

Let’s create some Doc objects from scratch!

**Part 1**

- Import the `Doc` from `spacy.tokens`.
- Create a `Doc` from the `words` and `spaces`. Don’t forget to pass in the vocab!

In [4]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


**Part 2**

- Import the `Doc` from `spacy.tokens`.
- Create a `Doc` from the `words` and `spaces`. Don’t forget to pass in the vocab!

In [5]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


#### Exercise 2.2: Complete the code

- Import the `English`.
- Complete the `words` and `spaces` to match the desired text and create a `doc`.

In [6]:
#### Exercise 2.2: Complete the code

# Import English
nlp = English() 

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]
# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


### 2.3 Docs, spans and entities from scratch 

In this exercise, you’ll create the `Doc` and `Span` objects manually, and update the named entities – just like spaCy does behind the scenes. A shared `nlp` object has already been created.

- Import the `Doc` and `Span` classes from `spacy.tokens`.
- Use the `Doc` class directly to create a doc from the words and spaces.
- Create a `Span` for “David Bowie” from the `doc` and assign it the label `"PERSON"`.
- Overwrite the `doc.ents` with a list of one entity, the “David Bowie” `span`.

In [7]:
from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
# Otetaan span eli palanen teksistä
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


### 2.4 Data structures and best practices

The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.

In [8]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags

token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)


Found proper noun before a verb: Berlin


#### Exercise 2.3: Why is the code bad?


*Your answer here...*


- `token_texts` ja `pos_tags` ovat turhia välivaiheita.
- Samoin on `enumerate` -looppi. 
- Tosin koodi toimii, niin onhan se siinä mielessä kuitenkin hyvää koodia 

- Koodin voisi kirjoittaa samaan tyyliin, kuin **Chapter1**:n kohdassa **1.3**

#### Exercise 2.4: Rewrite the code?

- Rewrite the code to use the native token attributes instead of lists of `token_texts` and `pos_tags`.
- Loop over each `token` in the `doc` and check the `token.pos_` attribute.
- Use `doc[token.i + 1]` to check for the next token and `its .pos_` attribute.
- If a proper noun before a verb is found, print its `token.text`.

In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Your code here
# --------------


# Iteroidaan tekstin tokenit
for token in doc:
    # .pos_ :lla saatiin tokenin sanaluokka
    # Tarkistetaan onko sanaluokka PROPN eli erisnimi
    if token.pos_ == "PROPN":
        # Tarkastetaan onko erisnimen jälkeen verbi
        # token.i+1 :llä saatiin seuraava tokeni
        if doc[token.i+1].pos_ == "VERB":
            # Lopuksi printataan alkuperäinen tokeni ...
            # ... jos sellainen teksistä löytyi
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


### 2.5 Inspecting word vectors

In this exercise, you’ll use a larger English model, which includes around 20.000 word vectors. The model is already pre-installed.

- Load the medium `"en_core_web_md"` model with word vectors.
- Print the vector for `"bananas"` using the `token.vector` attribute.

In [10]:
import spacy

# Load the en_core_web_md model
nlp = spacy.load("en_core_web_md")

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

### 2.6 Comparing similarities

In this exercise, you’ll be using spaCy’s similarity methods to compare Doc, Token and Span objects and get similarity scores.

**Part 1**

- Use the `doc.similarity` method to compare `doc1` to `doc2` and print the result.

In [11]:
import spacy

nlp = spacy.load("en_core_web_md")

doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2

# Tarkistetaan, kuinka lähellä nämä kaksi lausetta ovat toisiaan
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


**Part 2**

- Use the `token.similarity` method to compare `TV` to `books` and print the result.

In [12]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.22325331


#### Exercise 2.5: Similarity between “great restaurant” and “really nice bar”?

- Create spans for “great restaurant”/“really nice bar”.
- Use `span.similarity` to compare them and print the result.

In [13]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Exercise 2.5: Complete the code
# ----------------
# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:-1]

print("Span1:",span1)
print("Span2:",span2)

# Get the similarity of the spans
similarity = span1.similarity(span2)
print("Similarity:",similarity)

Span1: great restaurant
Span2: really nice bar
Similarity: 0.75173926


### 2.6 Debugging patterns (1)


#### Exercise 2.6: Why does this pattern not match the tokens “Silicon Valley” in the doc?

```pattern = [{"LOWER": "silicon"}, {"TEXT": " "}, {"LOWER": "valley"}]```

```doc = nlp("Can Silicon Valley workers rein in big tech from within?")```


*Your answer here...*

- Tokenizer ei luo tokenia välilyönnille

### 2.7 Debugging patterns (2)

#### Exercise 2.7: Both patterns in this exercise contain mistakes and won’t match as expected. Can you fix them? 

If you get stuck, try printing the tokens in the `doc` to see how the text will be split and adjust the pattern so that each dictionary represents one token.

- Edit pattern1 so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
- Edit pattern2 so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.


In [14]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns

# Pattern1:ssä pitää muutta "LOWER" -tokeni pienellä kirjoitettuun muotoon
#pattern1 = [[{"LOWER": "Amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]]
pattern1 = [[{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]]
# Pattern2:ssa pitää purkaa "ad-free" useampaan osaan ...
# ... sillä se on kaksiosainen ja sisältää välimerkin
#pattern2 = [[{"LOWER": "ad-free"}, {"POS": "NOUN"}]]
pattern2 = [[{"LOWER": "ad"}, {"IS_PUNCT": True}, {"LOWER": "free"}, {"POS": "NOUN"}]]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", pattern1)
matcher.add("PATTERN2", pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


### 2.8 Efficient phrase matching

Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

- Import the PhraseMatcher and initialize it with the shared vocab as the variable matcher.
- Add the phrase patterns and call the matcher on the doc.


In [17]:
import json
from spacy.lang.en import English

with open("data/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
#print(*patterns)

matcher.add("COUNTRY", [*patterns])

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


### 2.9 Extracting coutries and relationships
In the previous exercise, there was a script using spaCy’s PhraseMatcher to find country names in text. Let’s use that country matcher on a longer text, analyze the syntax and update the document’s entities with the matched countries.

- Iterate over the matches and create a Span with the label "GPE" (geopolitical entity).
- Overwrite the entities in doc.ents and add the matched span.
- Get the matched span’s root head token.
- Print the text of the head token and the span.

#### Exercise 2.8: Complete the code.

In [18]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

with open("data/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())
with open("data/country_text.txt", encoding="utf8") as f:
    TEXT = f.read()

print(TEXT)
    
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Create a doc and reset existing entities
doc = nlp(TEXT)
doc.ents = []

# Exercise 2.8: Complete the code.
# ----------------

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])
print("GPE count:", len([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"]))



After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a "false renaissance" for the organization, given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations suc

**Hint.** You should get print:
```
in --> Namibia
in --> South Africa
Africa --> Cambodia
of --> Kuwait
as --> Somalia
Somalia --> Haiti
Haiti --> Mozambique
in --> Somalia
for --> Rwanda
Britain --> Singapore
War --> Sierra Leone
of --> Afghanistan
invaded --> Iraq
in --> Sudan
of --> Congo
earthquake --> Haiti
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]
```



#### Exercise 2.9: 

Do the same as in exercise 2.8, but use directly `en_core_web_sm` entities. Does it find same amount of GPEs as code in exercise 2.8? 


In [19]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

with open("data/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())
with open("data/country_text.txt", encoding="utf8") as f:
    TEXT = f.read()

print(TEXT)

# Your code here:
# ------------------
nlp = spacy.load("en_core_web_sm")
doc = nlp(TEXT)
ent = doc.ents

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])
print("GPE count:", len([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"]))


After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a "false renaissance" for the organization, given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations suc

# Reflection
1. Print Finnish stop word. Is it longer or shorter than English stop word list?
2. What is `"word vector"`? And how many features spaCy `"word vector"` has?
3. What is "bag of words"?
4. What is lemma? What about lemmatization?
5. Why spaCy does not use stemming?

*Your answers here...*

In [20]:
from spacy.lang.fi import Finnish
print('spaCy Version: %s' % (spacy.__version__))
english_stopwords = spacy.lang.en.stop_words.STOP_WORDS
finnish_stopwords = spacy.lang.fi.stop_words.STOP_WORDS
print('Number of english stop words: %d' % len(english_stopwords))
print('Number of finnish stop words: %d' % len(finnish_stopwords))

spaCy Version: 3.0.5
Number of english stop words: 326
Number of finnish stop words: 822


1. Ylempää nähdään, että suomalaisia stop wordeja on enemmän.

2. Sanavektori kuvaa sanojen ominaisuuksia ja se koostuu pienistä luvuista. Koneiden on helpompi käsitellä sanavektoreita ja tunnistaa siitä sanojen ominaisuudet. Sanavektori on 300 ulotteinen, mutta se voidaan muuttaa kaksiulotteiseksi käyttämällä PCA:ta.

3. Luonnollisen kielen käsittelyssä käytettävä menetelmä. Siinä muutetaan teksti ns. osiin ja tarkastellaan kunkin sanan esiintymistiheyttä ja luokkaa. Tämän avulla voidaan totteuttaa koneoppimista ja luokittetelua myöhemmin datan (tekstin) analysoimisessa.

4. Lemma on sanan perusmuoto. Lemmatilazion on taas yksi NLP:ssä datan (tekstin) käsittelyssä käytettävä menetelmä. Se palauttaa sanan perusmuodon, jotta koneoppivat algoritmin osaisivat käsitellä tekstiä paremmin.

5. SpaCy käyttää vain lemmatizingiä. Varmaan siksi, koska lemmatizing on parempi algoritmi ajamaan saman asian. 