# Chapter 2: Large-scale data analysis with spaCy

https://course.spacy.io/en/chapter2

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

In [1]:
import spacy
from spacy.lang.en import English
from spacy.matcher import Matcher

# Data Structures (1)
## Vocab, Lexemes, and StringStore

### Shared Vocab and StringStore Part 1
- __spaCy only communicates in hash ids__
- spaCy stores shared strings/tokens/data across multiple documents.
- spaCy saves memory by encoding all strings to hash values.
- Strings are only stored once in the `StringStore` via `nlp.vocab.strings`
- String store: lookup table in both directions.
    - Passing a string returns a hash value
    ```python
    # Hash value
    earth_hash = nlp.vocab.strings['Earth']
    ```
    - Passing a hash value returns a string
    ```python
    # String value
    nlp.vocab.strings[earth_hash]
    ```
- Hashes cannot be reversed

In [2]:
nlp = spacy.load("en_core_web_lg")

nlp.vocab.strings['Earth']



10533021089177626446

`nlp.vocab.strings[hash_value]` will raise an error because the nlp object __has not seen the hash value of Earth__.

In [3]:
nlp.vocab.strings[10533021089177626446]

'Earth'

__Always pass around the shared vocab between a doc and the nlp object__

To use the string and hash value as inputs in `nlp.vocab.string[input]` we need to give the nlp object text that contains the word we're trying to look up with it's hash value.

In [4]:
doc = nlp("I live on Earth. It's a beautiful planet with diverse life forms." \
          "Over millions of years, these creatures have adapted to harsh climates.")

# The nlp object has `memory` of the word 'Earth' an successfully returns the string, given its hash value.
nlp.vocab.strings[10533021089177626446]

'Earth'

### Shared Vocab and String Store Part 2
You can use the __nlp object__ and the __doc object__ to look up the string value or hash value of a token.

#### Find the string and hash values using the nlp object

In [5]:
doc = nlp("I love black coffee, from the hearts mountains of Costa Rica.")

# Display the hash value of the string "coffee"
print("Hash value:", nlp.vocab.strings['coffee'])

Hash value: 3197928453018144401


In [6]:
# Display the string of the hash value 3197928453018144401
print("String value:", nlp.vocab.strings[3197928453018144401])

String value: coffee


#### Find the string and hash values using the doc object

In [7]:
print("Hash value:", doc.vocab.strings['coffee'])
print("String value:", doc.vocab.strings[3197928453018144401])

Hash value: 3197928453018144401
String value: coffee


### Lexemes: entries in the vocabulary
A `Lexeme` object is an entry in the vocabulary. It contains the __context-independent__ information about a word.
- Word text: lexeme.text for the string and lexeme.orth for the hash value
- Lexical attributes of the string, e.g. lexeme.is_alpha
- Lexemes __DO NOT__ contain Parts-of-speech tags, dependencies, or entity labels.
>These attributes depend on the __CONTEXT__ of a sentence.

## Exercises: Strings to hashes

### Part 1

    Look up the string “cat” in nlp.vocab.strings to get the hash.
    Look up the hash to get back the string.


In [8]:
from spacy.lang.en import English

nlp = English()
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings['cat']
print(f"The hash id of the word 'cat' is {cat_hash}", end='\n\n')

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(f"The string of the hash id {cat_hash}, is '{cat_string}'")

The hash id of the word 'cat' is 5439657043933447811

The string of the hash id 5439657043933447811, is 'cat'


### Part 2

    Look up the string label “PERSON” in nlp.vocab.strings to get the hash.
    Look up the hash to get back the string.


In [9]:
from spacy.lang.en import English

nlp = English()
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(f"The hash id of the word 'PERSON' is {person_hash}", end='\n\n')

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(f"The string of the hash id {person_hash}, is '{person_string}'")

The hash id of the word 'PERSON' is 380

The string of the hash id 380, is 'PERSON'


## Exercises: Vocab, hashes, and lexemes

    Why does this code throw an error?

```python
from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings["Bowie"]
print(bowie_id)

# Look up the ID for "Bowie" in the vocab
print(nlp_de.vocab.strings[bowie_id])
```
<br>

    Answer: The string "Bowie" isn’t in the German vocab, so the hash can’t be resolved in the string store.

# Data Structures (2)
## Doc, Span and Token

- The __doc__ object is the most important data structure.
- The doc object is created automatically whenever text processed by a nlp object

### The doc object

In [10]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a Doc object manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
doc.text

'Hello world!'

### The Span object(1)

The span is the slice of a Doc consisting of one or more tokens.

The __Span__ object takes 3 arguments:
- The doc object it refers to.
- The start index
- The end index, exclusive

### The Span object(2)

In [11]:
# Import the Doc and Span classes from spaCy
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "World", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
Span(doc, 0, 2)

Hello World

In [12]:
# Create a span with a label
span_with_label = Span(doc, 0, 2, label='GREETING')
print(span_with_label)
print(span_with_label.label)

Hello World
12946562419758953770


In [13]:
# Add span to the doc.ents attribute to add an entity to
# The list of entities in doc
doc.ents = [span_with_label]

In [14]:
doc.ents

(Hello World,)

## Best Practices

`Doc` and `Span` are very powerful and hold references and relationships of words and sentences.
- Convert result to strings as late as possible. Converting strings earlier in the process will cause a loss of all relationships between the tokens.
- Use token attributes if available - for example, token.i for the token index.
- Always pass shared vocab between the doc and the nlp object.

## Exercises: Creating a Doc

### Part 1

    Import the Doc from spacy.tokens.
    Create a Doc from the words and spaces. Don’t forget to pass in the vocab!


In [15]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


### Part 2

    Import the Doc from spacy.tokens.
    Create a Doc from the words and spaces. Don’t forget to pass in the vocab!


In [16]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]

# The bools indicate if a word has a space inbetween the tokens
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


### Part 3

    Import the Doc from spacy.tokens.
    Complete the words and spaces to match the desired text and create a doc.

In [17]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


## Exercises: Docs, spans, entities from scratch

In this exercise, you’ll create the `Doc` and `Span` objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

    1. Import the `Doc` and `Span` classes from spacy.tokens.
    2. Use the `Doc` class directly to create a doc from the words and spaces.
    3. Create a `Span` for “David Bowie” from the doc and assign it the label "PERSON".
    4. Overwrite the `doc.ents` with a list of one entity, the “David Bowie” span.


In [18]:
from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words, spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


## Exercises: Data structures best practices

The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.

### Part 1

    Why is the code bad?

```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)
```

    Answer: It only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.
    
    Response: That's correct!

    Always convert the results to strings as late as possible, and try to use native token attributes to keep things consistent. 

### Part 2

    Rewrite the code to use the native token attributes instead of lists of `token_texts` and `pos_tags`.
    Loop over each token in the doc and check the `token.pos_` attribute.
    Use `doc[token.i + 1]` to check for the next token and its `.pos_ attribute`.
    If a proper noun before a verb is found, print its `token.text`.


In [19]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")


for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


    ✔ Great work! While the solution here works fine for the given example,
    there are still things that can be improved. If the doc ends with a proper noun,
    doc[token.i + 1] will fail. To make sure the code generalizes, you should first
    check if token.i + 1 < len(doc).

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")


for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB" and (token.i+1 < len(doc)):
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


# Word vectors and semantic similarity

# Combining models and rules

<br>
<br>

# Appendix

In [21]:
lexeme = nlp.vocab['love']

print(lexeme.text, lexeme.orth, lexeme.is_alpha)

love 3702023516439754181 True


This function is awesome!
```python
dir(lexeme)
```

In [22]:
lexeme.text

'love'

In [23]:
doc = nlp("I love black coffee, from the hearts mountains of Costa Rica.")

for i in doc.noun_chunks:
    print(i)

I
black coffee
the hearts mountains
Costa Rica
