# Chapter 2: Large-scale data analysis with spaCy

https://course.spacy.io/en/chapter2

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

In [1]:
import spacy
from spacy.matcher import Matcher
from spacy.lang.en import English

# Data Structures Part 1
## Vocab, Lexemes, and StringStore

### Shared Vocab and StringStore Part 1
- spaCy stores shared strings/tokens/data across multiple documents.
- spaCy saves memory by encoding all strings to hash values.
- Strings are only stored once in the `StringStore` via `nlp.vocab.strings`
- String store: lookup table in both directions.
    - Passing a string returns a hash value
    ```python
    # Hash value
    earth_hash = nlp.vocab.strings['Earth']
    ```
    - Passing a hash value returns a string
    ```python
    # String value
    nlp.vocab.strings[earth_hash]
    ```
- Hashes cannot be reversed

In [2]:
! python -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation:
/usr/local/anaconda3/lib/python3.7/site-packages/spacy[0m

NAME             SPACY               VERSION                              
en_core_web_md   >=3.0.0rc3,<3.1.0   [38;5;2m3.0.0a1[0m   [38;5;2m✔[0m
en_core_web_sm   >=3.0.0rc3,<3.1.0   [38;5;2m3.0.0a1[0m   [38;5;2m✔[0m
en_core_web_lg   >=3.0.0rc3,<3.1.0   [38;5;2m3.0.0a1[0m   [38;5;2m✔[0m



In [3]:
nlp = spacy.load("en_core_web_lg")

nlp.vocab.strings['Earth']



10533021089177626446

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 2.3.5 requires catalogue<1.1.0,>=0.0.7, but you have catalogue 2.0.1 which is incompatible.
spacy 2.3.5 requires srsly<1.1.0,>=1.0.2, but you have srsly 2.3.2 which is incompatible.
spacy 2.3.5 requires thinc<7.5.0,>=7.4.1, but you have thinc 8.0.1 which is incompatible.
spacy-transformers 1.0.0rc0 requires transformers<3.1.0,>=3.0.0, but you have transformers 4.2.2 which is incompatible.
allennlp-models 1.0.0 requires allennlp==1.0.0, but you have allennlp 2.0.1 which is incompatible.

pip install catalogue==1.0.0
pip install srsly==1.0.2
pip install thinc==7.4.1
pip install transformers==3.0.0

`nlp.vocab.strings[hash_value]` will raise an error because the nlp object __has not seen the hash value of Earth__.

In [4]:
nlp.vocab.strings[10533021089177626446]

'Earth'

__Always pass around the shared vocab between a doc and the nlp object__

To use the string and hash value as inputs in `nlp.vocab.string[input]` we need to give the nlp object text that contains the word we're trying to look up with it's hash value.

In [5]:
doc = nlp("I live on Earth. It's a beautiful planet with diverse life forms." \
          "Over millions of years, these creatures have adapted to harsh climates")

# The nlp object has `memory` of the word 'Earth' an successfully returns the string, given its hash value.
nlp.vocab.strings[10533021089177626446]

'Earth'

### Shared Vocab and String Store Part 2
You can use the __nlp object__ and the __doc object__ to look up the string value or hash value of a token.

#### Find the string and hash values using the nlp object

In [6]:
doc = nlp("I love black coffee, from the hearts mountains of Costa Rica.")

# Display the hash value of the string "coffee"
print("Hash value:", nlp.vocab.strings['coffee'])

Hash value: 3197928453018144401


In [7]:
# Display the string of the hash value 3197928453018144401
print("String value:", nlp.vocab.strings[3197928453018144401])

String value: coffee


#### Find the string and hash values using the doc object

In [8]:
print("Hash value:", doc.vocab.strings['coffee'])
print("String value:", doc.vocab.strings[3197928453018144401])

Hash value: 3197928453018144401
String value: coffee


### Lexemes: entries in the vocabulary
A `Lexeme` object is an entry in the vocabulary. It contains the __context-independent__ information about a word.
- Word text: lexeme.text for the string and lexeme.orth for the hash value
- Lexical attributes of the string, e.g. lexeme.is_alpha
- Lexemes DO NOT contain Parts-of-speech tags, dependencies, or entity labels. These attributes depend on the __CONTEXT__ of a sentence.

In [9]:
lexeme = nlp.vocab['coffee']

print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True
