# Data Structes: Vocab, Lexemes and StringStore 

Welcome back! Now that we have had some real experience using spaCy's objects, its time for we learn more about what's actually going on under spaCy's hood. 

In this lesson, we will take a look at the shared vocabulary and how spaCy deals with strings.

* spaCy stores all shared data in a vocabulary, the Vocab.

* This includes words, but also the labels schemes for tags and entities. 
* To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time. 
* Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string nlp.vocab.strings . 
* It's a lookup tables that works in both directions. We can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs. 
* Hash IDs can't be reversed, though. If a word is not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab. 

* Vocab: stores data shared across multiple documents 
* To save memory, spaCy encodes all strings to hash values. 
* Strings are only stored once in the StringStore via nlp.vocab.strings
* Strings store: lookup table in both directions. 

In [None]:
pip install spacy

In [19]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     -------------------------------------- 12.8/12.8 MB 483.2 kB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip available: 22.2.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.vocab.strings.add("coffee")
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

# Hashes can't be reversed-that's why we need to provide the shared vocab 

# Raises an error if we haven't seen the string before 
string = nlp.vocab.strings[3197928453018144401]

* To get the hash for a string, we can look it up in nlp.vocab.strings. 
* To get the string representation of a hash, we can look up the hash. 
* A Doc object also exposes its vocab and strings. 

In [21]:
# Look up string and hash in nlp.vocab.strings 
doc = nlp("I love coffee")
print("hash values: ", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401]) 

hash values:  3197928453018144401
string value: coffee


In [22]:
# The doc also exposes the vocab and strings 
doc = nlp("I love coffee")
print("hash values:", doc.vocab.strings["coffee"])

hash values: 3197928453018144401


## Lexems: entries in the vocabulary

* Lexemes are context-independent entries in the vocabulary. 

* We can get a lexeme by looking up a string or hash ID in the vocab. 
* Lexemes expose attributes, just like tokens. 
* They hold context-independent information about a word, like the text, or whether the word consist of alphabetic characters. 
* Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.

In [26]:
# A Lexeme object is an entry in the vocabulary 
doc = nlp("I love coffee")
lexeme = nlp.vocab["coffee"]

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


* Contains the context-independent information about a word. 
    * Word text: lexeme.text and lexeme.orth (the hash)
    * Lexical attributes like lexeme.is_alpha 
    * Not context-dependent part-of-speech tags, dependencies or entity labels. 