# SpaCy

SpaCy tries to minimize the time you need to create an nlp algorithm.
To do this SpaCy provides classes of specific languages that contain the whole processing pipeline already.
The only thing you need to do, is to deactivate the steps in the pipeline, you don't need and to optimize the algorithm for
your specific dataset.

Below is a simple nlp algorithm, that uses SpaCy

(This is a altered version of the tutorial Ines Montani did [here](https://course.spacy.io/en/))

In [None]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

# Or get a specific token and print its text
print (doc[1].text)

# Different token attributes
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])
print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

Now that we have seen a basic workflow we can dive deeper into SpaCys' features:

## Statistical models

Statistical models are used to predict linguistic attributes in context
* Part-of-speech tags
* Syntactic dependencies
* Named entities

This means SpaCy train algorithms on specific texts (on be retrained/updated to increase accuracy in project) to detect in what
context the word is and what type of word it is.
This is especial useful later on, when we build queries to look for specific sentences in text.

To use the models we need to install them first:

```
$ python -m spacy download en_core_web_sm
```

In [None]:
import spacy

# Load the model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # print text,  the according part-of-speech tag, the relation in a sentence and the parent token
    print(token.text, token.pos_, token.dep_, token.head.text)

We can also predict named entities, which are specific descriptions of elements in a sentence:

In [None]:
# Process text
doc = nlp("Apple is looking at buying U.K. startup for 1billion euro")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

To understand the definitions better spacy has the explain method, that illustrates, what a tag means

In [None]:
spacy.explain("GPE")

spacy.explain("NNP")

## Match patterns
With pattern matching we can filter out specific sentences of parts of a given text.

To implement this we can create a list of dictionaries, that describes what
the pattern should look like:

In [None]:
# Match exact token texts
[{"TEXT": "iPhone"}, {"TEXT", "X"}]

# Match lexical attributes (iphone x should be detected in any form of writing it)
[{"LOWER": "iphone"}, {"LOWER": "x"}]

# Match any token attributes (Lemma is the stem of a word and after that should follow a noun)
[{"LEMMA": "buy"}, {"POS": "NOUN"}]

# Matchea can be quantified with the op keyword ("OP": ? - 0 or 1, ! - 0, + - 1 or more, * - 0 or more)
[{"LEMMA": "buy"}, {"POS": "DET", "OP":"?"}, {"POS": "NOUN"}]

Result will be something with stem of buy then an optional article and finally a noun.

### Example Matcher

In [None]:
import spacy

# Import Matcher
from spacy.matcher import Matcher

# Load model
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    span = doc[start:end]
    # Get the matched span
    print("Matched span:", span.text)
    # Get the span's root token (decides category of the phrase)
    print("Root token:", span.root.text)
    # And root head token (The syntactic parent that governs the phrase)
    print("Root head token:", span.root.head.text)
    # Get the previous token and its POS tag
    print("Previous token:", doc[start - 1].text, doc[start - 1].pos_)

### Phrase matcher
PhraseMatcher is more efficient than Matcher, because it takes whole Doc objects as patterns

#### Example

In [None]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", None, pattern)
doc = nlp("I hav a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print("Matched span:", span.text)


## Shared vocab and string store

SpaCy saves every word with its given hash value (and more) as a "Lexeme" in the string store via "nlp.vocab.strings".
Internally SpaCy only uses the given hash values to save memory

Lexemes saves
* Text
* Hash-id
* attributes like (lexeme.is_alpha)

Note: if a string doesn't exists in the string store, we can't call its hash value to get it

In [None]:
# Get hash value or text
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

# Other way to get the information
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])

# Get the lexeme itself and show its information
lexeme = nlp.vocab["coffee"]
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

## Doc Data Structure

The doc class stores 3 vales
* Words: a list of every word in a string
* Spaces: stores if after a word comes a space or not
* vocab: contains the shard word class

### Span
The Span is a slice of the doc and it takes 3 arguments:
* doc: which doc it refers to
* start: contains the index of the start token
* end: contains the index of the end token (Span doesn't contain the element in the end index)

With that knowledge we can define own named entities in form of spans and can add them to the doc.ents to find specific
patterns in an text.

Good to know:
* Use token attributes when possible and convert tokens to strings as late as possible, otherwise the performance of your
    project will sink dramatically
* Don't forget to pass in the shared vocab when creating the Doc

In [None]:
from spacy.tokens import Doc, Span

# Create doc manually
doc = Doc(nlp.vocab, words=["hello", "world", "!"], spaces=[True, False, False])

# Create span manually (with label)
span = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span]



## Word vectors and semantic similarity

Because SpaCy stores additional data in every word token, it allows you to compare to objects and predict similarity.
* Doc.similarity()
* Span.similarity()
* Token.similarity()

**Important**: only the medium and large models store enough data to use this feature!!

In [None]:
# Load medium/large model (they contain the needed vectors)
nlp = spacy.load("en_core_web_md")

# Compare two documents / span / token
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

## Combining models and rules
When using rules to gather information from data, SpaCy offers two types:

| ´ | Statistical models | Rule-based systems |
|----|----|----|
| Use cases | application needs to generalize based on examples | | dictionary with finite number of examples
| Real-world examples | product names, person names, subject/object relationships | countries of the world, cities, drug names, dog breeds |
| spaCy features | entity recognizer, dependency parser, part-of-speech tagger | tokenizer, Matcher, PhraseMatcher |


## Processing pipelines

As mentioned earlier SpaCy's nlp method already provides you with a whole nlp pipeline and it looks like this:

| Name | Description | Creates |
|---|---|---|
| 1. tokenizer | Turn text in Doc object | Doc |
| 2. tagger | Part-of.speech tagger | Token.tag, token.pos |
| 3. parser | Dependency parser | Token.dep, Token.head, Doc.sents, Doc.noun_chunks |
| 4. ner | Named entity recognizer | Doc.ents, Token.ent_iob, token.ent_type |
| 5. textcat | text classifier | Doc.cats |

We can check which pipelines are applied with the following commands:

In [None]:
# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

## Custom pipeline components
You can create custom pipelines steps like this:

In [None]:
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)

In addition you can decide where the component should be executed

| Argument | Description | Example |
|---|---|---|
| last | If True, add last | nlp.add_pipe(component, last=True) |
| first | If True, add first | nlp.add_pipe(component, first=True) |
| before | Add before component | nlp.add_pipe(component, before="ner") |
| after | Add after component | nlp.add_pipe(component, after="tagger") |

## Custom attributes
Custom metadata can be added to docs, token and spans or changed via the "._" property

We can create extensions as well, there are 3 types:
1. Attribute extensions: Set a default value that can be overwritten
2. Property extensions: Defining a getter and an optional setter function. Getter only called when attribute value is called
3. Method extensions: makes the extension attribute a callable method (you can pass multiple arguments to it)

**Important** Attributes for Span should always be handled with a getter function!

In [None]:
from spacy.tokens import Doc, Token, Span

# Set extensions on the global Doc, token or Span via
Doc.set_extension("title", default=None)
Token.set_extension("is_color", default=False)
Span.set_extension("has_color", default=False)

# change the attributes
doc._.title = "my document"
token._.is_color = True


# Attribute extension
Token.set_extension("is_color", default=False)

doc = nlp("Water is not green.")

# Overwrite attribute extension
doc[3]._.is_color = True


# Property extension
# Define getter function
def get_is_color(token):
    color = ["red", "yellow", "blue"]
    return token.text in color

# Set extension on the token with getter
Token.set_extension("is_color", getter=get_is_color)

print(doc[3]._.is_color, "-", doc[3].text)


# Method extension
# Define a Method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc
Doc.set_extension("has_token", method=has_token)
print("Has blue: ", doc._.has_token("blue"))

## Scaling and performance
* Use nlp.pip method, because this way text a processed as stream, that means Asynchronous
* Much faster than nlp on each text

In [None]:
LOTS_OF_TEXTS = None

# BAD
docs = [nlp(text) for text in LOTS_OF_TEXTS]

# GOOD
docs = list(nlp.pipe(LOTS_OF_TEXTS))

With nlp.pipe you can add an get context to each text (Useful for associating metadata with the doc):

In [None]:
data = [
    ("This is a text", {"id": 1, "page_number":15}),
    ("And another text", {"id": 2, "page_number":16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context["page_number"])

Another way to improve the processing speed, is to deactivate unnecessary pipeline steps:
* we can use the "nlp.make_doc" method to only run the tokenizer
* and disable specific pipeline components with "nlp.disable_pipes"

In [None]:
# only tokenizer
doc = nlp.make_doc("Hello world!")

# disable tagger and parser
with nlp.disable_pipes("tagger", "parser"):
    # Process the text and print the entities
    doc = nlp("Hello World!")
    print(doc.ents)

## train and updating models
Why updating the model?
* Better results on specific domains
* Learn classification schemes specifically for your problem
* Essential for text classification
* Very useful for named entity recognition
* Less critical for part-of-speech tagging and dependency parsing

How to train
1. **Initialize** the model weights randomly with nlp.begin_training
2. **Predict** a few examples with the current weights by calling nlp.update
3. **Compare** prediction with true labels
4. **Calculate** how to change weights to improve predictions
5. **Update** weights slightly
6. Go back to 2

The steps of a training loop
1. **Loop** for a number of times
2. **Shuffle the training data
3. **Divide** the data into batches
4. **Update** the model for each batch
5. **Save** the updated model

To label data fast on your own, use [brat](http://brat.nlplab.org/)

#### Example train loop

In [None]:
import random

TRAINING_DATA = [
    ("How to preorder the iPhone X", {"entities": [(20, 28, "GADGET")]})
    # more examples...
]
nlp.begin_training

# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

# Save the model
# nlp.to_disk(path_to_model)


### Update an existing model
* Improve the predictions on new data
* useful for existing categories like "PERSON"
* Also possible to add new categories
* Be careful, the model can forget the old categories!!

#### Example Setting up a new pipeline from scratch

In [None]:
# start with blank English model
nlp = spacy.blank("en")

# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

# Add a new label
ner.add_label("GADGET")

# Start training
# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

### Training Problems

#### Models can "forget" things
Don't update models with only one category or the model will unlearn the other categories

In [None]:
# BAD
TRAINING_DATA = [
    ("reddit is a website", {"entities": [(0, 6, "WEBSITE")]})
]

# GOOD
TRAINING_DATA = [
    ("reddit is a website", {"entities": [(0, 6, "WEBSITE")]}),
    ("Obama is a person", {"entitres": [(0, 5, "PERSON")]})
]

#### Models can't learn everything
* SpaCy's models make predictions based on local context
* Model can struggle to learn if decision is difficult to make based on context
* Labels need to be consistent and not to0 specific
    * For example: "CLOTHING" is better than "ADULT_CLOTHING" and "CHILDRENS_CLOTHING"


## References
* https://course.spacy.io/en/
* https://www.youtube.com/watch?v=DBbBRwpneLs&feature=emb_logo