This is continution of [part1 notebook](https://github.com/JpChii/ML-Projects/blob/main/Spacy.ipynb)

# Processing Pipelines

Learn everything needed about spaCy's processing pipeline. Learn what goes uner the hood when processing a text and to write custom components and add them to pipeline, custom attributes to add own metadata to the documents, spans and tokens.

![alt text](https://course.spacy.io/pipeline.png "What happens when nlp is called?")

**Built-in pipeline components**

|Name|Description|Creates|
|----|-----------|-------|
| tagger | Part-of-speech tagger | Token.tag, Token.pos |
| parser | Dependency parser | Token.dep, Token.head, Doc.sents, Doc.noun_chunks |
| ner | Named entity recogonizer | Doc.ents, Token.ent_iob, Token.ent_type |
| textcat | Text classifier | Doc.cats |

**Under the hood**
![alt text](https://course.spacy.io/package_meta.png "What happens when nlp is called?")

* Pipeline defined in model's `config.cfg` in order
* Built-inn components need binary data to make predictions

### 1.1 Inspecting the pipeline


In [1]:
import spacy
spacy.cli.download("en_core_web_sm")

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")

print(nlp.pipeline)
print(nlp.pipe_names)

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7faa5e7bf6e0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7faa5e7bf600>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7faa5eb2a450>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7faa5ebfc960>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7faa5ebf8780>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7faa5eadfd50>)]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


## 2. Custom Pipeline components

**why custom components?**

* Make a function execute automatically when calling nlp
* Add own metadata to documents and tokens
* Updating built-in attributes like `doc.ents`

### 2.1 Simple component

In [3]:
import spacy
from spacy.language import Language

# Define the custom component
@Language.component("length_component")
def length_component_function(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("Tst in")

['length_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
This document is 2 tokens long.


### 2.2 Complex components

Write a custom component that uses the `PhraseMatcher` to find animal names in the document and adds the matched spans to the doc.ents

In [4]:
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))

print("animal_patterns", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
animal_patterns = list(nlp.pipe(animals))
matcher.add("ANIMAL", animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component_function(doc):
    # Apply matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    
    return doc

# Add the component to the pipeline after the "ner" component
nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


Yay!! Written a custom pipeline component for rule-based entity matching.

## 3. Extension attributes

* Add custom metadata to documents, tokens and spans
* Accessible via the ._ property

**Extension attributes types**

1. Attribute extensions
2. Property extensions
3. Method extensions

### 3.1 Setting extension attributes(1)

In [5]:
# Attribute extension
import spacy
from spacy.tokens import Token

nlp = spacy.blank("en")

# Register the Token extension attribute "is_country" with the default value False
Token.set_extension("is_country", default=False, force=True)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [6]:
# Getter function extension
import spacy
from spacy.tokens import Token

nlp = spacy.blank("en")

# Defining the getter function
def get_reversed(token):
    return token.text[::-1]

Token.set_extension("reversed", getter=get_reversed)

doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print(f"reversed: {token._.reversed}")

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


### 3.2 Setting extension attributes (2)

Set an extension function for the `Doc`

In [7]:
import spacy
from spacy.tokens import Doc

nlp = spacy.blank("en")

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)

Doc.set_extension("has_number", method=get_has_number, force=True)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print(f"has_number: {doc._.has_number}")

has_number: functools.partial(<function get_has_number at 0x7faa6279bef0>, The museum closed for five years in 2012.)


In [8]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return f"<{tag}>{span.text}</{tag}>"


# Register the Span method extension "to_html" with the method to_html
Span.set_extension("to_html", method=to_html, force=True)

# Process the text and call the to_html method on the span with the tag name "strong"
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

<strong>Hello world</strong>


### 3.3 Entities and extensions

Getting a wikipedia url if the span's label is in the list of labels

In [9]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def get_wikipedia_url(span):
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

Span.set_extension("wiki_url", getter=get_wikipedia_url, force=True)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)

for ent in doc.ents:
    print(ent.text, ent._.wiki_url)

over fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


Yay!! Created a pipeline component that uses named entities predicted by the model to generate wikipedia URLs and adds them as a custom attribute

In [10]:
nlp=spacy.blank("en")
nlp.pipe_names

[]

In [11]:
import json
import spacy
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

with open("exercises/en/capitals.json", encoding="utf8") as f:
    CAPITALS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", list(nlp.pipe(COUNTRIES)))


@Language.component("countries_component")
def countries_component_function(doc):
    # Create an entity Span with the label "GPE" for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe("countries_component", first=True)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute "capital" with the getter get_capital
Span.set_extension("captial", getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

FileNotFoundError: [Errno 2] No such file or directory: 'exercises/en/countries.json'

This is an example of how to add structured data to a spaCy pipeline.

## 4. Scaling and Performance

**Processing large volumes of text**

* Use `nlp.pipe` method
* Preprocesses texts as a stream, yields `Doc` objects
* Much faster than calling `nlp` on each text

**Passing in context(1)**

* Setting `as_tuples=True` on `nlp.pipe` lets you pass in `(text, context)` tuples
* Yields `(doc, context)` tuples
* Useful for associating metadata with `doc`

### 4.1 Processing streams

In [None]:
import json
import spacy

with open("data/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

In [None]:
# Getting all the tweet contents alone
tweets = [tweet["content"] for tweet in TEXTS]
tweets[0]

In [None]:
# Process the texts and print the adjectives
for doc in nlp.pipe(tweets[:20]):
    print([token.text for token in doc if token.pos_ == "ADJ"])

In [None]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("exercises/en/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the entities
# docs = [nlp(text) for text in TEXTS] # Bad performance
docs = list(nlp.pipe(docs)) # Good perforance
entities = [doc.ents for doc in docs]
print(*entities)

### 4.3 Preprocessing with context

In this exercise, you’ll be using custom attributes to add author and book meta information to quotes.

A list of [text, context] examples is available as the variable DATA. The texts are quotes from famous books, and the contexts dictionaries with the keys "author" and "book".

Use the set_extension method to register the custom attributes "author" and "book" on the Doc, which default to None.
Process the [text, context] pairs in DATA using nlp.pipe with as_tuples=True.
Overwrite the doc._.book and doc._.author with the respective info passed in as the context.

In [None]:
import json
import spacy
from spacy.tokens import Doc

with open("exercises/en/bookquotes.json", encoding="utf8") as f:
    DATA = json.loads(f.read())

nlp = spacy.blank("en")

# Register the Doc extension "author" (default None)
Doc.set_extension("author", default=None)

# Register the Doc extension "book" (default None)
Doc.set_extension("book", default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context["book"]
    doc._.author = context["author"]

    # Print the text and custom attribute data
    print(f"{doc.text}\n — '{doc._.book}' by {doc._.author}\n")

### 4.4 Selecective processing

In this excercise, you'll use the `nlp.make_doc` and `nlp.select_pipes` method to run only selected components when running a text.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Performing only toenization
doc = nlp.make_doc(text)
print([token.text for token in doc])

In [None]:
nlp.pipe_names

In [None]:
# Running only the ner disabling tagger and lemmetizer
with nlp.select_pipes(enable="ner"):
    doc = nlp(doc)
    print(doc.ents)

#### Summary

1. What goes behing an nlp pipeline call
2. Inspection the pipeline components using pipe_name and pipe
3. Custom PhraseMatcher to include structured data for natches and using the matches and assigning the spans to entities. Adding custom components to existing pipeline
4. Extension attributes, methods(getter)
5. Components with extensions

# Training a neural network model

Learn how to update spaCy's statistical models to customize thme for use case. For example to predict a new entity type in online comments. Train our own model from scratch and understand the basic of how training works

## Training and updating models

**Why update the model?**

* Better results on specific domain
* Lean classification schemes specifically for a problem
* Essential for text classification
* Very useful for named entity recogonition
* Less critical for part-of-speech tagging and dependency parsing

**How training works (1)**

* Initialize the model weights randomly
* Predict a few examples with the current weights
* Compare predictions with true labels
* Calculate how to change weights to improve predictions
* Update weights slightly
* go back to step 2

**How training works (2)**
![alt text](https://course.spacy.io/training.png)

* Training data: Examples and their annotations
* Text: The inputtext the model should predict a label for
* Label: The label the model should predict
* Gradient: How to change the weights

**Example: Training the entity recogonizer**

* The entity recogonizer tags words and phrases in context
* Each token can only be part of one entity
* Examples needs to come with context
* Text with no entitites are also important
* Goal: Teach the model to generalize

**The training data**

* Examples of what we want the model to predict in context
* Update an *existing model*: a few hundred toa few thousand examples
* Train a *new category*: a few thousand to million examples
    * spaCy's English models: 2 million words
* Usually create manually by human annotators
* Can be semi-automates - for example, using spaCy's `Matcher`

In [12]:
!cat data/iphone.json

[
  "How to preorder the iPhone X",
  "iPhone X is coming",
  "Should I pay $1,000 for the iPhone X?",
  "The iPhone 8 reviews are here",
  "iPhone 11 vs iPhone 8: What's the difference?",
  "I need a new phone! Any tips?"
]


## 1. Training and evaluation data

### 1.1 Creating Training data (1)

spaCy's rule-based `Matcher` is a great way to quicky create training data for named entity models. A list of sentences is available as the variable TEXTS. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recogonize them as "GADGET".

* Write a pattern for two tokens whose lowercase forms match `iphone` and `x`
* Write a pattern for two tokens: one token whose lowercase matches `iphone` and a digit.

In [15]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

with open("data/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())
    
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

matcher.add("GADGET", [pattern1, pattern2])
docs = []

for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    print(spans)
    doc.ents = spans
    docs.append(doc)

[iPhone X]
[iPhone X]
[iPhone X]
[iPhone 8]
[iPhone 11, iPhone 8]
[]


Nice! Let's use those patterns to quickly bootstap some training data for our model.

### 1.2 Create training data (2)

After creating the data for our corpus, we need to save it out to a `.spacy`file. The code from the previous example is already available

* Instantiate the `DocBin` with the list of `docs`.
* Save the `DocBin` to a file called `train.spacy`

In [16]:
docs

[How to preorder the iPhone X,
 iPhone X is coming,
 Should I pay $1,000 for the iPhone X?,
 The iPhone 8 reviews are here,
 iPhone 11 vs iPhone 8: What's the difference?,
 I need a new phone! Any tips?]

In [26]:
from spacy.tokens import DocBin
doc_bin = DocBin(docs=docs)
doc_bin.to_disk("data/train.spacy")

Well well well! The pipeline is now ready, so let's start writing the training loop.

## 2. Configuring and running the training

**The training config (1)**

* *single source of truth* for all settings
* typically called `config.cfg`
* defines how to initialize the `nlp` object
* includes all settings abot the pipeline components and their model implementations
* Configures the training process and hyperparameters
* makes training more reproducible

### 2.1 Generating a config file

The `init config` command auto-generates a config file for training with the default settings. We want to train a named entity recogonizer, so let's generate a config file for one pipeline component, `ner`.

In [27]:
!python -m spacy init config ./config.cfg --lang en --pipeline ner

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [28]:
!cat ./config.cfg

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architec

### 2.2 Using the training CLI

Let' use the config gile generated in the previous excercise and the training corpus we've created to train the named entity recognizer!

The `train` command lets us to train a model from a training config file. A file `config_gadget.cfg` contains training config.

In [29]:
!curl https://raw.githubusercontent.com/explosion/spacy-course/master/exercises/en/config_gadget.cfg --output config_gadget.cfg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2730  100  2730    0     0   5645      0 --:--:-- --:--:-- --:--:--  5783


In [30]:
!python -m spacy train config_gadget.cfg --output ./output --paths.train data/train_gadget.spacy --paths.dev data/dev_gadget.spacy

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-04-16 13:06:25,759] [INFO] Set up nlp object from config
[2022-04-16 13:06:25,776] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-04-16 13:06:25,789] [INFO] Created vocabulary
[2022-04-16 13:06:25,793] [INFO] Finished initializing nlp object
[2022-04-16 13:06:27,069] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     20.33    1.69    1.04    4.44    0.02
  d_xhat = N * dY - sum_dy - dist * var ** (-1.0) * sum_dy_dist
  1     200         29.20    986.03   76.92   76.09   77.78    0.77
  2     400         74.01    246.38   81.28   78.35   84.44    0.81
  4     600   

In [31]:
!ls output/

[34mmodel-best[m[m [34mmodel-last[m[m
