<center>    
    <h1 id='spacy-chapter-3' style='color:#7159c1; font-size:350%'>Spacy: Chapter 3</h1>
    <i style='font-size:125%'>Training a Neural Network Model</i>
</center>

> **Topics**

```
- 💻 Applications
- 🏷️ Training Named Entity (NER) Pipeline
- 💾 Generating Binary Corpus
- 📝 Training Config File (Single Source of Truth)
- 💪 Training Pipeline
- 🪈 Loading Pipeline
- 📦 Packaging Pipeline
- 🥇 Best Practices for Training Neural Network Models with Spacy
```

<h1 id='0-applications' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>💻 | Applications</h1>

Spacy allows us to train and update our own model with our specific dataset, turning possible more advanced NLP tasks, such as `Text Classification`, `Specific Named Entity Recognition (NER)`, `Improvements in Tagger (Tag and Part-of-Speech [POS]) and Parser (Dependency Label and Syntatic Head) Pipelines`.

Normally, every model goes through the following six steps (list and Figure 1) over the training phase:

1. Initialize the model with random weights;
2. Predict a few examples with the current weights;
3. Compare the predicted results with the real labels;
4. Calculate the changes to improve the weights;
5. Slightly update and improve the weights;
6. Go back to step 2.

<figure style='text-aling:center'>
    <img style='border-radius:20px' src='./images/3-steps-to-update-our-own-model.png' alt='Diagram of Training Steps of Neural Network Models in Spacy' />
    <figcaption>Figure 1 - Diagram of Training Steps of Neural Network Models in Spacy. By <a href='https://course.spacy.io/en/chapter4'>Spacy - Advanced NLP with Spacy Course - Chapter 4</a>.</figcaption>
</figure>

The cycle is repeated until good values for weights are achieved.

<h1 id='1-training-named-entity-ner-pipeline' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🏷️ | Training Named Entity (NER) Pipeline</h1>

In this notebook, let's focus on training the Named Entity (NER) Pipeline, making the model learn a new group of entity, as well as, new words.

The sample size for the training phase can vary accordingly to our goals and actions, for instance:

- **Update and Existing Model** - `a few hundred to a few thousand examples`;
- **Train a New Category** - `a few thousand to a million examples`;
- **Spacy's English Model** - `2 million words and examples`.

For now, let's train a model to be able to identify `iPhone X` as a `GADGET` entity.

In [1]:
# Preparing Dataset
import spacy
from spacy.tokens import Span

nlp_blank = spacy.blank('en')

document1 = nlp_blank('iPhone X is coming')
document1.ents = [Span(document1, 0, 2, label='GADGET')]

document2 = nlp_blank('I need a new phone. Any tips?')
document2.ents = []

documents = [document1, document2]

In [2]:
# Splitting Data into Train (50.00%) and Validation (50.00%)
import random

random.shuffle(documents)
threshold = len(documents) // 2
train_documents = documents[:threshold]
valid_documents = documents[threshold:]

<h1 id='2-generating-binary-corpus' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>💾 | Generating Binary Corpus</h1>

Before training the model, we must to export both training and validation datasets into binary files known as `DocBin`.

- **DocBin** - `file that serielizes and stores Documents in binary. Besides, since it only stores the shared vocabulary once, it's faster than Pickle`.

In [3]:
# Generating Binary Corpus
from spacy.tokens import DocBin

train_binary_documents = DocBin(docs=train_documents)
train_binary_documents.to_disk('./datasets/train.spacy')

valid_binary_documents = DocBin(docs=valid_documents)
valid_binary_documents.to_disk('./datasets/valid.spacy')

Binary Documents in Spacy normally uses `.spacy` extension for all generated files, however, sometimes we already have our training and validation data and we must convert it to Spacy's binary format. In order to do it, we must use the following command:

```bash
python -m spacy convert ./datasets/train.gold.conll ./datasets
```

The conversion accepts the following extensions to be converted into Spacy's binary format: `['.conll', '.conll', '.iob']`.

<h1 id='3-training-config-file-single-source-of-truth' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📝 | Training Config File (Single Source of Truth)</h1>

The Config File, normally named as `config.cfg`, is a file where we can tell how the NLP object must be initialized, which Pipeline components must be added to the model, how the model's internal configurations should be configured, how to load the training and validation data, and the hyperparameter values.

But don't you worry, there is no need to write this whole file by hand, we can use a Spacy's command to generate a default file and then update it accordingly to our needs. The command is the following:

```bash
python -m spacy init config ./configs/config.cfg --lang en --pipeline tagger,parser,ner,lemmatizer
```

In [4]:
!python -m spacy init config ./configs/config.cfg --lang en --pipeline ner

[38;5;3m[!] To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4m[i] Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m[+] Auto-filled config with all values[0m
[38;5;2m[+] Saved config[0m
configs\config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


<h1 id='4-training-pipeline' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>💪 | Training Pipeline</h1>

After creating the binary files of Training and Validation Documents and setting the Config file, we are ready to train our model using the following command:

```bash
python -m spacy train ./configs/config.cfg --output ./output --paths.train ./datasets/train.spacy --paths.dev ./datasets/valid.spacy
```

In [5]:
!python -m spacy train ./configs/config.cfg --output ./output --paths.train ./datasets/train.spacy --paths.dev ./datasets/valid.spacy

[38;5;4m[i] Saving to output directory: output[0m
[38;5;4m[i] Using CPU[0m
[1m
[38;5;2m[+] Initialized pipeline[0m
[1m
[38;5;4m[i] Pipeline: ['tok2vec', 'ner'][0m
[38;5;4m[i] Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------


OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.


<h1 id='5-loading-pipeline' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🪈 | Loading Pipeline</h1>

After training the model, the last and the best Pipelines are stored into disk and we can load them in the same way we do with `en_core_web_lg`.

In [None]:
# Loading Pipeline
nlp_best_pipeline = spacy.load('./output/model-best')
nlp_last_pipeline = spacy.load('./output/model-last')

<h1 id='6-packaging-pipeline' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Packaging Pipeline</h1>

In the end, we can package a Pipeline in order to it become a Python Package and turn it installable!! To do it, we first should run the following command:

```bash
python -m spacy package ./output/model-best ./packages --name my_pipeline --version 1.0.0
```

Install it into any project we will be working on:

```bash
pip install ./output/en_model_best-1.0.0
```

And then load the Pipeline int our project:

```python
nlp = spacy.load('en_my_pipeline')
```

Realize that Spacy automatically adds the language the Pipeline has been trained on at the beginning of the package's name.

<h1 id='7-best-practices-for-training-neural-network-models-with-spacy' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🥇 | Best Practices for Training Neural Network Models with Spacy</h1>

- **Catastrophic Forgetting Problem**

Description: when we update the model with a bunch of examples with a especific label, such as 'CARS', the model can `'unlearn'` how to predict 'PERSON' labels.

Solution: always mix in examples of what the model previously got correct, for example, when training the model with 'CARS' labels, also include examples of 'PERSON' labels.

- **Models Can't Learn Everything**

Description: models can only make predictions based on Local Context, that is, based on the Context present in the examples it has been trained on. So, the model may not recognize and learn all patterns as expected.

Solution: always stick with general labels rather than too specific ones, thus always prefer working with 'CLOTHING' label rather than 'ADULT_CLOTHING' and 'CHILDREN_CLOTHING' labels. Besides, we can add a `Rule-Based System` (Matcher or PhraseMatcher) in order to go from generic label to specific labels.

---

Observations: when dealing with NER, consider using `Doccano` for small and medium projects, and `INCEpTION` for large projects. They will help us automatinc the Specific Named Entity (NER) Recognition process.

- <a href='https://github.com/doccano/doccano'>Doccano</a>

- <a href='https://github.com/inception-project/inception'>INCEpTION</a>

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).