<center><h1>𝓬𝓾𝓼𝓽𝓸𝓶𝓲𝔃𝓲𝓷𝓰 𝓼𝓹𝓪𝓒𝔂 𝓶𝓸𝓭𝓮𝓵</h1></center>
    
[**coDes**](https://github.com/PacktPublishing/Mastering-spaCy/tree/main/Chapter07)

In this chapter we are going to learn train, store, and use custom statistical pipeline components. you will also learn how to make the best use of Prodigy, the annotation tool this chapter, we are going to disucss this topics: 

* [**Getting started with data preparation**](#Data-Preparation)
* [**Annotating and preparing data**](#Annotating-and-preparing-data)
* Updating an existing pipeline component
* [**Training a pipeline component from scratch**](#Training-a-pipeline-component-from-scratch)




## Data Preparation

In the previous chapter, we looked at statistical model like ner, pos tagger, and more. But in this chapter we customize this model based on particular domain, why we need to customize? However, sometimes, we work on very specific domains that spaCy models didn't see during training. 

For Example: twitter data has lot's of hastags, it's not a sentence its a phrase and it also has a non regular words also, this things are not seen by spaCy training. spaCy is trained with grammatical english sentences. Another example is the medical domain. The medical domain contains many entities, such as drug, disease, and chemical compound names. These entities are not expected to be recognized by spaCy's NER model because it has no disease or drug entity labels. NER does not know anything about the medical domain at all.

Training your custom models requires time and effort. Before even starting the training process, you should decide whether the training is really necessary. To determine whether you really need custom training, you will need to ask yourself the following questions:

* Do spaCy models perform well enough on your data?
* Does your domain include many labels that are absent in spaCy models?
* Is there a pre-trained model/application in GitHub or elsewhere already?


**Do spaCy models perform well enough on your data?**

If your models performs well enough (above 0.75), then you can customize the model by another spaCy component. For
example, let's say we work on the navigation domain and we have utterances
such as the following:

```
navigate to my home
navigate to Oxford Street
```

In [9]:
import spacy 
import en_core_web_md 
from spacy.matcher import Matcher 

nlp = en_core_web_md.load() 
matcher = Matcher(nlp.vocab) 

doc = nlp("navigate to my home")
print(doc.ents)

doc2 = nlp("navigate to Oxford Street")
print(doc2.ents)

()
(Oxford Street,)


In [6]:
doc2.ents[0].label_

'ORG'

In [10]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

Here we need `home`, `oxford street` as `GPE`, but we get organization. We want this entity to be recognized as GPE, a location. Here, we can train
NER further to recognize street names as GPE, as well as also recognizing
some location words, such as work, home, and my mama's house, as GPE.

**Does your domain include many labels that are absent in spaCy models?** 

For instance, in the preceding newspaper example, only one entity label,
vehicle, is missing from the spaCy's NER model's labels. Other entity types
are recognized. In this case, you don't need custom training.

Consider the medical domain again. The entities are diseases, symptoms,
drugs, dosages, chemical compound names, and so on. This is a specialized
and long list of entities. Obviously, for the medical domain, you require
custom model training.

If we need custom model training, we usually follow these steps:
1. Collect your data.
2. Annotate your data.
3. Decide to update an existing model or train a model from scratch.


## Annotating and preparing data


Before training the model, we need to collect the data from various resources. After collection of data, we need to annotate the data. spaCy training works with`Json` file. 

Example of annotated data: 
```Python 
annotations  = {
                "sentence": "I visited JFK Airport."
                "entities": {
                    "label": "LOC"
                    "value": "JFK Airport"
                    }
                }
```

Writing down JSON files manually can be error-prone and time-consuming.
Hence, in this section, we'll also see spaCy's annotation tool, Prodigy, along
with an open source data annotation tool, Brat. Prodigy is not open source or
free, but we will go over how it works to give you a better view of how
annotation tools work in general. Brat is open source and immediately
available for your use.

### Annotating Data with Brat 

Another annotation tool is Brat, which is a free and web-based tool for [**text
annotation**](https://brat.nlplab.org/introduction.html). . After the annotation session is finished, Brat dumps
a JSON of annotated data as well.

### spaCy Training Data Format 

As we remarked earlier, spaCy training code works with JSON file format.
Let's see the details of training the data format.

For the NER, you need to provide a list of pairs of sentences and their
annotations. Each annotation should include the entity type, the start position
of the entity in terms of characters, and the end position of the entity in terms
of characters. Let's see an example of a dataset:

```Python 
training_data = [
    ("I will visit you in Munich.", {"entities": [(20, 26, "GPE")]}),
    ("I'm going to Victoria's house.", {"entities": [(13, 23, "PERSON"),(24, 29, "GPE")]}), 
    ("I go there.", {"entities": []})
] 

```

We cannot feed the raw text and annotations directly to spaCy. Instead, we
need to create an Example object for each training example. Let's see the
code:

```Python 
from spacy.training import Example

doc = nlp("I will visit you in Munich.") 

annotations = {'entities': [ (20, 26, 'GPE') ]} 
example_sent = Example.from_dict(doc, annotations)
```

**Training the spaCy model (procedure)** 
1. First, we'll disable all the other statistical pipeline components, including the POS tagger and the dependency parser.
2. We'll feed our domain examples to the training procedure.
3. We'll evaluate the new NER model.

Also, we will learn how to do the following:

1. Save the updated NER model to disk.
2. Read the updated NER model when we want to use it.

#### Let's Train our Model

In [28]:
# First step is to disable the other statistical pipeline components 

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']  # Pipes: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']
nlp.disable_pipes(*other_pipes)

# Another way of writing this code is as follows:

# other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
# with nlp.disable_pipes(*other_pipes):
      # training code goes here

[]

We know spaCy is a neural network model, to trian the NN, we need to give some sorts of weights. Then we will do multiple epochs because
showing an example only once is not enough. At each iteration, we shuffle the
training data so that the order of the training data does not matter. This
shuffling of training data helps train the neural network thoroughly. We will use **SGD** SGD starts from a random point on the loss function and travels
down its slope in steps until it reaches the lowest point of that function. 


In [None]:
import json 
import spacy 
import random  # for random shuffling 
import en_core_web_md 
from spacy.matcher import Matcher 
from spacy.training import Example


nlp = spacy.load('en_core_web_md')
matcher = Matcher(nlp.vocab) 

# let's add the training data (structure should be like this) 
trainset = [
    ("navigate home", {"entities": [(9,13, "GPE")]}),
    ("navigate to office", {"entities": [(12,18, "GPE")]}),
    ("navigate", {"entities": []}),
    ("navigate to Oxford Street", {"entities": [(12, 25, "GPE")]})
]

epochs = 50

# disable other pipes 
other_pipes = [ pipe for pipe in nlp.pipe_names if pipe != 'ner'] 

with nlp.disable_pipes(*other_pipes): 
    optimizer = nlp.create_optimizer()  # it creates a optimizer object 
    
    for i in range(epochs): 
        random.shuffle(trainset)  # we need to shuffle every set for each iteration to avoid overfitting 
        losses = {}  # it stores the losses 
        
        for text, annotations in trainset:
            doc = nlp(text)
            example = Example.from_dict(doc, annotations)   # we need to give like examples 
        
            # actual trainig here only
            nlp.update([example], sgd = optimizer, losses = losses, drop = 0.2)
        print('Iteration'+ str(i) + str(':') + str(' Losses:') + str(' : ') + str(losses.get('ner')))

It's actually works, In my computer it not workinggg :( 

```Python
nlp = spacy.blank("en")  # spacy.load('en', disable = ['ner'])
ner = nlp.create_pipe("ner") 
ner.from_disk('navi_ner') 

nlp.add_pipe(ner, 'navi_ner') 

print(nlp.meta['pipeline'])  # you can load the model like this 
```

## Training a pipeline component from scratch


In the previous section, we saw how to update the existing NER component
according to our data. In this section, we will create a brand-new NER
component for the medicine domain.

Let's start with a small dataset to understand the training procedure. Then
we'll be experimenting with a real medical NLP dataset. The following
sentences belong to the medicine domain and include medical entities such as
drug and disease names:

The following code block shows how to train an NER component from
scratch. As we mentioned before, it's better to create our own NER rather
than updating spaCy's default NER model as medical entities are not
recognized by spaCy's NER component at all. Let's see the code and also
compare it to the code from the previous section. We'll go step by step:


In [None]:
import json 
import spacy 
import random  # for random shuffling 
import en_core_web_md 
from spacy.matcher import Matcher 
from spacy.training import Example


train_set = [
    ("Methylphenidate is effectively used in treating children with epilepsy and ADHD.",
     {"entities": [(0, 15, "DRUG"), (62, 70, "DISEASE"), (75, 79, "DISEASE")]}),
    ("Patients were followed up for 6 months.", {"entities": []}),
    ("Antichlamydial antibiotics may be useful for during coronary-artery disease.", {"entities": [(0, 26, "DRUG"), (52, 75, "DIS")]})
]


# create a blank model 
nlp = spacy.blank('en')

# create a pipe 
ner = nlp.add_pipe('ner')

# set of entities (unique) 
entities = ["DIS", "DRUG"]

# adding unique label to the nlp pipe 
for ent in entities:  
    ner.add_label(ent)
    
epochs = 100

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] 
l = []
with nlp.disable_pipes(*other_pipes): 
    optimizer = nlp.begin_training()  # it initializes the ner model with weigts 0, hence it forget everything it learned before (we need blank model)
    
    
    for i in range(epochs):
        random.shuffle(train_set)
        losses = {}
        
        for text, annotation in train_set: 
            doc = nlp.make_doc(text)  # making a text to container 
            
            example = Example.from_dict(doc, annotation)  # we need to give a batch of objects so,
            
            # training starts 
            nlp.update([example], sgd = optimizer, losses = losses, drop = 0.5)
          
        print('Iteration'+ str(i) + str(':') + str(' Losses:') + str(' : ') + str(losses.get('ner'))) 
        l.append(losses)
            

The output will be 

<img src='images/sup.png' width="400"/>

## Working with a real-world dataset


In this section, we will train on a real-world corpus. We will train an NER
model on the CORD-19 corpus provided by the Allen Institute for AI
(https://allenai.org/). This is an open challenge for text miners to extract
information from this dataset to help medical professionals around the world
fight against Corona disease. CORD-19 is an open source dataset that is
collected from over 500,000 scholarly articles about Corona disease. The
training set consists of 20 annotated medical text samples:

```
The antiviral drugs amantadine and rimantadine inhibit a viral
ion channel (M2 protein), thus inhibiting replication of the
influenza A virus.[86] These drugs are sometimes effective
against influenza A if given early in the infection but are
ineffective against influenza B viruses, which lack the M2 drug
target.[160] Measured resistance to amantadine and rimantadine
in American isolates of H3N2 has increased to 91% in 2005.[161]
This high level of resistance may be due to the easy
availability of amantadines as part of over-the-counter cold
remedies in countries such as China and Russia,[162] and their
use to prevent outbreaks of influenza in farmed poultry.[163]
[164] The CDC recommended against using M2 inhibitors during the
2005–06 influenza season due to high levels of drug resistance.
[165]
```

As we see from this example, real-world medical text can be quite long,
and it can include many medical terms and entities. Nouns, verbs, and
entities are all related to the medicine domain. Entities can be numbers
(91%), number and units (100 ng/ml, 25 microg/ml), number-letter
combinations (H3N2), abbreviations (CDC), and also compound words
(qRT-PCR, PE-labeled).


The medical entities come in several shapes (numbers, number and letter
combinations, and compounds) as well as being very domain-specific.
Hence, a medical text is very different from everyday spoken/written
language and definitely needs custom training

**The training data will availabe in this folder**, take this

[**all the codes are available here**](https://colab.research.google.com/drive/1yz7i0GcADho46JNXnwd0422dXRanPz5W#scrollTo=T43slgxddiYg)

```Python
import json 
import spacy 
import random  # for random shuffling 
import en_core_web_md 
from spacy.matcher import Matcher 
from spacy.training import Example


train_set = [
    ("Methylphenidate is effectively used in treating children with epilepsy and ADHD.",
     {"entities": [(0, 15, "DRUG"), (62, 70, "DISEASE"), (75, 79, "DISEASE")]}),
    ("Patients were followed up for 6 months.", {"entities": []}),
    ("Antichlamydial antibiotics may be useful for during coronary-artery disease.", {"entities": [(0, 26, "DRUG"), (52, 75, "DIS")]})
]


# create a blank model 
nlp = spacy.blank('en')

# create a pipe 
ner = nlp.add_pipe('ner')

# set of entities (unique) 
entities = ["DIS", "DRUG"]

# adding unique label to the nlp pipe 
for ent in entities:  
    ner.add_label(ent)
    
epochs = 100

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] 
l = []
with nlp.disable_pipes(*other_pipes): 
    optimizer = nlp.begin_training()  # it initializes the ner model with weigts 0, hence it forget everything it learned before (we need blank model)
    
    
    for i in range(epochs):
        random.shuffle(train_set)
        losses = {}
        
        for text, annotation in train_set: 
            doc = nlp.make_doc(text)  # making a text to container 
            
            example = Example.from_dict(doc, annotation)  # we need to give a batch of objects so,
            
            # training starts 
            nlp.update([example], sgd = optimizer, losses = losses, drop = 0.5)
          
        print('Iteration'+ str(i) + str(':') + str(' Losses:') + str(' : ') + str(losses.get('ner'))) 
        l.append(losses)
        
```