@author Dennis

Notebook generato a partire dall'esempio presente in https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/.

L'esempio mostra le funzioni base di spaCy e come estendere i modelli di NER quando ci servono cose custom come l'introduzione di una nuova label, cosa che ci è utile per i nostri task. 

Durante lo studio è emerso il problema del forgetting https://github.com/explosion/spaCy/discussions/9414 quando si cerca di fare incremental learning.
Questo problema, apparentemente, sembra non avere soluzione. È consigliato eseguire un retraining da 0 con spaCy. 
Altri blog dove si discute del problema: https://support.prodi.gy/t/generating-examples-in-spacy-to-address-catastrophic-forgetting/5097.
E anche: https://github.com/explosion/spaCy/discussions/5134

Una possibile soluzione (devo ancora leggerla): https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting

A livello teorico si parla di catastrophic interference https://en.wikipedia.org/wiki/Catastrophic_interference

Corpus originali sui quali si è fatto il training: https://spacy.io/models/it


In [1]:
# load di un modello spacy
import spacy

# usually, the model is saved in the nlp object
nlp = spacy.load('en_core_web_sm')

In [2]:
# check if the model presents the ner step in its pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [3]:
# in case the model does not have ner among its steps
# nlp.add_pipe('ner')

Sebbene spaCy abbia nella sua pipeline built-in il ner per la entity recognition, e sebbene sperabilmente performi bene, non è sempre accurato per il nostro testo (in particolare, i modelli in italiano sono più limitati rispetto i modelli in inglese). In particolare, con l'italiano ma anche altre lingue, succede che la categoria che vogliamo possa non essere built-in in spaCy. 

Di seguito un esempio di come la NER (in inglese) performa su di un articolo riguardo una compagnia di E-commerce. 

In [29]:
# a variable with the whole text
article_text="""India that previously comprised only a handful of players in the e-commerce space, is now home to many biggies and giants battling out with each other to reach the top. This is thanks to the overwhelming internet and smartphone penetration coupled with the ever-increasing digital adoption across the country. These new-age innovations not only gave emerging startups a unique platform to deliver seamless shopping experiences but also provided brick and mortar stores with a level-playing field to begin their online journeys without leaving their offline legacies.
In the wake of so many players coming together on one platform, the Indian e-commerce market is envisioned to reach USD 84 billion in 2021 from USD 24 billion in 2017. Further, with the rate at which internet penetration is increasing, we can expect more and more international retailers coming to India in addition to a large pool of new startups. This, in turn, will provide a major Philip to the organized retail market and boost its share from 12% in 2017 to 22-25% by 2021. 
Here’s a view to the e-commerce giants that are dominating India’s online shopping space:
Amazon - One of the uncontested global leaders, Amazon started its journey as a simple online bookstore that gradually expanded its reach to provide a large suite of diversified products including media, furniture, food, and electronics, among others. And now with the launch of Amazon Prime and Amazon Music Limited, it has taken customer experience to a godly level, which will remain undefeatable for a very long time. 

Flipkart - Founded in 2007, Flipkart is recognized as the national leader in the Indian e-commerce market. Just like Amazon, it started operating by selling books and then entered other categories such as electronics, fashion, and lifestyle, mobile phones, etc. And now that it has been acquired by Walmart, one of the largest leading platforms of e-commerce in the US, it has also raised its bar of customer offerings in all aspects and giving huge competition to Amazon. 

Snapdeal - Started as a daily deals platform in 2010, Snapdeal became a full-fledged online marketplace in 2011 comprising more than 3 lac sellers across India. The platform offers over 30 million products across 800+ diverse categories from over 125,000 regional, national, and international brands and retailers. The Indian e-commerce firm follows a robust strategy to stay at the forefront of innovation and deliver seamless customer offerings to its wide customer base. It has shown great potential for recovery in recent years despite losing Freecharge and Unicommerce. 

ShopClues - Another renowned name in the Indian e-commerce industry, ShopClues was founded in July 2011. It’s a Gurugram based company having a current valuation of INR 1.1 billion and is backed by prominent names including Nexus Venture Partners, Tiger Global, and Helion Ventures as its major investors. Presently, the platform comprises more than 5 lac sellers selling products in nine different categories such as computers, cameras, mobiles, etc. 

Paytm Mall - To compete with the existing e-commerce giants, Paytm, an online payment system has also launched its online marketplace - Paytm Mall, which offers a wide array of products ranging from men and women fashion to groceries and cosmetics, electronics and home products, and many more. The unique thing about this platform is that it serves as a medium for third parties to sell their products directly through the widely-known app - Paytm. 

Reliance Retail - Given Reliance Jio’s disruptive venture in the Indian telecom space along with a solid market presence of Reliance, it is no wonder that Reliance will soon be foraying into retail space. As of now, it has plans to build an e-commerce space that will be established on online-to-offline market program and aim to bring local merchants on board to help them boost their sales and compete with the existing industry leaders. 
Big Basket - India’s biggest online supermarket, Big Basket provides a wide variety of imported and gourmet products through two types of delivery services - express delivery and slotted delivery. It also offers pre-cut fruits along with a long list of beverages including fresh juices, cold drinks, hot teas, etc. Moreover, it not only provides farm-fresh products but also ensures that the farmer gets better prices. 

Grofers - One of the leading e-commerce players in the grocery segment, Grofers started its operations in 2013 and has reached overwhelming heights in the last 5 years. Its wide range of products includes atta, milk, oil, daily need products, vegetables, dairy products, juices, beverages, among others. With its growing reach across India, it has become one of the favorite supermarkets for Indian consumers who want to shop grocery items from the comforts of their homes. 

Digital Mall of Asia - Going live in 2020, Digital Mall of Asia is a very unique concept coined by the founders of Yokeasia Malls. It is designed to provide an immersive digital space equipped with multiple visual and sensory elements to sellers and shoppers. It will also give retailers exclusive rights to sell a particular product category or brand in their respective cities. What makes it unique is its zero-commission model enabling retailers to pay only a fixed amount of monthly rental instead of paying commissions. With its one-of-a-kind features, DMA is expected to bring
never-seen transformation to the current e-commerce ecosystem while addressing all the existing e-commerce worries such as counterfeiting. """


In [None]:

# give the text to the nlp object, obtain the Doc object in output
doc = nlp(article_text)

# now print the named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

Notiamo come dal tutorial sembrasse che l'entità Flipkart fosse mal classificata in una PER (persona) invece che in una ORG. Invece, da qui vediamo che è stata correttamente classificata in una ORG. Evidentemente, nel tempo spaCy ha aggiornato i suoi modelli e sono migliorati. 

Nell'esempio successivo vediamo comunque come la rete, ancora oggi, non è in grado di identificare l'entità "Alto", né tantomeno quindi di classificarla come un prodotto.

In [4]:
# a test to see that, as it is, the model cannot classify 'Alto' ad a product (a car) 
doc = nlp("I was driving a Alto")
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities []


Quello che si fa qui è prendere un modello di spaCy pre-trained e farvi l'update con nuovi esempi. Quindi, il primo step è fare il load di un modello contenente la componente `ner`, da questo modello si estrapola il Named Entity Recognizer.

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')


# get the NER pipeline component
ner = nlp.get_pipe("ner")

Vediamo ora come si può fare per eseguire l'update della parte NER in base al contesto e requirements dove lavoriamo. 
A questo punto, serve fare l'update del modello con nuovi esampi. Questi devono essere molti e significativi affinché il sistema migliore. Si parla di alcune centinaia come valore minimo. 

spaCy accetta training data come una lista di tuple. Ogni tupla dovrebbe contenere del testo e un dizionario. Il dizionario contiene gli indici di start ed end della named entity all'interno del testo (si usa notazione ad indici. Il primo e' l'indice della prima lettera/carattere della entity, il secondo e' l'indice del primo carattere al di fuori dell'entity), e la categoria/label della entità stessa. Un esempio è il seguente: 

In [6]:
# just a one-tuple example of the data format used by spaCy
tuple_example = ("Walmart is a leading e-commerce company", {"entities" : [(0, 7, "ORG")]})

In [7]:
# training data
TRAIN_DATA = [
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(10, 17, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(31,37, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT"), (25,31,"ORG")]}),
              ("I bought a new Washer", {"entities": [(15,21, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(15,20, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(17,22, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(11,17, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(11,15, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(11,22, "PRODUCT")]}),
              ("I repaired my computer", {"entities": [(14,22, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(15,20, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(15,20, "PRODUCT")]}),
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,30, "ORG")]})
              ]

È possibile che ci siano delle label/classi tra quelle usate nei dati di training non attualmente presenti tra quelle riconosciute dal modello. Pertanto, prima di qualsiasi operazione di update, aggiorniamo la lista di label a disposizione del nostro modello. 

In [8]:
# adding the new labels to the ner component of the pipeline
for _, annotations in TRAIN_DATA:
  for ent in annotations.get("entities"):
    ner.add_label(ent[2])

In [11]:
# it is possible to list the labels currently considered by the model
ner.labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

Ora usiamo i dati di training (qui pochi ma è giusto per iniziare da qualche parte) per fare l'update del modello. In particolare, ricordiamoci, prima di fare il training, che a parte per la componente `ner`, il modello contiene altre componenti nella pipeline. Queste componenti non devono essere influenzate durante il processo di training. Dobbiamo pertanto disabilitarle. 

In [12]:
# Disable pipeline components you dont need to change
# pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
pipe_exceptions = ["tok2vec", "ner"] # components that will be affected (only ner would have been ok too)
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] # components that won't be affected

In [13]:
unaffected_pipes

['tagger', 'parser', 'attribute_ruler', 'lemmatizer']

Alcune osservazioni dal tutorial onlne:

a)  To train an `ner` model, it has to be looped over the example for a sufficient number of iterations. If you train it for like just 5 or 6 iterations, the operation may not be effective.

b) Before every iteration it’s a good practice to shuffle the examples randomly through `random.shuffle()` function. This will ensure that the model does not make generalizations based on the order of the examples.

c) The training data is usually passed in batches.

***

È possibile chiamare la funzione `minibatch()` di spaCy sul set dei training data come fatto qui sotto. Essa ritorner' il dataset diviso in batches. Essa prende un primo parametro `size` per denotare la dimensione del singolo batches. 
A nostra volta, per ottenere il parametro size della funzione minibatch, usiamo un'altra funzione, `compounding`. Questa funzione prende tre input, `start` (il più piccolo valore che può essere generato), `stop` (il più grande valore che può essere generato), e `compound`, il compounding factor per la serie, Maggior informazioni riguardo questi aspetti  __[qui](https://spacy.io/api/top-level#util)__.

***

Ad ogni iterazione, il modello di ner viene aggiornato tramite la funzione `nlp.update()`, la quale ha come parami un array di esempi, composti da documento ed annotazione. Il `drop`, ossia il dropout rate, ossia la percentuale di neuroni randomicamente distrutti come metodo di regolarizzazione della rete. Infine, il parametro `losses`, ossia un dizionario che contiene le losses osservate su ogni componente della pipeline, usato per la backtrack propagation. Per questo, serve passare come parametro un dizionario vuoto. Ci pensa la funzione a riempirlo.

***

Ad ogni iterazione il metodo update fa una predizione. consulta le annotazioni per vedere se ci ha azzeccato e, se non lo ha fatto, aggiusta i pesi nella NN affinché l'azione corretta abbia uno score più alto la prossima volta. 
Infine, il training è fatto solo sulla componente nlp, o comunque sulle componenti che abbiamo indicato come "attive". Le altre non vengono influenzate dal training. 


In [14]:
# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training.example import Example

# TRAIN THE MODEL
with nlp.disable_pipes(*unaffected_pipes):
    for iteration in range(30):
        # shufling examples  before every iteration
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                # Update the model
                nlp.update([example], losses=losses, drop=0.5)
                print("Losses", losses)

Losses {'tok2vec': 0.0, 'ner': 1.2652465105542852}
Losses {'tok2vec': 0.0, 'ner': 3.2651807615982587}
Losses {'tok2vec': 0.0, 'ner': 3.2671112190950553}
Losses {'tok2vec': 0.0, 'ner': 4.843149523724394}
Losses {'tok2vec': 0.0, 'ner': 6.827805456960367}
Losses {'tok2vec': 0.0, 'ner': 6.828644086123553}
Losses {'tok2vec': 0.0, 'ner': 8.08635737124986}
Losses {'tok2vec': 0.0, 'ner': 9.499662768721223}
Losses {'tok2vec': 0.0, 'ner': 11.577408242699434}
Losses {'tok2vec': 0.0, 'ner': 13.576520268507878}
Losses {'tok2vec': 0.0, 'ner': 15.674196717269455}
Losses {'tok2vec': 0.0, 'ner': 16.28660020716795}
Losses {'tok2vec': 0.0, 'ner': 18.065558597145966}
Losses {'tok2vec': 0.0, 'ner': 19.082501530245167}
Losses {'tok2vec': 0.0, 'ner': 20.87836748761061}
Losses {'tok2vec': 0.0, 'ner': 22.078238831484562}
Losses {'tok2vec': 0.0, 'ner': 24.064810429303577}
Losses {'tok2vec': 0.0, 'ner': 24.222429819703247}
Losses {'tok2vec': 0.0, 'ner': 26.3127833759316}
Losses {'tok2vec': 0.0, 'ner': 1.08627830

Ora che il training è concluso, si può testare su dei nuovi dati. Se il modello continua a non operare come ci aspettiamo, potrebbe essere necessario aggiungere altri dati al training. 

In [16]:
# Testing the model - now the model correctly classifies Alto!!
doc = nlp("I was driving a Alto")
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities [('Alto', 'PRODUCT')]


Come vediamo dalla cella qui sopra, il modello, adesso che è stato aggiornato, è in grado di riconoscere Lto e di classificarlo correttamente. 

In [17]:
# Save the  model to directory
output_dir = Path('./english_model')
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# Load the saved model and predict
print("Loading from", output_dir)
nlp_updated = spacy.load(output_dir)
doc = nlp_updated("Fridge can be ordered in FlipKart" )
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Saved model to english_model
Loading from english_model
Entities [('Fridge', 'PRODUCT'), ('FlipKart', 'ORG')]


### Training di una nuova entity type (label) in spaCy

Supponiamo di avere una categoria che non è ancora presente, per il momento. Immaginiamoci quindi una nuova categoria FOOD. spaCy permette di aggiungere una nuova categoria e di fare il training del modello. Questa feature è utile in quanto ci permette di aggiungere nuove entità, migliorando task come information retrieval. Si parte facendo il load di un modello spaCy pre-esistente. 

In [17]:
# Import and load the spacy model
import spacy
nlp=spacy.load("en_core_web_sm") 

# Getting the ner component
ner=nlp.get_pipe('ner')

Serve preparare ora la nuova label e i nuovi training data.

In [18]:
# New label to add
LABEL = "FOOD"

# Training examples in the required format
TRAIN_DATA =[ ("Pizza is a common fast food.", {"entities": [(0, 5, "FOOD")]}),
              ("Pasta is an italian recipe", {"entities": [(0, 5, "FOOD")]}),
              ("China's noodles are very famous", {"entities": [(8,15, "FOOD")]}),
              ("Shrimps are famous in China too", {"entities": [(0,7, "FOOD")]}),
              ("Lasagna is another classic of Italy", {"entities": [(0,7, "FOOD")]}),
              ("Sushi is extemely famous and expensive Japanese dish", {"entities": [(0,5, "FOOD")]}),
              ("Unagi is a famous seafood of Japan", {"entities": [(0,5, "FOOD")]}),
              ("Tempura , Soba are other famous dishes of Japan", {"entities": [(0,7, "FOOD")]}),
              ("Udon is a healthy type of noodles", {"entities": [(0,4, "ORG")]}),
              ("Chocolate soufflé is extremely famous french cuisine", {"entities": [(0,17, "FOOD")]}),
              ("Flamiche is french pastry", {"entities": [(0,8, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Frenchfries are considered too oily", {"entities": [(0,11, "FOOD")]})
           ]

Il primo passo è aggiungere la label FOOD al modello, attraverso il metodo add_label(). Come prima, inibiamo le parti della pipeline che qui non ci interessano. 

In [25]:
# Add the new label to the ner component
ner.add_label(LABEL)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

Notiamo che: 
a) Serve eseguire un numero sufficiente di iterazioni al modello perché le modifiche siano significative. 

b) Serve inoltre fare il fine-tuning del modello con un occhio alle performance. Prima di ogni iterazione, è meglio eseguire lo shuffle degli esempli, in maniera da non introdurre bias proveniente dall'organizzazione degli examples.

c) i training data vanno passati in batch. 

In [33]:
# Importing requirements
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :
  # Training for 30 iterations     
  for itn in range(30):
    # shuffle examples before training
    random.shuffle(TRAIN_DATA)
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(1.0, 4.0, 1.001))
    # ictionary to store losses
    losses = {}
    for batch in batches:
      for text, annotations in batch:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        # Update the model
        nlp.update([example], losses=losses, drop=0.5)
        print("Losses", losses)

Losses {'tok2vec': 0.0, 'ner': 3.9288996426526424e-10}
Losses {'tok2vec': 0.0, 'ner': 2.852108600148923e-07}
Losses {'tok2vec': 0.0, 'ner': 2.8541110670177405e-07}
Losses {'tok2vec': 0.0, 'ner': 1.6537065975908272}
Losses {'tok2vec': 0.0, 'ner': 1.6537065977106993}
Losses {'tok2vec': 0.0, 'ner': 1.6537097684635291}
Losses {'tok2vec': 0.0, 'ner': 1.6537104686822603}
Losses {'tok2vec': 0.0, 'ner': 1.658817139959233}
Losses {'tok2vec': 0.0, 'ner': 1.6588172068667404}
Losses {'tok2vec': 0.0, 'ner': 1.658818181673252}
Losses {'tok2vec': 0.0, 'ner': 1.658818183332906}
Losses {'tok2vec': 0.0, 'ner': 1.6588181833742635}
Losses {'tok2vec': 0.0, 'ner': 1.6588182390713566}
Losses {'tok2vec': 0.0, 'ner': 1.6590123154325729}
Losses {'tok2vec': 0.0, 'ner': 0.3396537902166091}
Losses {'tok2vec': 0.0, 'ner': 0.3417269009885401}
Losses {'tok2vec': 0.0, 'ner': 0.34172691737243993}
Losses {'tok2vec': 0.0, 'ner': 0.3417271712484607}
Losses {'tok2vec': 0.0, 'ner': 0.3417271715369572}
Losses {'tok2vec': 0.0

Testiamo.

In [36]:
# Testing the NER
test_text = "I ate Sushi yesterday. Maggi is a common fast food "

# make it a doc with the new nlp model
doc = nlp(test_text)

print("Entities in '%s'" % test_text)
for ent in doc.ents:
  print(ent, ent.label_)

Entities in 'I ate Sushi yesterday. Maggi is a common fast food '
Sushi FOOD
Maggi FOOD


In [32]:
# let us also test the original document to see that there is no forgetting
doc = nlp(article_text)
print("Entities in the original article" )
for ent in doc.ents:
  print(ent, ent.label_)


Entities in the original article
India FOOD
billion FOOD
billion FOOD
12% PERCENT
2021 DATE
Amazon FOOD
Prime FOOD
Limited FOOD
Founded FOOD
Flipkart FOOD
Amazon FOOD
Walmart FOOD
Started FOOD
Snapdeal FOOD
Unicommerce ORG
ShopClues FOOD
2011 FOOD
Gurugram FOOD
Partners FOOD
Global FOOD
Ventures FOOD
Paytm FOOD
Mall FOOD
Paytm FOOD
Reliance FOOD
Reliance FOOD
Basket FOOD
Moreover FOOD
Grofers - FOOD
Grofers FOOD
Going FOOD
Asia ORG
DMA ORG
