@author Dennis

Notebook generato a partire dall'esempio presente in https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/.

L'esempio mostra le funzioni base di spaCy e come estendere i modelli di NER quando ci servono cose custom come l'introduzione di una nuova label, cosa che ci è utile per i nostri task. 

In [1]:
# load di un modello spacy
import spacy

# usually, the model is saved in the nlp object
nlp = spacy.load('en_core_web_sm')

In [2]:
# check if the model presents the ner step in its pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [3]:
# in case the model does not have ner among its steps
# nlp.add_pipe('ner')

Sebbene spaCy abbia nella sua pipeline built-in il ner per la entity recognition, e sebbene sperabilmente performi bene, non è sempre accurato per il nostro testo (in particolare, i modelli in italiano sono più limitati rispetto i modelli in inglese). In particolare, con l'italiano ma anche altre lingue, succede che la categoria che vogliamo possa non essere built-in in spaCy. 

Di seguito un esempio di come la NER (in inglese) performa su di un articolo riguardo una compagnia di E-commerce. 

In [4]:
# a variable with the whole text
article_text="""India that previously comprised only a handful of players in the e-commerce space, is now home to many biggies and giants battling out with each other to reach the top. This is thanks to the overwhelming internet and smartphone penetration coupled with the ever-increasing digital adoption across the country. These new-age innovations not only gave emerging startups a unique platform to deliver seamless shopping experiences but also provided brick and mortar stores with a level-playing field to begin their online journeys without leaving their offline legacies.
In the wake of so many players coming together on one platform, the Indian e-commerce market is envisioned to reach USD 84 billion in 2021 from USD 24 billion in 2017. Further, with the rate at which internet penetration is increasing, we can expect more and more international retailers coming to India in addition to a large pool of new startups. This, in turn, will provide a major Philip to the organized retail market and boost its share from 12% in 2017 to 22-25% by 2021. 
Here’s a view to the e-commerce giants that are dominating India’s online shopping space:
Amazon - One of the uncontested global leaders, Amazon started its journey as a simple online bookstore that gradually expanded its reach to provide a large suite of diversified products including media, furniture, food, and electronics, among others. And now with the launch of Amazon Prime and Amazon Music Limited, it has taken customer experience to a godly level, which will remain undefeatable for a very long time. 

Flipkart - Founded in 2007, Flipkart is recognized as the national leader in the Indian e-commerce market. Just like Amazon, it started operating by selling books and then entered other categories such as electronics, fashion, and lifestyle, mobile phones, etc. And now that it has been acquired by Walmart, one of the largest leading platforms of e-commerce in the US, it has also raised its bar of customer offerings in all aspects and giving huge competition to Amazon. 

Snapdeal - Started as a daily deals platform in 2010, Snapdeal became a full-fledged online marketplace in 2011 comprising more than 3 lac sellers across India. The platform offers over 30 million products across 800+ diverse categories from over 125,000 regional, national, and international brands and retailers. The Indian e-commerce firm follows a robust strategy to stay at the forefront of innovation and deliver seamless customer offerings to its wide customer base. It has shown great potential for recovery in recent years despite losing Freecharge and Unicommerce. 

ShopClues - Another renowned name in the Indian e-commerce industry, ShopClues was founded in July 2011. It’s a Gurugram based company having a current valuation of INR 1.1 billion and is backed by prominent names including Nexus Venture Partners, Tiger Global, and Helion Ventures as its major investors. Presently, the platform comprises more than 5 lac sellers selling products in nine different categories such as computers, cameras, mobiles, etc. 

Paytm Mall - To compete with the existing e-commerce giants, Paytm, an online payment system has also launched its online marketplace - Paytm Mall, which offers a wide array of products ranging from men and women fashion to groceries and cosmetics, electronics and home products, and many more. The unique thing about this platform is that it serves as a medium for third parties to sell their products directly through the widely-known app - Paytm. 

Reliance Retail - Given Reliance Jio’s disruptive venture in the Indian telecom space along with a solid market presence of Reliance, it is no wonder that Reliance will soon be foraying into retail space. As of now, it has plans to build an e-commerce space that will be established on online-to-offline market program and aim to bring local merchants on board to help them boost their sales and compete with the existing industry leaders. 
Big Basket - India’s biggest online supermarket, Big Basket provides a wide variety of imported and gourmet products through two types of delivery services - express delivery and slotted delivery. It also offers pre-cut fruits along with a long list of beverages including fresh juices, cold drinks, hot teas, etc. Moreover, it not only provides farm-fresh products but also ensures that the farmer gets better prices. 

Grofers - One of the leading e-commerce players in the grocery segment, Grofers started its operations in 2013 and has reached overwhelming heights in the last 5 years. Its wide range of products includes atta, milk, oil, daily need products, vegetables, dairy products, juices, beverages, among others. With its growing reach across India, it has become one of the favorite supermarkets for Indian consumers who want to shop grocery items from the comforts of their homes. 

Digital Mall of Asia - Going live in 2020, Digital Mall of Asia is a very unique concept coined by the founders of Yokeasia Malls. It is designed to provide an immersive digital space equipped with multiple visual and sensory elements to sellers and shoppers. It will also give retailers exclusive rights to sell a particular product category or brand in their respective cities. What makes it unique is its zero-commission model enabling retailers to pay only a fixed amount of monthly rental instead of paying commissions. With its one-of-a-kind features, DMA is expected to bring
never-seen transformation to the current e-commerce ecosystem while addressing all the existing e-commerce worries such as counterfeiting. """

# give the text to the nlp object, obtain the Doc object in output
doc = nlp(article_text)

# now print the named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

India GPE
one CARDINAL
Indian NORP
USD 84 billion MONEY
2021 DATE
USD 24 billion MONEY
2017 DATE
India GPE
Philip PERSON
12% PERCENT
2017 DATE
22-25% PERCENT
2021 DATE
India GPE
Amazon - One ORG
Amazon ORG
Amazon ORG
Amazon Music Limited ORG
Flipkart - Founded ORG
2007 DATE
Flipkart ORG
Indian NORP
Amazon ORG
Walmart ORG
one CARDINAL
US GPE
Amazon ORG
2010 DATE
Snapdeal GPE
2011 DATE
more than 3 CARDINAL
India GPE
over 30 million QUANTITY
over 125,000 CARDINAL
Indian NORP
recent years DATE
Freecharge and Unicommerce ORG
Indian NORP
ShopClues ORG
July 2011 DATE
Gurugram ORG
INR 1.1 billion MONEY
Nexus Venture Partners ORG
Tiger Global PERSON
Helion Ventures ORG
more than 5 CARDINAL
nine CARDINAL
Paytm Mall - To PERSON
Paytm GPE
third ORDINAL
Reliance Retail - Given PERSON
Reliance Jio’s PERSON
Indian NORP
Reliance ORG
Reliance ORG
two CARDINAL
Grofers ORG
2013 DATE
the last 5 years DATE
daily DATE
India GPE
Indian NORP
2020 DATE
Digital Mall of Asia ORG
Yokeasia GPE
zero CARDINAL
monthl

Notiamo come dal tutorial sembrasse che l'entità Flipkart fosse mal classificata in una PER (persona) invece che in una ORG. Invece, da qui vediamo che è stata correttamente classificata in una ORG. Evidentemente, nel tempo spaCy ha aggiornato i suoi modelli e sono migliorati. 

Ad ogni modo vediamo nel prosieguo come si possa fare per eseguire l'update della parte NER in base al contesto e requirements dove lavoriamo. 

In [5]:
# a test to see that, as it is, the model cannot classify 'Alto' ad a product (a car) 
doc = nlp("I was driving a Alto")
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities []


Quello che si fa qui è prendere un modello di spaCy pre-trained e farvi l'update con nuovi esempi. Quindi, il primo step è fare il load di un modello contenente la componente `ner`, da questo modello si estrapola il Named Entity Recognizer.

In [6]:
import spacy
nlp = spacy.load('en_core_web_sm')


# eget the ner pipeline component
ner = nlp.get_pipe("ner")

A questo punto, serve fare l'update del modello con nuovi esampi. Questi devono essere molti e significativi affinché il sistema migliore. Si parla di alcune centinaia come valore minimo. 

spaCy accetta training data come una lista di tuple. Ogni tupla dovrebbe contenere del testo e un dizionario. Il dizionario contiene gli indici di start ed end della named entity all'interno del testo (si usa notazione ad indici. Il primo e' l'indice della prima lettera/carattere della entity, il secondo e' l'indice del primo carattere al di fuori dell'entity), e la categoria/label della entità stessa. Un esempio è il seguente: 

In [7]:
tuple_example = ("Walmart is a leading e-commerce company", {"entities" : [(0, 7, "ORG")]})

In [8]:
# training data
TRAIN_DATA = [
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(10, 17, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(31,37, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT"), (25,31,"ORG")]}),
              ("I bought a new Washer", {"entities": [(15,21, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(15,20, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(17,22, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(11,17, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(11,15, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(11,22, "PRODUCT")]}),
              ("I repaired my computer", {"entities": [(14,22, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(15,20, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(15,20, "PRODUCT")]}),
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,30, "ORG")]})
              ]

In [9]:
# adding the new labels to the ner component of the pipeline
for _, annotations in TRAIN_DATA:
  for ent in annotations.get("entities"):
    ner.add_label(ent[2])

In [10]:
# you can look at all the available labels in this way
ner.labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

Ora usiamo i dati di training (qui pochi ma è giusto per iniziare da qualche parte) per fare l'update del modello. In particolare, ricordiamoci, prima di fare il training, che a parte per la componente `ner`, il modello contiene altre componenti nella pipeline. Queste componenti non devono essere influenzate durante il processo di training. Dobbiamo pertanto disabilitarle. 

In [11]:
# Disable pipeline components you dont need to change
# pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
pipe_exceptions = ["tok2vec", "ner", "trf_wordpiecer"] # components that will be affected
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] # components that won't be affected

In [12]:
unaffected_pipes

['tagger', 'parser', 'attribute_ruler', 'lemmatizer']

Alcune osservazioni dal tutorial onlne:
a)  To train an `ner` model, it has to be looped over the example for a sufficient number of iterations. If you train it for like just 5 or 6 iterations, the operation may not be effective.
b) Before every iteration it’s a good practice to shuffle the examples randomly through `random.shuffle()` function. This will ensure that the model does not make generalizations based on the order of the examples.
c) The training data is usually passed in batches. 


In [13]:
# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training.example import Example

# TRAIN THE MODEL
with nlp.disable_pipes(*unaffected_pipes):
    for iteration in range(30):
        # shufling examples  before every iteration
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                # Update the model
                nlp.update([example], losses=losses, drop=0.5)
                print("Losses", losses)

Losses {'tok2vec': 0.0, 'ner': 2.781343273785751}
Losses {'tok2vec': 0.0, 'ner': 3.5311394005793115}
Losses {'tok2vec': 0.0, 'ner': 5.517842516289809}
Losses {'tok2vec': 0.0, 'ner': 6.816723941157423}
Losses {'tok2vec': 0.0, 'ner': 8.434883982253357}
Losses {'tok2vec': 0.0, 'ner': 9.591214979170033}
Losses {'tok2vec': 0.0, 'ner': 12.025451327235237}
Losses {'tok2vec': 0.0, 'ner': 13.446031380087083}
Losses {'tok2vec': 0.0, 'ner': 15.44596973468489}
Losses {'tok2vec': 0.0, 'ner': 17.784664457670907}
Losses {'tok2vec': 0.0, 'ner': 19.872319887269747}
Losses {'tok2vec': 0.0, 'ner': 19.92794597318956}
Losses {'tok2vec': 0.0, 'ner': 21.92186672565348}
Losses {'tok2vec': 0.0, 'ner': 21.952923573291866}
Losses {'tok2vec': 0.0, 'ner': 23.9447570733801}
Losses {'tok2vec': 0.0, 'ner': 25.95120491492561}
Losses {'tok2vec': 0.0, 'ner': 26.662619758050518}
Losses {'tok2vec': 0.0, 'ner': 29.9455194891815}
Losses {'tok2vec': 0.0, 'ner': 30.716698720403443}
Losses {'tok2vec': 0.0, 'ner': 1.99959211393

In [14]:
# Testing the model - now the model correctly classifies Alto!!
doc = nlp("I was driving a Alto")
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities [('Alto', 'PRODUCT')]


Come vediamo dalla cella qui sopra, il modello, adesso che è stato ulteriormente

In [17]:
# Save the  model to directory
output_dir = Path('./english_model')
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# Load the saved model and predict
print("Loading from", output_dir)
nlp_updated = spacy.load(output_dir)
doc = nlp_updated("Fridge can be ordered in FlipKart" )
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Saved model to english_model
Loading from english_model
Entities [('Fridge', 'PRODUCT'), ('FlipKart', 'ORG')]
