# NLP. Week 4. Named entity recognition (competition)

[Competition link](https://www.kaggle.com/t/a93a92d0dbe445d4814f161071d715c9)


## 4. Named Entity Recognition

Named Entity Recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on. You can consider NER task as an advance step in PoS tagging task.

### Perfrom NER with spacy


In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
from collections import Counter
import en_core_web_sm

In [3]:
sentence = "European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices"

text_processing_pipeline = en_core_web_sm.load()
doc = text_processing_pipeline(sentence)
print([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]


In [4]:
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)

## Task

Your goal is to perform NER on recipe ingridients. You're free to choose any NER approach to solve this task. 

We recommend to create a pipeline from  ```spacy```. Start with creation of empty pipeline. Next, add a new pipe member into it that will represents NER tagger.  Don't forget to set tokenizer and labels(i.e. names) from the preprocecced dataset. Finally, iterate over the dataset and train a model (by function update)

[This page](https://spacy.io/api/language) contains the documentation you need.

Starting the spacy approach...

In [5]:
import spacy
from tqdm.notebook import tqdm
import pandas as pd

import random
from spacy.training import Example
from spacy.util import minibatch
from spacy.tokenizer import Tokenizer
import re

In [6]:
CLASS_MAPPING = {
    0: "DF",
    1: "NAME",
    2: "O",
    3: "QUANTITY",
    4: "SIZE",
    5: "STATE",
    6: "TEMP",
    7: "UNIT",
}
INVERSE_CLASS_MAPPING = {v: k for k, v in CLASS_MAPPING.items()}

In [7]:
data = pd.read_csv("/kaggle/input/nlp-week-4-ner/train.csv")
data.head()

Unnamed: 0,tokens,labels
0,4 cloves garlic 2 cups cooked corned beef -LRB...,3 7 1 3 7 5 5 1 2 2 2 2
1,"2 tablespoons vegetable oil , divided 1 1/2 cu...",3 7 1 1 2 5 3 7 5 1
2,2 tablespoons dried marjoram 3 tablespoons pac...,3 7 0 1 3 7 5 1 1
3,"1 large red onion , 1/4-inch slices pulled int...",3 4 1 1 2 2 2 2 2 2 3 7 1 2 5
4,"2 jalapeno peppers , seeded and minced 1/2 - 3...",3 1 1 2 5 2 5 3 2 2 7 5 1 1


In [8]:
def prepare_training_data(data):
    training_data = []
    for _, row in data.iterrows():
        tokens = row["tokens"].split()
        labels = list(map(int, row["labels"].split()))
        entities = []
        start = 0
        for token, label in zip(tokens, labels):
            end = start + len(token)
            if label != INVERSE_CLASS_MAPPING["O"]:  # Ignore 'O' labels
                entities.append((start, end, CLASS_MAPPING[label]))
            start = end + 1  # Move to the next token position
        training_data.append((row["tokens"], {"entities": entities}))
    return training_data

In [9]:
training_data = prepare_training_data(data)
print(training_data[0])

('4 cloves garlic 2 cups cooked corned beef -LRB- or canned -RRB-', {'entities': [(0, 1, 'QUANTITY'), (2, 8, 'UNIT'), (9, 15, 'NAME'), (16, 17, 'QUANTITY'), (18, 22, 'UNIT'), (23, 29, 'STATE'), (30, 36, 'STATE'), (37, 41, 'NAME')]})


In [10]:
# text_processing_pipeline = spacy.blank("en")
# text_processing_pipeline.add_pipe("ner", last=True)
# ner = text_processing_pipeline.get_pipe("ner")
# ...

nlp = spacy.blank("en")
ner = nlp.add_pipe("ner", last=True)

In [11]:
for _, annotations in training_data:
    for ent in annotations['entities']:
        ner.add_label(ent[2])

In [12]:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

In [13]:
EPOCHS = 10
BATCH_SIZE = 10
DROPOUT = 0.4

In [14]:
optimizer = nlp.begin_training()

with nlp.disable_pipes(*other_pipes):
    for _ in tqdm(range(EPOCHS)):
        random.shuffle(training_data)
        batches = minibatch(training_data, size=BATCH_SIZE)
        for batch in batches:
            examples = []
            for text, annotations in batch:
                examples.append(Example.from_dict(nlp.make_doc(text), annotations))
            nlp.update(examples, drop=DROPOUT, losses = {})
        

[2025-02-12 10:19:06,459] [INFO] Created vocabulary
[2025-02-12 10:19:06,461] [INFO] Finished initializing nlp object


  0%|          | 0/10 [00:00<?, ?it/s]

  d_xhat = N * dY - sum_dy - dist * var ** (-1.0) * sum_dy_dist


In [15]:
nlp.to_disk("recipe_ner_model")

In [16]:
doc = nlp("2 tablespoons olive oil, finely chopped onion")
displacy.render(doc, style="ent", jupyter=True)

In [17]:
test_data = pd.read_csv("/kaggle/input/nlp-week-4-ner/test.csv")
test_data.head()

Unnamed: 0,id,token
0,0,1/2
1,1,large
2,2,sweet
3,3,red
4,4,onion


In [18]:
def predict_label(token):
    doc = nlp(token)
    for ent in doc.ents:
        return INVERSE_CLASS_MAPPING.get(ent.label_, INVERSE_CLASS_MAPPING["O"])
    return INVERSE_CLASS_MAPPING['O']

In [19]:
test_data["label"] = test_data["token"].apply(predict_label)

In [20]:
test_data.head()

Unnamed: 0,id,token,label
0,0,1/2,3
1,1,large,4
2,2,sweet,1
3,3,red,1
4,4,onion,1


In [21]:
test_data = test_data.drop(columns=('token'))

In [22]:
test_data.head()

Unnamed: 0,id,label
0,0,3
1,1,4
2,2,1
3,3,1
4,4,1


In [23]:
test_data.to_csv("submission.csv", index=False)

Another approach is to train a CRF model, as in [this tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html)


Alternatively, you could try to pretrain Stanford Parser, similar to this [example](https://data-ai.theodo.com/en/technical-blog/python-train-model-ntlk-stanford-ner-tagger). While this is a promising approach, it might be trickier in Jupyter, because the parser module in implemented on Java. Please also refer to this [FAQ](
https://nlp.stanford.edu/software/parser-faq.html#d) 
