### NER Exercise

Created by: Tan Poh Keam, Republic Polytechnic


Techniques used to prepare the training and test data are taken from https://deepnote.com/@isaac-aderogba/Spacy-Food-Entities-LMLRnMOsQyGIUwvPLvVlsw .

In this notebook, we'll be training spaCy to identify FOOD entities from given snippets of text.

In the following cells, we will be generating the datasets using templates and food name fillers.

Once we have the dataset, we will train a emptyspaCy NER model and add the custom FOOD entities. 


In [1]:
import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
from spacy import displacy
import random
import re

**Prepare the training data**2
From the the USDA's Branded Food's dataset, download the Apr 2021 Branded Food dataset from
https://fdc.nal.usda.gov/download-datasets.html

We will download the folder ../assets/ and unzip


In [None]:
!wget -P ../assets https://fdc.nal.usda.gov/fdc-datasets/FoodData_Central_branded_food_csv_2021-04-28.zip 

In [None]:
!ls -l ../assets

In [None]:
!unzip -o -q ../assets/FoodData_Central_branded_food_csv_2021-04-28.zip -d ../assets

In [2]:
!ls -l ../assets

total 3433168
-rw-r--r--  1 tanpohkeam  staff     148996 Nov  9  2020 Download & API Field Descriptions April 2021.pdf
-rw-r--r--  1 tanpohkeam  staff  218215172 Apr 28 05:29 FoodData_Central_branded_food_csv_2021-04-28.zip
-rw-r--r--  1 tanpohkeam  staff     402397 Jun 29 09:55 FoodData_Central_branded_food_csv_2021-04-28.zip.1
-rw-r--r--  1 tanpohkeam  staff        140 Apr 27 05:25 all_downloaded_table_record_counts.csv
-rw-r--r--  1 tanpohkeam  staff  499213246 Apr 27 05:15 branded_food.csv
-rw-r--r--  1 tanpohkeam  staff       1173 Jun 27 10:36 dev.spacy
-rw-r--r--  1 tanpohkeam  staff   92669707 Apr 27 05:16 food.csv
-rw-r--r--  1 tanpohkeam  staff   63249964 Apr 27 05:16 food_attribute.csv
-rw-r--r--  1 tanpohkeam  staff  780640628 Apr 27 05:19 food_nutrient.csv
-rw-r--r--  1 tanpohkeam  staff   72102688 Apr 27 05:17 food_update_log_entry.csv
-rw-r--r--  1 tanpohkeam  staff       1173 Jun 27 10:36 train.spacy


In [3]:

path_to_datafile ="../assets/food.csv"
food_df = pd.read_csv(path_to_datafile)

# print row and column information
food_df.head(5)

Unnamed: 0,fdc_id,data_type,description,food_category_id,publication_date
0,1105904,branded_food,WESSON Vegetable Oil 1 GAL,,2020-11-13
1,1105905,branded_food,SWANSON BROTH BEEF,,2020-11-13
2,1105906,branded_food,CAMPBELL'S SLOW KETTLE SOUP CLAM CHOWDER,,2020-11-13
3,1105907,branded_food,CAMPBELL'S SLOW KETTLE SOUP CHEESE BROCCOLI,,2020-11-13
4,1105908,branded_food,SWANSON BROTH CHICKEN,,2020-11-13


In the dataframe, the column 'Description" contains the food item description.
For simplicity, we will only choose food description that DOES NOT contains special characters, and we retain rows with food descriptions of 1, 2 , or 3 words only.


In [7]:

foods = food_df[food_df["description"].str.contains("[^a-zA-Z ]") == False]["description"].apply(lambda food: food.lower())

# filter out foods with more than 3 words, drop any duplicates
foods = foods[foods.str.split().apply(len) <= 3].drop_duplicates()

# print the remaining size
print("food items total :" , foods.size)

# show first 5 food items
foods.head()

food items total : 40508


1          swanson broth beef
4       swanson broth chicken
25    pepperidge farm cookies
35      pepperidge farm bread
42    swanson broth vegetable
Name: description, dtype: object

Prepares seperate list of foods entities with one word, two words and three words.
This will be need to simplify the way we create the training data.

In [9]:
one_worded_foods = foods[foods.str.split().apply(len) == 1]
two_worded_foods = foods[foods.str.split().apply(len) == 2]
three_worded_foods = foods[foods.str.split().apply(len) == 3]

In [12]:
one_worded_foods.head()

238    blueberries
314       crackers
501          dills
634          adobo
687        gummies
Name: description, dtype: object

In [13]:
two_worded_foods.head()

182       walnut butter
195       tita crackers
206      teriyaki sauce
214      dessert shells
236    italian dressing
Name: description, dtype: object

In [14]:
three_worded_foods.head()

1          swanson broth beef
4       swanson broth chicken
25    pepperidge farm cookies
35      pepperidge farm bread
42    swanson broth vegetable
Name: description, dtype: object

At this point, we want to create different placeholders templates that we can insert our food entities into. 
The training sentences should cater to mentioning one, two or three food items.

In [15]:
food_templates = [
    "I ate my {}.",
    "I'm eating a {}.",
    "I just ate a {}.",
    "I only ate the {}.",
    "I'm done eating a {}",
    "I've already eaten a {}.",
    "I just finished my {}.",
    "When I was having lunch I ate a {}.",
    "I had a {} and a {} today",
    "I ate a {} and a {} for lunch",
    "I made a {} and {} for lunch",
    "I ate {} and {} just now",
    "today I ate a {} and a {} for lunch",
    "I had {} with my husband last night",
    "I brought you some {} on my birthday",
    "I made {} for yesterday's dinner",
    "last night, a {} was sent to me via Grabfood",
    "I had {} yesterday and I'd like to eat it anyway",
    "I ate a couple of {} last night",
    "I had some {} at dinner last night",
    "Last night, I ordered some {} as I was hungry",
    "I made a {} last night",
    "I had a bowl of {} with {} and I wanted to go to the mall today",
    "I brought a basket of {} for breakfast this morning",
    "I had a bowl of {} just now",
    "I ate a {} with {} in the morning",
    "I made a bowl of {} for my breakfast",
    "There's {} for breakfast in the bowl this morning",
    "I made a bowl of {} thius morning",
    "I decided to have some {} as a little bonus",
    "I decided to enjoy some {} to relax",
    "I've decided to have some {} for dessert",
    "I had a {}, a {} and {} at home",
    "I took a {}, {} and {} on the weekend",
    "I ate a {} with {} and {} just now",
    "Last night, I ate an {} with {} and {} with my girl friend",
    "I tasted some {}, {} and {} at the office",
    "There's a basket of {}, {} and {} that I consumed",
    "I devoured a {}, {} and {} for dinner",
    "I've already had a bag of {}, {} and {} from the fridge"
]

We need to prepare the data into the following format.

In [16]:
# example
data = [
    ("I love chicken", [(7, 13, "FOOD")]),
    ("We ordered Sushi thorugh Grabfood", [(11, 16, "FOOD")]),
]

Next, we will generate the training data and prepare them into the desired format.
#### Do Not Modify The Next Block of Codes

In [19]:
## DO NOT MOIFY THIS BLOCK OF CODES.

## create dictionaries to store the generated food combinations. 
## Do note that one_food != one_worded_food. one_food == "barbecue sauce", one_worded_food == "sauce"
## Rather, one_food refers to sentences that talks about one type of food,regardless of the number of words
## in the food item.

 # reset the data list


data =[]  
# the pattern to replace from the template sentences
pattern_to_replace = "{}"

# shuffle the data before starting
foods = foods.sample(frac=1)

# the count that helps us decide when to break from the for loop
food_entity_count = foods.size - 1

# start the while loop, ensure we don't get an index out of bounds error
while food_entity_count >= 2:
    entities = []

    # pick a random food template
    sentence = food_templates[random.randint(0, len(food_templates) - 1)]

    # find out how many braces "{}" need to be replaced in the template
    matches = re.findall(pattern_to_replace, sentence)

    # for each brace, replace with a food entity from the shuffled food data
    for match in matches:
        food = foods.iloc[food_entity_count]
        food_entity_count -= 1

        # replace the pattern, but then find the match of the food entity we just inserted
        sentence = sentence.replace(match, food, 1)
        match_span = re.search(food, sentence).span()

        # use that match to find the index positions of the food entity in the sentence, append
        entities.append((match_span[0], match_span[1], "FOOD"))

    # append the sentence and the position of the entities to the correct dictionary and array
    data.append((sentence, {"entities": entities}))

In [20]:
print ('Sample of the data contents\n')
for d in data[0:5]:
    print (d)

Sample of the data contents

('I brought you some sweetened diced cranberries on my birthday', {'entities': [(19, 46, 'FOOD')]})
('I decided to enjoy some cheddar to relax', {'entities': [(24, 31, 'FOOD')]})
('I tasted some spicy pickled mushrooms, homestyle cranberry sauce and holiday milk at the office', {'entities': [(14, 37, 'FOOD'), (39, 64, 'FOOD'), (69, 81, 'FOOD')]})
('When I was having lunch I ate a turkey stock concentrate.', {'entities': [(32, 56, 'FOOD')]})
('I just finished my hearty crispbread.', {'entities': [(19, 36, 'FOOD')]})


The next function is used to convert the data (prepared in the format above) to Spacy Docbin format.
**Do not change the next block of codes**

In [22]:
### Do NOT CHANGE THIS BLOCK OF CODES.

nlp = spacy.blank("en") # load a new spacy model

def  create_spacy3_training_data(DATA):
    db = DocBin() # create a DocBin object
    for text, annot in tqdm(DATA): # data in previous format
       doc = nlp.make_doc(text) # create doc object from text
       ents = []
       for start, end, label in annot["entities"]: # add character indexes
           span = doc.char_span(start, end, label=label, alignment_mode="contract")
           if span is None:
               print("Skipping entity")
           else:
              ents.append(span)
       doc.ents = ents # label the text with the ents
       db.add(doc)
    return (db)


### Task 1


Typically, datasets needs to split into test and validation sets.
The dataset that was generated is huge (25K).
As a first step, you can use the slice it such that 1000 as training data, and the next 200 as validation data

In [28]:
# Example to slice the first 10, use the following:

sample_slice = data[0:10]

In [23]:

# your codes
TRAIN_DATA= ?????
TEST_DATA= ??


### Task 2

From here onwards, convert the training data sets and validation data sets to Spacy Docbin format.



In [None]:
# your codes


### Task 3

Run the spacy CLI commands with suitable parameters to start the NER training


In [None]:
# your codes


### Task 4

Reload your new model. 
Test your new model against unseen sentences.


In [None]:
## your codes

sentences  = [ 'I ate wantan noodles and kopi for lunch', 
               'Last night, I ate instant noodles ',
               'I plan to eat beef steak and ice-cream for dinner.',
               'I ordered a chicken steak, beer and cheesecake for dinner']


### Solutions for Task 1, Task 2, and Task 3

In [24]:
train_db = create_spacy3_training_data(TRAIN_DATA)
train_db.to_disk("./food_train.spacy") # save the docbin object

100%|██████████| 1000/1000 [00:00<00:00, 4470.60it/s]


In [25]:
test_db = create_spacy3_training_data(TEST_DATA[0:100])
test_db.to_disk("./food_dev.spacy") # save the docbin object

100%|██████████| 100/100 [00:00<00:00, 2333.34it/s]


In [26]:
!python -m spacy train config.cfg --output ./output --paths.train ./food_train.spacy --paths.dev food_dev.spacy

[38;5;4mℹ Using CPU[0m
[1m
[2021-06-29 10:08:06,585] [INFO] Set up nlp object from config
[2021-06-29 10:08:06,589] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-06-29 10:08:06,590] [INFO] Created vocabulary
[2021-06-29 10:08:06,590] [INFO] Finished initializing nlp object
[2021-06-29 10:08:07,071] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  d_xhat = N * dY - sum_dy - dist * var ** (-1.0) * sum_dy_dist
  0       0          0.00     48.50    0.00    0.00    0.00    0.00
  0      50         15.43   1299.79   59.65   58.96   60.36    0.60
  0     100         13.13    166.88   97.30   98.78   95.86    0.97
  1     150          6.71     24.69   97.35   97.06   97.63    0.97
  1     200         13.57     28.88   97.91 

In [29]:
!ls -l ./output

total 0
drwxr-xr-x  8 tanpohkeam  staff  256 Jun 29 10:08 [34mmodel-best[m[m
drwxr-xr-x  8 tanpohkeam  staff  256 Jun 29 10:08 [34mmodel-last[m[m


In [30]:
best_nlp = spacy.load("./output/model-best")

sentences  = [ 'I ate wantan noodles and kopi for lunch', 
               'Last night, I ate instant noodles ',
               'I plan to eat beef steak and ice-cream for dinner.',
               'I ordered a chicken steak, beer and cheesecake for dinner']

for s in sentences:
   doc = best_nlp(s)
   displacy.render(doc, style="ent")