The Dataset
MIT Restaurant Dataset
https://groups.csail.mit.edu/sls/downloads/restaurant/

https://huggingface.co/datasets/tner/mit_restaurant


In [1]:

import warnings
warnings.filterwarnings('ignore')

! pip install -U transformers
! pip install -U accelerate
! pip install -U datasets




In [2]:

import pandas as pd
import json
import requests

In [3]:
train = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/mit_restaurant_search_ner/train.bio", sep="\t", header=None)
train.head()

Unnamed: 0,0,1
0,B-Rating,2
1,I-Rating,start
2,O,restaurants
3,O,with
4,B-Amenity,inside


In [14]:
response = requests.get("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/mit_restaurant_search_ner/train.bio")
response = response.text
response = response.splitlines()
train_tokens = []
train_tags = []

temp_tokens = []
temp_tags = []
for line in response:
    if line != "":
        tag, token = line.strip().split("\t")
        temp_tags.append(tag)
        temp_tokens.append(token)
    else:
        train_tokens.append(temp_tokens)
        train_tags.append(temp_tags)

        temp_tokens, temp_tags = [], []

print(len(train_tokens), len(train_tags))



response = requests.get("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/mit_restaurant_search_ner/test.bio")
response = response.text
response = response.splitlines()

test_tokens = []
test_tags = []

temp_tokens = []
temp_tags = []
for line in response:
    if line != "":
        tag, token = line.strip().split("\t")
        temp_tags.append(tag)
        temp_tokens.append(token)
    else:
        test_tokens.append(temp_tokens)
        test_tags.append(temp_tags)

        temp_tokens, temp_tags = [], []

print(len(test_tokens), len(test_tags))

from datasets import Dataset, DatasetDict

df = pd.DataFrame({'tokens': train_tokens, 'ner_tags_str': train_tags})
train = Dataset.from_pandas(df)

df = pd.DataFrame({'tokens': test_tokens, 'ner_tags_str': test_tags})
test = Dataset.from_pandas(df)

dataset = DatasetDict({'train': train, 'test': test, 'validation': test})

print(dataset)
print(dataset['train'][0])

7659 7659
1520 1520
DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags_str'],
        num_rows: 7659
    })
    test: Dataset({
        features: ['tokens', 'ner_tags_str'],
        num_rows: 1520
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags_str'],
        num_rows: 1520
    })
})
{'tokens': ['2', 'start', 'restaurants', 'with', 'inside', 'dining'], 'ner_tags_str': ['B-Rating', 'I-Rating', 'O', 'O', 'B-Amenity', 'I-Amenity']}


I've loaded and processed the MIT Restaurant Search Named Entity Recognition (NER) dataset. I've successfully:

-  the training and test data from GitHub
- Parsed the BIO-formatted files into tokens and tags
- Created a Hugging Face DatasetDict with train, test, and validation splits
- Verified the data structure by examining the first example

The dataset contains tokens from restaurant searches and corresponding NER tags in BIO format, where:

- B-TAG marks the Beginning of an entity
- I-TAG marks the Inside (continuation) of an entity
- O marks tokens Outside any entity

In [15]:
unique_tags = set()
for tag in dataset['train']['ner_tags_str']:
    unique_tags.update(tag)

unique_tags = list(set([x[2:] for x in list(unique_tags) if x!='O']))

tag2index = {"O": 0}
for i, tag in enumerate(unique_tags):
    tag2index[f'B-{tag}'] = len(tag2index)
    tag2index[f'I-{tag}'] = len(tag2index)

index2tag = {v:k for k,v in tag2index.items()}
dataset = dataset.map(lambda example: {"ner_tags": [tag2index[tag] for tag in example['ner_tags_str']]})
dataset

Map:   0%|          | 0/7659 [00:00<?, ? examples/s]

Map:   0%|          | 0/1520 [00:00<?, ? examples/s]

Map:   0%|          | 0/1520 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags_str', 'ner_tags'],
        num_rows: 7659
    })
    test: Dataset({
        features: ['tokens', 'ner_tags_str', 'ner_tags'],
        num_rows: 1520
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags_str', 'ner_tags'],
        num_rows: 1520
    })
})

- Extracted unique tags from the dataset (excluding the 'O' tag)
- Created a mapping from tag strings to numeric indices (tag2index)
- Created a reverse mapping from indices to tags (index2tag)
- Added a new 'ner_tags' field to the dataset with the numeric tag IDs
- Applied this transformation to all splits (train, test, validation)

This approach handles the BIO format correctly by:

- Setting 'O' (Outside) tags to index 0
- Creating separate indices for both B- (Beginning) and I- (Inside) versions of each entity type

This type of preprocessing is essential for NER models that require numeric tag IDs rather than string labels. The dataset is now in good shape for training a sequence labeling model.

In [21]:
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
dataset['train'][2]

{'tokens': ['5', 'star', 'resturants', 'in', 'my', 'town'],
 'ner_tags_str': ['B-Rating',
  'I-Rating',
  'O',
  'B-Location',
  'I-Location',
  'I-Location'],
 'ner_tags': [1, 2, 0, 13, 14, 14]}

I've loaded the DistilBERT tokenizer. This example shows:

- Tokens: ['5', 'star', 'resturants', 'in', 'my', 'town']
- NER tags in string format: ['B-Rating', 'I-Rating', 'O', 'B-Location', 'I-Location', 'I-Location']
- NER tags converted to numeric IDs: [1, 2, 0, 13, 14, 14]

From this example, I can see:

- "5 star" is tagged as a Rating entity
- "in my town" has "in" tagged as B-Location and "my town" as I-Location

In [29]:
input = dataset['train'][2]['tokens']
output = tokenizer(input, is_split_into_words=True)
tokenizer.convert_ids_to_tokens(output.input_ids)



['[CLS]', '5', 'star', 'rest', '##ura', '##nts', 'in', 'my', 'town', '[SEP]']

In [30]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)

        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            # if id=-100 then loss is not calculated
            if word_idx is None:
                label_ids.append(-100)

            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])

            else:
                label_ids.append(-100)

            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs['labels'] = labels

    return tokenized_inputs


In [31]:
  tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)


Map:   0%|          | 0/7659 [00:00<?, ? examples/s]

Map:   0%|          | 0/1520 [00:00<?, ? examples/s]

Map:   0%|          | 0/1520 [00:00<?, ? examples/s]

In [32]:
tokenized_dataset['train'][2]

{'tokens': ['5', 'star', 'resturants', 'in', 'my', 'town'],
 'ner_tags_str': ['B-Rating',
  'I-Rating',
  'O',
  'B-Location',
  'I-Location',
  'I-Location'],
 'ner_tags': [1, 2, 0, 13, 14, 14],
 'input_ids': [101, 1019, 2732, 2717, 4648, 7666, 1999, 2026, 2237, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 1, 2, 0, -100, -100, 13, 14, 14, -100]}

Implemented tokenization with label alignment.

First, I tested how the tokenizer splits your example:

- Original tokens: ['5', 'star', 'resturants', 'in', 'my', 'town']
- Tokenized: ['[CLS]', '5', 'star', 'rest', '##ura', '##nts', 'in', 'my', 'town', '[SEP]']
- Notice how "resturants" gets split into 3 tokens: "rest", "##ura", "##nts"

Then I implemented tokenize_and_align_labels which:

- Tokenizes the input text
- Aligns the original labels with the tokenized input
- Uses -100 for special tokens ([CLS], [SEP]) and subword continuation tokens (##...)
- Only assigns actual labels to the first token of each word


My mapped dataset now contains:

- Original tokens and tags
- Tokenized input_ids and attention_mask
- Aligned labels with -100 for tokens that should be ignored during loss calculation



This alignment strategy is important because:

- It ensures only the first token of each word gets a prediction
- It prevents the model from being penalized for predicting on subword pieces
- It maintains the original label sequence during evaluation

## Data Collation and Metrics

In [33]:
!pip install seqeval
!pip install evaluate

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=4541d7dc3b548085bf29b0d987d2ba19ca1b7206201a0a77d23a14d3f48cc46e
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [34]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [35]:
import evaluate
import numpy as np

metric = evaluate.load('seqeval')
label_names = list(tag2index)

def compute_metrics(eval_preds):
    logits, labels = eval_preds

    predictions = np.argmax(logits, axis=-1)
    true_labels = [[label_names[l] for l in label if l!=-100] for label in labels]

    true_predictions = [[label_names[p] for p, l in zip(prediction, label) if l != -100]
                        for prediction, label in zip(predictions, labels)]

    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)

    return {
        "precision": all_metrics['overall_precision'],
        'recall': all_metrics['overall_recall'],
        'f1': all_metrics['overall_f1'],
        'accuracy': all_metrics['overall_accuracy'],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

successfully set up the evaluation components for the NER model:

Created a DataCollatorForTokenClassification from Hugging Face's transformers library, which will:

- Handle padding to the maximum length in each batch
- Properly mask padded tokens in the attention mask
- Make sure labels for padded positions are set to -100 (ignored in loss calculation)


Loaded the seqeval metric, which is specifically designed for evaluating sequence labeling tasks like NER by:

- Computing precision, recall, F1, and accuracy at the entity level
- Considering the entire entity span rather than just token-level predictions


Defined a compute_metrics function that:

- Converts logits to predictions by taking the argmax
- Filters out ignored positions (-100)
- Maps numeric IDs back to the original tag strings
- Computes and returns the overall metrics



Your evaluation setup correctly handles the BIO tagging scheme and the label alignment strategy you implemented earlier. The seqeval metric will properly evaluate entity-level performance rather than just token-level accuracy, which is crucial for NER tasks.

## Model Training

I've initialized the DistilBERT model for token classification (NER):

- using AutoModelForTokenClassification.from_pretrained() to load the pretrained DistilBERT weights
- passing in index2tag dictionary as id2label to map prediction indices to tag names
- passing in tag2index dictionary as label2id to map tag names to indices

The model is now configured with the correct number of output classes to match the NER tag set. This model will add a classification layer on top of the DistilBERT encoder, with an output dimension that matches the number of NER tags in your dataset.

In [36]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(model_ckpt, id2label=index2tag, label2id=tag2index)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


set up the training pipeline with the Hugging Face Trainer API:

Configured TrainingArguments with:

- Output directory: "finetuned-ner"
- Evaluation and saving checkpoints after each epoch
- Learning rate of 2e-5, which is appropriate for fine-tuning transformer models
- 3 training epochs
- Weight decay of 0.01 for regularization


Initialized the Trainer with:

- token classification model
- Training arguments
- Training and validation datasets
- Data collator for batching and padding
- Evaluation metrics function
- Tokenizer for potential generation tasks


Started the training process with trainer.train()

The model will now train for 3 epochs on your NER dataset. After each epoch, it will evaluate on the validation set and report metrics (precision, recall, F1, accuracy).

In [37]:
from transformers import TrainingArguments, Trainer
args = TrainingArguments("finetuned-ner", evaluation_strategy='epoch',
                         save_strategy='epoch',
                         learning_rate=2e-5,
                         num_train_epochs=3,
                         weight_decay=0.01)
trainer = Trainer(model=model, args=args,
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset=tokenized_dataset['validation'],
                  data_collator=data_collator,
                  compute_metrics=compute_metrics,
                  tokenizer=tokenizer)
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mniruthi2000[0m ([33mniruthi2000-northeastern-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.6426,0.306422,0.736982,0.790794,0.76294,0.908408
2,0.25,0.285927,0.761185,0.799365,0.779808,0.913672
3,0.2058,0.286526,0.765738,0.803175,0.78401,0.915848


TrainOutput(global_step=2874, training_loss=0.31773001100756515, metrics={'train_runtime': 5661.1525, 'train_samples_per_second': 4.059, 'train_steps_per_second': 0.508, 'total_flos': 105239751014754.0, 'train_loss': 0.31773001100756515, 'epoch': 3.0})

In [38]:
trainer.save_model("ner_with_distilbert")

In [39]:
from transformers import pipeline

checkpoint = "ner_with_distilbert"
pipe = pipeline('token-classification', model=checkpoint, aggregation_strategy='simple')

Device set to use cpu


In [40]:
pipe("which restaurant serves the best shushi in new york?")

[{'entity_group': 'Rating',
  'score': np.float32(0.9691859),
  'word': 'best',
  'start': 28,
  'end': 32},
 {'entity_group': 'Dish',
  'score': np.float32(0.89437824),
  'word': 'shushi',
  'start': 33,
  'end': 39},
 {'entity_group': 'Location',
  'score': np.float32(0.9060159),
  'word': 'new york',
  'start': 43,
  'end': 51}]

In [41]:
from huggingface_hub import login

In [42]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [44]:

# Step 2: Prepare your model and tokenizer
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Load your saved model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("ner_with_distilbert")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")  # Or from your saved tokenizer

# Step 3: Push to Hub
# Set your Hugging Face username
username = "niruthiha"
model_name = "restaurant-ner"
repo_name = f"{username}/{model_name}"

# Push the model and tokenizer
model.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

# You can also add a model card with description
from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(
    language="en",
    license="mit",
    tags=["token-classification", "ner", "restaurant-search"],
    datasets=["mit_restaurant_search_ner"],
    model_name=model_name
)

card = ModelCard.from_template(
    card_data,
    model_description="""
    This model performs Named Entity Recognition (NER) on restaurant search queries.
    It can identify entities such as Rating, Location, Cuisine, Amenity, etc.

    The model was fine-tuned using DistilBERT on the MIT Restaurant Search NER dataset.
    """,
    model_usage="""
    from transformers import pipeline

    nlp = pipeline('token-classification',
                   model='your-username/restaurant-ner',
                   aggregation_strategy='simple')

    result = nlp("I want a 5 star Italian restaurant in Boston with outdoor seating")
    print(result)
    """
)

card.push_to_hub(repo_name)

model.safetensors:   0%|          | 0.00/266M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/niruthiha/restaurant-ner/commit/50f7ee6ccbc543b13a91ccc4b52989b029ffcdbb', commit_message='Upload README.md with huggingface_hub', commit_description='', oid='50f7ee6ccbc543b13a91ccc4b52989b029ffcdbb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/niruthiha/restaurant-ner', endpoint='https://huggingface.co', repo_type='model', repo_id='niruthiha/restaurant-ner'), pr_revision=None, pr_num=None)

In [1]:
from transformers import pipeline

nlp = pipeline('token-classification',
               model='niruthiha/restaurant-ner',
               aggregation_strategy='simple')

result = nlp("I want a 5 star Italian restaurant in Boston with outdoor seating")
print(result)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/266M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cpu


[{'entity_group': 'Rating', 'score': 0.9812622, 'word': '5 star', 'start': 9, 'end': 15}, {'entity_group': 'Cuisine', 'score': 0.99342704, 'word': 'italian', 'start': 16, 'end': 23}, {'entity_group': 'Location', 'score': 0.8756751, 'word': 'boston', 'start': 38, 'end': 44}, {'entity_group': 'Amenity', 'score': 0.98906565, 'word': 'outdoor seating', 'start': 50, 'end': 65}]
