In [1]:
from transformers import BertForTokenClassification, pipeline, BertTokenizerFast
import pandas as pd

To test the model we use another Hugging Face class: pipeline. It makes the process of getting the predicted result easier as we do not need to do data preprocessing steps manually.

In [2]:
label2id = {'O': 0, 'B-mount': 1, 'I-mount': 2} #labels are necessary for the model
id2label = {0: 'O', 1: 'B-mount', 2: 'I-mount'}

In [3]:
dir = "./model_save" #directory with presaved model and tokenizer settings

First of all, we need to load the pretrained parameters of model and tokenizer

In [4]:
tokenizer = BertTokenizerFast.from_pretrained(dir)  #we use BertTokenizerFast instead of BertTokenizer because the BertTokenizer cannot handle aggregation_strategy="first" in a pipeline
model = BertForTokenClassification.from_pretrained(dir,
                                                   num_labels=len(id2label),
                                                   id2label=id2label,
                                                   label2id=label2id)

In [5]:
pipe_ner = pipeline(task="token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="first")
#aggregation strategy is defined because we did worpiece tokenization. So the output of the model originally is also pieces of the words and their labels. aggregation_strategy="simple" changes the output from worpieces to words
pipe_ner("Ben Nevis is a popular hiking destination, with 150,000 people a year visiting the peak.")

Device set to use cpu


[{'entity_group': 'mount',
  'score': 0.8071073,
  'word': 'ben',
  'start': 0,
  'end': 3},
 {'entity_group': 'mount',
  'score': 0.9702565,
  'word': 'nevis',
  'start': 4,
  'end': 9}]

We need to write custom function to extract mountain(s) name(s) from model prediction

In [6]:
def get_mountains_from_predictions(predictions):
    # Combine the words to reconstruct the sentence
    words = [pred['word'] for pred in predictions]
    sentence = " ".join(words)

    # Remove extra spaces (e.g., around punctuation)
    sentence = sentence.replace(" ##", "")  # Handle subword tokens if they appear
    sentence = sentence.replace(" ,", ",").replace(" .", ".").replace(" !", "!")
    return sentence

In [7]:
get_mountains_from_predictions(pipe_ner("Ben Nevis is a popular hiking destination, with 150,000 people a year visiting the peak."))

'ben nevis'

Now we will try to use our model on the new dataset

In [8]:
df = pd.read_csv("mountain_sentences.csv")

In [9]:
df

Unnamed: 0,Sentence
0,We took a road trip through the Ozarks.
1,Mount Kinabalu is on my travel list.
2,We spent the weekend exploring the city’s old ...
3,I’ve been dreaming of trekking through the Cau...
4,Our guide told us fascinating stories about th...
...,...
58,Elbrus is the highest point in Europe.
59,Camping by the river was a fantastic experience.
60,The view from Everest Base Camp is breathtaking.
61,"I just saw a documentary about the Himalayas, ..."


In [10]:
df["Mountain"] = 0

In [11]:
for i in range(len(df)):
  df.loc[i, "Mountain"] = get_mountains_from_predictions(pipe_ner(df.iloc[i,0]))

  df.loc[i, "Mountain"] = get_mountains_from_predictions(pipe_ner(df.iloc[i,0]))


In [12]:
df.head(15)

Unnamed: 0,Sentence,Mountain
0,We took a road trip through the Ozarks.,ozarks
1,Mount Kinabalu is on my travel list.,kinabalu
2,We spent the weekend exploring the city’s old ...,
3,I’ve been dreaming of trekking through the Cau...,caucasus
4,Our guide told us fascinating stories about th...,
5,I saw the Northern Lights for the first time!,
6,I’ve been photographing wildflowers in the fie...,
7,The sunrise over the valley was unforgettable.,valley
8,I love hiking in the woods near my house.,
9,We watched the sunset over the Blue Ridge.,blue ridge


In [13]:
df.tail(15)

Unnamed: 0,Sentence,Mountain
48,Annapurna is one of the most dangerous climbs ...,annapurna
49,The Andes are a dream destination for trekkers.,andes
50,Have you ever heard of Rwenzori? It’s stunning.,rwenzori
51,Taking a boat ride through the mangroves was fun.,mangroves
52,I’ve been trying out different trails in the c...,
53,K2 remains one of the toughest peaks to conquer.,k2
54,We tried some local food at the farmer’s market.,
55,Our hike through the Alps was unforgettable.,alps
56,Ama Dablam is called the Matterhorn of the Him...,dablam matterhorn himalayas
57,Have you ever climbed up Mount Rainier?,rainier


As we can see, the model has no problems finding names of mountains in these sentences. I will save this dataset.

In [14]:
df.to_csv("predictions.csv")

#Edge Cases

Let`s go through some of the edge cases

###1.Typos

In [None]:
get_mountains_from_predictions(pipe_ner("I am looking forward to our trip to the HVrla")) #I am looking forward to our trip to the Hoverla

'hvrla'

In [26]:
get_mountains_from_predictions(pipe_ner("Should we visit Swss akps this weekends?")) #Should we visit Swiss Apls this weekends?

'swss akps'

###Overlap with other definitions

In [27]:
get_mountains_from_predictions(pipe_ner("Olympus is the highest point of Greece"))

'olympus'

In [33]:
get_mountains_from_predictions(pipe_ner("My favourite music band is Olympus"))

''

BUT

In [34]:
get_mountains_from_predictions(pipe_ner("Olympus is my favourite music band"))

'olympus'