# NER model for mountain names
I trained model, but had some little problems with it, because it doesn't fully recognize mountain names. My script for training you can see in the model_training.py. Result of my model is below:

In [1]:
%run inference.py

[('next', 'O'), ('on', 'O'), ('our', 'O'), ('list', 'O'), ('is', 'O'), ('den', 'O'), ('##ali', 'O'), ('peak', 'O'), (',', 'O'), ('also', 'O'), ('known', 'O'), ('as', 'O'), ('mount', 'B-MOUNTAIN'), ('everest', 'O'), (',', 'O'), ('in', 'O'), ('alaska', 'O'), ('.', 'O')]
next: O
on: O
our: O
list: O
is: O
den: O
##ali: O
peak: O
,: O
also: O
known: O
as: O
mount: B-MOUNTAIN
everest: O
,: O
in: O
alaska: O
.: O


The model correctly identifies the "mount" part of "Denali Peak", but fails to tag the rest of the mountain names ("Everest" and "Alaska") and other related terms. It might indicate that more fine-tuning or a different strategy for handling entities is needed, especially for cases where entities span multiple tokens. I think I should increase training data, or fine-tune further. Luckily, we have good solution, BERT pre-trained model for this problem:

In [4]:
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained("dieumerci/mountain-recognition-ner")
tokenizer = AutoTokenizer.from_pretrained("dieumerci/mountain-recognition-ner")
classifier = pipeline("ner", model=model, tokenizer=tokenizer)

text = "Next on our list is Denali Peak, also known as Mount McKinley, in Alaska."
result = classifier(text)

label_map = {
    'LABEL_0': 'O',      
    'LABEL_1': 'B-MOUNTAIN',  
}

# Get the tokenized version of the input text
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(text)))
token_label_pairs = []

for i, token in enumerate(tokens):
    # Match the index of the token with the corresponding NER result
    # Since the NER result is a list of dictionaries, we can assign the label
    # based on the entity classification for each token
    entity_label = result[i]['entity'] if i < len(result) else 'O'  # Default to 'O' if no entity found
    label = label_map.get(entity_label, 'O')  # Map to 'O' if the label is not found in the label_map
    token_label_pairs.append((token, label))
    
for token, label in token_label_pairs:
    print(f'{token}: {label}')
print(result)

Device set to use cpu


[CLS]: O
Next: O
on: O
our: O
list: O
is: B-MOUNTAIN
Den: B-MOUNTAIN
##ali: B-MOUNTAIN
Peak: O
,: O
also: O
known: O
as: B-MOUNTAIN
Mount: B-MOUNTAIN
McKinley: O
,: O
in: O
Alaska: O
.: O
[SEP]: O
[{'entity': 'LABEL_0', 'score': 0.999171, 'index': 1, 'word': 'Next', 'start': 0, 'end': 4}, {'entity': 'LABEL_0', 'score': 0.99950564, 'index': 2, 'word': 'on', 'start': 5, 'end': 7}, {'entity': 'LABEL_0', 'score': 0.9994672, 'index': 3, 'word': 'our', 'start': 8, 'end': 11}, {'entity': 'LABEL_0', 'score': 0.99964094, 'index': 4, 'word': 'list', 'start': 12, 'end': 16}, {'entity': 'LABEL_0', 'score': 0.99947625, 'index': 5, 'word': 'is', 'start': 17, 'end': 19}, {'entity': 'LABEL_1', 'score': 0.9991271, 'index': 6, 'word': 'Den', 'start': 20, 'end': 23}, {'entity': 'LABEL_1', 'score': 0.9990953, 'index': 7, 'word': '##ali', 'start': 23, 'end': 26}, {'entity': 'LABEL_1', 'score': 0.99852437, 'index': 8, 'word': 'Peak', 'start': 27, 'end': 31}, {'entity': 'LABEL_0', 'score': 0.9979856, 'index'