### Idea
To generate the dataset, I used Chatgpt by sending it several prompts to create a complete dataset.

It generated a sentence, after which it identified which mountain names were in it. This method is the easiest for Chatgpt itself, and thus the least likely to get it wrong, as opposed to the method where Chatgpt should generate more structured or already tokenized data for this task.

Then I created `generated_data.txt` combining all Chatgpt answers.

### Prompts Explanation


#### Mountain-Specific Prompt: 

```
Generate 200 different natural sentences that include the names of different mountains,  and then identify mentioned mountains. Present the output in the following format (each sentence with new line, without any numbering):
The hike to Everest or Elbrus is challenging but rewarding.
Everest, Elbrus
The view from the top of Mont Blanc is breathtaking.
Mont Blanc
```

It was designed to explicitly mention mountains, helping the model learn to recognize mountain names in sentences where they are likely entities.


#### Mixed Prompt with Other Locations:

```
In same way generate 200 sentences which include different proper nouns of other places which are not mountains. Also sentences can contain mountains names and other places names. Present the output in the same format (leave blank line if there no mountains in sentence)
```

It included both mountain names and non-mountain locations to increase the model's robustness in differentiating between mountains and other places.

#### Non-Location-Proper-Names Prompt:

```
in same way generate 100 sentences which include different proper nouns excepting places.
```

Now model could distinguish between locations and other proper names in sentences.

#### Non-Mountain-Specific Prompt:

```
in same way generate another 100 sentences which include mountains names, but are obviously not related to mountains from context of the sentence. (so there no mountains name detected)
```

This prompt included mountain names out of context, so the model could distinguish when the same name does not refer to an entity related to mountains.

### Formatting data

Transform data in next way (basically just get entity position):
```
The journey to Denali is a test of endurance and skill.
Denali
```
$\downarrow$
```
{
text: 'The journey to Denali is a test of endurance and skill.',
label: [[15, 6, "MOUNT"]]
}
```

In [135]:
import re


def get_formatted_data(file_path):
    with open(file_path, "r", encoding="UTF-8") as file:
        lines = file.readlines()

    formatted_data = []
    i = 0
    while i < len(lines):
        sentence = lines[i].strip()
        if sentence:
            i += 1
            mountains = lines[i].strip().split(", ")
            labels = [] 
            if mountains[0] != "":  # No mountains name in sentence
                for mountain in mountains:
                    match = re.search(re.escape(mountain), sentence)
                    if match:
                        start_idx = match.start()
                        labels.append([start_idx, len(mountain), "MOUNT"])
    
            formatted_data.append({
                "text": sentence,
                "label": labels
            })
        i += 1
    
    return formatted_data

In [136]:
data = get_formatted_data('data/generated_dataset.txt')

### Split Dataset

In [None]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data, test_size=0.15, random_state=42)

In [137]:
from collections import Counter
train_counts = Counter()
test_counts = Counter()
for item in train_data:
    train_counts[len(item['label'])] += 1
for item in test_data:
    test_counts[len(item['label'])] += 1

In [139]:
print("Mountain distribution in train sentences:", train_counts)
print("Mountain distribution in test sentences:", test_counts)

Mountain distribution in train sentences: Counter({0: 250, 1: 164, 2: 5})
Mountain distribution in test sentences: Counter({0: 42, 1: 30, 2: 2})


It's not the best distribution, but still fine.

In [154]:
import json
# Save datasets
with open('data/train_data.json', 'w') as f:
    json.dump(train_data, f)
with open('data/test_data.json', 'w') as f:
    json.dump(test_data, f)

### Tokenization

Now in order to use this data for NER training, it should be tokenized and labeled in a format compatible with NER models, such as the CoNLL format. Each sentence would appear tokenized, with each token labeled as either B-MOUNT (beginning of a mountain name), I-MOUNT (inside a mountain name), or 0 (outside of any named entity).

In [140]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")

In [142]:
def tokenize_data(dataset):
    tokenized_data = []
    
    for sample in dataset:
        text = sample["text"]
        entities = sample["label"]
        tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(text)))
        labels = [0] * len(tokens)
        
        # Label mountain entities
        for start, length, _ in entities:
            prefix_tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(text[:start])))
            start_token = len(prefix_tokens) - 1
            
            entity_tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(text[start:start+length])))
            to_ignore = entity_tokens.count("[SEP]") + entity_tokens.count("[CLS]")
            end_token = start_token + len(entity_tokens) - 1 - to_ignore
            
            labels[start_token] = 1  # B-MOUNT
            for idx in range(start_token+1, end_token+1):
                labels[idx] = 2  # I-MOUNT

        tokens_ids = tokenizer.convert_tokens_to_ids(tokens)   
        tokenized_data.append({
            'input_ids': tokens_ids,
            'labels': labels
        })
    return tokenized_data

In [143]:
tokenized_train_data = tokenize_data(train_data)
tokenized_test_data = tokenize_data(test_data)

In [144]:
import pandas as pd
from datasets import Dataset

train_ds = Dataset.from_pandas(pd.DataFrame(data=tokenized_train_data))
test_ds = Dataset.from_pandas(pd.DataFrame(data=tokenized_test_data))

Example:

In [146]:
example = train_ds[1]
input_ids = example["input_ids"]
labels = example["labels"]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
print("Ids:", input_ids)
print("Tokens:", tokens)
print("Labels:", labels)

Ids: [101, 33745, 10124, 11053, 10142, 10474, 42235, 12898, 12221, 10111, 23704, 20783, 119, 102]
Tokens: ['[CLS]', 'Prague', 'is', 'known', 'for', 'its', 'beautiful', 'old', 'town', 'and', 'historic', 'architecture', '.', '[SEP]']
Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [148]:
# Save data
train_ds.save_to_disk("data/train_dataset")
test_ds.save_to_disk("data/test_dataset")

Saving the dataset (0/1 shards):   0%|          | 0/419 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/74 [00:00<?, ? examples/s]