So, at first, I decided to use GPT for data generation. It generates data quite well, but not in large quantities, which makes the process time-consuming.
Initially, I generated the first 100 sentences for testing.
Next, I searched on Kaggle and found a dataset with mountain names, coordinates, and other data. I decided to use this dataset for my work.
First, I loaded the dataset and displayed the number of unique mountain names it contains.

In [None]:
# Upload file
df = pd.read_csv('Mountain.csv')

# Counting the number of unique mountain names
num_mountains = df['Mountain'].nunique()
print(f"Number of unique mountain names: {num_mountains}")

Number of unique mountain names: 1621
I decided that it would be better to train the model on real data rather than GPT-generated data.
I thought about where I could find texts about a large number of mountains and extract them, and I settled on Wikipedia.
I wrote a script that allows extracting the required amount of mountain data using the Wikipedia API.
In this script, you can adjust the number of mountains you want to extract—I started with 400 and eventually settled on 1200.
You can also control how many characters you want to extract from an article about a specific mountain—I started with 2000 and eventually settled on 500.
Increasing the dataset positively affected the result, and typically, the essential information about a mountain is presented in the first sentences of the text, so I decided to extract only 500 characters (although I think it could have been even less).
It would also be worth improving the naming since manually changing the variable name every time based on the number of mountains extracted is not very convenient.
For example: mountain_names_600.

In [None]:
# Initialize Wikipedia API with User-Agent
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent="Mozilla/5.0 (compatible; MyNLPApp/1.0; +https://example.com)"
)

# Path to the mountain file
file_path = 'Mountain.csv'

# Loading data from a CSV file
df_mountains = pd.read_csv(file_path)

# Extract the first 1200 mountain names from the 'Mountain' column
mountain_names_600 = df_mountains['Mountain'].head(1200).tolist()

# Виведемо кількість витягнутих назв гір для перевірки
print(f"Number of mountains drawn: {len(mountain_names_600)}")
print(f"Top 10 mountains: {mountain_names_600[:10]}")

In [None]:
# Function for downloading article texts
def fetch_mountain_texts(mountain_names):
    texts = {}
    for name in mountain_names:
        page = wiki_wiki.page(name)
        if page.exists():
            texts[name] = page.text[:500]  # Limit the text to the first 500 characters
    return texts

# Loading texts
mountain_texts_600 = fetch_mountain_texts(mountain_names_600)

I displayed the text to evaluate it visually.

In [None]:
# Print a preview for multiple mountains
mountain_texts_preview_600 = {name: text[:300] for name, text in mountain_texts_600.items()}
len(mountain_texts_600), mountain_texts_preview_600

Next, I decided to determine the "main" mountain in a Wikipedia article, as various mountains are usually mentioned in a single article (due to comparisons, etc.). Initially, I thought this would be helpful, but after forming and testing the model on the new data, I started to believe that it might have been a mistake, and I shouldn't have done it.

In [None]:
# Function to determine the main mountain in the text
def get_main_mountain(text, mountain_names):
    mountain_counts = Counter()
    text_lower = text.lower()
    
    for mountain in mountain_names:
        count = text_lower.count(mountain)  # Count the number of mentions of each mountain
        if count > 0:
            mountain_counts[mountain] = count
            
    if mountain_counts:
        main_mountain = mountain_counts.most_common(1)[0][0]  # Picking the mountain with the most mentions
        return main_mountain
    return None

# Function for marking up texts taking into account the main mountain
def annotate_mountains_by_main(texts, mountain_names):
    annotated_data = []
    for name, text in texts.items():
        main_mountain = get_main_mountain(text, mountain_names)  # Determine the main mountain
        if main_mountain:
            sentences = re.split(r'(?<=[.!?]) +', text)  # Split the text into sentences
            for sentence in sentences:
                found_mountains = [mountain for mountain in mountain_names if mountain in sentence.lower()]
                if found_mountains:
                    annotated_data.append({
                        'sentence': sentence,
                        'main_mountain': main_mountain,
                        'mentioned_mountains': found_mountains
                    })
    return annotated_data

Next, I decided to add synthetic data (generated by GPT) to improve the model’s performance. Therefore, I kept the existing dataset and displayed the current number of sentences in it:
Number of sentences in the dataset before adding synthetic data: 1958.

In [None]:
# Text markup
annotated_dataset_by_main = annotate_mountains_by_main(mountain_texts_600, mountain_names_lower)

# Convert to DataFrame for saving
df_annotated_main = pd.DataFrame(annotated_dataset_by_main)

# Output the number of sentences in the dataset
print(f"Number of sentences in the dataset before adding synthetic data: {df_annotated_main.shape[0]}")

# Saving the dataset in CSV format
output_path = 'annotated_mountain_dataset.csv'
df_annotated_main.to_csv(output_path, index=False)

# Show the first few lines
df_annotated_main.head(20)

I generated synthetic data using GPT (around 1000 sentences), loaded the data, processed it (ignoring insertion markers), mixed it with the Wikipedia data, and saved the updated dataset.
Combined dataset has been saved to 'combined_annotated_dataset.csv'
Size of the merged dataset: (2992, 3)

In [None]:
# Uploading synthetic data
df_synthetic = pd.read_csv('synthetic_data.csv', header=None, names=['sentence'])

# Filter the lines to exclude those containing only "[" or "]"
df_synthetic = df_synthetic[~df_synthetic['sentence'].str.contains(r'^\[|\]$', regex=True)]

# Create a column 'main_mountain' for synthetic data
df_synthetic['main_mountain'] = df_synthetic['sentence'].apply(lambda x: x.split()[1] if len(x.split()) > 1 else x)

# Combine the main dataset with synthetic data
df_combined = pd.concat([df_annotated_main, df_synthetic], ignore_index=True)

# Fill NaN values in 'mentioned_mountains' with empty lists
df_combined['mentioned_mountains'] = df_combined['mentioned_mountains'].fillna('[]')

# Save the combined dataset to a new CSV file
df_combined.to_csv('combined_annotated_dataset.csv', index=False)
print("Combined dataset has been saved to 'combined_annotated_dataset.csv'")

# Output the size of the new merged dataset for verification
print(f"Size of the merged dataset: {df_combined.shape}")

Some mountain names consist of multiple words, so I wrote a function to split such names into separate tokens. As a result, three tokens were defined:
B-Mountain – the beginning of a mountain name,
I-Mountain – the continuation of a mountain name,
O – all other tokens.
I also believe that improving this function could have enhanced the model's performance. It would have been possible to better handle mountain-related tokens—for example, if the model identifies a word as part of a mountain name continuation, it should pay more attention to ensuring that the beginning of the mountain name appears somewhere as well, or perhaps try to completely revise the logic.

In [None]:
def prepare_ner_dataset_fixed_multiword(df, mountain_names):
    """
    Prepare a dataset for training a NER model taking into account multi-word mountain names.

    Returns a list of sentences tokenized into words and the corresponding labels.
    """
    sentences = []
    labels = []
    
    for _, row in df.iterrows():
        sentence = row['sentence'].split()  # tokenization
        main_mountain = row['main_mountain'].lower()  # Main mountain in the article
        
        sentence_labels = ['O'] * len(sentence)  # Initialize the label list with the value 'O'
        
        # Check all the long-form mountain names in the sentence
        for mountain in mountain_names:
            mountain_tokens = mountain.split()  # Break the name of the mountain into words
            mountain_len = len(mountain_tokens)
            
            # Check each substring in the sentence
            for i in range(len(sentence) - mountain_len + 1):
                window = sentence[i:i + mountain_len]  # A substring of a sentence as long as the name of a mountain
                window_lower = [word.lower().strip(string.punctuation) for word in window]
                
                if window_lower == mountain_tokens:
                    sentence_labels[i] = 'B-MOUNTAIN'  # Beginning of mountain name
                    for j in range(1, mountain_len):
                        sentence_labels[i + j] = 'I-MOUNTAIN'  # Continuation of the mountain name
        
        sentences.append(sentence)
        labels.append(sentence_labels)
    
    return sentences, labels

In [None]:
# Preparing the dataset
sentences_fixed_multiword, labels_fixed_multiword = prepare_ner_dataset_fixed_multiword(df_combined, mountain_names_lower)

Next, I decided to remove stop words. This was primarily necessary to address the main issue – poor balance in the dataset.

In [None]:
# Let's output an example sentence before removing stop words
print("Before removing stop words:")
print(list(zip(sentences_fixed_multiword[10], labels_fixed_multiword[10])))

In [None]:
# Load stop words and remove them from the dataset
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(sentences, labels):
    """
    Removes stop words for class O tokens only.
    """
    new_sentences = []
    new_labels = []
    
    for sentence, label in zip(sentences, labels):
        reduced_sentence = []
        reduced_label = []
        
        for word, l in zip(sentence, label):
            if l == 'O' and word.lower() in stop_words:
                continue  # Skip the token if it is a class O stop word
            reduced_sentence.append(word)
            reduced_label.append(l)
        
        new_sentences.append(reduced_sentence)
        new_labels.append(reduced_label)
    
    return new_sentences, new_labels

In [None]:
# Apply the function to the dataset
sentences_reduced, labels_reduced = remove_stopwords(sentences_fixed_multiword, labels_fixed_multiword)

In [None]:
# Let's output an example sentence after removing stop words
print("\nAfter removing stop words:")
print(list(zip(sentences_reduced[10], labels_reduced[10])))

Next, I considered what else could be done to improve the balance and concluded that not all of the extracted sentences necessarily contain mentions of mountain names.
Therefore, I decided that if a sentence does not mention a mountain, it has less impact on the model's training, and we can significantly reduce the number of O tokens by removing such sentences.
Number of sentences after filtering: 2700.

In [None]:
def filter_sentences_no_mountains(sentences, labels):
    """
    Removes sentences where all tokens have the label 'O' (i.e. no mention of mountain names).
    """
    filtered_sentences = []
    filtered_labels = []
    
    for sentence, label in zip(sentences, labels):
        if any(l != 'O' for l in label):  # Check if there is at least one label other than 'O'
            filtered_sentences.append(sentence)
            filtered_labels.append(label)
    
    return filtered_sentences, filtered_labels

# Apply the function to the reduced dataset
filtered_sentences, filtered_labels = filter_sentences_no_mountains(sentences_reduced, labels_reduced)

# Output the number of sentences after filtering
print(f"Number of sentences after filtering: {len(filtered_sentences)}")

# Example sentence after filtering
print("Example sentence after filtering:")
print(list(zip(filtered_sentences[0], filtered_labels[0])))

This slightly improved the situation, but the imbalance still remained. So, I decided to take a risk and randomly reduce the number of O tokens in each sentence. This was a risky move, as it could potentially remove tokens that the model might need. I chose to remove half of the O tokens to achieve a 70/30 balance, which is the minimally desirable result.

As a result, the model's performance improved.

In [None]:
def reduce_o_tokens(sentences, labels, reduction_ratio=0.5):
    """
    Selectively reduces the number of class O tokens in sentences.

    Parameters:
    - sentences: list of sentences (tokenized into words)
    - labels: list of labels for each sentence
    - reduction_ratio: fraction of class O tokens to remove

    Returns new lists of sentences and labels.
    """
    new_sentences = []
    new_labels = []
    
    for sentence, label in zip(sentences, labels):
        o_indices = [i for i, l in enumerate(label) if l == 'O']
        num_to_remove = int(len(o_indices) * reduction_ratio)
        
        if num_to_remove > 0:
            indices_to_remove = set(random.sample(o_indices, num_to_remove))
        else:
            indices_to_remove = set() 
        
        reduced_sentence = [word for i, word in enumerate(sentence) if i not in indices_to_remove]
        reduced_label = [l for i, l in enumerate(label) if i not in indices_to_remove]
        
        new_sentences.append(reduced_sentence)
        new_labels.append(reduced_label)
    
    return new_sentences, new_labels

# Apply the function to the dataset
final_sentences_reduced, final_labels_reduced = reduce_o_tokens(filtered_sentences, filtered_labels, reduction_ratio=0.5)

print(f"Number of sentences after reducing class O tokens: {len(final_sentences_reduced)}")

Next, I formed the final dataset after all the manipulations and proceeded with training the model.

In [None]:
# Combine words and tags into strings
final_sentences_str = [' '.join(sentence) for sentence in final_sentences_reduced]
final_labels_str = [' '.join(label) for label in final_labels_reduced]

# Creating DataFrame
df = pd.DataFrame({
    'sentence': final_sentences_str,
    'label': final_labels_str
})

# Saving DataFrame in CSV file
df.to_csv('final_dataset.csv', index=False)

print("File successfully saved as 'final_dataset.csv'")

In [None]:
# Uploading dataset
df = pd.read_csv('final_dataset.csv')

# Split the strings back into token lists
final_sentences = [sentence.split() for sentence in df['sentence']]
final_labels = [label.split() for label in df['label']]