#### Dataset Source
The dataset used in this project was obtained from the [NER Mountains GitHub Repository](https://github.com/antonItachi/ner-mountains/tree/main). 

- This dataset contains annotated sentences related to mountains and their names, which are used for training a Named Entity Recognition (NER) model. 
- The dataset was downloaded directly and augmented to enhance diversity for the project.

#### Data Collection
- Data was collected by web scraping websites containing information about mountains.
- The collected text data was preprocessed to extract and annotate mountain names.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast
from torch.utils.data import DataLoader
import torch

from scripts.data_processing import *

In [2]:
# Load the dataset
dataset_path = "data/annotated_100.csv"
data = pd.read_csv(dataset_path)

In [3]:
# Display the first few rows of the dataset
print("First rows of the dataset:")
data.head()

First rows of the dataset:


Unnamed: 0,sentence,annotation
0,Chimborazo in the ocean and even whole ranges ...,B-MOUNT O O O O O O O O O O O O O O O O O O O ...
1,I would rather go to the Elbrus than to the be...,O O O O O O B-MOUNT O O O O
2,"Which do you like better, the sea or the Olympus?",O O O O O O O O O B-MOUNT
3,Some people like the sea; others prefer the An...,O O O O O O O O B-MOUNT
4,We watched the sun setting behind the Annapurna.,O O O O O O O B-MOUNT


In [4]:
# Extract and display unique annotations
print("\nUnique values in the 'annotation' column:")
unique_annotations = data['annotation'].apply(lambda x: x.split()).explode().unique()
print(unique_annotations)

# Count occurrences of 'I-MOUNT' and 'B-MOUNT'
i_mount_count = data['annotation'].apply(lambda x: x.split()).explode().value_counts().get('I-MOUNT', 0)
b_mount_count = data['annotation'].apply(lambda x: x.split()).explode().value_counts().get('B-MOUNT', 0)

print(f"\nNumber of 'I-MOUNT' annotations: {i_mount_count}")
print(f"Number of 'B-MOUNT' annotations: {b_mount_count}")


Unique values in the 'annotation' column:
['B-MOUNT' 'O' 'I-MOUNT']

Number of 'I-MOUNT' annotations: 6
Number of 'B-MOUNT' annotations: 101


The dataset consists of two columns:

- **Sentence**: Contains text with geographical references.
- **Annotation**: Uses BIO format for NER, with labels like "B-MOUNT" (beginning of a mountain name), "I-MOUNT" (inside a mountain name), and "O" (non-entity tokens).

In [5]:
# Check for missing values in the dataset
print("\nMissing values check:")
data.isnull().sum()


Missing values check:


sentence      0
annotation    0
dtype: int64

There are no missing values in the dataset.

In [6]:
# Check the dataset size (number of rows and columns)
print("\nDataset size:")
data.shape


Dataset size:


(100, 2)

The dataset contains 100 rows and 2 columns.

### Data preprocessing 

In [7]:
# Split sentences and annotations for further processing
sentences = data['sentence'].values
annotations = data['annotation'].apply(lambda x: x.split()).values

The dataset will be augmented by adding more I-MOUNT annotations to increase the representation of "inside" mountain entities. This will help balance the occurrence of different entity labels and improve the model's ability to correctly identify and classify parts of mountain names.

In [8]:
# Define mountains and base sentences for augmentation
mountains = [
    "Mount Kilimanjaro", "Mount McKinley", "Rocky Mountains",
    "Blue Ridge Mountains", "Cascade Range", "Great Smoky Mountains"
]
base_sentences = [
    "I climbed {} last year.",
    "{} are breathtaking.",
    "The view from {} is spectacular.",
    "I recently visited {} with my friends.",
    "{} is a famous destination for hikers."
]

In [9]:
# Generate augmented data
augmented_sentences, augmented_annotations = augment_data(
    mountain_list=mountains,
    base_sentences=base_sentences,
    num_augmentations=100
)

# Combine original sentences with augmented sentences
sentences = list(sentences) + augmented_sentences
annotations = list(annotations) + augmented_annotations

# Tokenize annotations by splitting each annotation string into a list of tokens
tokenized_annotations = annotations

In [10]:
# Split the dataset into training and testing sets
train_sentences, test_sentences, train_annotations, test_annotations = train_test_split(
    sentences, annotations, test_size=0.2, random_state=42
)

# Display the number of samples in the training and testing sets
print("Number of training samples:", len(train_sentences))
print("Number of testing samples:", len(test_sentences))

Number of training samples: 160
Number of testing samples: 40


In [11]:
# Initialize the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")

In [12]:
# Tokenize and align labels for training and testing data
train_inputs, train_labels = tokenize_and_align_labels(train_sentences, train_annotations, tokenizer)
test_inputs, test_labels = tokenize_and_align_labels(test_sentences, test_annotations, tokenizer)

In [13]:
# Check if the number of tokens matches the number of labels for training data
for i, labels in enumerate(train_labels):
    tokens = train_inputs.encodings[i].tokens  # Get the tokens for the i-th sentence
    if len(labels) != len(tokens):  # Compare number of labels with number of tokens
        print(f"Mismatch in training sample {i}: {len(labels)} labels vs {len(tokens)} tokens")

In [14]:
# Check if the number of tokens matches the number of labels for testing data
for i, labels in enumerate(test_labels):
    tokens = test_inputs.encodings[i].tokens  # Get the tokens for the i-th sentence
    if len(labels) != len(tokens):  # Compare number of labels with number of tokens
        print(f"Mismatch in testing sample {i}: {len(labels)} labels vs {len(tokens)} tokens")

In [15]:
# Example debugging for tokenization and label alignment
example_index = 1  # Change this index to check different examples

# Get tokens and labels for the example
tokens = train_inputs.encodings[example_index].tokens  # Use .tokens to get token list
labels = train_labels[example_index]

# Check alignment visually
for token, label in zip(tokens[:15], labels[:15]):
    print(f"{token:15} --> {label}")

[CLS]           --> -100
Rocky           --> B-MOUNT
Mountains       --> I-MOUNT
is              --> O
a               --> O
famous          --> O
destination     --> O
for             --> O
hike            --> O
##rs            --> O
.               --> -100
[SEP]           --> -100
[PAD]           --> -100
[PAD]           --> -100
[PAD]           --> -100


This output shows token-level labels for a sentence, where:

- "Mount" is labeled B-MOUNT (beginning of a mountain name).
- "Ki", "##lim", "##an", "##jar", "##o" are labeled I-MOUNT (continuation of the name).
- Other words are labeled O (outside any entity), and special tokens like "[CLS]" and punctuation are marked as -100.

In [16]:
# Define label-to-ID mapping
label2id = {"O": 0, "B-MOUNT": 1, "I-MOUNT": 2}

# Prepare datasets
train_dataset = NERDataset(train_inputs, train_labels, label2id)
test_dataset = NERDataset(test_inputs, test_labels, label2id)

# Create DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=8)

In [20]:
# Save the inputs and labels
torch.save({
    'train_inputs': train_inputs,
    'train_labels': train_labels,
    'test_inputs': test_inputs,
    'test_labels': test_labels
}, 'data/preprocessed_data.pth')

In [22]:
# Save the tokenizer
tokenizer.save_pretrained('tokenizer')

('tokenizer\\tokenizer_config.json',
 'tokenizer\\special_tokens_map.json',
 'tokenizer\\vocab.txt',
 'tokenizer\\added_tokens.json',
 'tokenizer\\tokenizer.json')

### Also create a test dataset to demonstrate the trained model in the future

*This text was generated using chatGPT 4o mini*

In [2]:
# Test sentences
test_sentences = [
    "Mount Everest is the tallest mountain.",
    "I hiked in the Alps and the Pyrenees last summer.",
    "Kilimanjaro is in Tanzania.",
    "This sentence does not mention a mountain.",
    "Denali, also known as Mount McKinley, is in Alaska.",
    "The Andes are a stunning mountain range in South America.",
    "I visited Table Mountain in South Africa last year.",
    "Mount Fuji is an active volcano in Japan.",
    "Rocky Mountains extend across the United States and Canada.",
    "Mount Elbrus is the highest peak in Europe."
]

# Corresponding labels
test_labels = [
    ["B-MOUNT", "I-MOUNT", "O", "O", "O", "O", "O"],
    ["O", "O", "O", "O", "B-MOUNT", "O", "O", "B-MOUNT", "O", "O", "O"],
    ["B-MOUNT", "O", "O", "O", "O"],
    ["O", "O", "O", "O", "O", "O", "O", "O"],
    ["B-MOUNT", "O", "O", "O", "B-MOUNT", "I-MOUNT", "O", "O", "O", "O", "O", "O"],
    ["O", "B-MOUNT", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
    ["O", "O", "B-MOUNT", "I-MOUNT", "O", "O", "O", "O", "O", "O"],
    ["B-MOUNT", "I-MOUNT", "O", "O", "O", "O", "O", "O", "O"],
    ["B-MOUNT", "I-MOUNT", "O", "O", "O", "O", "O", "O", "O", "O"],
    ["B-MOUNT", "I-MOUNT", "O", "O", "O", "O", "O", "O", "O"]
]

In [5]:
# Combine into a dataframe
test_data = pd.DataFrame({
    "sentence": test_sentences,
    "labels": [" ".join(labels) for labels in test_labels]
})

# Save to CSV
test_data.to_csv("data/test_dataset.csv", index=False)
print("Test dataset saved as 'test_dataset.csv'")

Test dataset saved as 'test_dataset.csv'
