# **Demo notebook**

**In this notebook, we will review the language models that were trained with spaCy to identify mountain entities in the text.**

We will consider two language models.
The first model was trained on automatically annotated data using Python.
The second model was trained on manually annotated data.

Let's create validation data using **GPT chat** and save this data to data_example.txt.



```
Mountains have always captivated the human imagination, symbolizing strength, endurance, and majesty. From the highest peaks to the lesser-known ranges, they shape the landscapes and ecosystems of our world in profound ways. Mount Everest , standing at an astounding 8,848 meters, is the highest point on Earth and is located in the Himalayas. Everest’s incredible height and challenging conditions make it a coveted climb for mountaineers across the globe, though many find its summit elusive.

To the west of Everest lies another giant, K2 , also known as Mount Godwin-Austen . Standing at 8,611 meters in the Karakoram Range , K2 is often regarded as one of the most dangerous mountains to climb, earning the nickname " Savage Mountain " due to its steep inclines and unpredictable weather.

Not far from these towering peaks lies Kangchenjunga , the third-highest mountain in the world at 8,586 meters. Located on the border between Nepal and the Indian state of Sikkim, Kangchenjunga is revered by locals, and many climbers respect the tradition of not stepping on its summit out of deference to its spiritual significance.

Heading further west, Nanga Parbat rises at 8,126 meters. Known as the "Killer Mountain," this massif has claimed the lives of many climbers, particularly in the early 20th century when it was notorious for its difficult ascent.

In Europe, the Alps dominate the skyline, with Mont Blanc standing as its tallest peak at 4,809 meters. This mountain, located between France and Italy, attracts thousands of climbers, hikers, and tourists every year who come to admire its beauty and take on the challenge of scaling its snow-covered heights.

Across the ocean in North America, the Rocky Mountains stretch over 4,800 kilometers, with the highest point being Mount Elbert at 4,401 meters, located in Colorado. The Rockies are known for their rugged beauty, alpine lakes, and abundant wildlife.

In Alaska, Denali (formerly known as Mount McKinley) towers above the rest of the continent at 6,190 meters. Denali’s massive size and isolated location make it a tough but rewarding climb for mountaineers seeking adventure in the wilderness.

Further south in Argentina, Aconcagua stands as the tallest mountain in the Southern Hemisphere, at 6,960 meters. Part of the Andes , Aconcagua is not only a popular climbing destination but also a natural landmark of the region’s geological diversity.

In Africa, the iconic Mount Kilimanjaro rises from the plains of Tanzania. At 5,895 meters, it is the highest free-standing mountain in the world and a prominent symbol of Africa. Its snow-capped peak is a striking contrast to the tropical landscape surrounding it.

Moving to Asia, Mount Fuji is Japan's most famous mountain, rising 3,776 meters. Revered in Japanese culture and religion, Fuji is an active volcano that remains a popular site for both pilgrimage and tourism.

Further to the north, in Russia, Mount Elbrus stands as Europe’s highest peak at 5,642 meters. Part of the Caucasus mountain range, Elbrus is a dormant volcano, often climbed by mountaineers aiming to complete the Seven Summits challenge.

In the Andes of Peru, the majestic Huascarán rises at 6,768 meters. It is not only the highest point in Peru but also home to Huascarán National Park, which protects the rich biodiversity and beautiful glacial landscapes of the region.

In Bolivia, Illimani stands tall at 6,438 meters, overlooking the city of La Paz. Its presence is a significant part of local culture, and its snow-capped peaks are visible from great distances.

Closer to the equator in Ecuador, Chimborazo claims the title of the highest point on Earth when measured from the center of the Earth rather than sea level, due to the equatorial bulge. It stands at 6,263 meters and is considered sacred by indigenous peoples.

In New Zealand, the Southern Alps host Aoraki / Mount Cook , the country’s tallest peak at 3,724 meters. Aoraki holds a special place in Māori mythology and is a popular destination for climbers and adventurers.

In Mexico, Pico de Orizaba ( also known as Citlaltépetl ) is the highest mountain at 5,636 meters. This dormant volcano is a striking feature of the Mexican landscape and offers climbers a rewarding, albeit challenging, ascent.

Moving to the Himalayas once more, Makalu stands at 8,485 meters, making it the fifth-highest mountain in the world. Its distinctive pyramid shape and challenging terrain attract experienced climbers seeking to test their limits.

In the Andes , Huayna Potosí in Bolivia is a favorite among climbers due to its accessibility from La Paz and its height of 6,088 meters. The mountain is also known for its stunning views of the surrounding landscape.

In the Canadian Rockies , Mount Robson is the highest peak at 3,954 meters. Known for its striking vertical relief, Robson is a challenging climb and a beloved destination for hikers and mountaineers alike.

In Chile, Ojos del Salado is the highest volcano in the world, standing at 6,893 meters. Located in the Atacama Desert, Ojos del Salado offers climbers a unique experience, combining desert landscapes with snow-capped peaks.

On the border of Argentina and Chile lies Mount Fitz Roy ( also known as Cerro Chaltén ), which, at 3,405 meters, is not particularly tall but is renowned for its sheer granite faces and technical climbing challenges.

In the remote regions of Antarctica, Vinson Massif is the tallest mountain, standing at 4,892 meters. Its location in one of the coldest and most desolate places on Earth makes it a formidable challenge for even the most experienced climbers.

Back in Asia, Dhaulagiri stands at 8,167 meters, part of the Dhaulagiri mountain range in Nepal. Its name means "White Mountain" and it remains one of the more isolated eight-thousanders, adding to its allure.

Further south in the Himalayas is Annapurna , which, at 8,091 meters, is known for its extreme difficulty and high fatality rate among climbers. Despite this, it remains a coveted summit for those willing to brave its dangers.

In the United States, the Sierra Nevada range hosts Mount Whitney , which, at 4,421 meters, is the highest point in the contiguous United States. Whitney’s accessibility and scenic views make it a popular hike, though still demanding.

Another significant peak in the Andes is Nevado Sajama in Bolivia, the highest point in the country at 6,542 meters. Sajama’s volcanic origins and isolated location add to its mystique, attracting adventurers to its slopes.

Lastly, in the Pyrenees, Pic du Midi d'Ossau is a distinct peak rising at 2,884 meters, known for its sharp, needle-like appearance. Though not particularly tall, it is a well-known landmark and offers breathtaking views of the surrounding landscape.


```



For a clean check, let's **annotate the data manually** and save this data to valid.json .

First, let's see how the model that was trained on **automatically annotated data** performs

In [3]:
import json
import spacy

def load_validation_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        validation_data = json.load(f)
    return validation_data

def extract_entities(annotation):
    entities = []
    for ent in annotation['entities']:
        start, end, label = ent
        entities.append((start, end, label))
    return entities

def evaluate_model(nlp, text, true_entities):
    # Predict entities using the model
    doc = nlp(text)

    # Get predicted entities
    predicted_entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    # Print predicted entities
    print("Predicted Entities:")
    print(predicted_entities)

    # Print true entities
    print("True Entities:")
    print(true_entities)

    # Convert lists to sets for comparison
    true_set = set(true_entities)
    pred_set = set(predicted_entities)

    # Calculate TP, FP, FN
    tp = len(true_set & pred_set)
    fp = len(pred_set - true_set)
    fn = len(true_set - pred_set)

    # Calculate Precision, Recall, F1-Score
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    # Print metrics
    print("\nMetrics for the current text:")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

    # Return metrics for further use
    return tp, fp, fn

def main():
    # Path to the validation file with annotations
    validation_file_path = r'D:\(NER) model for the identification of mountain names\model_evaluation\valid.json'

    # Load the validation file
    validation_data = load_validation_file(validation_file_path)

    # Load the trained spaCy model
    model_path = r"D:\(NER) model for the identification of mountain names\NER_python_annotation\output_model"
    nlp = spacy.load(model_path)

    # Variables to accumulate metrics for the entire dataset
    total_tp = 0
    total_fp = 0
    total_fn = 0

    # For each text in the file
    for sample in validation_data:
        text = sample[0]
        annotation = sample[1]

        # Extract true entities from annotations
        true_entities = extract_entities(annotation)

        print("\nText for evaluation:")
        print(text)

        # Evaluate the model for the current text
        tp, fp, fn = evaluate_model(nlp, text, true_entities)

        # Accumulate metrics
        total_tp += tp
        total_fp += fp
        total_fn += fn

    # Calculate overall metrics
    overall_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0.0
    overall_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0.0
    overall_f1 = 2 * overall_precision * overall_recall / (overall_precision + overall_recall) if (overall_precision + overall_recall) > 0 else 0.0

    print("\nOverall metrics for the entire dataset:")
    print(f"Overall Precision: {overall_precision:.4f}")
    print(f"Overall Recall: {overall_recall:.4f}")
    print(f"Overall F1-Score: {overall_f1:.4f}")

if __name__ == "__main__":
    main()



Text for evaluation:
Mountains have always captivated the human imagination, symbolizing strength, endurance, and majesty. From the highest peaks to the lesser-known ranges, they shape the landscapes and ecosystems of our world in profound ways. Mount Everest , standing at an astounding 8,848 meters, is the highest point on Earth and is located in the Himalayas. Everest’s incredible height and challenging conditions make it a coveted climb for mountaineers across the globe, though many find its summit elusive.
Predicted Entities:
[(231, 238, 'MOUNTAIN'), (333, 342, 'MOUNTAIN'), (344, 351, 'MOUNTAIN')]
True Entities:
[(231, 238, 'MOUNTAIN'), (333, 343, 'MOUNTAIN'), (344, 353, 'MOUNTAIN')]

Metrics for the current text:
Precision: 0.3333
Recall: 0.3333
F1-Score: 0.3333

Text for evaluation:

Predicted Entities:
[]
True Entities:
[]

Metrics for the current text:
Precision: 0.0000
Recall: 0.0000
F1-Score: 0.0000

Text for evaluation:
To the west of Everest lies another giant, K2 , also k

Overall metrics for the entire dataset:
Overall Precision: **0.8140**
Overall Recall: **0.5469**
Overall F1-Score: **0.6542**

First, let's see how the model that was trained on **manually annotated data** performs

In [6]:
import json
import spacy

def load_validation_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        validation_data = json.load(f)
    return validation_data

def extract_entities(annotation):
    entities = []
    for ent in annotation['entities']:
        start, end, label = ent
        entities.append((start, end, label))
    return entities

def evaluate_model(nlp, text, true_entities):
    # Predict entities using the model
    doc = nlp(text)

    # Get predicted entities
    predicted_entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    # Print predicted entities
    print("Predicted Entities:")
    print(predicted_entities)

    # Print true entities
    print("True Entities:")
    print(true_entities)

    # Convert lists to sets for comparison
    true_set = set(true_entities)
    pred_set = set(predicted_entities)

    # Calculate TP, FP, FN
    tp = len(true_set & pred_set)
    fp = len(pred_set - true_set)
    fn = len(true_set - pred_set)

    # Calculate Precision, Recall, F1-Score
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    # Print metrics
    print("\nMetrics for the current text:")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

    # Return metrics for further use
    return tp, fp, fn

def main():
    # Path to the validation file with annotations
    validation_file_path = r'D:\(NER) model for the identification of mountain names\model_evaluation\valid.json'

    # Load the validation file
    validation_data = load_validation_file(validation_file_path)

    # Load the trained spaCy model
    model_path = r"D:\(NER) model for the identification of mountain names\NER_manual_annotation\output_model"
    nlp = spacy.load(model_path)

    # Variables to accumulate metrics for the entire dataset
    total_tp = 0
    total_fp = 0
    total_fn = 0

    # For each text in the file
    for sample in validation_data:
        text = sample[0]
        annotation = sample[1]

        # Extract true entities from annotations
        true_entities = extract_entities(annotation)

        print("\nText for evaluation:")
        print(text)

        # Evaluate the model for the current text
        tp, fp, fn = evaluate_model(nlp, text, true_entities)

        # Accumulate metrics
        total_tp += tp
        total_fp += fp
        total_fn += fn

    # Calculate overall metrics
    overall_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0.0
    overall_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0.0
    overall_f1 = 2 * overall_precision * overall_recall / (overall_precision + overall_recall) if (overall_precision + overall_recall) > 0 else 0.0

    print("\nOverall metrics for the entire dataset:")
    print(f"Overall Precision: {overall_precision:.4f}")
    print(f"Overall Recall: {overall_recall:.4f}")
    print(f"Overall F1-Score: {overall_f1:.4f}")

if __name__ == "__main__":
    main()



Text for evaluation:
Mountains have always captivated the human imagination, symbolizing strength, endurance, and majesty. From the highest peaks to the lesser-known ranges, they shape the landscapes and ecosystems of our world in profound ways. Mount Everest , standing at an astounding 8,848 meters, is the highest point on Earth and is located in the Himalayas. Everest’s incredible height and challenging conditions make it a coveted climb for mountaineers across the globe, though many find its summit elusive.
Predicted Entities:
[(231, 238, 'MOUNTAIN'), (333, 343, 'MOUNTAIN')]
True Entities:
[(231, 238, 'MOUNTAIN'), (333, 343, 'MOUNTAIN'), (344, 353, 'MOUNTAIN')]

Metrics for the current text:
Precision: 1.0000
Recall: 0.6667
F1-Score: 0.8000

Text for evaluation:

Predicted Entities:
[]
True Entities:
[]

Metrics for the current text:
Precision: 0.0000
Recall: 0.0000
F1-Score: 0.0000

Text for evaluation:
To the west of Everest lies another giant, K2 , also known as Mount Godwin-Aus

Overall metrics for the entire dataset:
Overall Precision: **0.9167**
Overall Recall: **0.6875**
Overall F1-Score: **0.7857**

The results show that the approach with **manual data annotation shows much better results**, with the same hyperparameters for model training

---
The overall performance of the model trained on manually annotated data is 13% better than that of the automatically annotated model. It is worth noting that 20 training epochs were used to train the model whose data was annotated manually, while 35 epochs were used to train the other model.


---

In order to better understand which entities are indexed and how many of them, you should refer to the required model folder, the **interface.py** file
