# Task: Mountain Name Recognition Using SpaCy
The goal is to identify mountain names in sentences using spaCy. The steps involve dataset creation, model training, and testing.

## Install Required Libraries

In [39]:
# Installing necessary libraries
!pip install datasets seqeval
!pip install evaluate



## 1. Import Libraries and Define Constants
We import essential libraries and set up global constants like mountain names and API keys.

In [40]:
import os
import re
import spacy
import shutil
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import (
    AutoTokenizer, AutoModelForTokenClassification,
    DataCollatorForTokenClassification, TrainingArguments, Trainer
)
from datasets import Dataset
from openai import OpenAI
from spacy.tokens import DocBin

In [3]:
# List of mountain names for the task(for labeling the dataset)
MOUNTAIN_NAMES = [
    'Everest', 'Kilimanjaro', 'Vesuvius', 'Fuji', 'St. Helens', 'K2',
    'Olympus', 'McKinley', 'Denali', 'Cook', 'Rainier', 'Kailash Mountain',
    'Rocky Mountains', 'Andes Mountain range', 'Blanc', 'Hengshan',
    'Appalachian Mountains', 'Eiger', 'Elbrus', 'Popa', 'Lemmon', 'Robson',
    'Rushmore', 'El Capitan', 'Huangshan'
]

## 2. Dataset Generation Using OpenAI GPT
Using OpenAI's GPT, generate synthetic sentences containing mountain names. This helps in creating a labeled dataset.

In [None]:
# OpenAI API initialization
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def generate_mountain_data_via_chatgpt(prompt, num_samples=10, temperature=1.0):
    """
    Generates mountain-related sentences using GPT-based generation.

    Args:
        prompt (str): The prompt for generating sentences.
        num_samples (int): Number of sentence samples to generate.
        temperature (float): Sampling temperature for GPT.

    Returns:
        list: Generated sentences.
    """
    responses = []
    for _ in range(num_samples):
        response = client.chat.completions.create(
            messages=[
                {"role": "user", "content": prompt},
            ],
            model="gpt-4",
            temperature=temperature,
        )
        response_text = response.choices[0].message.content
        sentences = response_text.split('\n')
        sentences_cleaned = [re.sub(r'^\d+\.\s*', '', sentence) for sentence in sentences]
        responses.extend(sentences_cleaned)
    return responses

# Generating and Saving Data
prompt = "Generate 10 different sentences that include the name of a mountain. Each sentence should be unique and describe a different aspect of the mountain or related topic."
generated_texts = generate_mountain_data_via_chatgpt(prompt, num_samples=5)
df = pd.DataFrame(generated_texts, columns=["sentence"])
df.to_csv('mountains.csv', index=False)

## 3. Load and Process Dataset
The dataset `mountains.csv` contains sentences mentioning mountain names. It was generated in a previous step. Now, we will load the dataset and process it to label mountain names within the sentences.

In [7]:
dataset_path = "mountains.csv"
# Read the dataset into a DataFrame
df = pd.read_csv(dataset_path, index_col=0)
# Display the first few rows to verify the structure
print("Dataset preview:")
print(df.head())

Dataset preview:
                                            sentence
0  "The glistening snowcaps of Mount Everest towe...
1  "Hikers come from all over the world to tackle...
2  "Mount Vesuvius looms over the city of Pompeii...
3  "The sunlight reflecting off Mount Fuji's sere...
4  "Geologists are continuously monitoring the vo...


In [10]:
def label_sentences(sentences, mountain_names):
    """
    Labels sentences by marking mountain names.

    Args:
        sentences (list): List of sentences to label.
        mountain_names (list): List of mountain names to detect.

    Returns:
        list: Labeled sentences.
    """
    labeled_sentences = []
    for sentence in sentences:
        words = re.findall(r'\b\w+\b', sentence)
        labeled_words = []
        for word in words:
            if any(mountain_name.lower() == word.lower() for mountain_name in mountain_names):
                labeled_words.append('MOUNTAIN')
            else:
                labeled_words.append('O')
        labeled_sentences.append(labeled_words)
    return labeled_sentences

In [15]:
df['labels'] = label_sentences(df['sentence'], MOUNTAIN_NAMES)
print("Labeled sentences added to the dataframe.")
df.head()

Labeled sentences added to the dataframe.


Unnamed: 0,sentence,labels
0,"""The glistening snowcaps of Mount Everest towe...","[O, O, O, O, O, MOUNTAIN, O, O, O, O, O, O, O,..."
1,"""Hikers come from all over the world to tackle...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, MOU..."
2,"""Mount Vesuvius looms over the city of Pompeii...","[O, MOUNTAIN, O, O, O, O, O, O, O, O, O, O, O,..."
3,"""The sunlight reflecting off Mount Fuji's sere...","[O, O, O, O, O, MOUNTAIN, O, O, O, O, O, O, O,..."
4,"""Geologists are continuously monitoring the vo...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"


## 4. Split Dataset into Training and Evaluation Sets
Split the labeled dataset into training and evaluation sets to prepare for model training and validation.


In [16]:
def split_dataset(df, test_size=0.2):
    """
    Splits the dataset into training and evaluation sets.

    Args:
        df (DataFrame): The dataset.
        test_size (float): Proportion of the dataset to include in the evaluation split.

    Returns:
        tuple: Training and evaluation datasets.
    """
    return train_test_split(df, test_size=test_size, random_state=42)

# Split the dataset
train_sentences, eval_sentences = split_dataset(df)
print("Training and evaluation sets prepared.")

Training and evaluation sets prepared.


## 5. Convert Data to SpaCy Format
Prepare the dataset in a format suitable for training a SpaCy Named Entity Recognition (NER) model.


In [17]:
def convert_to_spacy_format(df, label="MOUNTAIN"):
    """
    Converts data into SpaCy format.

    Args:
        df (DataFrame): Data to convert.
        label (str): Label for entities.

    Returns:
        list: Data in SpaCy format.
    """
    spacy_data = []
    for _, row in df.iterrows():
        text = row["sentence"]
        entities = []
        for match in re.finditer(r'\b(?:' + '|'.join(map(re.escape, MOUNTAIN_NAMES)) + r')\b', text, re.IGNORECASE):
            entities.append((match.start(), match.end(), label))
        spacy_data.append((text, {"entities": entities}))
    return spacy_data

# Convert train and eval datasets to SpaCy format
train_data_spacy = convert_to_spacy_format(pd.DataFrame(train_sentences, columns=df.columns))
eval_data_spacy = convert_to_spacy_format(pd.DataFrame(eval_sentences, columns=df.columns))
print("Data converted to SpaCy format.")

Data converted to SpaCy format.


## 6. Save Data in SpaCy Format
Save the SpaCy-formatted training and evaluation datasets for use in training the NER model.


In [18]:
def save_to_spacy(data, output_path, nlp):
    """
    Saves data in SpaCy format.

    Args:
        data (list): Data to save.
        output_path (str): File path to save data.
        nlp: SpaCy language model.
    """
    from spacy.tokens import DocBin
    db = DocBin()
    for text, annotations in data:
        doc = nlp.make_doc(text)
        entities = annotations["entities"]
        spans = [doc.char_span(start, end, label=label) for start, end, label in entities]
        spans = [span for span in spans if span is not None]
        doc.ents = spans
        db.add(doc)
    db.to_disk(output_path)

# Save the datasets
nlp = spacy.blank("en")
save_to_spacy(train_data_spacy, "train.spacy", nlp)
save_to_spacy(eval_data_spacy, "eval.spacy", nlp)
print("SpaCy datasets saved.")

SpaCy datasets saved.


## 7. Train SpaCy NER Model
Train a Named Entity Recognition (NER) model using the prepared SpaCy datasets.


In [19]:
!python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./eval.spacy
print("SpaCy NER model training completed.")

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  -----

## 8. Test the Trained SpaCy Model
Load the trained SpaCy NER model and test it on sample sentences.

In [26]:
# Load the trained model
nlp_trained = spacy.load("./output/model-best")

def test_spacy_model(texts, model):
    """
    Tests the SpaCy model with sample texts.

    Args:
        texts (list): List of texts to test.
        model: Trained SpaCy model.
    """
    for text in texts:
        doc = model(text)
        if doc.ents:
          for ent in doc.ents:
              print(f"Entity: {ent.text}, Label: {ent.label_}")
        else:
          print(f"No mountain names found in: \"{text}\"")


# Test the trained model
sample_texts = [
    "Everest is one of the tallest mountains in the world.",
    "Mount Kilimanjaro is a dormant volcano located in Tanzania.",
    "Mount Fuji in Japan is an active volcano and a cultural icon.",
    "The Appalachian Mountains span multiple states in the eastern United States.",
    "Rocky Mountains are a major mountain range in North America.",
    "Huangshan, also known as Yellow Mountain, is famous for its scenic beauty in China.",
    "Table Mountain offers stunning views of Cape Town in South Africa.",
    "The Andes is the longest mountain range in the world.",
    "This is a simple sentence without any mountain names.",
    "A random sentence about hiking trails and beautiful landscapes.",
    "The tallest peak in Antarctica is Mount Vinson.",
    "Denali, formerly known as Mount McKinley, is the highest mountain in North America.",
    "Mont Blanc is the highest mountain in the Alps and Western Europe.",
    "A sentence without any significant geographical names."
]

test_spacy_model(sample_texts, nlp_trained)

Entity: Everest, Label: MOUNTAIN
Entity: Kilimanjaro, Label: MOUNTAIN
Entity: Fuji, Label: MOUNTAIN
Entity: Appalachian Mountains, Label: MOUNTAIN
Entity: Rocky Mountains, Label: MOUNTAIN
Entity: Huangshan, Label: MOUNTAIN
Entity: Yellow Mountain, Label: MOUNTAIN
Entity: Table Mountain, Label: MOUNTAIN
Entity: Cape Town, Label: MOUNTAIN
No mountain names found in: "The Andes is the longest mountain range in the world."
No mountain names found in: "This is a simple sentence without any mountain names."
No mountain names found in: "A random sentence about hiking trails and beautiful landscapes."
Entity: Vinson, Label: MOUNTAIN
Entity: Denali, Label: MOUNTAIN
Entity: McKinley, Label: MOUNTAIN
Entity: Blanc, Label: MOUNTAIN
No mountain names found in: "A sentence without any significant geographical names."


### Results Analysis of the SpaCy NER Model

**Test Results:**

1. **Identified Correct Entities:**
   - The model correctly identified various mountain names such as:
     - *Everest* (Label: MOUNTAIN)
     - *Kilimanjaro* (Label: MOUNTAIN)
     - *Fuji* (Label: MOUNTAIN)
     - *Appalachian Mountains* (Label: MOUNTAIN)
     - *Rocky Mountains* (Label: MOUNTAIN)
     - *Huangshan* (Label: MOUNTAIN)
     - *Yellow Mountain* (Label: MOUNTAIN)
     - *Table Mountain* (Label: MOUNTAIN)
     - *Vinson* (Label: MOUNTAIN)
     - *Denali* (Label: MOUNTAIN)
     - *McKinley* (Label: MOUNTAIN)
     - *Blanc* (Label: MOUNTAIN)

2. **Missed Entities:**
   - The model failed to recognize the mountain name in sentences such as:
     - *The Andes is the longest mountain range in the world.*
   - Possible reasons:
     - Insufficient training examples for *Andes*.
     - Variance in phrasing not covered in training.

3. **False Negatives:**
   - Sentences without mountain names were correctly identified as having no entities.

4. **Performance Overview:**
   - Strengths:
     - Successfully identifies most explicitly mentioned mountain names.
     - Handles multi-word names well, e.g., *Yellow Mountain*.
   - Weaknesses:
     - Struggles with general phrases or less common mountains.

**Suggestions for Improvement:**

- **Expand Training Dataset:** Include more sentences with underrepresented mountains such as *Andes* and *Elbrus*.
- **Augment Data:** Use paraphrasing to create diverse expressions of mountain mentions.
- **Iterate Training:** Train with additional examples to improve recognition consistency.

**Next Steps:**
1. Review missed and misclassified examples for patterns.
2. Augment training data to address weaknesses.
3. Retrain and re-evaluate the model.


## 9. Create a Compressed Archive of the Model

To save trained SpaCy model as a compressed archive, we use the following code to package it into a `.tar.gz` file.


In [38]:
# Create a compressed archive of the model directory
shutil.make_archive("spacy_model", "gztar", "./output", "model-best")

'/content/spacy_model.tar.gz'