# MultiCoNER 2 - Few-Shot NER with Gemini API

This notebook uses Google's Gemini API for few-shot Named Entity Recognition on the MultiCoNER 2 dataset.

## Overview
- **Model**: Gemini 2.5 Flash
- **Approach**: Few-shot prompting with entity verification
- **Dataset**: MultiCoNER 2 English (7 entity types)
- **Entity Types**: Artist, Politician, HumanSettlement, PublicCorp, ORG, Facility, OtherPER

## 1. Setup and Imports

In [None]:
# Import required libraries
import google.generativeai as genai
from google.colab import userdata
import pandas as pd
import json
import time
from tqdm.notebook import tqdm

# Enable tqdm for pandas
tqdm.pandas()

print("All libraries imported successfully.")

## 2. Configure Gemini API

In [None]:
# Get API key from Colab secrets
GOOGLE_API_KEY = userdata.get('gemini_api_key')
genai.configure(api_key=GOOGLE_API_KEY)

# Initialize Gemini model
gemini_model = genai.GenerativeModel('gemini-2.5-flash')

print("Gemini API configured successfully.")
print(f"Model: {gemini_model._model_name}")

## 3. Define System Prompt

This prompt instructs the model to:
- Act as a slow, deliberate reasoning agent
- Verify ambiguous entities by "searching the internet"
- Apply strict BIO tagging rules

In [None]:
SYSTEM_PROMPT = """### SYSTEM INSTRUCTION
**MODE: ACCURACY-FIRST / RESEARCH-ENABLED**
You are a slow, deliberate reasoning agent. You must NOT guess.
1. **Scan**: Identify every proper noun or capitalized token in the input.
2. **Verify**: If you are not 100% sure of an entity's type (e.g., is "Finisterre" a place or a book?), you MUST pause and search the internet.
3. **Classify**: Apply the strict class definitions below based on your search results.
4. **Format**: Output only the final JSON list.

---

### Role
You are an expert linguist and data labeling specialist specifically trained for the MultiCoNER 2 Shared Task. You possess deep knowledge of fine-grained Named Entity Recognition (NER). **Crucially, you act as a researcher who verifies facts using the internet when faced with ambiguity.**

### Context
The user will provide a list of text tokens (words/sub-words) derived from search queries, social media, or noisy web text. Your task is to analyze these tokens and map each one to a specific Named Entity Recognition tag. The data contains ambiguity, typos (e.g., "united stats"), and lacks capitalization cues.

### Rules
1. **Input**: A JSON list of tokens (e.g., `["new", "york", "is", "big"]`).
2. **Task**: Assign a BIO (Begin, Inside, Outside) tag to every single token.
3. **Classes**: You must strictly use only the following 7 entity categories:
   - **Artist**: Musicians, bands, actors, authors, directors, painters. (e.g., "simon mayo", "picasso")
   - **Politician**: Government officials, politicians, heads of state. (e.g., "obama", "frank d. o'connor")
   - **HumanSettlement**: Cities, towns, villages, states, countries, counties. (e.g., "busan", "cleveland", "ohio")
   - **PublicCorp**: Commercial companies, businesses, brands. (e.g., "safeway", "mcdonald 's", "s&p global ratings")
   - **ORG**: Non-commercial organizations, government agencies, political parties, sports teams, unions. (e.g., "democrat", "united stats census bureau", "real madrid")
   - **Facility**: Buildings, stadiums, airports, highways, public places. (e.g., "village hall", "lanxess arena")
   - **OtherPER**: Persons who are not artists or politicians (e.g., athletes, scientists, soldiers, fictional characters, or general people). (e.g., "zcrny", "peter bourne")
   - **O**: Tokens that are not part of a named entity (CRITICAL: This includes Books, Movies, Songs, Albums, and Products).
4. **Tagging Scheme**:
   - Use `B-<Category>` for the first token of an entity.
   - Use `I-<Category>` for all subsequent tokens of the same entity.
   - Use `O` for non-entities.

### Verification Strategy (CRITICAL)
**If you are unsure about a proper noun, you MUST SEARCH THE INTERNET.**
* **Ambiguity**: If a word looks like a name (e.g., "finisterre", "wclv", "zcrny") but you don't know it, pause and search for it.
* **Distinction Logic**:
    * If search shows it is a **Book, Movie, Album, or Product** -> Tag as **O**. (We do not have tags for these in this specific task).
    * If search shows it is a **Company** -> Check if it is commercial (**PublicCorp**) or non-profit/sports (**ORG**).
    * If search shows it is a **Person** -> Check if they are a politician (**Politician**), creator (**Artist**), or other (**OtherPER**).

### Constraints
1. **Length Consistency**: The output list MUST have exactly the same number of items as the input list.
2. **Format**: Output ONLY a raw JSON list of strings. Do not include markdown formatting, explanations, or notes.
3. **Robustness**: Treat lower-cased proper nouns as entities (e.g., "paris" -> B-HumanSettlement). Context is king.

### Examples
**Input**:
["frank", "d.", "o'connor", "(", "1909", "–", "1992", ")", "lawyer", "judge", "and", "politician", "−", "head", "trauma"]
**Output**:
["B-Politician", "I-Politician", "I-Politician", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]

**Input**:
["prior", "to", "the", "stabbings", "he", "was", "an", "employee", "of", "safeway", "."]
**Output**:
["O", "O", "O", "O", "O", "O", "O", "O", "O", "B-PublicCorp", "O"]

**Input**:
["its", "current", "representative", "is", "democrat", "bruce", "antone", "."]
**Output**:
["O", "O", "O", "O", "B-ORG", "B-Politician", "I-Politician", "O"]

**Input**:
["finisterre", "a", "1943", "poetry", "collection", "by", "eugenio", "montale"]
**Output**:
["O", "O", "O", "O", "O", "O", "B-Artist", "I-Artist"]
*(Note: Search reveals 'Finisterre' is a book, so it is tagged 'O')*
"""

print("System prompt defined.")
print(f"Prompt length: {len(SYSTEM_PROMPT)} characters")

## 4. Define Prediction Function

In [None]:
def get_few_shot_prediction_with_delay(model, system_prompt, tokens, delay_seconds=6.0):
    """
    Generates NER predictions using Gemini API with rate limiting.
    
    Args:
        model: The initialized Gemini generative model
        system_prompt: The complete prompt with instructions and examples
        tokens: List of tokens for which to generate predictions
        delay_seconds: Delay between API calls (default 6.0 to avoid 429 errors)
    
    Returns:
        list: Predicted BIO tags, or error message string if parsing fails
    """
    # Add delay to avoid rate limiting
    time.sleep(delay_seconds)
    
    # Construct final prompt
    final_prompt = f"{system_prompt}\n\nNew Tokens: {json.dumps(tokens)}\nTags:"
    
    try:
        # Call Gemini API
        response = model.generate_content(final_prompt)
        raw_response = response.text.strip()
        
        # Clean response if wrapped in markdown code blocks
        if raw_response.startswith("```json") and raw_response.endswith("```"):
            raw_response = raw_response[7:-3].strip()  # Remove ```json and ```
        elif raw_response.startswith("```") and raw_response.endswith("```"):
            raw_response = raw_response[3:-3].strip()  # Remove ``` and ```
        
        # Parse JSON response
        predicted_tags = json.loads(raw_response)
        
        # Validate response format
        if isinstance(predicted_tags, list) and all(isinstance(tag, str) for tag in predicted_tags):
            return predicted_tags
        else:
            return f"Error: Invalid format. Raw: {raw_response}"
            
    except json.JSONDecodeError as e:
        return f"Error: JSON parsing failed: {e}. Raw: {raw_response}"
    except Exception as e:
        return f"Error: API call failed: {e}"

print("Prediction function defined.")

## 5. Load Data

In [None]:
# Load validation split
try:
    val_data = pd.read_json('val_split.jsonl', lines=True)
    print(f"✓ Loaded val_split.jsonl: {len(val_data)} examples")
    print(f"  Columns: {list(val_data.columns)}")
    display(val_data.head(3))
except FileNotFoundError:
    print("✗ Error: val_split.jsonl not found")
    val_data = pd.DataFrame()
except Exception as e:
    print(f"✗ Error loading val_split.jsonl: {e}")
    val_data = pd.DataFrame()

In [None]:
# Load test data
try:
    test_data = pd.read_json('test_data.jsonl', lines=True)
    print(f"✓ Loaded test_data.jsonl: {len(test_data)} examples")
    print(f"  Columns: {list(test_data.columns)}")
    display(test_data.head(3))
except FileNotFoundError:
    print("✗ Error: test_data.jsonl not found")
    test_data = pd.DataFrame()
except Exception as e:
    print(f"✗ Error loading test_data.jsonl: {e}")
    test_data = pd.DataFrame()

## 6. Generate Predictions on Validation Set

**Note**: This may take a long time due to:
- 6 second delay per example (to avoid rate limits)
- Gemini API processing time

Estimated time: ~6-10 seconds per example

In [None]:
if not val_data.empty and 'tokens' in val_data.columns:
    print("Generating predictions for validation set...")
    print(f"Total examples: {len(val_data)}")
    print(f"Estimated time: ~{len(val_data) * 6 / 60:.1f} minutes\n")
    
    # Generate predictions with progress bar
    val_data['predicted_tags'] = val_data['tokens'].progress_apply(
        lambda tokens: get_few_shot_prediction_with_delay(
            gemini_model, 
            SYSTEM_PROMPT, 
            tokens
        )
    )
    
    print("\n✓ Predictions complete!")
    display(val_data.head())
else:
    print("✗ Cannot generate predictions: val_data is empty or missing 'tokens' column")

## 7. Save Validation Predictions

In [None]:
if not val_data.empty and 'predicted_tags' in val_data.columns:
    output_file = 'val_split_predictions.jsonl'
    val_data.to_json(output_file, orient='records', lines=True)
    print(f"✓ Validation predictions saved to {output_file}")
    print(f"  Total examples: {len(val_data)}")
else:
    print("✗ No predictions to save")

## 8. Evaluate on Validation Set

In [None]:
try:
    import utils
    print("✓ utils.py imported successfully")
    
    # Load predictions
    predictions_df = pd.read_json('val_split_predictions.jsonl', lines=True)
    
    # Extract ground truth and predicted labels
    ground_truth = predictions_df['ner_tags'].tolist()
    predicted = predictions_df['predicted_tags'].tolist()
    
    # Filter out error messages (keep only valid predictions)
    valid_pairs = [
        (gt, pred) for gt, pred in zip(ground_truth, predicted)
        if isinstance(pred, list)
    ]
    
    if valid_pairs:
        ground_truth_clean = [pair[0] for pair in valid_pairs]
        predicted_clean = [pair[1] for pair in valid_pairs]
        
        print(f"\nEvaluating {len(valid_pairs)} valid predictions...")
        print(f"Skipped {len(ground_truth) - len(valid_pairs)} errors\n")
        
        # Evaluate
        if hasattr(utils, 'evaluate_ner'):
            results = utils.evaluate_ner(ground_truth_clean, predicted_clean)
            print("\n=== Evaluation Results ===")
            print(results)
        else:
            print("✗ Error: evaluate_ner function not found in utils.py")
    else:
        print("✗ No valid predictions to evaluate")
        
except FileNotFoundError as e:
    print(f"✗ File not found: {e}")
except ImportError:
    print("✗ Error: utils.py not found")
except Exception as e:
    print(f"✗ Error during evaluation: {e}")

## 9. Generate Predictions on Test Set

In [None]:
if not test_data.empty and 'tokens' in test_data.columns:
    print("Generating predictions for test set...")
    print(f"Total examples: {len(test_data)}")
    print(f"Estimated time: ~{len(test_data) * 6 / 60:.1f} minutes\n")
    
    # Generate predictions with progress bar
    test_data['predicted_tags'] = test_data['tokens'].progress_apply(
        lambda tokens: get_few_shot_prediction_with_delay(
            gemini_model, 
            SYSTEM_PROMPT, 
            tokens
        )
    )
    
    print("\n✓ Predictions complete!")
    display(test_data.head())
else:
    print("✗ Cannot generate predictions: test_data is empty or missing 'tokens' column")

## 10. Save Test Predictions

In [None]:
if not test_data.empty and 'predicted_tags' in test_data.columns:
    output_file = 'test_data_predictions.jsonl'
    test_data.to_json(output_file, orient='records', lines=True)
    print(f"✓ Test predictions saved to {output_file}")
    print(f"  Total examples: {len(test_data)}")
else:
    print("✗ No predictions to save")

## Summary

### Files Generated:
- `val_split_predictions.jsonl` - Validation set with predictions
- `test_data_predictions.jsonl` - Test set with predictions

### Next Steps:
1. Review evaluation results on validation set
2. Analyze errors (look for patterns in misclassifications)
3. Optionally refine the system prompt based on error analysis
4. Submit test predictions for final evaluation

### Notes:
- This is a **few-shot learning** approach (no fine-tuning required)
- Expected performance: Lower than fine-tuned models (M8/M9) but requires no training
- Main advantage: Quick experimentation and no GPU needed
- Main disadvantage: Slow inference (6+ seconds per example) and API costs