<a href="https://colab.research.google.com/github/SPVillacorta/GeoNER-SchemaEval/blob/main/notebooks/geoner_schema_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating the Performance of a Trained NER Model using a Confusion Matrix

This notebook demonstrates how to evaluate the performance of a pre-trained Named Entity Recognition (NER) model using a confusion matrix and classification report. It specifically uses the Flair NLP framework and `scikit-learn` for evaluation.

**Important Note:**
For demonstration and reproducibility, this notebook expects the following files to be accessible within a cloned repository structure (or downloaded/provided):
* A pre-trained model: `./ozrock/model/best-model.pt`
* Example PDF documents: `./ozrock/IRpdfs/`
* Annotated dataset: `./ozrock/annotated-ozrock.csv`

In [None]:
## 0. Setup and Dependencies

First, we need to install the required libraries and download necessary NLTK data.

```python
# Install required libraries
!pip install flair pdfplumber pdfminer scikit-learn pandas seaborn matplotlib numpy

# Import necessary modules
import glob
import nltk
import matplotlib.pyplot as plt
import numpy as np
import os
import pdfplumber
import pdfminer
import pandas as pd
import seaborn as sns
import flair
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from pdfminer.pdfdocument import PDFNoValidXRef
from pdfminer.psparser import PSEOF
from sklearn.metrics import confusion_matrix, classification_report
from typing import List

# Download NLTK data
nltk.download("punkt")

## 1. Prepare Your Environment and Data (Google Colab Specific)
When running this notebook on Google Colab, you need to clone the GitHub repository to access the model, example PDFs, and annotated data.

In [None]:
# Clone the GitHub repository
# Replace 'SPVillacorta/GeoNER-SchemaEval' with your actual GitHub username and repository name if different
!git clone [https://github.com/SPVillacorta/GeoNER-SchemaEval.git](https://github.com/SPVillacorta/GeoNER-SchemaEval.git)

# Change the current working directory to the cloned repository's root.
# This ensures all relative paths (e.g., './ozrock/...') work correctly.
import os
os.chdir("GeoNER-SchemaEval") # Make sure this matches your repository's folder name
print(f"Current working directory set to: {os.getcwd()}")
!ls -F # Verify the contents of the current directory; you should see 'ozrock/'

2. PDF to Sentences Conversion
This function extracts text from PDF files within a specified directory and tokenizes it into sentences.

In [None]:
def pdf_to_sentences(pdf_dir: str):
    pdf_paths = glob.glob(os.path.join(pdf_dir, "*.pdf"))
    all_sentences = []
    for pdf_path in pdf_paths:
        try:
            with pdfplumber.open(pdf_path) as pdf:
                text = "\n".join([page.extract_text() for page in pdf.pages])
                sentences = nltk.sent_tokenize(text)
                all_sentences.extend(sentences)
        except (pdfminer.pdfdocument.PDFNoValidXRef, pdfminer.psparser.PSEOF) as e:
            print(f"Warning: Skipping file {pdf_path} due to an error. {e}")
    return all_sentences

3. Load the Pre-trained Model
Load the pre-trained Flair SequenceTagger model from the specified path.
Due to the model's size, it cannot be hosted directly on GitHub. It will be downloaded from an external service.

In [None]:
# *** IMPORTANT: Replace 'YOUR_GOOGLE_DRIVE_FILE_ID_HERE' with your actual Google Drive file ID ***
# To get the ID:
# 1. Upload 'best-model.pt' to your Google Drive.
# 2. Right-click the file and select "Share" -> "Share".
# 3. Change access to "Anyone with the link".
# 4. Copy the link. The ID is the part between /d/ and /view.
#    Example: [https://drive.google.com/file/d/THIS_IS_THE_ID/view?usp=sharing](https://drive.google.com/file/d/THIS_IS_THE_ID/view?usp=sharing)
model_file_id = 'YOUR_GOOGLE_DRIVE_FILE_ID_HERE'

# Local path where the model will be saved within the Colab environment
model_local_path = './ozrock/model/best-model.pt'

# Ensure the destination folder exists
!mkdir -p ./ozrock/model

# Download the model
print(f"Downloading model from Google Drive (ID: {model_file_id})... Make sure the file is shared with 'Anyone with the link'.")
!gdown --id {model_file_id} -O {model_local_path}
print("Model downloaded successfully.")

# Load the model from the local path
model = SequenceTagger.load(model_local_path)