#Mini-ConExion: A Minimal Implementation of Concept Extraction with Large Language Models

## Requirements

### Installing libraries

First we need to install the `datasets` library to get access to *Inspec* and *Semeval2017* datasets.

Even though Google Colab comes with `datasets` pre-installed we need to downgrade the library since our requied datasets are not compatible with latest version.

In [1]:
!pip install datasets==3.6.0

Collecting datasets==3.6.0
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
Successfully installed datasets-3.6.0


You might need to restart your runtime after the installation is complete.

To do so navigate to `Runtime > Restart session` or press `Ctrl+M .`.

### Importing libraries

In [2]:
import os # Used to handle file paths
import re # For text proccessing
import pandas as pd # To navigate and save results
from tqdm.auto import tqdm # Handles progress and time estimations
from google.colab import ai # Our models!
from datasets import load_dataset # To load and use Semeval2017 & Inspec datasets

## Listing available models and our datasets

In [3]:
ai.list_models()

['google/gemini-2.5-flash', 'google/gemini-2.5-flash-lite']

In [5]:
semeval = load_dataset("midas/semeval2017", "raw")
semeval

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'],
        num_rows: 350
    })
    test: Dataset({
        features: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'],
        num_rows: 50
    })
})

In [None]:
print("Sample from training dataset split")
train_sample = semeval["train"][0]
print("Fields in the sample: ", [key for key in train_sample.keys()])
print("Tokenized Document: ", train_sample["document"])
print("Document BIO Tags: ", train_sample["doc_bio_tags"])
print("Extractive/present Keyphrases: ", train_sample["extractive_keyphrases"])
print("Abstractive/absent Keyphrases: ", train_sample["abstractive_keyphrases"])
print("\n-----------\n")

# sample from the validation split
print("Sample from validation dataset split")
validation_sample = semeval["validation"][0]
print("Fields in the sample: ", [key for key in validation_sample.keys()])
print("Tokenized Document: ", validation_sample["document"])
print("Document BIO Tags: ", validation_sample["doc_bio_tags"])
print("Extractive/present Keyphrases: ", validation_sample["extractive_keyphrases"])
print("Abstractive/absent Keyphrases: ", validation_sample["abstractive_keyphrases"])
print("\n-----------\n")

# sample from the test split
print("Sample from test dataset split")
test_sample = semeval["test"][0]
print("Fields in the sample: ", [key for key in test_sample.keys()])
print("Tokenized Document: ", test_sample["document"])
print("Document BIO Tags: ", test_sample["doc_bio_tags"])
print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"])
print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"])
print("\n-----------\n")

Sample from training dataset split
Fields in the sample:  ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata']
Tokenized Document:  ['It', 'is', 'well', 'known', 'that', 'one', 'of', 'the', 'long', 'standing', 'problems', 'in', 'physics', 'is', 'understanding', 'the', 'confinement', 'physics', 'from', 'first', 'principles.', 'Hence', 'the', 'challenge', 'is', 'to', 'develop', 'analytical', 'approaches', 'which', 'provide', 'valuable', 'insight', 'and', 'theoretical', 'guidance.', 'According', 'to', 'this', 'viewpoint,', 'an', 'effective', 'theory', 'in', 'which', 'confining', 'potentials', 'are', 'obtained', 'as', 'a', 'consequence', 'of', 'spontaneous', 'symmetry', 'breaking', 'of', 'scale', 'invariance', 'has', 'been', 'developed', '[1].', 'In', 'particular,', 'it', 'was', 'shown', 'that', 'a', 'such', 'theory', 'relies', 'on', 'a', 'scale-invariant', 'Lagrangian', 'of', 'the', 'type', '[2]', '(1)L=14w2−12w−FμνaFaμν,', 'where', 'Fμνa

In [7]:
inspec = load_dataset("midas/inspec", "raw")
inspec

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'],
        num_rows: 500
    })
})

In [None]:
print("Sample from training dataset split")
train_sample = inspec["train"][0]
print("Fields in the sample: ", [key for key in train_sample.keys()])
print("Tokenized Document: ", train_sample["document"])
print("Document BIO Tags: ", train_sample["doc_bio_tags"])
print("Extractive/present Keyphrases: ", train_sample["extractive_keyphrases"])
print("Abstractive/absent Keyphrases: ", train_sample["abstractive_keyphrases"])
print("\n-----------\n")

# sample from the validation split
print("Sample from validation dataset split")
validation_sample = inspec["validation"][0]
print("Fields in the sample: ", [key for key in validation_sample.keys()])
print("Tokenized Document: ", validation_sample["document"])
print("Document BIO Tags: ", validation_sample["doc_bio_tags"])
print("Extractive/present Keyphrases: ", validation_sample["extractive_keyphrases"])
print("Abstractive/absent Keyphrases: ", validation_sample["abstractive_keyphrases"])
print("\n-----------\n")

# sample from the test split
print("Sample from test dataset split")
test_sample = inspec["test"][0]
print("Fields in the sample: ", [key for key in test_sample.keys()])
print("Tokenized Document: ", test_sample["document"])
print("Document BIO Tags: ", test_sample["doc_bio_tags"])
print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"])
print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"])
print("\n-----------\n")

Sample from training dataset split
Fields in the sample:  ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata']
Tokenized Document:  ['A', 'conflict', 'between', 'language', 'and', 'atomistic', 'information', 'Fred', 'Dretske', 'and', 'Jerry', 'Fodor', 'are', 'responsible', 'for', 'popularizing', 'three', 'well-known', 'theses', 'in', 'contemporary', 'philosophy', 'of', 'mind', ':', 'the', 'thesis', 'of', 'Information-Based', 'Semantics', '-LRB-', 'IBS', '-RRB-', ',', 'the', 'thesis', 'of', 'Content', 'Atomism', '-LRB-', 'Atomism', '-RRB-', 'and', 'the', 'thesis', 'of', 'the', 'Language', 'of', 'Thought', '-LRB-', 'LOT', '-RRB-', '.', 'LOT', 'concerns', 'the', 'semantically', 'relevant', 'structure', 'of', 'representations', 'involved', 'in', 'cognitive', 'states', 'such', 'as', 'beliefs', 'and', 'desires', '.', 'It', 'maintains', 'that', 'all', 'such', 'representations', 'must', 'have', 'syntactic', 'structures', 'mirroring', 'the', 's

In [None]:
datasets = {
    "Inspec": inspec,
    "SemEval2017": semeval
}

stats_list = []

for name, ds in datasets.items():
    for split in ["train", "test"]:
        if split not in ds: continue

        data = ds[split]
        n_doc = len(data)

        # Word-based lengths (as used in the paper)
        doc_lengths = [len(doc) for doc in data['document']]

        # Deduplicated and lowercased concept counts
        concept_counts = []
        for kp_list in data['extractive_keyphrases']:
            concept_counts.append(len(kp_list))

        # Concept distribution percentages
        dist = {1: 0, 2: 0, 3: 0, 4: 0, ">=5": 0}
        for c in concept_counts:
            if c >= 5: dist[">=5"] += 1
            elif c in dist: dist[c] += 1
            # Note: 0 is ignored in the paper's distribution table

        # Compile row to match Table 2 format
        stats_list.append({
            "Dataset": name,
            "Set": split.capitalize(),
            "N_doc": n_doc,
            "Avg_doc": round(sum(doc_lengths) / n_doc, 2),
            "Max_doc": max(doc_lengths),
            "Max/Min/Avg_con": f"{max(concept_counts)} / {min(concept_counts)} / {round(sum(concept_counts) / n_doc, 2)}",
            "1 (%)": round((dist[1] / n_doc) * 100, 1),
            "2 (%)": round((dist[2] / n_doc) * 100, 1),
            "3 (%)": round((dist[3] / n_doc) * 100, 1),
            "4 (%)": round((dist[4] / n_doc) * 100, 1),
            ">=5 (%)": round((dist[">=5"] / n_doc) * 100, 1)
        })

df_stats = pd.DataFrame(stats_list)
print(df_stats.to_string(index=False))

    Dataset   Set  N_doc  Avg_doc  Max_doc Max/Min/Avg_con  1 (%)  2 (%)  3 (%)  4 (%)  >=5 (%)
     Inspec Train   1000   141.51      557   24 / 0 / 6.39    3.3    8.4   11.2   11.7     64.6
     Inspec  Test    500   134.60      384   27 / 0 / 6.57    3.0    8.6   10.6   12.0     65.0
SemEval2017 Train    350   160.51      355  29 / 2 / 11.98    0.0    0.6    0.6    3.1     95.7
SemEval2017  Test    100   190.40      297  27 / 4 / 12.26    0.0    0.0    0.0    3.0     97.0


## Zero-Shot Concept Extraction & Evaluation


In [8]:
def extract_keyphrases(database, model_name, num_samples, results_csv_name, shuffle=True, seed=77):
    results = []
    dataset = load_dataset(f"midas/{database}", "raw")
    # Track global counts for Micro-averaging
    total_hits = 0
    total_predicted_count = 0
    total_ground_truth_count = 0

    # 1. Selection
    if shuffle:
        test_subset = dataset["test"].shuffle(seed=seed).select(range(num_samples))
    else:
        test_subset = dataset["test"].select(range(num_samples))
    for i, sample in enumerate(tqdm(test_subset, total=num_samples, desc=f"Evaluating {model_name}")):
        document_text = " ".join(sample["document"])
        ground_truth = sample["extractive_keyphrases"]

        prompt = f"""
        Extract the most important keyphrases from the following technical document.
        Return them as a simple comma-separated list.

        Document: {document_text}

        Keyphrases:"""

        try:
            response = ai.generate_text(prompt, model_name=model_name)
            predicted_text = response
        except Exception as e:
            predicted_text = ""
            print(f"Error on sample {i}: {e}")


        # Cleaning
        predicted_text = predicted_text.replace("\n", " ").strip()
        predicted_keyphrases = [k.strip().lower() for k in predicted_text.split(",") if k.strip()]

        gt_set = {k.lower() for k in ground_truth}
        pred_set = set(predicted_keyphrases)

        # Per-sample stats
        hits = len(gt_set.intersection(pred_set))
        p = hits / len(pred_set) if len(pred_set) > 0 else 0.0
        r = hits / len(gt_set) if len(gt_set) > 0 else 0.0
        f1 = 2 * (p * r) / (p + r) if (p + r) > 0 else 0.0

        # Accumulate for Micro-averaging
        total_hits += hits
        total_predicted_count += len(pred_set)
        total_ground_truth_count += len(gt_set)

        results.append({
            "document": document_text,
            "predicted": ", ".join(pred_set),
            "ground_truth": ", ".join(gt_set),
            "precision": round(p, 3),
            "recall": round(r, 3),
            "f1_score": round(f1, 3),
        })

    # Create DataFrame
    df_samples = pd.DataFrame(results)

    # --- MACRO CALCULATION ---
    # We use explicit float conversion to ensure .mean() doesn't fail
    macro_precision = df_samples["precision"].mean()
    macro_recall = df_samples["recall"].mean()
    macro_f1 = df_samples["f1_score"].mean()

    # --- MICRO CALCULATION ---
    # Micro-averaging focuses on the total number of correct phrases vs total phrases generated
    micro_precision = total_hits / total_predicted_count if total_predicted_count > 0 else 0.0
    micro_recall = total_hits / total_ground_truth_count if total_ground_truth_count > 0 else 0.0
    micro_f1 = (2 * micro_precision * micro_recall / (micro_precision + micro_recall)
                if (micro_precision + micro_recall) > 0 else 0.0)

    # Save summary
    summary_file = "evaluation_summary_log.csv"
    summary_data = {
        "model_name": model_name,
        "dataset": database,
        "num_samples": num_samples,
        "macro_p": round(macro_precision, 3),
        "macro_r": round(macro_recall, 3),
        "macro_f1": round(macro_f1, 3),
        "micro_p": round(micro_precision, 3),
        "micro_r": round(micro_recall, 3),
        "micro_f1": round(micro_f1, 3)
    }
    print(f"PREDICTION AND EVALUATION COMPLETE!\n\n<-----Parameters----->\nDataset: {database}\nModel: {model_name}\nNumber of samples: {num_samples}\n")
    df_summary = pd.DataFrame([summary_data])
    df_summary.to_csv(summary_file, mode='a', header=not os.path.exists(summary_file), index=False)
    print(f"Model Summary added to {summary_file}")
    # Save per-sample results separately
    df_samples.to_csv(f"{results_csv_name}.csv", index=False)
    print(f"Per-sample results saved to {results_csv_name}.csv")

In [None]:
extract_keyphrases("semeval2017", "google/gemini-2.5-flash-lite", 100, "gemini-2.5-flash-lite_semeval")

Evaluating google/gemini-2.5-flash-lite:   0%|          | 0/100 [00:00<?, ?it/s]

PREDICTION AND EVALUATION COMPLETE!

<-----Parameters----->
Dataset: semeval2017
Model: google/gemini-2.5-flash-lite
Number of samples: 100

Model Summary added to evaluation_summary_log.csv
Per-sample results saved to gemini-2.5-flash-lite_semeval.csv


In [None]:
extract_keyphrases("semeval2017", "google/gemini-2.5-flash", 100, "gemini-2.5-flash_semeval")

Evaluating google/gemini-2.5-flash:   0%|          | 0/100 [00:00<?, ?it/s]

PREDICTION AND EVALUATION COMPLETE!

<-----Parameters----->
Dataset: semeval2017
Model: google/gemini-2.5-flash
Number of samples: 100

Model Summary added to evaluation_summary_log.csv
Per-sample results saved to gemini-2.5-flash_semeval.csv


In [None]:
extract_keyphrases("inspec", "google/gemini-2.5-flash-lite", 500, "gemini-2.5-flash-lite_inspec")

Evaluating google/gemini-2.5-flash-lite:   0%|          | 0/500 [00:00<?, ?it/s]

PREDICTION AND EVALUATION COMPLETE!

<-----Parameters----->
Dataset: inspec
Model: google/gemini-2.5-flash-lite
Number of samples: 500

Model Summary added to evaluation_summary_log.csv
Per-sample results saved to gemini-2.5-flash-lite_inspec.csv


In [9]:
extract_keyphrases("inspec", "google/gemini-2.5-flash", 500, "gemini-2.5-flash_inspec")

Evaluating google/gemini-2.5-flash:   0%|          | 0/500 [00:00<?, ?it/s]

Error on sample 459: Error code: 429 - {'message': 'Insufficient quota available to perform this operation. Try again later.', 'type': 'invalid_request_error'}
Error on sample 460: Error code: 429 - {'message': 'Insufficient quota available to perform this operation. Try again later.', 'type': 'invalid_request_error'}
Error on sample 461: Error code: 429 - {'message': 'Insufficient quota available to perform this operation. Try again later.', 'type': 'invalid_request_error'}
Error on sample 462: Error code: 429 - {'message': 'Insufficient quota available to perform this operation. Try again later.', 'type': 'invalid_request_error'}
Error on sample 463: Error code: 429 - {'message': 'Insufficient quota available to perform this operation. Try again later.', 'type': 'invalid_request_error'}
Error on sample 464: Error code: 429 - {'message': 'Insufficient quota available to perform this operation. Try again later.', 'type': 'invalid_request_error'}
Error on sample 465: Error code: 429 - {

## Results

In [12]:
results = pd.read_csv("evaluation_summary_log.csv")
print(results.to_string(index=False))

                  model_name     dataset  num_samples  macro_p  macro_r  macro_f1  micro_p  micro_r  micro_f1
google/gemini-2.5-flash-lite semeval2017          100    0.372    0.445     0.387    0.357    0.440     0.394
     google/gemini-2.5-flash semeval2017          100    0.315    0.322     0.305    0.317    0.313     0.315
google/gemini-2.5-flash-lite      inspec          500    0.395    0.635     0.462    0.383    0.617     0.473
     google/gemini-2.5-flash      inspec          500    0.338    0.474     0.372    0.363    0.436     0.396
