# Tutorial: ICL Text Classifier

This notebook shows how to use the **`icl-text-classifier`** project step by step:

1. Install the package in editable mode (`pip install -e .`).
2. Inspect the example input CSV.
3. Create a YAML configuration file directly from the notebook (using `%%writefile`).
4. Run the classifier from Python and inspect the results.

> **Important:** Place this notebook in the **root folder** of the project (the same folder that contains `pyproject.toml`).


## 1. Installation (editable mode)

The command below installs the project in **editable mode**, so that any change in the source code
is immediately reflected in the environment.

Run it from the project root (where this notebook lives):


In [1]:
# Install the package in editable (development) mode
!pip install -e .


Obtaining file:///home/marcacini/Documents/PROJETOS/ICL-TEXT-CLASSIFIER
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: icl-text-classifier
  Building editable for icl-text-classifier (pyproject.toml) ... [?25ldone
[?25h  Created wheel for icl-text-classifier: filename=icl_text_classifier-0.1.0-0.editable-py3-none-any.whl size=4626 sha256=ee80a942194a1eb6c21ffa8b7687e2ba1f333e690a64a1fff9da81ee4a9c77f5
  Stored in directory: /tmp/pip-ephem-wheel-cache-d9k9yhro/wheels/fd/fb/1d/2c6e0b47e8c1c1cffb2caf126d7277f96770c75fc078fd0e26
Successfully built icl-text-classifier
Installing collected packages: icl-text-classifier
  Attempting uninstall: icl-text-classifier
    Found existing installation: icl-text-classifier 0.1.0
    Uninstalling icl-text-class

## 2. Inspect the example input CSV

The project includes an example CSV file in `examples/input.csv`. It contains a few short texts,
each with an `ID` and a `TEXT` column.


In [2]:
import pandas as pd
from pathlib import Path

csv_path = Path("examples/input.csv")
print(csv_path.resolve())

df = pd.read_csv(csv_path)
df.head()


/home/marcacini/Documents/PROJETOS/ICL-TEXT-CLASSIFIER/examples/input.csv


Unnamed: 0,ID,TEXT
0,doc001,The city hospital has been struggling with lon...
1,doc002,A new primary school is being built in the out...
2,doc003,Residents are worried about the recent increas...
3,doc004,The local government launched a mental health ...
4,doc005,Teachers report that the lack of updated textb...


## 3. Create a YAML configuration from the notebook

Here we create a configuration file named `examples/config_notebook.yaml` using the notebook
cell magic `%%writefile`.

You must **edit the line with `YOUR_OPENROUTER_API_KEY` and insert your real key** before running the classifier.


In [8]:
%%writefile examples/config_example.yaml
model:
  name: "mistralai/mistral-nemo"
  base_url: "https://openrouter.ai/api/v1"
  api_key: "YOUR_OPENROUTER_API_KEY"  # <-- Replace with your real key
  temperature: 0.0

classification:
  system_prompt: |
    You are an expert classifier that assigns highly relevant classes
    to each input text, based on public policy themes.

    The predefined classes are:
    {CLASSES_DESCRIPTION}

    Rules:
    - Only assign a class if it is clearly supported by the text.
    - You may assign MORE THAN ONE class if they are all highly relevant.
    - If no class is clearly relevant, return an empty list.
    - Use only the class_id values provided.

  classes:
    - id: "health"
      description: >
        Texts about hospitals, clinics, primary care, emergency rooms,
        doctors, nurses, vaccination campaigns, mental health support,
        telemedicine, chronic disease monitoring, mobile clinics, and
        other healthcare services or policies.

    - id: "education"
      description: >
        Texts about schools, universities, teachers, curricula, exams,
        remote learning platforms, tutoring, literacy programs, scholarships,
        educational infrastructure, and teacher training.

    - id: "public_safety"
      description: >
        Texts about crime, robberies, assaults, domestic violence, drug
        trafficking, policing strategies, community policing, surveillance
        cameras, patrols, checkpoints, and measures to improve safety in
        public spaces or transport.

    - id: "governance"
      description: >
        Texts about justice and legal systems, anti-corruption campaigns,
        transparency, reporting misuse of public funds, and citizen oversight
        of government actions.

  csv_input_path: "examples/input.csv"
  id_column: "ID"
  text_column: "TEXT"

  num_threads: 4
  max_tries: 3

  output_path: "examples/output_classification_notebook.jsonl"


Writing examples/config_example.yaml


## 4. Run the classifier from Python

Now we import `ICLClassifier`, load the configuration we just wrote,
run the classification, and save the results.


In [5]:
from icl_classifier import ICLClassifier
import json
import logging

logging.disable(logging.CRITICAL) # disable logging for cleaner output

config_path = "examples/config_example.yaml"
classifier = ICLClassifier(config_path=config_path)
results = classifier.run()
classifier.save_results(results)  # uses output_path from YAML

print(f"Number of documents classified: {len(results)}")
print("\nFirst 3 results:\n")
print(json.dumps(results[:3], ensure_ascii=False, indent=2))


Classifying documents: 100%|██████████| 100/100 [00:27<00:00,  3.65it/s]

Number of documents classified: 100

First 3 results:

[
  {
    "doc_id": "doc001",
    "relevant_classes": [
      {
        "class_id": "health",
        "justification": "The text explicitly mentions 'city hospital', 'emergency room', 'nurses', and 'doctors', indicating healthcare services."
      }
    ]
  },
  {
    "doc_id": "doc002",
    "relevant_classes": [
      {
        "class_id": "education",
        "justification": "The text explicitly mentions 'primary school', 'education', and 'access to education', making it highly relevant to the 'education' class."
      }
    ]
  },
  {
    "doc_id": "doc003",
    "relevant_classes": [
      {
        "class_id": "public_safety",
        "justification": "The text explicitly mentions an increase in robberies, a public safety concern, and residents asking for more police patrols."
      }
    ]
  }
]





## 5. Inspect the JSONL output

The results are stored in a JSONL file (one JSON object per line), at the path
defined in `output_path` in the YAML configuration.

Here we load the file and inspect the first few entries.


In [6]:
import json

output_path = "examples/output_classification_notebook.jsonl"
records = []
with open(output_path, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        records.append(json.loads(line))

len(records), records[:3]


(100,
 [{'doc_id': 'doc001',
   'relevant_classes': [{'class_id': 'health',
     'justification': "The text explicitly mentions 'city hospital', 'emergency room', 'nurses', and 'doctors', indicating healthcare services."}]},
  {'doc_id': 'doc002',
   'relevant_classes': [{'class_id': 'education',
     'justification': "The text explicitly mentions 'primary school', 'education', and 'access to education', making it highly relevant to the 'education' class."}]},
  {'doc_id': 'doc003',
   'relevant_classes': [{'class_id': 'public_safety',
     'justification': 'The text explicitly mentions an increase in robberies, a public safety concern, and residents asking for more police patrols.'}]}])

## 6. Convert to a doc × class matrix

As an extra step, we can convert the JSONL results into a simple
document × class matrix using `pandas`, with 1 indicating that the
class was assigned to the document and 0 otherwise.


In [7]:
import pandas as pd

# Collect all unique class_ids
all_class_ids = set()
for rec in records:
    for c in rec.get("relevant_classes", []):
        all_class_ids.add(c["class_id"])

all_class_ids = sorted(all_class_ids)
print("Classes found:", all_class_ids)

rows = []
for rec in records:
    row = {"doc_id": rec["doc_id"]}
    assigned = {c["class_id"] for c in rec.get("relevant_classes", [])}
    for cid in all_class_ids:
        row[cid] = 1 if cid in assigned else 0
    rows.append(row)

matrix_df = pd.DataFrame(rows)
matrix_df


Classes found: ['education', 'governance', 'health', 'public_safety']


Unnamed: 0,doc_id,education,governance,health,public_safety
0,doc001,0,0,1,0
1,doc002,1,0,0,0
2,doc003,0,0,0,1
3,doc004,0,0,1,0
4,doc005,1,0,0,0
...,...,...,...,...,...
95,doc096,0,0,1,0
96,doc097,1,0,0,0
97,doc098,0,1,0,1
98,doc099,0,0,1,0
