# Eval Notebook: Multiclass Classification of Cultural Items
**Homework 1 - Multilingual Natural Language Processing**

**Authors**: Joshua Edwin & Clemens Kubach

**Team**: teXt-Men

This eval notebook is only a minimal excerpt from our development notebooks in order to generate an output file with the predictions for a given test file for our LM-based approach and non-LM-based approach each.

For full explainations and code documentations see our [GitHub](https://github.com/ClemensKubach/multilingual-nlp-homeworks).

## General Code
Code that is required in both approaches.

**The only requirement that must be given before running the notebook:**

Locate the test.csv file into current working directory or adjust the path!

In [None]:
!pip install -q pandas scikit-learn joblib requests tqdm
!pip install accelerate -U
!pip install datasets



In [None]:
from pathlib import Path
from typing import Iterable

import pandas as pd
import torch
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, BertForSequenceClassification

Mount the Google Drive to access the shared folder with out test and intermediate results dataset.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Define the location of the test_unlabeled.csv file. Because the Google Drive shared folder approach is used as recommended in the submission guidelines, the shared folder has to be added/linked to your own Drive.

For that: In the Google Drive web UI right click on our "teXt-Men_shared_folder" and click "add shortcut". [Source](https://stackoverflow.com/questions/54351852/accessing-shared-with-me-with-colab)

In [None]:
TEST_FILEPATH = Path("/content/drive/MyDrive/teXt-Men_shared_folder/test_unlabeled.csv") # Path("test_unlabeled.csv")
if not TEST_FILEPATH.exists():
    raise FileNotFoundError(f"Test file {TEST_FILEPATH} not found!")

Define Label-ClassID mapping.

In [None]:
LABEL2ID = {
    "cultural agnostic": 0,
    "cultural representative": 1,
    "cultural exclusive": 2,
}
ID2LABEL = {k: v for v, k in LABEL2ID.items()}

## LM-based Approach

Load the test data and prepare it.

In [None]:
def aggregate_selected_columns(df: pd.DataFrame, selected_fields: Iterable[str]) -> pd.DataFrame:
    """Aggregate selected string columns to a single column with name 'text'.

    Combine str cols in the following pattern: name: <name>; type: <type>; ...
    via iterating over selected_fields.

    Args:
        df: Input dataframe that includes all selected fields.
        selected_fields: Fields that are aggregated in the defined.

    Returns:
        Dataframe with aggregated text column.
    """
    minimal_out_fields = ["item", "name"]  # "label" will be added at the end
    out_fields = list(set(minimal_out_fields + list(selected_fields)))
    agg_df = df[out_fields].copy()
    agg_df['text'] = agg_df.apply(
        lambda row: '; '.join(f"{col}: {row[col]}" for col in selected_fields),
        axis=1
    )
    return agg_df

In [None]:
test_df = pd.read_csv(TEST_FILEPATH)
SELECTED_COLUMNS = ("name", "type", "category", "subcategory", "description")
test_df = aggregate_selected_columns(test_df, selected_fields=SELECTED_COLUMNS)
hf_dataset = Dataset.from_pandas(test_df)
print("Test Set:")
display(test_df.head())

Test Set:


Unnamed: 0,item,type,subcategory,description,name,category,text
0,http://www.wikidata.org/entity/Q2427430,concept,historical event,Zhang Xueliang's announcement on 29 December 1...,Northeast Flag Replacement,History,name: Northeast Flag Replacement; type: concep...
1,http://www.wikidata.org/entity/Q125482,concept,religious leader,Islamic leadership position,imam,philosophy and religion,name: imam; type: concept; category: philosoph...
2,http://www.wikidata.org/entity/Q15789,named entity,sports club,"association football club in Munich, Germany",FC Bayern Munich,sports,name: FC Bayern Munich; type: named entity; ca...
3,http://www.wikidata.org/entity/Q582496,named entity,government agency,program intended to eradicate hunger and extre...,Fome Zero,politics,name: Fome Zero; type: named entity; category:...
4,http://www.wikidata.org/entity/Q572811,named entity,literary award,awards given at Bouchercon for mystery literature,Anthony Award,Literature,name: Anthony Award; type: named entity; categ...


Define the final model that should be loaded from huggingface hub.

In [None]:
model_id = "ClemensK/cultural-bert-base-multilingual-cased-classifier"
model_version_commit = "83167f0257239616ccc5af2497b7195fbec09c22"

Inference: Load the model and tokenizer and apply them on the given test data.

If you are asked for accessing your HF_TOKEN secret, you can allow or deny it. It is not necessary.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=model_version_commit)
model = BertForSequenceClassification.from_pretrained(model_id, revision=model_version_commit)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = model.to(device)

def predict_batch(batch):
    """Define a batch-wise function for the model inference.

    Args:
        batch: List of formatted text entries where the a specific format is
          required as described in the preprocessing of the data
          (see aggregate_selected_columns function).

    Retutrns:
        Batch with predicted labels.
    """
    inputs = tokenizer(
        batch["text"],
        return_tensors="pt",
        padding=True,
        truncation=True,
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        logits = model(**inputs).logits
    ids = logits.argmax(dim=-1).tolist()
    batch["label"] = [model.config.id2label[i] for i in ids]
    return batch

Using device: cuda


Add the predicted class labels to a new column `label` into a copy of the test input file in the same directory but with changed name: `test-predictions-lm.csv`.



In [None]:
predicted_hf_dataset = hf_dataset.map(predict_batch, batched=True, batch_size=32)
predicted_test_df = predicted_hf_dataset.to_pandas()
predicted_test_df.head()

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

In [None]:
predicted_file_path = "/content/drive/MyDrive/teXt-Men_shared_folder/LM_approach/teXt-Men_test_LM_predictions.csv" # TEST_FILEPATH.parent / "teXt-Men_test_LM_predictions.csv"
predicted_test_df.to_csv(predicted_file_path, index=False)

## Non-LM-based Approach - Rule-based Classifier

### STEP 1: Install necessary libraries.
This is already done above.

In [None]:
#!pip install -q pandas scikit-learn joblib matplotlib seaborn requests tqdm accelerate datasets huggingface_hub

### Step 2: Enrichment Instructions (Performed Locally)

Enrichment is not performed in this eval notebook.

We are not doing the Wikidata enrichment directly in Google Colab as of several reasons:
- Slow performance (5-7 seconds per request, compared to 1-3 seconds locally)
- API rate limits and throttling from Wikidata
- Risk of incomplete results due to network timeouts or disconnections (e.g. because of Colab usage limits)

To avoid these issues, the enrichment was performed in a local environment (e.g., PyCharm) before, and the fully enriched dataset was uploaded to Google Drive for use in this notebook:

File path:  
`/content/drive/MyDrive/teXt-Men_shared_folder/non-LM_approach/teXt-Men_test_enriched.csv`

**What is enrichment in this context?**

Enrichment refers to the process of enhancing the raw dataset (which only contains QIDs or URLs) by retrieving structured metadata from Wikidata for each item.
This metadata provides the necessary cultural and contextual information used in rule-based classification.

The enrichment process involves:
- Connecting to the Wikidata API using each QID
- Extracting the following culturally relevant properties:
    - `instance_of` (P31)
    - `part_of_culture` (P2596)
    - `heritage_status` (P1435)
    - `country_of_origin` (P495)
    - `country` (P17)
    - `located_in` (P131)

These properties were selected because they provide essential cultural, geographical, and institutional context. For example, `part_of_culture` and `heritage_status` help indicate cultural specificity, while `instance_of` describes the entity type. Geographic properties like `country` and `located_in` aid in determining regional or national affiliation.

**Example:**

Given a raw entry like:
item: https://www.wikidata.org/wiki/Q1331793


After enrichment, this item might return:

* qid: q1331793
* instance_of: q571
* part_of_culture: q1763564
* country_of_origin: q38


This enriched metadata allows the classifier to recognize that the item is a "book" (`q571`), linked to a specific cultural heritage, and originating from a particular country (`q38`, Italy).

**Formatting Requirement:**

All QIDs extracted during enrichment must be converted to lowercase (e.g., `q42` instead of `Q42`). This is necessary because the pre-defined Golden QID lists used for classification are stored entirely in lowercase and are case-sensitive. This formatting can be ensured during or after enrichment using `.str.lower()`.

No enrichment scripts are provided in this step, as enrichment is assumed to be completed externally and the final enriched file is already available in the specified Google Drive location.



### Step 3: How the Golden QID Lists were created

To build the rule-based classifier, golden QID lists were manually created for three key properties:

- instance_of (P31)
- part_of_culture (P2596)
- heritage_status (P1435)

The golden QIDs were selected through the following methodology:

#### Analysis of Training and Validation Data:

The datasets were enriched with Wikidata properties, and the most frequently associated QIDs for each label (Cultural Agnostic, Cultural Representative, Cultural Exclusive) were examined.

#### Identification of Strong Cultural Signals:

QIDs that appeared predominantly or exclusively with a particular cultural class were identified as strong indicators.

#### Manual Curation and Iterative Polishing:

Initial selections were manually curated based on data patterns and subsequently refined by analyzing misclassified examples during validation runs.

#### Final Application:

These curated golden QIDs form the direct basis for classification decisions in the rule-based system, providing clear and explainable signals for cultural labeling.

In [None]:
# Golden QIDs for "instance_of" property
golden_instance_qids = {
    "q10517054": "Cultural Representative",
    "q18127": "Cultural Exclusive",
    "q202866": "Cultural Representative",
    "q23413": "Cultural Exclusive",
    "q2490224": "Cultural Representative",
    "q64578911": "Cultural Exclusive",
    "q20860083": "Cultural Exclusive",
    "q5107": "Cultural Representative",
    "q82794": "Cultural Representative",
    "q1914636": "Cultural Agnostic",
    "q334166": "Cultural Agnostic",
    "q811102": "Cultural Representative",
    "q477298": "Cultural Representative",
    "q620591": "Cultural Representative",
    "q7075402": "Cultural Representative",
    "q713623": "Cultural Agnostic",
    "q122985379": "Cultural Representative",
    "q1641122": "Cultural Exclusive",
    "q1331793": "Cultural Exclusive",
    "q186165": "Cultural Exclusive",
    "q6881511": "Cultural Exclusive",
    "q4182287": "Cultural Exclusive",
    "q783794": "Cultural Exclusive",
    "q40050": "Cultural Representative",
    "q56055944": "Cultural Agnostic",
    "q11862829": "Cultural Agnostic",
    "q112231559": "Cultural Exclusive",
    "q11488158": "Cultural Agnostic",
    "q12737077": "Cultural Agnostic",
    "q20827480": "Cultural Agnostic",
    "q10675206": "Cultural Agnostic",
    "q1799072": "Cultural Agnostic",
    "q47728": "Cultural Agnostic",
    "q2915955": "Cultural Agnostic",
    "q5303685": "Cultural Agnostic",
    "q1123037": "Cultural Exclusive",
    "q788460": "Cultural Exclusive",
    "q105543609": "Cultural Exclusive",
    "q55488": "Cultural Exclusive",
    "q1339195": "Cultural Exclusive",
    "q12819564": "Cultural Exclusive",
    "q820477": "Cultural Exclusive",
    "q740752": "Cultural Exclusive",
    "q532": "Cultural Exclusive",
    "q685309": "Cultural Exclusive",
    "q123705": "Cultural Exclusive",
    "q15042037": "Cultural Exclusive",
    "q207694": "Cultural Representative",
    "q1200701": "Cultural Representative",
    "q2668072": "Cultural Representative",
    "q746549": "Cultural Representative",
    "q105763925": "Cultural Exclusive",
    "q380829": "Cultural Exclusive",
    "q109734237": "Cultural Agnostic",
    "q12684": "Cultural Representative",
    "q4442611": "Cultural Representative",
    "q1787375": "Cultural Exclusive",
    "q561068": "Cultural Exclusive",
    "q5737899": "Cultural Representative",
    "q1781513": "Cultural Agnostic",
    "q2355817": "Cultural Agnostic",
    "q751876": "Cultural Representative",
    "q1686006": "Cultural Representative",
    "q17376093": "Cultural Exclusive",
    "q215380": "Cultural Exclusive",
    "q4220920": "Cultural Agnostic",
    "q66715801": "Cultural Agnostic",
    "q16521": "Cultural Agnostic",
    "q4894405": "Cultural Agnostic",
    "q223642": "Cultural Agnostic",
    "q2445904": "Cultural Agnostic",
    "q206615": "Cultural Agnostic",
    "q59152282": "Cultural Representative",
    "q166142": "Cultural Representative",
    "q841645": "Cultural Representative",
    "q620615": "Cultural Representative",
    "q10931": "Cultural Representative",
    "q13418847": "Cultural Representative",
    "q119328980": "Cultural Representative",
    "q5193493": "Cultural Representative",
    "q51041800": "Cultural Representative",
    "q895526": "Cultural Representative",
    "q45971958": "Cultural Agnostic",
    "q349": "Cultural Agnostic",
    "q234460": "Cultural Exclusive",
    "q45594": "Cultural Representative",
    "q22811234": "Cultural Agnostic",
    "q26971562": "Cultural Representative",
    "q10541491": "Cultural Representative",
    "q60147807": "Cultural Agnostic",
    "q1344963": "Cultural Agnostic",
    "q43229": "Cultural Agnostic",
    "q13219666": "Cultural Representative",
    "q18608583": "Cultural Representative",
    "q968159": "Cultural Representative",
    "q19861951": "Cultural Representative",
    "q749316": "Cultural Representative",
    "q32880": "Cultural Representative",
    "q1802587": "Cultural Exclusive",
    "q63187384": "Cultural Representative",
    "q124467322": "Cultural Representative",
    "q2500378": "Cultural Exclusive",
    "q15711797": "Cultural Exclusive",
    "q778575": "Cultural Representative",
    "q891723": "Cultural Representative",
    "q116474095": "Cultural Representative",
    "q1418640": "Cultural Exclusive",
    "q41773366": "Cultural Exclusive",
    "q1478437": "Cultural Representative",
    "q268592": "Cultural Agnostic",
    "q2142903": "Cultural Agnostic",
    "q2467478": "Cultural Agnostic",
    "q7748": "Cultural Exclusive",
    "q19464263": "Cultural Exclusive",
    "q16335296": "Cultural Agnostic",
    "q200250": "Cultural Representative",
    "q1066984": "Cultural Representative",
    "q515": "Cultural Representative",
    "q208511": "Cultural Representative",
    "q174844": "Cultural Representative",
    "q51929311": "Cultural Representative",
    "q108178728": "Cultural Representative",
    "q621993": "Cultural Agnostic",
    "q209928": "Cultural Representative",
    "q1192063": "Cultural Representative",
    "q627436": "Cultural Representative",
    "q31629": "Cultural Agnostic",
    "q114962596": "Cultural Agnostic",
    "q162875": "Cultural Exclusive",
    "q1115221": "Cultural Exclusive",
    "q189819": "Cultural Exclusive",
    "q26213387": "Cultural Agnostic",
    "q17431399": "Cultural Representative",
    "q868557": "Cultural Exclusive",
    "q107655869": "Cultural Exclusive",
    "q6055843": "Cultural Agnostic",
    "q21114848": "Cultural Agnostic",
    "q10383930": "Cultural Agnostic",
    "q15839299": "Cultural Exclusive",
    "q1040689": "Cultural Agnostic",
    "q110401222": "Cultural Representative",
    "q49773": "Cultural Representative",
    "q16334295": "Cultural Representative",
    "q3516833": "Cultural Agnostic",
    "q385378": "Cultural Agnostic",
    "q375336": "Cultural Exclusive",
    "q100663018": "Cultural Representative",
    "q15711870": "Cultural Representative",
    "q123126876": "Cultural Representative",
    "q21114371": "Cultural Representative",
    "q17279032": "Cultural Representative",
    "q599151": "Cultural Representative",
    "q132364": "Cultural Agnostic",
    "q294414": "Cultural Agnostic",
    "q2612572": "Cultural Agnostic",
    "q58051350": "Cultural Agnostic",
    "q111972893": "Cultural Agnostic",
    "q70208": "Cultural Exclusive",
    "q19730508": "Cultural Exclusive",
    "q270791": "Cultural Exclusive",
    "q15265344": "Cultural Exclusive",
    "q378427": "Cultural Exclusive",
    "q209495": "Cultural Exclusive",
    "q1493054": "Cultural Agnostic",
    "q131207": "Cultural Agnostic",
    "q201676": "Cultural Agnostic",
    "q28468127": "Cultural Representative",
    "q117599782": "Cultural Representative",
    "q570116": "Cultural Representative",
    "q214609": "Cultural Agnostic",
    "q103112098": "Cultural Representative",
    "q721830": "Cultural Exclusive",
    "q5009242": "Cultural Exclusive",
    "q1021645": "Cultural Representative",
    "q2319498": "Cultural Representative",
    "q109852002": "Cultural Agnostic",
    "q3624078": "Cultural Representative",
    "q1520223": "Cultural Representative",
    "q6256": "Cultural Representative",
    "q24699794": "Cultural Exclusive",
    "q1428660": "Cultural Agnostic",
    "q24034076": "Cultural Representative",
    "q32090": "Cultural Representative",
    "q169336": "Cultural Agnostic",
    "q1415395": "Cultural Exclusive",
    "q108874": "Cultural Exclusive",
    "q81647906": "Cultural Agnostic",
    "q1402592": "Cultural Representative",
    "q483247": "Cultural Agnostic",
    "q98374631": "Cultural Agnostic",
    "q55983715": "Cultural Agnostic",
    "q10429667": "Cultural Representative",
    "q658255": "Cultural Representative",
    "q15081030": "Cultural Exclusive",
    "q106859689": "Cultural Agnostic",
    "q16823155": "Cultural Representative",
    "q16560": "Cultural Representative",
    "q5": "Cultural Representative",
    "q107357104": "Cultural Representative",
    "q188451": "Cultural Representative",
    "q220505": "Cultural Representative",
    "q11424": "Cultural Representative",
    "q223393": "Cultural Representative",
    "q28640": "Cultural Representative",
    "q28823": "Cultural Representative",
    "q11460": "Cultural Representative",
    "q2095": "Cultural Representative",
    "q116658403": "Cultural Representative",
    "q483394": "Cultural Representative",
    "q201658": "Cultural Representative",
    "q7777573": "Cultural Representative",
    "q112248470": "Cultural Representative",
    "q1007870": "Cultural Representative",
    "q1107679": "Cultural Representative",
    "q1762059": "Cultural Representative",
    "q371174": "Cultural Representative",
    "q4830453": "Cultural Exclusive",
    "q847017": "Cultural Exclusive",
    "q64578911": "Cultural Exclusive",
    "q20860083": "Cultural Exclusive",
    "q223393": "Cultural Exclusive",
    "q11424": "Cultural Exclusive",
    "q1007870": "Cultural Exclusive",
    "q12973014": "Cultural Exclusive",
    "q327333": "Cultural Exclusive",
    "q11795121": "Cultural Exclusive",
    "q28972120": "Cultural Exclusive",
    "q1826286": "Cultural Exclusive",
    "q7777573": "Cultural Exclusive",
    "q220505": "Cultural Exclusive",
    "q107357104": "Cultural Exclusive",
    "q11460": "Cultural Exclusive"
}

# Golden QIDs for "heritage_status" property
golden_heritage_qids = {
    "q12126757": "Cultural Exclusive",
    "q12127133": "Cultural Exclusive",
    "q26086651": "Cultural Exclusive",
    "q811165": "Cultural Representative",
    "q1516079": "Cultural Representative",
    "q10387684": "Cultural Exclusive",
    "q10387575": "Cultural Exclusive",
    "q23668083": "Cultural Representative",
    "q385405": "Cultural Exclusive",
    "q43113623": "Cultural Representative",
    "q17297633": "Cultural Representative"
}

# Golden QIDs for "part_of_culture" property
golden_culture_qids = {
    "q66049360": "Cultural Representative",
    "q928904": "Cultural Representative",
    "q495348": "Cultural Representative"
}

### Step 4: Rule-Based Classification System

The functions below implement the rule-based classification system.
They take enriched metadata as input and classify items based on a priority sequence:

heritage_status ➔ 2. part_of_culture ➔ 3. instance_of

If no strong signal is found, the item is classified as Cultural Agnostic.
The system can also evaluate predictions if ground-truth labels are available.

In [None]:
import pandas as pd
import ast
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix


def load_enriched_data(path: str) -> pd.DataFrame:
    """Load enriched CSV and parse list fields."""
    df = pd.read_csv(path, dtype={'qid': str})
    fields = ['country_of_origin', 'country', 'located_in',
              'part_of_culture', 'instance_of', 'heritage_status']
    for f in fields:
        df[f] = df[f].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])
    return df


def classify(row: pd.Series, inst_map: dict, culture_map: dict, heritage_map: dict) -> str:
    """Rule-based classification: heritage > culture > instance."""
    for q in row['heritage_status']:
        label = heritage_map.get(q.lower())
        if label: return label
    for q in row['part_of_culture']:
        label = culture_map.get(q.lower())
        if label: return label
    for q in row['instance_of']:
        label = inst_map.get(q.lower())
        if label: return label
    return 'Cultural Agnostic'


def evaluate(df: pd.DataFrame, labels: list) -> None:
    """Print metrics and plot confusion matrix."""
    print(f"Accuracy: {accuracy_score(df['label'], df['prediction']):.2f}")
    print(classification_report(df['label'], df['prediction']))
    cm = confusion_matrix(df['label'], df['prediction'], labels=labels)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()


def run(enriched_path: str, inst_map: dict, culture_map: dict,
        heritage_map: dict, evaluate_flag: bool = False) -> pd.DataFrame:
    """Load data, predict, optionally evaluate, and return results."""
    df = load_enriched_data(enriched_path)
    df['prediction'] = df.apply(
        lambda r: classify(r, inst_map, culture_map, heritage_map), axis=1
    ).str.title()
    if 'label' in df.columns:
        df['label'] = df['label'].str.title()
    if evaluate_flag and 'label' in df.columns:
        evaluate(df, ['Cultural Agnostic', 'Cultural Representative', 'Cultural Exclusive'])
    cols = ['qid', 'prediction'] + (['label'] if evaluate_flag else [])
    return df[cols]

### Step 5: Running the Classifier and Saving Results

In this step, the rule-based classifier is applied to an enriched dataset.
Predictions are generated for each item, and the results are optionally evaluated (if labels are available).

The predicted labels are saved to a CSV file for submission or further analysis.

In [None]:
import pandas as pd

in_path = '/content/drive/MyDrive/teXt-Men_shared_folder/non-LM_approach/teXt-Men_test_enriched.csv'# replace this with the path of your enriched file
out_path = '/content/drive/MyDrive/teXt-Men_shared_folder/non-LM_approach/teXt-Men_test_non-LM_predictions.csv' # change accordingly

# Execute classifier
results = run(in_path, golden_instance_qids, golden_culture_qids, golden_heritage_qids, evaluate_flag=False)# change it to true if the testset is labeled

# Save results
df = results[['qid','prediction'] + (['label'] if 'label' in results.columns else [])]
df.to_csv(out_path, index=False) # Change to True if the ground-truth labels are available
                                 # change it to False if the testset has no labels
print(f"Predictions saved to {out_path}")

In [None]:
# Visualize Rule-Based Test Predictions
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/teXt-Men_shared_folder/non-LM_approach/teXt-Men_test_non-LM_predictions.csv')

print("First few predictions:")
display(df.head())

counts = df['prediction'].value_counts()
print("\nPrediction counts:")
print(counts)

## Step 6: Applying ML Fallback to Enhance Rule-Based Predictions

The rule-based classifier we implemented in the previous step works well for items whose metadata includes known `instance_of`, `part_of_culture`, or `heritage_status` QIDs mapped to predefined cultural categories. However, for many items — especially those not present in our golden QID lists — the classifier defaults to the `"Cultural Agnostic"` label. This results in an overprediction of agnostic cases and underrepresents more culturally specific classifications.

To address this limitation, we introduce a **hybrid approach** that leverages a **machine learning (ML) fallback model**. This model takes the `name` and `description` of fallback items (those initially labeled as `"Cultural Agnostic"`) and reclassifies them using a trained **TF-IDF + Logistic Regression pipeline**.

This step improves overall recall, reduces default classifications, and helps recover potentially misclassified cultural items based on learned textual patterns. Below, we implement this fallback logic.


Step 6.1: Load Fallback Model and Prepare Test Data

In this step, we:
- Load the fallback model from Hugging Face
- Load the rule-based predictions and raw test data
- Merge the text fields (`name` + `description`) for use in ML prediction

This prepares the data needed for the ML fallback in the next step.

In [None]:
import pandas as pd
import joblib


MODEL_PATH = (
    '/content/drive/MyDrive/teXt-Men_shared_folder/'
    'non-LM_approach/tfidf_fallback_model.pkl'
)
PRED_PATH = (
    '/content/drive/MyDrive/teXt-Men_shared_folder/'
    'non-LM_approach/teXt-Men_test_non-LM_predictions.csv'
)
RAW_PATH = (
    '/content/drive/MyDrive/teXt-Men_shared_folder/'
    'test_unlabeled.csv'
)


def load_fallback_model(path: str):
    model = joblib.load(path)
    print(f"Model loaded from: {path}")
    return model


def load_csv(path: str) -> pd.DataFrame:
    return pd.read_csv(path)


def prepare_raw_text(df: pd.DataFrame) -> pd.DataFrame:
    df['qid'] = df['item'].str.extract(r'(Q\d+)', expand=False)
    return df[['qid', 'name', 'description', 'item']]


def merge_data(preds: pd.DataFrame, raw: pd.DataFrame) -> pd.DataFrame:
    merged = preds.merge(raw, on='qid', how='left')
    merged['text'] = merged['name'].fillna('') + ' ' + merged['description'].fillna('')
    return merged


model = load_fallback_model(MODEL_PATH)
predictions = load_csv(PRED_PATH)
raw_text = prepare_raw_text(load_csv(RAW_PATH))
data = merge_data(predictions, raw_text)

print(
    "Fallback model and test data prepared successfully."
    f" Columns: {list(data.columns)}"
)





In [None]:
data[['qid', 'text']].head()


Step 6.2: Apply ML Fallback and Save Hybrid Predictions

In this step, we:
- Identify items labeled as "Cultural Agnostic" by the rule-based classifier
- Reclassify those items using the fallback ML model
- Tag each prediction with its source ("ML" or "QID")
- Save the final hybrid predictions to a CSV file for further analysis

In [None]:
import pandas as pd
import joblib

MODEL_PATH = '/content/drive/MyDrive/teXt-Men_shared_folder/non-LM_approach/tfidf_fallback_model.pkl'
model = joblib.load(MODEL_PATH)

OUTPUT_PATH = (
    '/content/drive/MyDrive/teXt-Men_shared_folder/'
    'non-LM_approach/teXt-Men_test_non-LM_hybrid_predictions.csv'
)

def apply_ml_fallback(df: pd.DataFrame, model) -> pd.DataFrame:
    if 'text' not in df.columns:
        raise ValueError("Missing 'text' column. Ensure merge_data() has been applied.")

    mask = (
        df['prediction'].fillna('').str.lower() == 'cultural agnostic'
    ) & (
        df['text'].str.strip().ne('')
    )
    count = mask.sum()
    print(f"Found {count} rows to reclassify." if count else "No rows to reclassify.")

    if count:
        preds = model.predict(df.loc[mask, 'text'])
        df.loc[mask, 'prediction'] = (
            pd.Series(preds, index=df.loc[mask].index)
            .str.title()
        )
        df.loc[mask, 'source'] = 'ML'

    df['source'] = df['source'].fillna('QID')
    missing = df['prediction'].isna().sum()
    if missing:
        print(f"Warning: {missing} missing predictions after fallback.")
    else:
        print("All fallback predictions applied successfully.")

    return df

def save_hybrid(df: pd.DataFrame, path: str = OUTPUT_PATH) -> None:
    df['label'] = df['prediction'].str.lower()
    df[['qid', 'prediction', 'source', 'name', 'item', 'label']].to_csv(path, index=False)
    print(f"Hybrid predictions saved to: {path}")



In [None]:
hybrid_df = apply_ml_fallback(data, model) #applying the fallback model on data from Step: 6.1
save_hybrid(hybrid_df)

In [19]:
hybrid_df

Unnamed: 0,qid,prediction,label,name,description,item,text,source
0,Q2427430,Cultural Representative,Cultural Representative,Northeast Flag Replacement,Zhang Xueliang's announcement on 29 December 1...,http://www.wikidata.org/entity/Q2427430,Northeast Flag Replacement Zhang Xueliang's an...,QID
1,Q125482,Cultural Representative,Cultural Agnostic,imam,Islamic leadership position,http://www.wikidata.org/entity/Q125482,imam Islamic leadership position,ML
2,Q15789,Cultural Exclusive,Cultural Exclusive,FC Bayern Munich,"association football club in Munich, Germany",http://www.wikidata.org/entity/Q15789,FC Bayern Munich association football club in ...,QID
3,Q582496,Cultural Exclusive,Cultural Exclusive,Fome Zero,program intended to eradicate hunger and extre...,http://www.wikidata.org/entity/Q582496,Fome Zero program intended to eradicate hunger...,QID
4,Q572811,Cultural Exclusive,Cultural Exclusive,Anthony Award,awards given at Bouchercon for mystery literature,http://www.wikidata.org/entity/Q572811,Anthony Award awards given at Bouchercon for m...,QID
...,...,...,...,...,...,...,...,...
295,Q4878968,Cultural Representative,Cultural Representative,bed skirt,bedding accessory consisting of a flat and gat...,http://www.wikidata.org/entity/Q4878968,bed skirt bedding accessory consisting of a fl...,QID
296,Q1361932,Cultural Representative,Cultural Representative,family film,film genre; films intended for a family audien...,http://www.wikidata.org/entity/Q1361932,family film film genre; films intended for a f...,QID
297,Q639669,Cultural Agnostic,Cultural Agnostic,musician,"person who composes, conducts or performs music",http://www.wikidata.org/entity/Q639669,"musician person who composes, conducts or perf...",ML
298,Q616077,Cultural Agnostic,Cultural Agnostic,VJing,broad designation for realtime visual performance,http://www.wikidata.org/entity/Q616077,VJing broad designation for realtime visual pe...,ML
