# CultureScope: Cultural Specificity Classification (LM-Based Approach)

**Task:** Classify items into three cultural specificity categories:
- `cultural agnostic` - Universally known, no specific cultural ownership
- `cultural representative` - Associated with a culture but known globally
- `cultural exclusive` - Known primarily within a specific culture

**Model:** Fine-tuned DeBERTa-v3-base transformer

**HuggingFace Model:** [ArchitRastogi/NLP_HW_LM_non_tuned](https://huggingface.co/ArchitRastogi/NLP_HW_LM_non_tuned)

---

## Setup

In [1]:
# Install required packages
!pip install -q datasets huggingface_hub transformers torch pandas numpy scikit-learn hf_transfer

In [2]:
import os
import warnings
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm.auto import tqdm
from datasets import load_dataset
from huggingface_hub import login, whoami
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

warnings.filterwarnings('ignore')

print("Setup complete.")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

Setup complete.
PyTorch version: 2.8.0+cu128
CUDA available: True
CUDA device: NVIDIA RTX A4500


## Configuration

In [None]:
# ----- CONFIGURATION -----

# Input: Choose one of the following options
USE_HUGGINGFACE_DATASET = True  # Set to False to use local CSV file
INPUT_CSV_PATH = "test.csv"     # Path to input CSV (used if USE_HUGGINGFACE_DATASET=False)

# HuggingFace dataset configuration
HF_DATASET_NAME = "sapienzanlp/nlp2025_hw1_cultural_dataset"  # Dataset name on HuggingFace
HF_DATASET_SPLIT = "SET HERE"                         # Split to use (e.g., "test", "validation")

if USE_HUGGINGFACE_DATASET:
    HF_TOKEN = ""  # Add HuggingFace token here, as the dataset is gated
    login(token=HF_TOKEN)
    print("Logged in as:", whoami())

# Model configuration
HF_MODEL_REPO = "ArchitRastogi/NLP_HW_LM_non_tuned"  # LM model repository

# Inference settings
BATCH_SIZE = 32         # Batch size for inference
MAX_LENGTH = 384        # Maximum sequence length

# Output configuration
OUTPUT_CSV_PATH = "predictions_lm_approach.csv"  # Output file path

# Device configuration
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"\nConfiguration:")
print(f"  Input: {'HuggingFace Dataset' if USE_HUGGINGFACE_DATASET else INPUT_CSV_PATH}")
print(f"  Model: {HF_MODEL_REPO}")
print(f"  Device: {DEVICE}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Output: {OUTPUT_CSV_PATH}")

## Load Test Data

In [4]:
if USE_HUGGINGFACE_DATASET:
    print(f"Loading dataset from HuggingFace: {HF_DATASET_NAME}")
    dataset = load_dataset(HF_DATASET_NAME, split=HF_DATASET_SPLIT)
    test_df = pd.DataFrame(dataset)
    print(f"Loaded {len(test_df)} samples.")
else:
    print(f"Loading dataset from: {INPUT_CSV_PATH}")
    test_df = pd.read_csv(INPUT_CSV_PATH)
    print(f"Loaded {len(test_df)} samples.")

# Display sample
print("\nSample data:")
display(test_df.head())

# Show column info
print("\nDataset columns:")
print(test_df.columns.tolist())
print(f"\nDataset shape: {test_df.shape}")

Loading dataset from HuggingFace: sapienzanlp/nlp2025_hw1_cultural_dataset


README.md:   0%|          | 0.00/2.31k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/946k [00:00<?, ?B/s]

valid.csv:   0%|          | 0.00/45.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6251 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/300 [00:00<?, ? examples/s]

Loaded 300 samples.

Sample data:


Unnamed: 0,item,name,description,type,category,subcategory,label
0,http://www.wikidata.org/entity/Q15786,1. FC Nürnberg,"German sports club based in Nuremberg, Bavaria",entity,sports,sports club,cultural representative
1,http://www.wikidata.org/entity/Q268530,77 Records,UK record label,entity,music,record label,cultural exclusive
2,http://www.wikidata.org/entity/Q216153,A Bug's Life,1998 animated film directed by John Lasseter a...,entity,comics and anime,animated film,cultural representative
3,http://www.wikidata.org/entity/Q593,A Gang Story,2011 film by Olivier Marchal,entity,films,film,cultural exclusive
4,http://www.wikidata.org/entity/Q192185,Aaron Copland,"American composer, composition teacher, writer...",entity,performing arts,choreographer,cultural representative



Dataset columns:
['item', 'name', 'description', 'type', 'category', 'subcategory', 'label']

Dataset shape: (300, 7)


## Load Model and Tokenizer

In [7]:
print(f"Loading model and tokenizer from: {HF_MODEL_REPO}")
print(f"Device: {DEVICE}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_REPO)
print("Tokenizer loaded")

# Load model
model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_REPO)
model.to(DEVICE)
model.eval()
print("Model loaded and moved to device")

# Get label mappings
id2label = model.config.id2label
label2id = model.config.label2id

print(f"\nModel configuration:")
print(f"  Number of labels: {model.config.num_labels}")
print(f"  Label mapping: {id2label}")

Loading model and tokenizer from: ArchitRastogi/NLP_HW_LM_non_tuned
Device: cuda


tokenizer_config.json: 0.00B [00:00, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Tokenizer loaded


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Model loaded and moved to device

Model configuration:
  Number of labels: 3
  Label mapping: {0: 'cultural agnostic', 1: 'cultural representative', 2: 'cultural exclusive'}


## Prepare Input Texts

In [8]:
def create_input_text(row):
    """Create formatted input text from row data"""
    parts = [f"Item: {row['name']}"]
    
    if pd.notna(row.get('description')) and str(row.get('description')).strip():
        parts.append(f"Description: {row['description']}")
    
    if pd.notna(row.get('type')) and str(row.get('type')).strip():
        parts.append(f"Type: {row['type']}")
    
    if pd.notna(row.get('category')) and str(row.get('category')).strip():
        parts.append(f"Category: {row['category']}")
    
    if pd.notna(row.get('subcategory')) and str(row.get('subcategory')).strip():
        parts.append(f"Subcategory: {row['subcategory']}")
    
    return ". ".join(parts) + "."

# Create input texts
print("Creating input texts...")
input_texts = [create_input_text(row) for _, row in tqdm(test_df.iterrows(), total=len(test_df))]

print(f"\nCreated {len(input_texts)} input texts")
print("\nExample input text:")
print(input_texts[0][:300] + "..." if len(input_texts[0]) > 300 else input_texts[0])

Creating input texts...


  0%|          | 0/300 [00:00<?, ?it/s]


Created 300 input texts

Example input text:
Item: 1. FC Nürnberg. Description: German sports club based in Nuremberg, Bavaria. Type: entity. Category: sports. Subcategory: sports club.


## Run Inference

In [9]:
print(f"Starting inference on {len(input_texts)} samples...")
print(f"Batch size: {BATCH_SIZE}")

all_predictions = []
all_confidences = []
all_probabilities = []

# Process in batches
for i in tqdm(range(0, len(input_texts), BATCH_SIZE), desc="Inference batches"):
    batch_texts = input_texts[i:i + BATCH_SIZE]
    
    # Tokenize
    inputs = tokenizer(
        batch_texts,
        return_tensors="pt",
        truncation=True,
        max_length=MAX_LENGTH,
        padding=True
    )
    
    # Move to device
    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = F.softmax(logits, dim=-1)
    
    # Get predictions and probabilities
    pred_classes = torch.argmax(probs, dim=-1).cpu().numpy()
    batch_probs = probs.cpu().numpy()
    
    # Store results
    for pred_class, prob_dist in zip(pred_classes, batch_probs):
        pred_label = id2label[pred_class]
        confidence = prob_dist[pred_class]
        
        all_predictions.append(pred_label)
        all_confidences.append(confidence)
        all_probabilities.append(prob_dist)

print(f"\nInference complete!")
print(f"  Predicted {len(all_predictions)} samples")

Starting inference on 300 samples...
Batch size: 32


Inference batches:   0%|          | 0/10 [00:00<?, ?it/s]


Inference complete!
  Predicted 300 samples


## Prepare Output

In [10]:
# Create output dataframe
output_df = test_df[['item', 'name', 'description', 'type', 'category', 'subcategory']].copy()
output_df['label'] = all_predictions

# Add probability columns
for idx, class_label in id2label.items():
    col_name = f'prob_{class_label.replace(" ", "_")}'
    output_df[col_name] = [p[idx] for p in all_probabilities]

print("Output dataframe created")
print(f"Shape: {output_df.shape}")
print(f"\nColumns: {output_df.columns.tolist()}")

Output dataframe created
Shape: (300, 10)

Columns: ['item', 'name', 'description', 'type', 'category', 'subcategory', 'label', 'prob_cultural_agnostic', 'prob_cultural_representative', 'prob_cultural_exclusive']


## Save Predictions

In [11]:
# Save to CSV
output_df.to_csv(OUTPUT_CSV_PATH, index=False)
print(f"Predictions saved to: {OUTPUT_CSV_PATH}")

# Show prediction distribution
print("\nPrediction distribution:")
print(output_df['label'].value_counts())

# Display sample predictions
print("\nSample predictions:")
display(output_df[[
    "item",
    "name",
    "description",
    "type",
    "category",
    "subcategory",
    "label",  
]].head(10))

Predictions saved to: predictions_lm_approach.csv

Prediction distribution:
label
cultural agnostic          125
cultural representative     92
cultural exclusive          83
Name: count, dtype: int64

Sample predictions:


Unnamed: 0,item,name,description,type,category,subcategory,label
0,http://www.wikidata.org/entity/Q15786,1. FC Nürnberg,"German sports club based in Nuremberg, Bavaria",entity,sports,sports club,cultural exclusive
1,http://www.wikidata.org/entity/Q268530,77 Records,UK record label,entity,music,record label,cultural exclusive
2,http://www.wikidata.org/entity/Q216153,A Bug's Life,1998 animated film directed by John Lasseter a...,entity,comics and anime,animated film,cultural representative
3,http://www.wikidata.org/entity/Q593,A Gang Story,2011 film by Olivier Marchal,entity,films,film,cultural representative
4,http://www.wikidata.org/entity/Q192185,Aaron Copland,"American composer, composition teacher, writer...",entity,performing arts,choreographer,cultural representative
5,http://www.wikidata.org/entity/Q265890,Aarwangen Castle,"castle in Aarwangen in the canton of Bern, Swi...",entity,architecture,architectural structure,cultural exclusive
6,http://www.wikidata.org/entity/Q305718,abaya,"simple, loose over-garment wore by humans, esp...",concept,fashion,traditional costume,cultural representative
7,http://www.wikidata.org/entity/Q337267,Academy of San Carlos,"Located in Mexico City, it was the first major...",entity,visual arts,art gallery,cultural exclusive
8,http://www.wikidata.org/entity/Q15,Africa,continent,entity,geography,geographic location,cultural agnostic
9,http://www.wikidata.org/entity/Q388170,African American literature,body of literature by Americans of African des...,entity,literature,literary genre,cultural representative


## Results on Validation set

root@0c1fb484b143:/# python evaluate_predictions.py --valid valid.csv --predictions predictions_lm_approach.csv
Number of samples: 300
Accuracy: 0.7733
Macro F1: 0.7578

Per-class metrics:
                         precision    recall  f1-score   support

      cultural agnostic       0.87      0.93      0.90       117
     cultural exclusive       0.65      0.71      0.68        76
     cult representative       0.75      0.64      0.69       107

               accuracy                           0.77       300
              macro avg       0.76      0.76      0.76       300
           weighted avg       0.77      0.77      0.77       300

---

## Summary

This notebook demonstrates the LM-based approach for cultural specificity classification using a fine-tuned transformer model.

**Model:** DeBERTa-v3-base fine-tuned on cultural classification task

**Output:** `predictions_lm_approach.csv` with predicted labels and class probabilities

In [None]:
# Final summary
print("\n" + "="*70)
print("INFERENCE COMPLETE")
print("="*70)
print(f"\nOutput file: {OUTPUT_CSV_PATH}")
print(f"Total predictions: {len(output_df)}")
print(f"Mean confidence: {np.mean(all_confidences):.4f}")
print("\nDone.")