Skip to content

AnikMallick/norm-classifier

Repository files navigation

🌐 Cultural Norm Classification

From CultureBank to Cultural Intelligence

Anik Mallick, Gaurav Tiwari

Quick links: [Dataset] - [Huggingface Models] - [Models]


🚀 Running the Inference App

1 · Clone the repository

git clone https://github.com/AnikMallick/norm-classifier
cd norm-classifier

2 · Install dependencies

Python 3.10+ is recommended.

pip install -r requirements.txt

3 · Download the RoBERTa model

The fine-tuned RoBERTa classifier is hosted on Hugging Face and is not included in this repo due to its size.
Run the setup script once — it will download the model into model-dir/ automatically.

python setup_model.py

What this does: fetches roberta-base-classifier-v01 from
anik-owl/roberta_norm_classifier
and saves it to ./model-dir/roberta-base-classifier-v01/.
All other model files (LR classifiers, FAISS indexes) are already included in the repo.

Expected output:

Model downloaded at: ./model-dir/roberta-base-classifier-v01


4 · Launch the app

streamlit run app/app.py

The app will open in your browser at http://localhost:8501.


Directory structure after setup

artifacts/
├── lr_model_fe_v01.pkl            ← included in repo
├── lr_model_fe_v02.pkl            ← included in repo
├── norm_faiss.index               ← included in repo
├── notnorm_faiss.index            ← included in repo
├── doc_norm_faiss.index           ← included in repo
├── doc_notnorm_faiss.index   
└──  ...
model-dir/                         ← ignored by git
└── roberta-base-classifier-v01/   ← downloaded by setup_model.py
    ├── config.json
    ├── model.safetensors
    └── ...

Troubleshooting

Issue Fix
ModuleNotFoundError Make sure you ran pip install -r requirements.txt
OSError: model path not found Re-run python setup_model.py
faiss install fails on Apple Silicon Use pip install faiss-cpu instead
Slow first load Models are cached after the first run — subsequent loads are faster

📋 Table of Contents


🔍 Overview

This project presents a comprehensive NLP pipeline for cultural norm classification using the CultureBank dataset (Reddit + TikTok, ~22,990 samples). Two interconnected objectives are addressed:

  1. Binary Norm Classification — distinguishing cultural norms from generic statements
  2. Cultural Group Attribution — identifying which cultural group a norm belongs to

These are treated as a cascade: a sentence classified as Norm is passed to the group identifier.

✨ Key Contributions

Component Description
🔤 Deontic Cue Scorer Linguistic rule-signals (obligation, prohibition, judgement, convention) fused with neural confidence scores
🔎 FAISS Retrieval Scorer Class-discriminative similarity scores against labelled training corpora
🧠 916-dim Fusion Framework Unified feature matrix enabling classical classifiers to approach transformer-level performance

🏆 Best Results at a Glance

Model Norm F1 Macro F1
RoBERTa-base (full text + threshold) ⭐ 65.4% 80.2%
Fusion 916 — LR (default) 64.8% 80.8%
Fusion 916 — Bagged LR + threshold 64.4% 80.4%
TAPT + ULMFiT V02 63.8% 80.1%

📊 Dataset

CultureBank

Source Size Rows
Reddit corpus 19.7 MB 11,236
TikTok corpus 16.2 MB 11,754
Merged 22,990

Each row contains 17 columns encoding cultural observations: cultural group, context, goal, relation, actor/recipient behaviour, topic, agreement score, scenario, and more.

Label Construction

The topic column is mapped as follows:

  • Social Norms and EtiquetteLabel 1 (Norm)
  • All other topics → Label 0 (Not Norm)

Class Distribution

Class Count Percentage
Norm (1) 2,225 9.7%
Not Norm (0) 20,765 90.3%
Total 22,990 100%

⚠️ The ~1:10 class imbalance is the defining challenge of this project. All design decisions — class-weighted loss, threshold tuning, sampling strategies — must be viewed through this lens.

Data Quality Filtering

Step Decision
Missing values in context (~145 rows) Complete Case Analysis (CCA) — dropped
Agreement filtering Retained only agreement ≥ 0.70

Post-filtering dataset statistics:

Metric Value
Total Observations 22,845
After agreement filter + CCA 18,169
Class Imbalance Ratio ~9.4 : 1
Unique Cultural Groups ~250+

Train / Validation / Test Split

Stratified 60% / 20% / 20% split maintained consistently across all experiments to prevent leakage.


🏗️ Pipeline Architecture

Raw CultureBank Data (22,990 samples)
         │
         ▼
┌─────────────────────────┐
│  Preprocessing          │
│  - CCA (missing vals)   │
│  - Agreement filter     │
│  - Text normalisation   │
│  - NER masking (V1/V2)  │
└─────────┬───────────────┘
          │
    ┌─────┴──────────────────────────────┐
    │                                    │
    ▼                                    ▼
TF-IDF / LSA                  Sentence Embeddings
(classical ML)                (MiniLM / MPNet)
    │                                    │
    └──────────────┬─────────────────────┘
                   │
                   ▼
        ┌──────────────────┐
        │  LSTM / RoBERTa  │  ← transformer fine-tuning
        └──────────────────┘
                   │
                   ▼
       ┌───────────────────────┐
       │  Fusion (916-dim)     │
       │  LSA + MiniLM +       │
       │  Deontic + FAISS +    │
       │  RoBERTa signals      │
       └───────────────────────┘
                   │
          ┌────────┴────────┐
          ▼                 ▼
  Binary Norm        Cultural Group
  Classifier         Identifier

🔧 Data Preprocessing & Feature Engineering

Column Selection

Three columns carry the most discriminative signal (from Y-Data profiling EDA):

Column Description
eval_whole_desc Full evaluation-oriented behavioural description (170–981 chars, median ~781)
context Situational context (e.g., "in public", "at work")
eval_persona Hypothetical persona posing the question

Text Normalisation

Lowercasing → Punctuation removal → Stop-word removal (spaCy English)

NER-Based Entity Masking

Two masking strategies implemented using spaCy NER:

Strategy Mapping
V1 GPE, NORP, LOC → [LOCATION]
V2 GPE/LOC → [LOCATION], NORP → [GROUP], PERSON → [PERSON], ORG → [ORG], LANGUAGE → [LANGUAGE]

Rationale: Masking America in a sample dropped America confidence by ~5%, confirming unmasked features leak identity rather than normative content.

TF-IDF Configuration

TfidfVectorizer(
    max_features=30_000,
    min_df=3,
    ngram_range=(1, 2)
)

LSA

Truncated SVD grid-searched over {10, 50, 100, 500, 1000, 2500, 5000, 10000} components.
Optimal: n_components = 10,000 → Cross-validated ROC-AUC of 0.9347

Sentence Embeddings

Model Dimensions
all-MiniLM-L6-v2 384-dim
all-mpnet-base-v2 768-dim

🧪 Experiments & Results

Baseline Models

Raw concatenated text (eval_whole_desc + context + eval_persona) vectorised with TF-IDF:

Model Norm Prec Norm Recall Norm F1 AUC
Logistic Regression 78.7% 26.9% 40.1% 0.94
Linear SVC ⭐ 71.3% 41.3% 52.3% 0.92
XGB (gblinear) 71.1% 37.0% 48.7% 0.91
SVM (RBF) 82.5% 26.9% 40.6% 0.94
XGB (gbtree) 10.7% 98.6% 19.3% ❌

XGB (gbtree) collapses to predicting all-Norm — tree methods are highly sensitive to imbalance without explicit correction.

NER-Based Entity Masking

Effect of V1 masking on Linear SVC (best improvement):

Model Accuracy Norm Prec Norm Recall Norm F1 Δ vs Baseline
LR 93.1% 83.8% 26.2% 39.9% −0.76pp
Linear SVC ⭐ 94.2% 76.0% 48.9% 59.5% +7.58pp ↑
XGB (gblinear) 93.4% 73.1% 38.5% 50.4% +1.50pp
SVM (RBF) 93.3% 87.0% 27.4% 41.7% +0.50pp

LSA Dimensionality Reduction

LSA (5,000 components) on masked normalised features. Weighted LR notable result:

Model Accuracy Norm Prec Norm Recall Norm F1
TFIDF_LSA_LR (Weighted) 93.1% 47% 86% 61%

High recall at the cost of precision — useful for high-recall deployment.

Sentence Embeddings

Model + Embedding Accuracy Norm Prec Norm Recall Norm F1
MiniLM + LR 93.7% 74.0% 42.3% 53.8%
MiniLM + Linear SVM 93.7% 70.8% 46.7% 56.3%
MiniLM + NL SVM 93.8% 75.2% 43.2% 54.9%
MiniLM + NL XGBoost 94.0% 78.7% 43.2% 55.8%
MiniLM + MLP [128×5] 93.6% 66.6% 53.6% 59.4%
MPNet + MLP [32×5] 93.8% 74.2% 44.5% 55.6%

LSTM

Trained at 4 vocabulary sizes; evaluated on held-out test set:

Vocab Size Norm Prec Norm Recall Norm F1
3,000 0.41 0.79 0.54
5,000 0.52 0.71 0.60
7,000 0.56 0.68 0.61
15,000 0.39 0.83 0.53

Sweet spot at vocab=7,000. Larger vocabularies introduce noise with insufficient data.

RoBERTa Fine-Tuning

TAPT + ULMFiT

Three training configurations with discriminative learning rates, slanted triangular LR scheduling, and gradual unfreezing:

Model Accuracy Norm Prec Norm Recall Norm F1
TAPT + ULMFiT V01 93.5% 62.1% 64.7% 63.4%
Base + ULMFiT V02 ⭐ 93.6% 62.9% 64.7% 63.8%
TAPT + ULMFiT V03 92.9% 58.1% 66.6% 62.1%

HuggingFace Trainer Variants

Model / Approach Macro F1 Norm Recall Norm Prec Norm F1
RoBERTa-base (Trainer, v1 masked) 79.7% 65.3% 61.1% 63.1%
RoBERTa-base (v2 mask + tokens) 80.1% 65.3% 62.4% 63.8%
RoBERTa-large (threshold tuned) 80.2% 64.0% 63.8% 63.9%
RoBERTa-base (full text + threshold) 80.2% 66.3% 61.8% 65.4%

Surprising finding: Full unmasked text outperforms masked variants — cultural group names carry genuinely useful semantic signal, not merely spurious memorisation.

Best Model — Full Classification Report

Class Precision Recall F1
Not-Norm (0) 0.97 0.96 0.96
Norm (1) 0.62 0.66 0.64
Macro avg 0.79 0.81 0.80
Weighted avg 0.94 0.94 0.94

Zero-Shot LLM Baseline

Gemma-4-E2B-it evaluated zero-shot on 1,000 samples:

Metric Value
Overall Accuracy 8%
Macro F1 7.4%
Norm Recall 100% (degenerate — predicts Norm for everything)
Not-Norm Recall 0.11%

❌ Zero-shot LLMs without prompt engineering are not viable for this imbalanced task.


🚀 Novel Components

Deontic Cue Normativity Scoring

Motivated by linguistic theory — norm descriptions use specific modal/evaluative language patterns.

Four deontic categories:

Category Example Keywords
Obligation should, must, are expected, required, ought to
Prohibition should not, must not, avoid, forbidden, prohibited
Judgement rude, polite, respectful, inappropriate, acceptable
Convention customary, traditionally, commonly, typically, norm

Scoring mechanism:

  • Per-category keyword matches → weighted points (e.g., Obligation match → +2.0)
  • Structural co-occurrence bonuses: "it is" + "customary/common" → +1.5
  • Negation penalty: "not required" → ×0.5
  • Final score normalised + sigmoid-squashed to [0, 1]

Features produced: Normativity score + binary category indicators + norm variation type indicators

FAISS Retrieval Scoring

A non-parametric retrieval system that classifies by semantic proximity to labelled training sentences.

Architecture:

  1. Split training data by class → Norm corpus + Not-Norm corpus
  2. Segment each corpus into sentences
  3. Encode with all-MiniLM-L6-v2 (384-dim, L2-normalised)
  4. Build two FAISS IndexFlatIP indices (cosine similarity)
  5. For each query: retrieval_score = avg_sim(top-k Norm) − avg_sim(top-k Not-Norm)

Sentence-level aggregate features: Max Score, Mean Score, Top-K Average, Std, Sentence Count

Retrieval-only classifier results:

Model Norm Prec Norm Recall Norm F1 Macro F1
LR (default) 72.6% 43.5% 54.4% 75.5%
Linear SVC (default) 72.3% 47.6% 57.4% 77.1%
NL XGBoost 61.7% 51.4% 56.1% 76.2%
RBF SVM (balanced) 37.9% 82.7% 51.9% 72.1%

Retrieval-only achieves up to 57.4% Norm F1 with zero explicit text learning — competitive with TF-IDF+LSA baselines.

Fusion Feature Vector (916-dim)

Combines all signal sources into a single unified representation:

32-dim engineered feature vector:

Component Dimensions Source
Deontic features 6 Rule-based scorer
Norm-type features 8 Category indicators
Variation features 6 Norm modality indicators
RoBERTa features 6 Logit-based scores (sentence + doc level)
FAISS retrieval 6 Similarity statistics
Total 32

884-dim embedding matrix:

Component Dimensions
LSA (TruncatedSVD, 500 components) 500-dim
MiniLM sentence embeddings 384-dim
Total 884-dim

Final: 32 + 884 = 916-dim per observation

Fusion Pipeline Results

Model / Config Norm Prec Norm Recall Norm F1 Macro F1 Accuracy
LR (default) 66.7% 63.1% 64.8% 80.8% 94.0%
Linear SVC (default) 65.9% 63.4% 64.6% 80.7% 93.9%
Linear XGB (default) 70.7% 58.7% 64.1% 80.5% 94.3%
NL XGBoost (default) 66.4% 61.8% 64.1% 80.4% 94.0%
MLP [128×5] + threshold (0.468) 64.2% 64.4% 64.3% 80.4% 93.8%
Bagged LR (25 est.) + threshold (0.768) 61.5% 67.5% 64.4% 80.4% 93.5%
RBF SVM (balanced) 53.5% 74.1% 62.2% 78.9% 92.1%

🎯 Key Finding: The 916-feature fusion framework closes 90% of the gap between TF-IDF baseline and the best transformer — Logistic Regression at 64.8% Norm F1 nearly matches RoBERTa at 65.4%, at a fraction of inference cost.


🌍 Cultural Group Identification

Problem Setup

  • 2,398 unique cultural groups in raw data (most with < 10 examples)
  • Consolidated to 12 classes (top 11 + Other)

Consolidated Group Distribution (Training Set)

Group Count
America 6,783
German 1,056
British 956
Australian 634
French 614
Italian 613
Europe 584
Korean 499
Dutch 441
Japanese 439
Spanish 338
Other 10,033

TF-IDF + SVD Results (dim=3000)

Cultural Group Precision Recall F1 Support
Other 0.78 0.33 0.46 2,006
America 0.63 0.36 0.46 1,356
Japanese 0.33 0.62 0.43 88
German 0.26 0.34 0.29 211
British 0.21 0.36 0.26 191
French 0.15 0.43 0.22 123
Korean 0.14 0.46 0.22 100
Europe 0.15 0.43 0.22 117
Australian 0.14 0.40 0.21 127
Italian 0.14 0.35 0.20 122
Dutch 0.11 0.34 0.16 88
Spanish 0.11 0.38 0.17 68
Macro Avg 0.26 0.40 0.28

Japanese achieves highest recall (0.62) — likely due to distinctive lexical patterns (formal addressing, specific food customs).

Effect of Masking on Group Prediction

For the text "it is customary for both Germans and their children to employ formal and informal addressing...":

Setup German Confidence Top Prediction
Unmasked 18.28% ✅ (correct) German
Masked ([GROUP]) 13.84% Korean (18.86%)

Masking drops German confidence ~5% but normative content retains discriminative power.

TF-IDF vs BiLSTM

BiLSTM macro-F1 remains in 0.27–0.37 range. BiLSTM skews heavily toward high-volume classes (America, Other). TF-IDF+SVD outperforms BiLSTM for this task.

Structural limitation: Non-Western cultures (e.g., India/Diwali) fall into Other — excluded from 12 consolidated classes.


📈 Comprehensive Results

Norm Classification — Full Ranking

Model / Approach Macro F1 Norm Recall Norm Prec Norm F1
Baseline LR (raw TF-IDF) 68.0% 26.9% 78.7% 40.1%
Lin. SVM (masked TF-IDF) 78.2% 48.9% 76.0% 59.5%
LSA (10k) + Lin. SVM 78.1% 48.9% 75.6% 59.4%
FAISS Retrieval Only (SVC) 77.1% 47.6% 72.3% 57.4%
LSTM (vocab 7k) 68.0% 56.0% 61.0%
MiniLM + MLP [128×5] 53.6% 66.6% 59.4%
TAPT + ULMFiT V02 80.1% 64.7% 62.9% 63.8%
RoBERTa-base (Trainer, masked) 79.7% 65.3% 61.1% 63.1%
RoBERTa-large (threshold tuned) 80.2% 64.0% 63.8% 63.9%
Fusion 916 — Bagged LR + threshold 80.4% 67.5% 61.5% 64.4%
Fusion 916 — MLP [128×5] + threshold 80.5% 64.4% 64.2% 64.3%
Fusion 916 — LR (default) 80.8% 63.1% 66.7% 64.8%
RoBERTa-base (full text + threshold) 80.2% 66.3% 61.8% 65.4%
Gemma-4-E2B (zero-shot) 7.4% 100% 7.9% 14.7% ❌

⚠️ Limitations

Limitation Impact
No SMOTE applied universally Norm recall likely underestimates achievable ceiling
Group consolidation to 12 classes 2,300+ minority cultures erased; non-Western cultures systematically excluded
No calibration analysis Reliability diagrams not explored
No error analysis What types of Norms are most confused with Not-Norms?

Why Norm Recall is Hard

Norm texts use hedged, conditional language"it is customary for...", "individuals tend to..." — that is descriptively similar to non-norm factual statements, making the boundary genuinely ambiguous even after agreement filtering (≥ 0.70).


Anik Mallick · Gaurav Tiwari

About

Fine-tuned RoBERTa classifier for detecting normative statements in text, with a feature-engineering LR ensemble (deontic keyword scoring, FAISS retrieval, sentence-level aggregation) and an interactive Streamlit inference app.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors