Quick links: [Dataset] - [Huggingface Models] - [Models]
git clone https://github.com/AnikMallick/norm-classifier
cd norm-classifierPython 3.10+ is recommended.
pip install -r requirements.txtThe fine-tuned RoBERTa classifier is hosted on Hugging Face and is not included in this repo due to its size.
Run the setup script once — it will download the model into model-dir/ automatically.
python setup_model.pyWhat this does: fetches
roberta-base-classifier-v01from
anik-owl/roberta_norm_classifier
and saves it to./model-dir/roberta-base-classifier-v01/.
All other model files (LR classifiers, FAISS indexes) are already included in the repo.
Expected output:
Model downloaded at: ./model-dir/roberta-base-classifier-v01
streamlit run app/app.pyThe app will open in your browser at http://localhost:8501.
artifacts/
├── lr_model_fe_v01.pkl ← included in repo
├── lr_model_fe_v02.pkl ← included in repo
├── norm_faiss.index ← included in repo
├── notnorm_faiss.index ← included in repo
├── doc_norm_faiss.index ← included in repo
├── doc_notnorm_faiss.index
└── ...
model-dir/ ← ignored by git
└── roberta-base-classifier-v01/ ← downloaded by setup_model.py
├── config.json
├── model.safetensors
└── ...
| Issue | Fix |
|---|---|
ModuleNotFoundError |
Make sure you ran pip install -r requirements.txt |
OSError: model path not found |
Re-run python setup_model.py |
faiss install fails on Apple Silicon |
Use pip install faiss-cpu instead |
| Slow first load | Models are cached after the first run — subsequent loads are faster |
- Overview
- Dataset
- Pipeline Architecture
- Data Preprocessing
- Experiments & Results
- Novel Components
- Cultural Group Identification
- Comprehensive Results
- Limitations
- Setup & Usage
This project presents a comprehensive NLP pipeline for cultural norm classification using the CultureBank dataset (Reddit + TikTok, ~22,990 samples). Two interconnected objectives are addressed:
- Binary Norm Classification — distinguishing cultural norms from generic statements
- Cultural Group Attribution — identifying which cultural group a norm belongs to
These are treated as a cascade: a sentence classified as Norm is passed to the group identifier.
| Component | Description |
|---|---|
| 🔤 Deontic Cue Scorer | Linguistic rule-signals (obligation, prohibition, judgement, convention) fused with neural confidence scores |
| 🔎 FAISS Retrieval Scorer | Class-discriminative similarity scores against labelled training corpora |
| 🧠 916-dim Fusion Framework | Unified feature matrix enabling classical classifiers to approach transformer-level performance |
| Model | Norm F1 | Macro F1 |
|---|---|---|
| RoBERTa-base (full text + threshold) ⭐ | 65.4% | 80.2% |
| Fusion 916 — LR (default) | 64.8% | 80.8% |
| Fusion 916 — Bagged LR + threshold | 64.4% | 80.4% |
| TAPT + ULMFiT V02 | 63.8% | 80.1% |
| Source | Size | Rows |
|---|---|---|
| Reddit corpus | 19.7 MB | 11,236 |
| TikTok corpus | 16.2 MB | 11,754 |
| Merged | — | 22,990 |
Each row contains 17 columns encoding cultural observations: cultural group, context, goal, relation, actor/recipient behaviour, topic, agreement score, scenario, and more.
The topic column is mapped as follows:
Social Norms and Etiquette→ Label 1 (Norm)- All other topics → Label 0 (Not Norm)
| Class | Count | Percentage |
|---|---|---|
| Norm (1) | 2,225 | 9.7% |
| Not Norm (0) | 20,765 | 90.3% |
| Total | 22,990 | 100% |
⚠️ The ~1:10 class imbalance is the defining challenge of this project. All design decisions — class-weighted loss, threshold tuning, sampling strategies — must be viewed through this lens.
| Step | Decision |
|---|---|
Missing values in context (~145 rows) |
Complete Case Analysis (CCA) — dropped |
| Agreement filtering | Retained only agreement ≥ 0.70 |
Post-filtering dataset statistics:
| Metric | Value |
|---|---|
| Total Observations | 22,845 |
| After agreement filter + CCA | 18,169 |
| Class Imbalance Ratio | ~9.4 : 1 |
| Unique Cultural Groups | ~250+ |
Stratified 60% / 20% / 20% split maintained consistently across all experiments to prevent leakage.
Raw CultureBank Data (22,990 samples)
│
▼
┌─────────────────────────┐
│ Preprocessing │
│ - CCA (missing vals) │
│ - Agreement filter │
│ - Text normalisation │
│ - NER masking (V1/V2) │
└─────────┬───────────────┘
│
┌─────┴──────────────────────────────┐
│ │
▼ ▼
TF-IDF / LSA Sentence Embeddings
(classical ML) (MiniLM / MPNet)
│ │
└──────────────┬─────────────────────┘
│
▼
┌──────────────────┐
│ LSTM / RoBERTa │ ← transformer fine-tuning
└──────────────────┘
│
▼
┌───────────────────────┐
│ Fusion (916-dim) │
│ LSA + MiniLM + │
│ Deontic + FAISS + │
│ RoBERTa signals │
└───────────────────────┘
│
┌────────┴────────┐
▼ ▼
Binary Norm Cultural Group
Classifier Identifier
Three columns carry the most discriminative signal (from Y-Data profiling EDA):
| Column | Description |
|---|---|
eval_whole_desc |
Full evaluation-oriented behavioural description (170–981 chars, median ~781) |
context |
Situational context (e.g., "in public", "at work") |
eval_persona |
Hypothetical persona posing the question |
Lowercasing → Punctuation removal → Stop-word removal (spaCy English)
Two masking strategies implemented using spaCy NER:
| Strategy | Mapping |
|---|---|
| V1 | GPE, NORP, LOC → [LOCATION] |
| V2 | GPE/LOC → [LOCATION], NORP → [GROUP], PERSON → [PERSON], ORG → [ORG], LANGUAGE → [LANGUAGE] |
Rationale: Masking
Americain a sample dropped America confidence by ~5%, confirming unmasked features leak identity rather than normative content.
TfidfVectorizer(
max_features=30_000,
min_df=3,
ngram_range=(1, 2)
)Truncated SVD grid-searched over {10, 50, 100, 500, 1000, 2500, 5000, 10000} components.
Optimal: n_components = 10,000 → Cross-validated ROC-AUC of 0.9347
| Model | Dimensions |
|---|---|
all-MiniLM-L6-v2 |
384-dim |
all-mpnet-base-v2 |
768-dim |
Raw concatenated text (eval_whole_desc + context + eval_persona) vectorised with TF-IDF:
| Model | Norm Prec | Norm Recall | Norm F1 | AUC |
|---|---|---|---|---|
| Logistic Regression | 78.7% | 26.9% | 40.1% | 0.94 |
| Linear SVC ⭐ | 71.3% | 41.3% | 52.3% | 0.92 |
| XGB (gblinear) | 71.1% | 37.0% | 48.7% | 0.91 |
| SVM (RBF) | 82.5% | 26.9% | 40.6% | 0.94 |
| XGB (gbtree) | 10.7% | 98.6% | 19.3% ❌ | — |
XGB (gbtree) collapses to predicting all-Norm — tree methods are highly sensitive to imbalance without explicit correction.
Effect of V1 masking on Linear SVC (best improvement):
| Model | Accuracy | Norm Prec | Norm Recall | Norm F1 | Δ vs Baseline |
|---|---|---|---|---|---|
| LR | 93.1% | 83.8% | 26.2% | 39.9% | −0.76pp |
| Linear SVC ⭐ | 94.2% | 76.0% | 48.9% | 59.5% | +7.58pp ↑ |
| XGB (gblinear) | 93.4% | 73.1% | 38.5% | 50.4% | +1.50pp |
| SVM (RBF) | 93.3% | 87.0% | 27.4% | 41.7% | +0.50pp |
LSA (5,000 components) on masked normalised features. Weighted LR notable result:
| Model | Accuracy | Norm Prec | Norm Recall | Norm F1 |
|---|---|---|---|---|
| TFIDF_LSA_LR (Weighted) | 93.1% | 47% | 86% | 61% |
High recall at the cost of precision — useful for high-recall deployment.
| Model + Embedding | Accuracy | Norm Prec | Norm Recall | Norm F1 |
|---|---|---|---|---|
| MiniLM + LR | 93.7% | 74.0% | 42.3% | 53.8% |
| MiniLM + Linear SVM | 93.7% | 70.8% | 46.7% | 56.3% |
| MiniLM + NL SVM | 93.8% | 75.2% | 43.2% | 54.9% |
| MiniLM + NL XGBoost | 94.0% | 78.7% | 43.2% | 55.8% |
| MiniLM + MLP [128×5] ⭐ | 93.6% | 66.6% | 53.6% | 59.4% |
| MPNet + MLP [32×5] | 93.8% | 74.2% | 44.5% | 55.6% |
Trained at 4 vocabulary sizes; evaluated on held-out test set:
| Vocab Size | Norm Prec | Norm Recall | Norm F1 |
|---|---|---|---|
| 3,000 | 0.41 | 0.79 | 0.54 |
| 5,000 | 0.52 | 0.71 | 0.60 |
| 7,000 ⭐ | 0.56 | 0.68 | 0.61 |
| 15,000 | 0.39 | 0.83 | 0.53 |
Sweet spot at vocab=7,000. Larger vocabularies introduce noise with insufficient data.
Three training configurations with discriminative learning rates, slanted triangular LR scheduling, and gradual unfreezing:
| Model | Accuracy | Norm Prec | Norm Recall | Norm F1 |
|---|---|---|---|---|
| TAPT + ULMFiT V01 | 93.5% | 62.1% | 64.7% | 63.4% |
| Base + ULMFiT V02 ⭐ | 93.6% | 62.9% | 64.7% | 63.8% |
| TAPT + ULMFiT V03 | 92.9% | 58.1% | 66.6% | 62.1% |
| Model / Approach | Macro F1 | Norm Recall | Norm Prec | Norm F1 |
|---|---|---|---|---|
| RoBERTa-base (Trainer, v1 masked) | 79.7% | 65.3% | 61.1% | 63.1% |
| RoBERTa-base (v2 mask + tokens) | 80.1% | 65.3% | 62.4% | 63.8% |
| RoBERTa-large (threshold tuned) | 80.2% | 64.0% | 63.8% | 63.9% |
| RoBERTa-base (full text + threshold) ⭐ | 80.2% | 66.3% | 61.8% | 65.4% |
Surprising finding: Full unmasked text outperforms masked variants — cultural group names carry genuinely useful semantic signal, not merely spurious memorisation.
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Not-Norm (0) | 0.97 | 0.96 | 0.96 |
| Norm (1) | 0.62 | 0.66 | 0.64 |
| Macro avg | 0.79 | 0.81 | 0.80 |
| Weighted avg | 0.94 | 0.94 | 0.94 |
Gemma-4-E2B-it evaluated zero-shot on 1,000 samples:
| Metric | Value |
|---|---|
| Overall Accuracy | 8% |
| Macro F1 | 7.4% |
| Norm Recall | 100% (degenerate — predicts Norm for everything) |
| Not-Norm Recall | 0.11% |
❌ Zero-shot LLMs without prompt engineering are not viable for this imbalanced task.
Motivated by linguistic theory — norm descriptions use specific modal/evaluative language patterns.
Four deontic categories:
| Category | Example Keywords |
|---|---|
| Obligation | should, must, are expected, required, ought to |
| Prohibition | should not, must not, avoid, forbidden, prohibited |
| Judgement | rude, polite, respectful, inappropriate, acceptable |
| Convention | customary, traditionally, commonly, typically, norm |
Scoring mechanism:
- Per-category keyword matches → weighted points (e.g., Obligation match → +2.0)
- Structural co-occurrence bonuses:
"it is" + "customary/common"→ +1.5 - Negation penalty:
"not required"→ ×0.5 - Final score normalised + sigmoid-squashed to
[0, 1]
Features produced: Normativity score + binary category indicators + norm variation type indicators
A non-parametric retrieval system that classifies by semantic proximity to labelled training sentences.
Architecture:
- Split training data by class → Norm corpus + Not-Norm corpus
- Segment each corpus into sentences
- Encode with
all-MiniLM-L6-v2(384-dim, L2-normalised) - Build two FAISS
IndexFlatIPindices (cosine similarity) - For each query:
retrieval_score = avg_sim(top-k Norm) − avg_sim(top-k Not-Norm)
Sentence-level aggregate features: Max Score, Mean Score, Top-K Average, Std, Sentence Count
Retrieval-only classifier results:
| Model | Norm Prec | Norm Recall | Norm F1 | Macro F1 |
|---|---|---|---|---|
| LR (default) | 72.6% | 43.5% | 54.4% | 75.5% |
| Linear SVC (default) ⭐ | 72.3% | 47.6% | 57.4% | 77.1% |
| NL XGBoost | 61.7% | 51.4% | 56.1% | 76.2% |
| RBF SVM (balanced) | 37.9% | 82.7% | 51.9% | 72.1% |
Retrieval-only achieves up to 57.4% Norm F1 with zero explicit text learning — competitive with TF-IDF+LSA baselines.
Combines all signal sources into a single unified representation:
32-dim engineered feature vector:
| Component | Dimensions | Source |
|---|---|---|
| Deontic features | 6 | Rule-based scorer |
| Norm-type features | 8 | Category indicators |
| Variation features | 6 | Norm modality indicators |
| RoBERTa features | 6 | Logit-based scores (sentence + doc level) |
| FAISS retrieval | 6 | Similarity statistics |
| Total | 32 |
884-dim embedding matrix:
| Component | Dimensions |
|---|---|
| LSA (TruncatedSVD, 500 components) | 500-dim |
| MiniLM sentence embeddings | 384-dim |
| Total | 884-dim |
Final: 32 + 884 = 916-dim per observation
| Model / Config | Norm Prec | Norm Recall | Norm F1 | Macro F1 | Accuracy |
|---|---|---|---|---|---|
| LR (default) ⭐ | 66.7% | 63.1% | 64.8% | 80.8% | 94.0% |
| Linear SVC (default) | 65.9% | 63.4% | 64.6% | 80.7% | 93.9% |
| Linear XGB (default) | 70.7% | 58.7% | 64.1% | 80.5% | 94.3% |
| NL XGBoost (default) | 66.4% | 61.8% | 64.1% | 80.4% | 94.0% |
| MLP [128×5] + threshold (0.468) | 64.2% | 64.4% | 64.3% | 80.4% | 93.8% |
| Bagged LR (25 est.) + threshold (0.768) | 61.5% | 67.5% | 64.4% | 80.4% | 93.5% |
| RBF SVM (balanced) | 53.5% | 74.1% | 62.2% | 78.9% | 92.1% |
🎯 Key Finding: The 916-feature fusion framework closes 90% of the gap between TF-IDF baseline and the best transformer — Logistic Regression at 64.8% Norm F1 nearly matches RoBERTa at 65.4%, at a fraction of inference cost.
- 2,398 unique cultural groups in raw data (most with < 10 examples)
- Consolidated to 12 classes (top 11 +
Other)
| Group | Count |
|---|---|
| America | 6,783 |
| German | 1,056 |
| British | 956 |
| Australian | 634 |
| French | 614 |
| Italian | 613 |
| Europe | 584 |
| Korean | 499 |
| Dutch | 441 |
| Japanese | 439 |
| Spanish | 338 |
| Other | 10,033 |
| Cultural Group | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Other | 0.78 | 0.33 | 0.46 | 2,006 |
| America | 0.63 | 0.36 | 0.46 | 1,356 |
| Japanese ⭐ | 0.33 | 0.62 | 0.43 | 88 |
| German | 0.26 | 0.34 | 0.29 | 211 |
| British | 0.21 | 0.36 | 0.26 | 191 |
| French | 0.15 | 0.43 | 0.22 | 123 |
| Korean | 0.14 | 0.46 | 0.22 | 100 |
| Europe | 0.15 | 0.43 | 0.22 | 117 |
| Australian | 0.14 | 0.40 | 0.21 | 127 |
| Italian | 0.14 | 0.35 | 0.20 | 122 |
| Dutch | 0.11 | 0.34 | 0.16 | 88 |
| Spanish | 0.11 | 0.38 | 0.17 | 68 |
| Macro Avg | 0.26 | 0.40 | 0.28 | — |
Japanese achieves highest recall (0.62) — likely due to distinctive lexical patterns (formal addressing, specific food customs).
For the text "it is customary for both Germans and their children to employ formal and informal addressing...":
| Setup | German Confidence | Top Prediction |
|---|---|---|
| Unmasked | 18.28% ✅ (correct) | German |
Masked ([GROUP]) |
13.84% | Korean (18.86%) |
Masking drops German confidence ~5% but normative content retains discriminative power.
BiLSTM macro-F1 remains in 0.27–0.37 range. BiLSTM skews heavily toward high-volume classes (America, Other). TF-IDF+SVD outperforms BiLSTM for this task.
Structural limitation: Non-Western cultures (e.g., India/Diwali) fall into
Other— excluded from 12 consolidated classes.
| Model / Approach | Macro F1 | Norm Recall | Norm Prec | Norm F1 |
|---|---|---|---|---|
| Baseline LR (raw TF-IDF) | 68.0% | 26.9% | 78.7% | 40.1% |
| Lin. SVM (masked TF-IDF) | 78.2% | 48.9% | 76.0% | 59.5% |
| LSA (10k) + Lin. SVM | 78.1% | 48.9% | 75.6% | 59.4% |
| FAISS Retrieval Only (SVC) | 77.1% | 47.6% | 72.3% | 57.4% |
| LSTM (vocab 7k) | — | 68.0% | 56.0% | 61.0% |
| MiniLM + MLP [128×5] | — | 53.6% | 66.6% | 59.4% |
| TAPT + ULMFiT V02 | 80.1% | 64.7% | 62.9% | 63.8% |
| RoBERTa-base (Trainer, masked) | 79.7% | 65.3% | 61.1% | 63.1% |
| RoBERTa-large (threshold tuned) | 80.2% | 64.0% | 63.8% | 63.9% |
| Fusion 916 — Bagged LR + threshold | 80.4% | 67.5% | 61.5% | 64.4% |
| Fusion 916 — MLP [128×5] + threshold | 80.5% | 64.4% | 64.2% | 64.3% |
| Fusion 916 — LR (default) | 80.8% | 63.1% | 66.7% | 64.8% |
| RoBERTa-base (full text + threshold) ⭐ | 80.2% | 66.3% | 61.8% | 65.4% |
| Gemma-4-E2B (zero-shot) | 7.4% | 100% | 7.9% | 14.7% ❌ |
| Limitation | Impact |
|---|---|
| No SMOTE applied universally | Norm recall likely underestimates achievable ceiling |
| Group consolidation to 12 classes | 2,300+ minority cultures erased; non-Western cultures systematically excluded |
| No calibration analysis | Reliability diagrams not explored |
| No error analysis | What types of Norms are most confused with Not-Norms? |
Norm texts use hedged, conditional language — "it is customary for...", "individuals tend to..." — that is descriptively similar to non-norm factual statements, making the boundary genuinely ambiguous even after agreement filtering (≥ 0.70).
Anik Mallick · Gaurav Tiwari