🌐 Cultural Norm Classification

From CultureBank to Cultural Intelligence

Anik Mallick, Gaurav Tiwari

Quick links: [Dataset] - [Huggingface Models] - [Models]

🚀 Running the Inference App

1 · Clone the repository

git clone https://github.com/AnikMallick/norm-classifier
cd norm-classifier

2 · Install dependencies

Python 3.10+ is recommended.

pip install -r requirements.txt

3 · Download the RoBERTa model

The fine-tuned RoBERTa classifier is hosted on Hugging Face and is not included in this repo due to its size.
Run the setup script once — it will download the model into model-dir/ automatically.

python setup_model.py

What this does: fetches roberta-base-classifier-v01 from
anik-owl/roberta_norm_classifier
and saves it to ./model-dir/roberta-base-classifier-v01/.
All other model files (LR classifiers, FAISS indexes) are already included in the repo.

Expected output:

Model downloaded at: ./model-dir/roberta-base-classifier-v01

4 · Launch the app

streamlit run app/app.py

The app will open in your browser at http://localhost:8501.

Directory structure after setup

artifacts/
├── lr_model_fe_v01.pkl            ← included in repo
├── lr_model_fe_v02.pkl            ← included in repo
├── norm_faiss.index               ← included in repo
├── notnorm_faiss.index            ← included in repo
├── doc_norm_faiss.index           ← included in repo
├── doc_notnorm_faiss.index   
└──  ...
model-dir/                         ← ignored by git
└── roberta-base-classifier-v01/   ← downloaded by setup_model.py
    ├── config.json
    ├── model.safetensors
    └── ...

Troubleshooting

Issue	Fix
`ModuleNotFoundError`	Make sure you ran `pip install -r requirements.txt`
`OSError: model path not found`	Re-run `python setup_model.py`
`faiss` install fails on Apple Silicon	Use `pip install faiss-cpu` instead
Slow first load	Models are cached after the first run — subsequent loads are faster

🔍 Overview

This project presents a comprehensive NLP pipeline for cultural norm classification using the CultureBank dataset (Reddit + TikTok, ~22,990 samples). Two interconnected objectives are addressed:

Binary Norm Classification — distinguishing cultural norms from generic statements
Cultural Group Attribution — identifying which cultural group a norm belongs to

These are treated as a cascade: a sentence classified as Norm is passed to the group identifier.

✨ Key Contributions

Component	Description
🔤 Deontic Cue Scorer	Linguistic rule-signals (obligation, prohibition, judgement, convention) fused with neural confidence scores
🔎 FAISS Retrieval Scorer	Class-discriminative similarity scores against labelled training corpora
🧠 916-dim Fusion Framework	Unified feature matrix enabling classical classifiers to approach transformer-level performance

🏆 Best Results at a Glance

Model	Norm F1	Macro F1
RoBERTa-base (full text + threshold) ⭐	65.4%	80.2%
Fusion 916 — LR (default)	64.8%	80.8%
Fusion 916 — Bagged LR + threshold	64.4%	80.4%
TAPT + ULMFiT V02	63.8%	80.1%

📊 Dataset

CultureBank

Source	Size	Rows
Reddit corpus	19.7 MB	11,236
TikTok corpus	16.2 MB	11,754
Merged	—	22,990

Each row contains 17 columns encoding cultural observations: cultural group, context, goal, relation, actor/recipient behaviour, topic, agreement score, scenario, and more.

Label Construction

The topic column is mapped as follows:

Social Norms and Etiquette → Label 1 (Norm)
All other topics → Label 0 (Not Norm)

Class Distribution

Class	Count	Percentage
Norm (1)	2,225	9.7%
Not Norm (0)	20,765	90.3%
Total	22,990	100%

⚠️ The ~1:10 class imbalance is the defining challenge of this project. All design decisions — class-weighted loss, threshold tuning, sampling strategies — must be viewed through this lens.

Data Quality Filtering

Step	Decision
Missing values in `context` (~145 rows)	Complete Case Analysis (CCA) — dropped
Agreement filtering	Retained only `agreement ≥ 0.70`

Post-filtering dataset statistics:

Metric	Value
Total Observations	22,845
After agreement filter + CCA	18,169
Class Imbalance Ratio	~9.4 : 1
Unique Cultural Groups	~250+

Train / Validation / Test Split

Stratified 60% / 20% / 20% split maintained consistently across all experiments to prevent leakage.

🏗️ Pipeline Architecture

Raw CultureBank Data (22,990 samples)
         │
         ▼
┌─────────────────────────┐
│  Preprocessing          │
│  - CCA (missing vals)   │
│  - Agreement filter     │
│  - Text normalisation   │
│  - NER masking (V1/V2)  │
└─────────┬───────────────┘
          │
    ┌─────┴──────────────────────────────┐
    │                                    │
    ▼                                    ▼
TF-IDF / LSA                  Sentence Embeddings
(classical ML)                (MiniLM / MPNet)
    │                                    │
    └──────────────┬─────────────────────┘
                   │
                   ▼
        ┌──────────────────┐
        │  LSTM / RoBERTa  │  ← transformer fine-tuning
        └──────────────────┘
                   │
                   ▼
       ┌───────────────────────┐
       │  Fusion (916-dim)     │
       │  LSA + MiniLM +       │
       │  Deontic + FAISS +    │
       │  RoBERTa signals      │
       └───────────────────────┘
                   │
          ┌────────┴────────┐
          ▼                 ▼
  Binary Norm        Cultural Group
  Classifier         Identifier

🔧 Data Preprocessing & Feature Engineering

Column Selection

Three columns carry the most discriminative signal (from Y-Data profiling EDA):

Column	Description
`eval_whole_desc`	Full evaluation-oriented behavioural description (170–981 chars, median ~781)
`context`	Situational context (e.g., "in public", "at work")
`eval_persona`	Hypothetical persona posing the question

Text Normalisation

Lowercasing → Punctuation removal → Stop-word removal (spaCy English)

NER-Based Entity Masking

Two masking strategies implemented using spaCy NER:

Strategy	Mapping
V1	GPE, NORP, LOC → `[LOCATION]`
V2	GPE/LOC → `[LOCATION]`, NORP → `[GROUP]`, PERSON → `[PERSON]`, ORG → `[ORG]`, LANGUAGE → `[LANGUAGE]`

Rationale: Masking America in a sample dropped America confidence by ~5%, confirming unmasked features leak identity rather than normative content.

TF-IDF Configuration

TfidfVectorizer(
    max_features=30_000,
    min_df=3,
    ngram_range=(1, 2)
)

LSA

Truncated SVD grid-searched over {10, 50, 100, 500, 1000, 2500, 5000, 10000} components.
Optimal: n_components = 10,000 → Cross-validated ROC-AUC of 0.9347

Sentence Embeddings

Model	Dimensions
`all-MiniLM-L6-v2`	384-dim
`all-mpnet-base-v2`	768-dim

🧪 Experiments & Results

Baseline Models

Raw concatenated text (eval_whole_desc + context + eval_persona) vectorised with TF-IDF:

Model	Norm Prec	Norm Recall	Norm F1	AUC
Logistic Regression	78.7%	26.9%	40.1%	0.94
Linear SVC ⭐	71.3%	41.3%	52.3%	0.92
XGB (gblinear)	71.1%	37.0%	48.7%	0.91
SVM (RBF)	82.5%	26.9%	40.6%	0.94
XGB (gbtree)	10.7%	98.6%	19.3% ❌	—

XGB (gbtree) collapses to predicting all-Norm — tree methods are highly sensitive to imbalance without explicit correction.

NER-Based Entity Masking

Effect of V1 masking on Linear SVC (best improvement):

Model	Accuracy	Norm Prec	Norm Recall	Norm F1	Δ vs Baseline
LR	93.1%	83.8%	26.2%	39.9%	−0.76pp
Linear SVC ⭐	94.2%	76.0%	48.9%	59.5%	+7.58pp ↑
XGB (gblinear)	93.4%	73.1%	38.5%	50.4%	+1.50pp
SVM (RBF)	93.3%	87.0%	27.4%	41.7%	+0.50pp

LSA Dimensionality Reduction

LSA (5,000 components) on masked normalised features. Weighted LR notable result:

Model	Accuracy	Norm Prec	Norm Recall	Norm F1
TFIDF_LSA_LR (Weighted)	93.1%	47%	86%	61%

High recall at the cost of precision — useful for high-recall deployment.

Sentence Embeddings

Model + Embedding	Accuracy	Norm Prec	Norm Recall	Norm F1
MiniLM + LR	93.7%	74.0%	42.3%	53.8%
MiniLM + Linear SVM	93.7%	70.8%	46.7%	56.3%
MiniLM + NL SVM	93.8%	75.2%	43.2%	54.9%
MiniLM + NL XGBoost	94.0%	78.7%	43.2%	55.8%
MiniLM + MLP [128×5] ⭐	93.6%	66.6%	53.6%	59.4%
MPNet + MLP [32×5]	93.8%	74.2%	44.5%	55.6%

LSTM

Trained at 4 vocabulary sizes; evaluated on held-out test set:

Vocab Size	Norm Prec	Norm Recall	Norm F1
3,000	0.41	0.79	0.54
5,000	0.52	0.71	0.60
7,000 ⭐	0.56	0.68	0.61
15,000	0.39	0.83	0.53

Sweet spot at vocab=7,000. Larger vocabularies introduce noise with insufficient data.

RoBERTa Fine-Tuning

TAPT + ULMFiT

Three training configurations with discriminative learning rates, slanted triangular LR scheduling, and gradual unfreezing:

Model	Accuracy	Norm Prec	Norm Recall	Norm F1
TAPT + ULMFiT V01	93.5%	62.1%	64.7%	63.4%
Base + ULMFiT V02 ⭐	93.6%	62.9%	64.7%	63.8%
TAPT + ULMFiT V03	92.9%	58.1%	66.6%	62.1%

HuggingFace Trainer Variants

Model / Approach	Macro F1	Norm Recall	Norm Prec	Norm F1
RoBERTa-base (Trainer, v1 masked)	79.7%	65.3%	61.1%	63.1%
RoBERTa-base (v2 mask + tokens)	80.1%	65.3%	62.4%	63.8%
RoBERTa-large (threshold tuned)	80.2%	64.0%	63.8%	63.9%
RoBERTa-base (full text + threshold) ⭐	80.2%	66.3%	61.8%	65.4%

Surprising finding: Full unmasked text outperforms masked variants — cultural group names carry genuinely useful semantic signal, not merely spurious memorisation.

Best Model — Full Classification Report

Class	Precision	Recall	F1
Not-Norm (0)	0.97	0.96	0.96
Norm (1)	0.62	0.66	0.64
Macro avg	0.79	0.81	0.80
Weighted avg	0.94	0.94	0.94

Zero-Shot LLM Baseline

Gemma-4-E2B-it evaluated zero-shot on 1,000 samples:

Metric	Value
Overall Accuracy	8%
Macro F1	7.4%
Norm Recall	100% (degenerate — predicts Norm for everything)
Not-Norm Recall	0.11%

❌ Zero-shot LLMs without prompt engineering are not viable for this imbalanced task.

🚀 Novel Components

Deontic Cue Normativity Scoring

Motivated by linguistic theory — norm descriptions use specific modal/evaluative language patterns.

Four deontic categories:

Category	Example Keywords
Obligation	`should`, `must`, `are expected`, `required`, `ought to`
Prohibition	`should not`, `must not`, `avoid`, `forbidden`, `prohibited`
Judgement	`rude`, `polite`, `respectful`, `inappropriate`, `acceptable`
Convention	`customary`, `traditionally`, `commonly`, `typically`, `norm`

Scoring mechanism:

Per-category keyword matches → weighted points (e.g., Obligation match → +2.0)
Structural co-occurrence bonuses: "it is" + "customary/common" → +1.5
Negation penalty: "not required" → ×0.5
Final score normalised + sigmoid-squashed to [0, 1]

Features produced: Normativity score + binary category indicators + norm variation type indicators

FAISS Retrieval Scoring

A non-parametric retrieval system that classifies by semantic proximity to labelled training sentences.

Architecture:

Split training data by class → Norm corpus + Not-Norm corpus
Segment each corpus into sentences
Encode with all-MiniLM-L6-v2 (384-dim, L2-normalised)
Build two FAISS IndexFlatIP indices (cosine similarity)
For each query: retrieval_score = avg_sim(top-k Norm) − avg_sim(top-k Not-Norm)

Sentence-level aggregate features: Max Score, Mean Score, Top-K Average, Std, Sentence Count

Retrieval-only classifier results:

Model	Norm Prec	Norm Recall	Norm F1	Macro F1
LR (default)	72.6%	43.5%	54.4%	75.5%
Linear SVC (default) ⭐	72.3%	47.6%	57.4%	77.1%
NL XGBoost	61.7%	51.4%	56.1%	76.2%
RBF SVM (balanced)	37.9%	82.7%	51.9%	72.1%

Retrieval-only achieves up to 57.4% Norm F1 with zero explicit text learning — competitive with TF-IDF+LSA baselines.

Fusion Feature Vector (916-dim)

Combines all signal sources into a single unified representation:

32-dim engineered feature vector:

Component	Dimensions	Source
Deontic features	6	Rule-based scorer
Norm-type features	8	Category indicators
Variation features	6	Norm modality indicators
RoBERTa features	6	Logit-based scores (sentence + doc level)
FAISS retrieval	6	Similarity statistics
Total	32

884-dim embedding matrix:

Component	Dimensions
LSA (TruncatedSVD, 500 components)	500-dim
MiniLM sentence embeddings	384-dim
Total	884-dim

Final: 32 + 884 = 916-dim per observation

Fusion Pipeline Results

Model / Config	Norm Prec	Norm Recall	Norm F1	Macro F1	Accuracy
LR (default) ⭐	66.7%	63.1%	64.8%	80.8%	94.0%
Linear SVC (default)	65.9%	63.4%	64.6%	80.7%	93.9%
Linear XGB (default)	70.7%	58.7%	64.1%	80.5%	94.3%
NL XGBoost (default)	66.4%	61.8%	64.1%	80.4%	94.0%
MLP [128×5] + threshold (0.468)	64.2%	64.4%	64.3%	80.4%	93.8%
Bagged LR (25 est.) + threshold (0.768)	61.5%	67.5%	64.4%	80.4%	93.5%
RBF SVM (balanced)	53.5%	74.1%	62.2%	78.9%	92.1%

🎯 Key Finding: The 916-feature fusion framework closes 90% of the gap between TF-IDF baseline and the best transformer — Logistic Regression at 64.8% Norm F1 nearly matches RoBERTa at 65.4%, at a fraction of inference cost.

🌍 Cultural Group Identification

Problem Setup

2,398 unique cultural groups in raw data (most with < 10 examples)
Consolidated to 12 classes (top 11 + Other)

Consolidated Group Distribution (Training Set)

Group	Count
America	6,783
German	1,056
British	956
Australian	634
French	614
Italian	613
Europe	584
Korean	499
Dutch	441
Japanese	439
Spanish	338
Other	10,033

TF-IDF + SVD Results (dim=3000)

Cultural Group	Precision	Recall	F1	Support
Other	0.78	0.33	0.46	2,006
America	0.63	0.36	0.46	1,356
Japanese ⭐	0.33	0.62	0.43	88
German	0.26	0.34	0.29	211
British	0.21	0.36	0.26	191
French	0.15	0.43	0.22	123
Korean	0.14	0.46	0.22	100
Europe	0.15	0.43	0.22	117
Australian	0.14	0.40	0.21	127
Italian	0.14	0.35	0.20	122
Dutch	0.11	0.34	0.16	88
Spanish	0.11	0.38	0.17	68
Macro Avg	0.26	0.40	0.28	—

Japanese achieves highest recall (0.62) — likely due to distinctive lexical patterns (formal addressing, specific food customs).

Effect of Masking on Group Prediction

For the text "it is customary for both Germans and their children to employ formal and informal addressing...":

Setup	German Confidence	Top Prediction
Unmasked	18.28% ✅ (correct)	German
Masked (`[GROUP]`)	13.84%	Korean (18.86%)

Masking drops German confidence ~5% but normative content retains discriminative power.

TF-IDF vs BiLSTM

BiLSTM macro-F1 remains in 0.27–0.37 range. BiLSTM skews heavily toward high-volume classes (America, Other). TF-IDF+SVD outperforms BiLSTM for this task.

Structural limitation: Non-Western cultures (e.g., India/Diwali) fall into Other — excluded from 12 consolidated classes.

📈 Comprehensive Results

Norm Classification — Full Ranking

Model / Approach	Macro F1	Norm Recall	Norm Prec	Norm F1
Baseline LR (raw TF-IDF)	68.0%	26.9%	78.7%	40.1%
Lin. SVM (masked TF-IDF)	78.2%	48.9%	76.0%	59.5%
LSA (10k) + Lin. SVM	78.1%	48.9%	75.6%	59.4%
FAISS Retrieval Only (SVC)	77.1%	47.6%	72.3%	57.4%
LSTM (vocab 7k)	—	68.0%	56.0%	61.0%
MiniLM + MLP [128×5]	—	53.6%	66.6%	59.4%
TAPT + ULMFiT V02	80.1%	64.7%	62.9%	63.8%
RoBERTa-base (Trainer, masked)	79.7%	65.3%	61.1%	63.1%
RoBERTa-large (threshold tuned)	80.2%	64.0%	63.8%	63.9%
Fusion 916 — Bagged LR + threshold	80.4%	67.5%	61.5%	64.4%
Fusion 916 — MLP [128×5] + threshold	80.5%	64.4%	64.2%	64.3%
Fusion 916 — LR (default)	80.8%	63.1%	66.7%	64.8%
RoBERTa-base (full text + threshold) ⭐	80.2%	66.3%	61.8%	65.4%
Gemma-4-E2B (zero-shot)	7.4%	100%	7.9%	14.7% ❌

⚠️ Limitations

Limitation	Impact
No SMOTE applied universally	Norm recall likely underestimates achievable ceiling
Group consolidation to 12 classes	2,300+ minority cultures erased; non-Western cultures systematically excluded
No calibration analysis	Reliability diagrams not explored
No error analysis	What types of Norms are most confused with Not-Norms?

Why Norm Recall is Hard

Norm texts use hedged, conditional language — "it is customary for...", "individuals tend to..." — that is descriptively similar to non-norm factual statements, making the boundary genuinely ambiguous even after agreement filtering (≥ 0.70).

Anik Mallick · Gaurav Tiwari

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.streamlit		.streamlit
.vscode		.vscode
app		app
artifacts		artifacts
configs		configs
data		data
experiments		experiments
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_script.py		setup_script.py

Folders and files

Latest commit

History

Repository files navigation

🌐 Cultural Norm Classification

From CultureBank to Cultural Intelligence

🚀 Running the Inference App

1 · Clone the repository

2 · Install dependencies

3 · Download the RoBERTa model

4 · Launch the app

Directory structure after setup

Troubleshooting

📋 Table of Contents

🔍 Overview

✨ Key Contributions

🏆 Best Results at a Glance

📊 Dataset

CultureBank

Label Construction

Class Distribution

Data Quality Filtering

Train / Validation / Test Split

🏗️ Pipeline Architecture

🔧 Data Preprocessing & Feature Engineering

Column Selection

Text Normalisation

NER-Based Entity Masking

TF-IDF Configuration

LSA

Sentence Embeddings

🧪 Experiments & Results

Baseline Models

NER-Based Entity Masking

LSA Dimensionality Reduction

Sentence Embeddings

LSTM

RoBERTa Fine-Tuning

TAPT + ULMFiT

HuggingFace Trainer Variants

Best Model — Full Classification Report

Zero-Shot LLM Baseline

🚀 Novel Components

Deontic Cue Normativity Scoring

FAISS Retrieval Scoring

Fusion Feature Vector (916-dim)

Fusion Pipeline Results

🌍 Cultural Group Identification

Problem Setup

Consolidated Group Distribution (Training Set)

TF-IDF + SVD Results (dim=3000)

Effect of Masking on Group Prediction

TF-IDF vs BiLSTM

📈 Comprehensive Results

Norm Classification — Full Ranking

⚠️ Limitations

Why Norm Recall is Hard

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages