# Unified Encoder-Only Transformer NLP Project

## **Objective**
To build a **unified NLP system** using **Encoder-Only Transformer models** (like BERT, RoBERTa, DistilBERT) capable of:
1. **Text Classification** – Assigning labels to entire documents or sentences.
2. **Entity Extraction** – Identifying and classifying spans of text into categories (Named Entity Recognition, NER).
3. **Semantic Similarity & Recommendations** – Measuring text similarity and finding the most relevant matches.

---

## **Purpose**
- **Practical Demonstration**: Showcase how encoder-only transformer architectures can power multiple NLP tasks.
- **Efficiency**: Consolidate code for training, evaluation, and inference across multiple NLP tasks.
- **Flexibility**: Enable switching between datasets, models, and tasks using configuration files or minimal code edits.
- **Reusability**: Serve as a foundation for chatbots, search engines, tagging systems, and recommendation platforms.

---

## **Workflow Overview**
1. **Data Loading & Preprocessing**
   - Load datasets for each task.
   - Tokenize and prepare inputs for transformer models.
2. **Model Selection**
   - Use HuggingFace `transformers` library to load pre-trained encoder-only models.
3. **Task-Specific Training**
   - Text classification with `Trainer` API.
   - Token classification (NER) with `Trainer` API and token alignment.
   - Sentence similarity training with `sentence-transformers`.
4. **Evaluation**
   - Use accuracy, F1-score, precision, recall for classification/NER.
   - Use cosine similarity scores and FAISS for recommendations.
5. **Inference & Recommendations**
   - Run predictions on new text.
   - Retrieve similar documents/sentences from FAISS index.

---

## **Expected Inputs**
- **Text Classification**: CSV with `text` and `label` columns.
- **Entity Extraction (NER)**: Dataset in `tokens` and `ner_tags` format.
- **Similarity**: Sentences/documents in plain text or CSV.

## **Expected Outputs**
- **Classification**: Predicted label for each text.
- **Entity Extraction**: List of entities with positions and types.
- **Similarity**: Ranked list of similar texts with similarity scores.

---

## **Applications**
- Chatbots with intent detection and entity recognition.
- Automated tagging of news articles, emails, or documents.
- Semantic search and recommendation engines.


In [None]:
"""
Unified project combining three notebooks:
- Classification + Multi-label Tagging
- Named Entity Recognition (Token Classification)
- Semantic Similarity & Recommendation (Sentence Transformers + FAISS)

File: encoder_only_transformer_unified_project.py
Author: Generated for user

Usage examples (from terminal):
    # Classification training
    python encoder_only_transformer_unified_project.py --task classification --config configs/classification.yaml

    # NER training
    python encoder_only_transformer_unified_project.py --task ner --config configs/ner.yaml

    # Similarity training / index build
    python encoder_only_transformer_unified_project.py --task similarity --config configs/similarity.yaml

This single-file project organizes data loading, model building, training and evaluation helpers.
Replace dataset paths and small placeholders with your real data.
"""

import os
import argparse
import yaml
from dataclasses import dataclass
from typing import List, Dict, Tuple, Optional

# Core ML libs
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)

# For similarity
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
import faiss
import numpy as np

# Utilities
import pandas as pd
from sklearn.metrics import f1_score, classification_report


# -----------------------------
# Config dataclasses (simple)
# -----------------------------
@dataclass
class CommonConfig:
    model_name: str = "bert-base-uncased"
    output_dir: str = "outputs"
    max_length: int = 256
    batch_size: int = 16
    num_epochs: int = 3
    seed: int = 42


# -----------------------------
# Data modules (placeholders)
# -----------------------------
class TextDataset(Dataset):
    def __init__(self, texts: List[str], labels: List[int], tokenizer, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        encoding = self.tokenizer(text, truncation=True, padding=False, max_length=self.max_length)
        item = {k: torch.tensor(v) for k, v in encoding.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item


class TokenClassificationDataset(Dataset):
    """Simple token classification dataset where inputs are pre-tokenized using the same tokenizer.
    Expects a list of dicts: {"tokens": [...], "labels": [...]} where labels are ints aligned to tokens.
    """
    def __init__(self, examples: List[Dict], tokenizer, label2id: Dict[str, int], max_length=256):
        self.examples = examples
        self.tokenizer = tokenizer
        self.label2id = label2id
        self.max_length = max_length

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        ex = self.examples[idx]
        tokens = ex['tokens']
        labels = ex['labels']
        # Use tokenizer.encode_plus with is_split_into_words for alignment
        encoding = self.tokenizer(tokens,
                                  is_split_into_words=True,
                                  truncation=True,
                                  padding=False,
                                  max_length=self.max_length)
        word_ids = encoding.word_ids()
        label_ids = []
        prev_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != prev_word_idx:
                # assign the label for the first token of the word
                label_ids.append(labels[word_idx])
            else:
                # for subsequent tokens of a word, you can set -100 or label depending on your choice
                label_ids.append(-100)
            prev_word_idx = word_idx

        item = {k: torch.tensor(v) for k, v in encoding.items()}
        item['labels'] = torch.tensor(label_ids, dtype=torch.long)
        return item


# -----------------------------
# Model builders
# -----------------------------
class MultiTaskEncoder(nn.Module):
    """Encoder-only model with two heads: classification (single-label) and multi-label tagging (sigmoid outputs)
    This wraps a pretrained encoder and two heads.
    """
    def __init__(self, model_name: str, num_classes: int, num_tags: int):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden_size = self.encoder.config.hidden_size
        self.classifier = nn.Linear(hidden_size, num_classes)
        self.multitag_head = nn.Linear(hidden_size, num_tags)

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        pooled = outputs.last_hidden_state[:, 0, :]  # use [CLS] token
        class_logits = self.classifier(pooled)
        multitag_logits = self.multitag_head(pooled)
        return {
            'class_logits': class_logits,
            'multitag_logits': multitag_logits,
        }


# -----------------------------
# Training helpers
# -----------------------------

def train_classification_with_trainer(train_df: pd.DataFrame, val_df: pd.DataFrame, config: CommonConfig, label_list: List[str]):
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    # create datasets
    train_texts, train_labels = train_df['text'].tolist(), train_df['label'].tolist()
    val_texts, val_labels = val_df['text'].tolist(), val_df['label'].tolist()

    train_dataset = TextDataset(train_texts, train_labels, tokenizer, max_length=config.max_length)
    val_dataset = TextDataset(val_texts, val_labels, tokenizer, max_length=config.max_length)

    model = AutoModelForSequenceClassification.from_pretrained(config.model_name, num_labels=len(label_list))

    args = TrainingArguments(
        output_dir=config.output_dir,
        per_device_train_batch_size=config.batch_size,
        per_device_eval_batch_size=config.batch_size,
        evaluation_strategy="epoch",
        num_train_epochs=config.num_epochs,
        save_strategy="epoch",
        load_best_model_at_end=True,
    )

    data_collator = DataCollatorWithPadding(tokenizer)

    def compute_metrics(pred):
        labels = pred.label_ids
        preds = np.argmax(pred.predictions, axis=1)
        return {"f1": f1_score(labels, preds, average='weighted')}

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    trainer.save_model(config.output_dir)


def train_token_classification_with_trainer(train_examples: List[Dict], val_examples: List[Dict], config: CommonConfig, label_list: List[str]):
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    label2id = {l: i for i, l in enumerate(label_list)}

    train_dataset = TokenClassificationDataset(train_examples, tokenizer, label2id, max_length=config.max_length)
    val_dataset = TokenClassificationDataset(val_examples, tokenizer, label2id, max_length=config.max_length)

    model = AutoModelForTokenClassification.from_pretrained(config.model_name, num_labels=len(label_list))

    args = TrainingArguments(
        output_dir=config.output_dir,
        per_device_train_batch_size=config.batch_size,
        per_device_eval_batch_size=config.batch_size,
        evaluation_strategy="epoch",
        num_train_epochs=config.num_epochs,
        save_strategy="epoch",
        load_best_model_at_end=True,
    )

    data_collator = DataCollatorWithPadding(tokenizer)

    # Note: compute_metrics for token classification needs alignment handling; keep simple here
    def compute_metrics(pred):
        # pred.predictions shape = (batch, seq_len, num_labels)
        preds = np.argmax(pred.predictions, axis=-1).flatten()
        labels = pred.label_ids.flatten()
        mask = labels != -100
        return {"f1": f1_score(labels[mask], preds[mask], average='weighted')}

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    trainer.save_model(config.output_dir)


# -----------------------------
# Similarity training and FAISS index
# -----------------------------

def train_similarity_sentence_transformer(train_pairs: List[Tuple[str, str, int]], model_name: str, output_dir: str, epochs=1, batch_size=8):
    # train_pairs is list of (sent1, sent2, label) where label 1=similar, 0=not similar
    sents_train = [InputExample(texts=[a, b], label=float(label)) for a, b, label in train_pairs]
    model = SentenceTransformer(model_name)
    train_loader = torch.utils.data.DataLoader(sents_train, shuffle=True, batch_size=batch_size)
    loss = losses.CosineSimilarityLoss(model)
    model.fit(train_objectives=[(train_loader, loss)], epochs=epochs, output_path=output_dir)
    return model


def build_faiss_index(sentences: List[str], model: SentenceTransformer, index_path: str, dim: Optional[int] = None):
    embeddings = model.encode(sentences, convert_to_numpy=True, show_progress_bar=True)
    if dim is None:
        dim = embeddings.shape[1]

    index = faiss.IndexFlatIP(dim)  # inner product for cosine if normalized
    # normalize for cosine
    faiss.normalize_L2(embeddings)
    index.add(embeddings)
    faiss.write_index(index, index_path)
    # Save sentences mapping
    with open(index_path + '.meta', 'w', encoding='utf-8') as f:
        for s in sentences:
            f.write(s.replace('\n', ' ') + '\n')
    return index


def query_faiss(index_path: str, model: SentenceTransformer, query: str, top_k=5):
    index = faiss.read_index(index_path)
    with open(index_path + '.meta', 'r', encoding='utf-8') as f:
        sentences = [line.strip() for line in f]
    q_emb = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, top_k)
    return [(sentences[i], float(D[0][k])) for k, i in enumerate(I[0])]


# -----------------------------
# CLI and main orchestration
# -----------------------------

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument('--task', type=str, choices=['classification', 'ner', 'similarity'], required=True)
    p.add_argument('--config', type=str, required=True, help='Path to yaml config for the task')
    return p.parse_args()


def load_config(path: str) -> dict:
    with open(path, 'r', encoding='utf-8') as f:
        return yaml.safe_load(f)


def main():
    args = parse_args()
    cfg = load_config(args.config)

    common = CommonConfig(**cfg.get('common', {}))

    if args.task == 'classification':
        # Expect train/val csv with columns text,label
        train_df = pd.read_csv(cfg['train_csv'])
        val_df = pd.read_csv(cfg['val_csv'])
        label_list = cfg['label_list']
        common.model_name = cfg.get('model_name', common.model_name)
        common.output_dir = cfg.get('output_dir', common.output_dir)
        common.max_length = cfg.get('max_length', common.max_length)
        common.batch_size = cfg.get('batch_size', common.batch_size)
        common.num_epochs = cfg.get('num_epochs', common.num_epochs)
        train_classification_with_trainer(train_df, val_df, common, label_list)

    elif args.task == 'ner':
        # Expect jsonl or python list of examples with tokens + labels
        # Example input format: a json file with [{'tokens': [...], 'labels': [...]}, ...]
        import json
        with open(cfg['train_json'], 'r', encoding='utf-8') as f:
            train_examples = json.load(f)
        with open(cfg['val_json'], 'r', encoding='utf-8') as f:
            val_examples = json.load(f)
        label_list = cfg['label_list']
        common.model_name = cfg.get('model_name', common.model_name)
        common.output_dir = cfg.get('output_dir', common.output_dir)
        common.max_length = cfg.get('max_length', common.max_length)
        common.batch_size = cfg.get('batch_size', common.batch_size)
        common.num_epochs = cfg.get('num_epochs', common.num_epochs)
        train_token_classification_with_trainer(train_examples, val_examples, common, label_list)

    elif args.task == 'similarity':
        # Expect a csv or json with pairs for training and a corpus file for indexing
        train_pairs = []
        if 'train_pairs_csv' in cfg:
            df = pd.read_csv(cfg['train_pairs_csv'])
            # expect columns a,b,label
            train_pairs = list(df[['a','b','label']].itertuples(index=False, name=None))
        model_name = cfg.get('model_name', common.model_name)
        output_dir = cfg.get('output_dir', common.output_dir)
        epochs = cfg.get('num_epochs', 1)
        model = train_similarity_sentence_transformer(train_pairs, model_name, output_dir, epochs=epochs, batch_size=cfg.get('batch_size',8))
        # build index if corpus provided
        if 'corpus_txt' in cfg:
            with open(cfg['corpus_txt'], 'r', encoding='utf-8') as f:
                sentences = [l.strip() for l in f if l.strip()]
            index_path = os.path.join(output_dir, 'faiss.index')
            build_faiss_index(sentences, model, index_path)
            print('FAISS index written to', index_path)

    else:
        raise ValueError('Unknown task')


if __name__ == '__main__':
    main()
