# üáªüá≥ Vietnamese Paraphrase Identification ‚Äî Inference Demo

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vmhdaica/vietnamese-paraphrase-identification/blob/main/notebooks/inference_demo.ipynb)

This notebook loads the fine-tuned **PhoBERT-base-v2** model from HuggingFace Hub and lets you:
1. Run **batch predictions** on sample sentence pairs
2. Launch an **interactive Gradio UI** right inside Colab

**Model:** [`vmhdaica/vnpi_model_checkpoint_3135`](https://huggingface.co/vmhdaica/vnpi_model_checkpoint_3135)  
**Accuracy:** 97.02% | **Macro-F1:** 0.876 | **PR-AUC:** 0.9995

---

## 1. Install Dependencies

In [None]:
!pip -q install transformers torch gradio

## 2. Load Model from HuggingFace Hub

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

MODEL_ID = "vmhdaica/vnpi_model_checkpoint_3135"
MAX_LENGTH = 256

print(f"Loading model: {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()
print(f"‚úÖ Model loaded on {device}")
print(f"   Labels: {model.config.id2label}")

## 3. Prediction Function

In [None]:
def predict_pair(s1: str, s2: str) -> dict:
    """Predict whether two Vietnamese sentences are paraphrases."""
    inputs = tokenizer(
        s1, s2,
        truncation=True, max_length=MAX_LENGTH,
        return_tensors="pt",
    ).to(device)
    inputs.pop("token_type_ids", None)  # PhoBERT/RoBERTa doesn't use this

    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=-1)[0].cpu().numpy()

    label = "‚úÖ Paraphrase" if probs[1] > probs[0] else "‚ùå Not paraphrase"
    return {
        "label": label,
        "p_paraphrase": round(float(probs[1]), 4),
        "p_not_paraphrase": round(float(probs[0]), 4),
    }

# Quick test
result = predict_pair("H√¥m nay tr·ªùi m∆∞a r·∫•t to.", "Th·ªùi ti·∫øt h√¥m nay m∆∞a l·ªõn.")
print(result)

## 4. Batch Predictions ‚Äî Sample Pairs

In [None]:
import pandas as pd

test_pairs = [
    ("H√¥m nay tr·ªùi m∆∞a r·∫•t to.",
     "Th·ªùi ti·∫øt h√¥m nay m∆∞a l·ªõn."),
    ("Gi√° v√†ng tƒÉng m·∫°nh.",
     "Tr·∫≠n ƒë·∫•u t·ªëi qua r·∫•t h·∫•p d·∫´n."),
    ("Th·ªß t∆∞·ªõng ƒë√£ h·ªçp v·ªõi c√°c b·ªô tr∆∞·ªüng.",
     "Cu·ªôc h·ªçp c·ªßa Th·ªß t∆∞·ªõng v·ªõi n·ªôi c√°c ƒë√£ di·ªÖn ra."),
    ("H√† N·ªôi l√† th·ªß ƒë√¥ c·ªßa Vi·ªát Nam.",
     "TP.HCM l√† th√†nh ph·ªë l·ªõn nh·∫•t Vi·ªát Nam."),
    ("C√¥ ·∫•y r·∫•t gi·ªèi ti·∫øng Anh.",
     "Kh·∫£ nƒÉng ti·∫øng Anh c·ªßa c√¥ ·∫•y r·∫•t t·ªët."),
    ("T√¥i ƒëi ƒÉn ph·ªü s√°ng nay.",
     "S√°ng nay t√¥i ƒë√£ th∆∞·ªüng th·ª©c m·ªôt t√¥ ph·ªü."),
    ("Vi·ªát Nam n·∫±m ·ªü ƒê√¥ng Nam √Å.",
     "ƒê·∫•t n∆∞·ªõc Vi·ªát Nam thu·ªôc khu v·ª±c ƒê√¥ng Nam √Å."),
    ("Con m√®o ng·ªìi tr√™n b√†n.",
     "Chi·∫øc xe ƒëang ch·∫°y tr√™n ƒë∆∞·ªùng."),
]

results = []
for s1, s2 in test_pairs:
    r = predict_pair(s1, s2)
    results.append({
        "Sentence 1": s1,
        "Sentence 2": s2,
        "Prediction": r["label"],
        "P(paraphrase)": r["p_paraphrase"],
    })

df = pd.DataFrame(results)
df.style.set_properties(**{"text-align": "left"})

## 5. Try Your Own Sentences

In [None]:
# ‚úèÔ∏è Change these sentences and re-run this cell!

sentence_1 = "Vi·ªát Nam c√≥ nhi·ªÅu c·∫£nh ƒë·∫πp."
sentence_2 = "ƒê·∫•t n∆∞·ªõc Vi·ªát Nam r·∫•t nhi·ªÅu phong c·∫£nh tuy·ªát v·ªùi."

result = predict_pair(sentence_1, sentence_2)
print(f"  Sentence 1: {sentence_1}")
print(f"  Sentence 2: {sentence_2}")
print(f"  ‚Üí {result['label']}  (confidence: {result['p_paraphrase']:.4f})")

## 6. üé® Interactive Gradio Demo

Run this cell to launch a **live UI** directly inside Colab!

In [None]:
import gradio as gr

def gradio_predict(s1: str, s2: str) -> dict:
    if not s1.strip() or not s2.strip():
        return {"paraphrase": 0.0, "not_paraphrase": 1.0}
    inputs = tokenizer(s1, s2, truncation=True, max_length=MAX_LENGTH, return_tensors="pt").to(device)
    inputs.pop("token_type_ids", None)
    with torch.no_grad():
        probs = torch.softmax(model(**inputs).logits, dim=-1)[0].cpu().numpy()
    return {"paraphrase": float(probs[1]), "not_paraphrase": float(probs[0])}

examples = [
    ["H√¥m nay tr·ªùi m∆∞a r·∫•t to.", "Th·ªùi ti·∫øt h√¥m nay m∆∞a l·ªõn."],
    ["Gi√° v√†ng tƒÉng m·∫°nh.", "Tr·∫≠n ƒë·∫•u t·ªëi qua r·∫•t h·∫•p d·∫´n."],
    ["Th·ªß t∆∞·ªõng ƒë√£ h·ªçp v·ªõi c√°c b·ªô tr∆∞·ªüng.", "Cu·ªôc h·ªçp c·ªßa Th·ªß t∆∞·ªõng v·ªõi n·ªôi c√°c ƒë√£ di·ªÖn ra."],
    ["H√† N·ªôi l√† th·ªß ƒë√¥ c·ªßa Vi·ªát Nam.", "TP.HCM l√† th√†nh ph·ªë l·ªõn nh·∫•t Vi·ªát Nam."],
    ["C√¥ ·∫•y r·∫•t gi·ªèi ti·∫øng Anh.", "Kh·∫£ nƒÉng ti·∫øng Anh c·ªßa c√¥ ·∫•y r·∫•t t·ªët."],
    ["T√¥i ƒëi ƒÉn ph·ªü s√°ng nay.", "S√°ng nay t√¥i ƒë√£ th∆∞·ªüng th·ª©c m·ªôt t√¥ ph·ªü."],
]

demo = gr.Interface(
    fn=gradio_predict,
    inputs=[
        gr.Textbox(label="C√¢u 1 / Sentence 1", placeholder="Nh·∫≠p c√¢u ti·∫øng Vi·ªát...", lines=2),
        gr.Textbox(label="C√¢u 2 / Sentence 2", placeholder="Nh·∫≠p c√¢u ti·∫øng Vi·ªát...", lines=2),
    ],
    outputs=gr.Label(label="Result", num_top_classes=2),
    title="üáªüá≥ Vietnamese Paraphrase Identification",
    description="Compare two Vietnamese sentences. Model: PhoBERT-base-v2 fine-tuned on 40K+ pairs ‚Äî 97.02% accuracy.",
    examples=examples,
    theme=gr.themes.Soft(),
)

demo.launch(share=True)  # share=True creates a public URL

---

**Resources:**
- ü§ó Model: [`vmhdaica/vnpi_model_checkpoint_3135`](https://huggingface.co/vmhdaica/vnpi_model_checkpoint_3135)
- üìÇ GitHub: [`vietnamese-paraphrase-identification`](https://github.com/Hoang-ca/vietnamese-paraphrase-identification)
- üìù Model Card: see `MODEL_CARD.md` in the repo