# üß† NLP Topic Classification Model (Stage B)

This notebook handles the **Training Phase** of the pipeline:
1.  **Load Data**: Reads the clean `nlp_dataset.jsonl` created in Stage A.
2.  **Tokenization**: Segments Vietnamese words using `pyvi` (e.g., "H√† N·ªôi" -> "H√†_N·ªôi").
3.  **Vectorization**: Converts text to numbers using **TF-IDF** (removing stopwords here).
4.  **Modeling**: Trains a **Logistic Regression** classifier.
5.  **Evaluation**: Checks Accuracy, F1-Score, and analyzes specific Error cases.
6.  **Export**: Saves the trained model for deployment.

In [1]:
# --- 1. SETUP & IMPORTS ---

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from pyvi import ViTokenizer

# Scikit-learn modules
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline

# Configuration
pd.set_option('display.max_colwidth', 200)
DATA_FILE = "nlp_dataset.jsonl"
MODEL_PATH = "nlp_topic_classifier.pkl"

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


## 2. Load Data & Prepare Stopwords

In [2]:
# Load Dataset
print(f"üìÇ Loading dataset from {DATA_FILE}...")

try:
    df = pd.read_json(DATA_FILE, lines=True)
    print(f"‚úÖ Loaded {len(df)} records.")
except ValueError:
    print("‚ùå Error: Dataset file not found. Please run 'create_dataset.py' first.")

# Define Vietnamese Stopwords
# These words are grammatical glue that don't carry much topic meaning
STOPWORDS = {
    "c·ªßa", "v√†", "c√°c", "c√≥", "ƒë∆∞·ª£c", "cho", "trong", "m·ªôt", "l√†", "v·ªõi", "ƒë·ªÉ", 
    "nh·ªØng", "khi", "n√†y", "khi", "t·∫°i", "ƒë√£", "th√¨", "m√†", "nh∆∞", "nƒÉm", "t·ª´",
    "ƒë·∫øn", "ng∆∞·ªùi", "s·∫Ω", "c≈©ng", "v·ªÅ", "v√†o", "ra", "n√™n", "n·∫øu", "b·ªã", "b·ªüi",
    "l·∫°i", "do", "nh∆∞ng", "ƒëang", "v·∫´n", "ch·ªâ", "theo", "g√¨", "ai", "ƒë√¢u", "r·∫±ng"
}
STOPWORDS_LIST = list(STOPWORDS)

# Preview data
df[['label_name', 'text']].head(3)

üìÇ Loading dataset from nlp_dataset.jsonl...
‚úÖ Loaded 126687 records.


Unnamed: 0,label_name,text
0,·∫®m th·ª±c,"M·ªôt trong nh·ªØng ph√¢n khu s√¥i ƒë·ªông nh·∫•t H·ªôi ch·ª£ M√πa Thu 2025 ƒëang di·ªÖn ra t·∫°i Trung t√¢m H·ªôi ch·ª£ Tri·ªÉn l√£m Qu·ªëc gia (·ªü x√£ ƒê√¥ng Anh, H√† N·ªôi) l√† ‚ÄúThu m·ªπ v·ªã‚Äù. Khu v·ª±c n√†y ƒëem ƒë·∫øn m·ªôt ƒë·∫°i ti·ªác ·∫©m th·ª±c t..."
1,·∫®m th·ª±c,"Ch·ªß ƒë·ªÅ v·ªÅ ·∫©m th·ª±c H√† th√†nh lu√¥n thu h√∫t s·ª± quan t√¢m c·ªßa nhi·ªÅu ng∆∞·ªùi. Sau khi D√¢n tr√≠ ƒëƒÉng t·∫£i b√†i vi·∫øt Tranh c√£i m√¢m c·ªó ngh·ªá nh√¢n ·ªü B√°t Tr√†ng 6 ng∆∞·ªùi gi√° 2,7 tri·ªáu ƒë·ªìng ƒë·∫Øt ƒë·ªè, ƒë·ªôc gi·∫£ v√† b·∫°n ƒë·ªçc ..."
2,·∫®m th·ª±c,"T·ª± l√†m b√°nh m·ª≥, b√°nh ph·ªü, nh·∫≠p m·∫Øm t√¥m t·ª´ Vi·ªát Nam sang M·ªπ\n\nTrong b√†i vi·∫øt ƒë∆∞·ª£c t√°c gi·∫£ Helen Rosner chia s·∫ª nh·∫≠n ƒë·ªãnh tr√™n t·ªù New Yorker, m·ªôt nh√† h√†ng do ng∆∞·ªùi Vi·ªát Nam l√†m ch·ªß ·ªü khu Upper West..."


## 3. Vietnamese Word Segmentation (T√°ch t·ª´)
Using `pyvi` to join compound words with underscores (e.g., `h·ªçc sinh` -> `h·ªçc_sinh`). This is crucial for TF-IDF to treat them as single tokens.

In [None]:
print("‚öôÔ∏è Tokenizing text (This may take a minute)...")

def segment_text(text):
    if not isinstance(text, str): return ""
    return ViTokenizer.tokenize(text)

# Apply tokenization to a new column
df['segmented_text'] = df['text'].apply(segment_text)

print("‚úÖ Tokenization Complete!")
print("Sample Original:  ", df['text'].iloc[0][:60])
print("Sample Segmented: ", df['segmented_text'].iloc[0][:60])

## 4. Train - Test Split
We split the data **BEFORE** vectorization to prevent data leakage.

In [None]:
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
    df['segmented_text'], 
    df['label_id'], 
    df.index, # Keep index to trace back filenames later
    test_size=0.2, 
    random_state=42, 
    stratify=df['label_id'] # Ensure balanced classes in test set
)

print(f"üîπ Training Set: {len(X_train)} samples")
print(f"üîπ Test Set:     {len(X_test)} samples")

## 5. Build & Train Pipeline
We use a `Pipeline` to bundle:
1.  **TfidfVectorizer**: Converts text to vectors, filtering stopwords and rare words.
2.  **LogisticRegression**: A fast and effective baseline model for text classification.

In [None]:
print("üöÄ Training Model...")

model_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=20000,      # Keep top 20k important words
        ngram_range=(1, 2),      # Use unigrams and bigrams
        stop_words=STOPWORDS_LIST,
        min_df=2                 # Ignore words appearing < 2 times
    )),
    ('clf', LogisticRegression(
        solver='sag',            # Optimized for large datasets
        multi_class='multinomial', 
        max_iter=1000,
        n_jobs=-1
    ))
])

model_pipeline.fit(X_train, y_train)

print("‚úÖ Training Finished.")

## 6. Evaluation
Assessing model performance on the Test set.

In [None]:
# Predictions
y_pred = model_pipeline.predict(X_test)

# Get correct label names map
label_map = df.drop_duplicates('label_id').sort_values('label_id')
label_names = label_map['label_name'].tolist()

# Metrics
acc = accuracy_score(y_test, y_pred)
print(f"üèÜ Accuracy: {acc:.2%}\n")

print("üìä Classification Report:")
print(classification_report(y_test, y_pred, target_names=label_names))

### Confusion Matrix Visualization

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(14, 12))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=label_names, 
            yticklabels=label_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

## 7. Error Analysis (Quan tr·ªçng)
Identify which specific files confused the model. This is possible because we preserved the `filename`.

In [None]:
# Reconstruct Test DataFrame with predictions
test_results = df.loc[idx_test].copy()
test_results['predicted_id'] = y_pred
test_results['predicted_label'] = test_results['predicted_id'].apply(lambda x: label_names[x])

# Filter Errors
errors = test_results[test_results['label_id'] != test_results['predicted_id']]

print(f"‚ö†Ô∏è Total Misclassified: {len(errors)} / {len(X_test)}")
print("samples of errors:")

# Show random 5 errors
if not errors.empty:
    cols_to_show = ['filename', 'label_name', 'predicted_label', 'text']
    display(errors[cols_to_show].sample(min(5, len(errors))))
else:
    print("Wow! No errors found. Perfect model? Check for overfitting.")

## 8. Save Model & Live Inference

In [None]:
# Save pipeline
joblib.dump(model_pipeline, MODEL_PATH)
print(f"üíæ Model saved to: {MODEL_PATH}")

# --- Inference Function ---
def predict_news(text):
    # Preprocessing (must match training)
    tokenized = ViTokenizer.tokenize(text)
    
    # Predict
    pred_id = model_pipeline.predict([tokenized])[0]
    prob = model_pipeline.predict_proba([tokenized]).max()
    
    return label_names[pred_id], prob

# Try it out
sample_news = "Gi√° chung c∆∞ t·∫°i H√† N·ªôi ti·∫øp t·ª•c l·∫≠p ƒë·ªânh m·ªõi do ngu·ªìn cung khan hi·∫øm."
topic, conf = predict_news(sample_news)

print(f"\nüì∞ Input: {sample_news}")
print(f"ü§ñ Prediction: {topic} (Confidence: {conf:.1%})")