# Thai Sentiment Classification with XLM-RoBERTa

This notebook trains a **XLM-RoBERTa** model for Thai text sentiment classification using the Wisesight-Sentiment-Thai dataset.

## 1. Install Dependencies

In [1]:
!pip install transformers datasets scikit-learn tensorflow



You should consider upgrading via the 'd:\Github\Text-classification-Thai\.venv\Scripts\python.exe -m pip install --upgrade pip' command.


## 2. Import Libraries

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
import pickle

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {tf.config.list_physical_devices('GPU')}")

  from .autonotebook import tqdm as notebook_tqdm


TensorFlow version: 2.19.0
GPU Available: []


## 3. Load and Prepare Dataset

In [3]:
print("[INFO] Loading dataset from Hugging Face ...")

# Load ZombitX64/Wisesight-Sentiment-Thai dataset
dataset = None
texts = []
labels = []

try:
    # Method 1: Direct loading with specific configuration
    print("[INFO] Loading ZombitX64/Wisesight-Sentiment-Thai dataset...")
    
    # Try loading without split first to see available splits
    dataset_dict = load_dataset('ZombitX64/Wisesight-Sentiment-Thai')
    print(f"Available splits: {list(dataset_dict.keys())}")
    
    # Use train split
    dataset = dataset_dict['train']
    print(f"Dataset loaded successfully! Size: {len(dataset)}")
    
    # Check column names
    print(f"Column names: {dataset.column_names}")
    
    # Extract data based on available columns
    if 'text' in dataset.column_names:
        texts = dataset['text']
    elif 'texts' in dataset.column_names:
        texts = dataset['texts']
    else:
        # Find text column
        for col in dataset.column_names:
            if 'text' in col.lower():
                texts = dataset[col]
                break
    
    if 'sentiment' in dataset.column_names:
        labels = dataset['sentiment']
    elif 'label' in dataset.column_names:
        labels = dataset['label']
    elif 'category' in dataset.column_names:
        labels = dataset['category']
    else:
        # Find label column
        for col in dataset.column_names:
            if any(keyword in col.lower() for keyword in ['sentiment', 'label', 'category']):
                labels = dataset[col]
                break
    
    print(f"[INFO] Successfully extracted {len(texts)} texts and {len(labels)} labels")
    print(f"Sample text: {texts[0] if texts else 'No texts found'}")
    print(f"Sample labels: {labels[:5] if labels else 'No labels found'}")
    
except Exception as e:
    print(f"Method 1 failed: {e}")
    try:
        # Method 2: Try with trust_remote_code and different parameters
        print("[INFO] Trying alternative loading method...")
        dataset = load_dataset('ZombitX64/Wisesight-Sentiment-Thai', 
                              split='train',
                              trust_remote_code=True,
                              download_mode='force_redownload')
        
        texts = dataset['text'] if 'text' in dataset.column_names else dataset['texts']
        labels = dataset['sentiment'] if 'sentiment' in dataset.column_names else dataset['label']
        print("[INFO] Alternative method successful!")
        
    except Exception as e2:
        print(f"Method 2 failed: {e2}")
        try:
            # Method 3: Manual download approach
            print("[INFO] Trying manual loading approach...")
            
            # Use datasets library with specific configuration
            from datasets import load_dataset, DownloadConfig
            
            download_config = DownloadConfig(
                resume_download=True,
                force_download=False,
                use_etag=False
            )
            
            dataset = load_dataset('ZombitX64/Wisesight-Sentiment-Thai',
                                 split='train',
                                 download_config=download_config)
            
            # Extract data
            texts = dataset['text'] if 'text' in dataset.column_names else dataset['texts']
            labels = dataset['sentiment'] if 'sentiment' in dataset.column_names else dataset['label']
            print("[INFO] Manual loading successful!")
            
        except Exception as e3:
            print(f"Method 3 failed: {e3}")
            print("[INFO] All methods failed. Please check dataset availability.")
            print("Error details:")
            print(f"- Method 1: {e}")
            print(f"- Method 2: {e2}")  
            print(f"- Method 3: {e3}")
            
            # Exit if all methods fail
            raise Exception("Unable to load ZombitX64/Wisesight-Sentiment-Thai dataset")

print("[INFO] Dataset loading complete.")

# Verify data quality
print(f"Total samples: {len(texts)}")
print(f"Total labels: {len(labels)}")

if len(texts) != len(labels):
    min_len = min(len(texts), len(labels))
    texts = texts[:min_len]
    labels = labels[:min_len]
    print(f"Adjusted to {min_len} samples for consistency")

# Use subset for training (adjust based on your resources)
SAMPLE_SIZE = min(50000, len(texts))  # Use up to 50k samples
texts = texts[:SAMPLE_SIZE]
labels = labels[:SAMPLE_SIZE]

print(f"\nFinal dataset info:")
print(f"Total samples: {len(labels)}")
print(f"Sample texts: {texts[:3]}")
print(f"Sample labels: {labels[:10]}")
print(f"Unique labels: {set(labels)}")

# Show label distribution
import pandas as pd
label_counts = pd.Series(labels).value_counts()
print(f"\nLabel distribution:\n{label_counts}")

print("[INFO] ZombitX64/Wisesight-Sentiment-Thai dataset ready!")

[INFO] Loading dataset from Hugging Face ...
[INFO] Loading ZombitX64/Wisesight-Sentiment-Thai dataset...


Using the latest cached version of the dataset since ZombitX64/Wisesight-Sentiment-Thai couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at C:\Users\Admin\.cache\huggingface\datasets\ZombitX64___wisesight-sentiment-thai\default\0.0.0\ce65e1d7831a020c303f6a0e204c43bd351a0f3e (last modified on Mon Jul  7 09:49:35 2025).
Found the latest cached dataset configuration 'default' at C:\Users\Admin\.cache\huggingface\datasets\ZombitX64___wisesight-sentiment-thai\default\0.0.0\ce65e1d7831a020c303f6a0e204c43bd351a0f3e (last modified on Mon Jul  7 09:49:35 2025).


Available splits: ['train']
Dataset loaded successfully! Size: 628715
Column names: ['text', 'sentiment']
[INFO] Successfully extracted 628715 texts and 628715 labels
Sample text: คอกเทลตลกอ่าาา สงสารพี่พิธีกรมาก55555
Sample labels: ['positive', 'question', 'positive', 'positive', 'neutral']
[INFO] Dataset loading complete.
Total samples: 628715
Total labels: 628715

Final dataset info:
Total samples: 50000
Sample texts: ['คอกเทลตลกอ่าาา สงสารพี่พิธีกรมาก55555', 'I love you....คุณชมพู่ว่ากี่คำครับ😂', 'สวัสดีครับครู❤ผมหายซึมแล้วครับ']
Sample labels: ['positive', 'question', 'positive', 'positive', 'neutral', 'positive', 'positive', 'positive', 'positive', 'neutral']
Unique labels: {'neutral', 'question', 'negative', 'positive', 'mixed'}

Label distribution:
neutral     14541
question    12282
positive    12149
mixed        6608
negative     4420
Name: count, dtype: int64
[INFO] ZombitX64/Wisesight-Sentiment-Thai dataset ready!
[INFO] Successfully extracted 628715 texts and 628715 labels

## 4. Encode Labels

In [4]:
print("[INFO] Encoding labels ...")
le = LabelEncoder()
labels_encoded = le.fit_transform(labels)
num_labels = len(le.classes_)

print(f"Total samples: {len(labels)}")
print(f"Label classes: {le.classes_}")
print(f"Number of classes: {num_labels}")

# Check label distribution
import pandas as pd
label_counts = pd.Series(labels).value_counts()
print(f"\nLabel distribution:\n{label_counts}")

[INFO] Encoding labels ...
Total samples: 50000
Label classes: ['mixed' 'negative' 'neutral' 'positive' 'question']
Number of classes: 5

Label distribution:
neutral     14541
question    12282
positive    12149
mixed        6608
negative     4420
Name: count, dtype: int64


## 5. Load Tokenizer and Prepare Data

In [5]:
MODEL_NAME = "xlm-roberta-base"  # Multilingual XLM-RoBERTa (supports Thai)
MAX_LENGTH = 128

print(f"[INFO] Loading tokenizer: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

print("[INFO] Tokenizing texts ...")
# Fix: Use return_tensors='np' and ensure consistent padding
encodings = tokenizer(
    texts, 
    truncation=True, 
    padding='max_length',  # Changed from padding=True to padding='max_length'
    max_length=MAX_LENGTH,
    return_tensors='np'  # Return numpy arrays for consistent shapes
)
print("[INFO] Tokenization complete.")

print(f"Input IDs shape: {encodings['input_ids'].shape}")
print(f"Attention mask shape: {encodings['attention_mask'].shape}")
print(f"Sample tokens (first 10): {encodings['input_ids'][0][:10]}")

# Verify all sequences have the same length
print(f"All sequences length {MAX_LENGTH}: {all(len(seq) == MAX_LENGTH for seq in encodings['input_ids'])}")

[INFO] Loading tokenizer: xlm-roberta-base
[INFO] Tokenizing texts ...
[INFO] Tokenizing texts ...
[INFO] Tokenization complete.
Input IDs shape: (50000, 128)
Attention mask shape: (50000, 128)
Sample tokens (first 10): [     0  27704  40143 189865 216248   5407  23560   9250   9250  14990]
All sequences length 128: True
[INFO] Tokenization complete.
Input IDs shape: (50000, 128)
Attention mask shape: (50000, 128)
Sample tokens (first 10): [     0  27704  40143 189865 216248   5407  23560   9250   9250  14990]
All sequences length 128: True


## 6. Create TensorFlow Dataset

In [6]:
# Convert to tf.data.Dataset using numpy arrays (more reliable)
print("[INFO] Preparing tf.data.Dataset ...")

# Convert to tensorflow tensors directly
input_ids = tf.constant(encodings['input_ids'], dtype=tf.int32)
attention_mask = tf.constant(encodings['attention_mask'], dtype=tf.int32)
labels_tensor = tf.constant(labels_encoded, dtype=tf.int64)

print(f"Input IDs tensor shape: {input_ids.shape}")
print(f"Attention mask tensor shape: {attention_mask.shape}")
print(f"Labels tensor shape: {labels_tensor.shape}")

# Create dataset from tensors
dataset = tf.data.Dataset.from_tensor_slices({
    'input_ids': input_ids,
    'attention_mask': attention_mask,
    'labels': labels_tensor
})

# Split into train/validation
BATCH_SIZE = 16
dataset_size = len(texts)
train_size = int(0.8 * dataset_size)

# Shuffle and split
dataset = dataset.shuffle(buffer_size=10000, seed=42)
train_ds = dataset.take(train_size).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_ds = dataset.skip(train_size).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

print(f"[INFO] Dataset ready for training. Train size: {train_size}, Val size: {dataset_size - train_size}")

# Test the dataset structure
for batch in train_ds.take(1):
    print(f"Batch input_ids shape: {batch['input_ids'].shape}")
    print(f"Batch attention_mask shape: {batch['attention_mask'].shape}")
    print(f"Batch labels shape: {batch['labels'].shape}")
    break

[INFO] Preparing tf.data.Dataset ...
Input IDs tensor shape: (50000, 128)
Attention mask tensor shape: (50000, 128)
Labels tensor shape: (50000,)
[INFO] Dataset ready for training. Train size: 40000, Val size: 10000
Batch input_ids shape: (16, 128)
Batch attention_mask shape: (16, 128)
Batch labels shape: (16,)
[INFO] Dataset ready for training. Train size: 40000, Val size: 10000
Batch input_ids shape: (16, 128)
Batch attention_mask shape: (16, 128)
Batch labels shape: (16,)


## 7. Load and Compile Model

In [7]:
print("[INFO] Loading pre-trained model ...")
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

# Compile model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

print("[INFO] Model loaded and compiled.")
print(f"Model summary:")
model.summary()

[INFO] Loading pre-trained model ...






TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You shou

[INFO] Model loaded and compiled.
Model summary:
Model: "tfxlm_roberta_for_sequence_classification"
_________________________________________________________________
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 roberta (TFXLMRobertaMainL  multiple                  277453056 
 ayer)                                                           
                                                                 
 classifier (TFXLMRobertaCl  multiple                  594437    
 assificationHead)                                               
                                                                 
 Layer (type)                Output Shape              Param #   
 roberta (TFXLMRobertaMainL  multiple                  277453056 
 ayer)                                                           
                                                                 
 classifier (TFXLMRobertaCl  multiple     

## 8. Train Model

In [None]:
print("[INFO] Training ...")

# Add callbacks for better monitoring
callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True),
    tf.keras.callbacks.ReduceLROnPlateau(patience=1, factor=0.5)
]

# Prepare the datasets for training (extract labels from the dataset)
def extract_labels(batch):
    return {'input_ids': batch['input_ids'], 'attention_mask': batch['attention_mask']}, batch['labels']

train_ds_formatted = train_ds.map(extract_labels)
val_ds_formatted = val_ds.map(extract_labels)

# Train the model
EPOCHS = 3  # Increase if you have more time/resources
history = model.fit(
    train_ds_formatted,
    validation_data=val_ds_formatted,
    epochs=EPOCHS,
    callbacks=callbacks,
    verbose=1
)

print("[INFO] Training complete.")

[INFO] Training ...
Epoch 1/3
Epoch 1/3




  11/2500 [..............................] - ETA: 8:43:35 - loss: 1.6224 - accuracy: 0.1648

## 9. Visualize Training History

In [None]:
import matplotlib.pyplot as plt

# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot accuracy
ax1.plot(history.history['accuracy'], label='Train Accuracy')
ax1.plot(history.history['val_accuracy'], label='Val Accuracy')
ax1.set_title('Model Accuracy')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.legend()

# Plot loss
ax2.plot(history.history['loss'], label='Train Loss')
ax2.plot(history.history['val_loss'], label='Val Loss')
ax2.set_title('Model Loss')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.legend()

plt.tight_layout()
plt.show()

## 10. Test Model with Sample Texts

In [None]:
# Test the model with sample text
def predict_sentiment(text, model, tokenizer, label_encoder, max_len=128):
    inputs = tokenizer(text, truncation=True, padding='max_length', max_length=max_len, return_tensors='tf')
    logits = model(inputs)[0]
    probs = tf.nn.softmax(logits, axis=1).numpy()[0]
    pred_idx = np.argmax(probs)
    pred_label = label_encoder.classes_[pred_idx]
    confidence = probs[pred_idx]
    top_indices = np.argsort(probs)[::-1][:3]
    top_labels = [label_encoder.classes_[i] for i in top_indices]
    top_scores = [probs[i] for i in top_indices]
    return pred_label, confidence, top_labels, top_scores

test_texts = [
    "ร้านนี้อร่อยมาก ชอบมากเลย",  # Should be positive
    "แย่มาก บริการแย่ อาหารไม่อร่อย",  # Should be negative
    "ปกติ ไม่ดีไม่แย่",  # Should be neutral
    "ดีใจมาก ประทับใจ",  # Should be positive
    "เศร้า ผิดหวัง",  # Should be negative
    "ไม่รู้จะตอบอะไร",  # Should be question/mixed
    "หนังเรื่องนี้เป็นยังไง มีคนดูแล้วมั้ย"  # Should be question
]

print("\n=== Testing Transformer Model ===")
for i, text in enumerate(test_texts):
    predicted_label, confidence, top_labels, top_scores = predict_sentiment(text, model, tokenizer, le)
    print(f"\nTest {i+1}: {text}")
    print(f"Predicted sentiment: {predicted_label} (confidence: {confidence:.4f})")
    print(f"Top 3 predictions:")
    for label, score in zip(top_labels, top_scores):
        print(f"  {label}: {score:.4f}")

## 11. Save Model and Artifacts

In [None]:
print("[INFO] Saving model and tokenizer ...")

# Save model and tokenizer
model.save_pretrained('/content/thai_text_transformer_model')
tokenizer.save_pretrained('/content/thai_text_transformer_model')

# Save label encoder
with open('/content/label_encoder.pkl', 'wb') as f:
    pickle.dump(le, f)

print("[INFO] Model, tokenizer, and label encoder saved successfully!")
print("\nFiles saved:")
print("- /content/thai_text_transformer_model/ (model and tokenizer)")
print("- /content/label_encoder.pkl (label encoder)")

## 12. Download Files (Optional)

In [None]:
# Create zip file for easy download
import shutil
from google.colab import files

print("Creating zip file for download...")
shutil.make_archive('/content/thai_sentiment_model', 'zip', '/content/', 'thai_text_transformer_model')

# Download files
print("Downloading model files...")
files.download('/content/thai_sentiment_model.zip')
files.download('/content/label_encoder.pkl')

print("Download complete!")

## 13. Load and Test Saved Model (Optional)

In [None]:
# Test loading the saved model
print("Testing saved model...")

# Load saved model
loaded_model = TFAutoModelForSequenceClassification.from_pretrained('/content/thai_text_transformer_model')
loaded_tokenizer = AutoTokenizer.from_pretrained('/content/thai_text_transformer_model')
with open('/content/label_encoder.pkl', 'rb') as f:
    loaded_le = pickle.load(f)

# Test with a sample
test_text = "ขอบคุณมากครับ บริการดีมาก"
pred_label, confidence, top_labels, top_scores = predict_sentiment(test_text, loaded_model, loaded_tokenizer, loaded_le)

print(f"\nTest text: {test_text}")
print(f"Predicted: {pred_label} (confidence: {confidence:.4f})")
print("Model loaded and working correctly!")

---

## Summary

This notebook successfully:
1. ✅ Loaded the Wisesight-Sentiment-Thai dataset
2. ✅ Fine-tuned XLM-RoBERTa for Thai sentiment classification
3. ✅ Trained the model with validation monitoring
4. ✅ Tested the model with sample Thai texts
5. ✅ Saved the model, tokenizer, and label encoder

**Model Performance:**
- Labels: mixed, negative, neutral, positive, question
- Architecture: XLM-RoBERTa (multilingual BERT)
- Max Length: 128 tokens
- Can be uploaded to Hugging Face Hub for public use

**Next Steps:**
- Upload to Hugging Face Hub
- Deploy as API or web service
- Fine-tune with domain-specific data
- Compare with other models (Thai-specific BERT, etc.)