# TSDAE Fine-tuning for Vietnamese Social Media & News

This notebook performs unsupervised fine-tuning of the embedding model using TSDAE (Trans-formers Specific Denoising Auto-Encoder). It allows the model to learn the specific language patterns of Vietnamese social media and news without manual labels.

In [None]:
# 1. Setup Environment
!pip install -q sentence-transformers rich pandas pynvml

import os
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")

In [None]:
# 2. Clone Repository
GITHUB_TOKEN = "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"
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/GadGadGad/Real-time-Event-Detection-on-Social-Media-Data"

!git clone {REPO_URL}
%cd Real-time-Event-Detection-on-Social-Media-Data

# Ensure data directory exists
!mkdir -p data models

### 3. Data Extraction
We extract raw text from the Kaggle dataset inputs.

In [None]:
# Path to Kaggle datasets
KAGGLE_INPUTS = [
    "/kaggle/input/se363-final-dataset/facebook",
    "/kaggle/input/se363-final-dataset/news"
]

# Run extraction script
!python scripts/train/extract_train_data.py --inputs {' '.join(KAGGLE_INPUTS)} --output data/train_tsdae.txt

# Check sample
!head -n 5 data/train_tsdae.txt

### 4. Run TSDAE Training
This will fine-tune the `paraphrase-multilingual-mpnet-base-v2` model.

In [None]:
# You can adjust batch_size depending on VRAM (8-16 is usually safe for T4 GPU)
!python scripts/train/train_tsdae.py

print("âœ… Fine-tuning complete!")

### 5. Export Model
Zip the model for easy download.

In [None]:
!zip -r tuned_model.zip models/tuned-mpnet-vietnamese-v1
print("Model zipped! You can download 'tuned_model.zip' from the output files.")