NTU AI6130 (Large Language Models) — Group G30
A system that classifies whether a social media post was written by a human or generated by AI, combining retrieval-augmented generation (RAG) with fine-tuned LLMs.
| Team | Responsibility | Environment |
|---|---|---|
| Alpha | Data preparation & retrieval pipeline | Local machine (no GPU) |
| Beta | Model training & experiments | NTU CCDS TC2 GPU Cluster |
src/data_pipeline/ # Team Alpha: data preparation pipeline (this README)
src/training/ # Team Beta: model training (added later)
src/eval/ # Team Beta: evaluation (added later)
data/raw/ # Downloaded datasets (git-ignored)
data/processed/ # Pipeline outputs (git-ignored)
scripts/ # SLURM job scripts (Team Beta)
configs/ # Training configs (Team Beta)
- Python 3.11
- A Gemini API key (free tier works)
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Configure API key
cp .env.example .env
# Edit .env and add your GEMINI_API_KEYRun each step sequentially from the project root directory:
python src/data_pipeline/download.pyDownloads three datasets to data/raw/:
- MultiSocial (~472K texts) from Zenodo — social media posts across 5 platforms, 7 AI models
- HC3 (~37K QA pairs) from HuggingFace — human vs ChatGPT answers
- RAID test set from raid-bench.xyz — adversarial evaluation data
Expected time: 10–30 min depending on connection speed.
python src/data_pipeline/preprocess.pyCleans and merges all datasets into a single file:
- Output:
data/processed/corpus.jsonl - Format: One JSON object per line with fields:
id,text,label,source_model,platform,dataset - Prints statistics (record counts by label, dataset, platform)
python src/data_pipeline/embed.pyConverts each text to a 768-dimensional vector using Gemini Embedding 2 API.
- Output:
data/processed/embeddings.npy(float32, N × 768) - Checkpoint: Progress saved every 10,000 texts. If interrupted, rerun the same command to resume.
- Expected time: ~30–60 min for 500K texts (depends on API rate limits)
- Cost: ~$5–10 for the full corpus
python src/data_pipeline/build_index.pyBuilds a cosine similarity search index from the embeddings.
- Output:
data/processed/corpus.index - Runs a sanity check (first vector should find itself with similarity ~1.0)
python src/data_pipeline/build_training_data.pyCreates two versions of instruction-tuning data:
data/processed/train_with_rag.jsonl— includes 5 similar retrieved texts as contextdata/processed/train_without_rag.jsonl— target text only, no context
Both files have instruction and output fields, ready for LLM fine-tuning.
| File | Approximate Size |
|---|---|
corpus.jsonl |
~500–800 MB |
embeddings.npy |
~1.5 GB (500K × 768 × 4 bytes) |
corpus.index |
~1.5 GB |
train_with_rag.jsonl |
~2–4 GB |
train_without_rag.jsonl |
~500–800 MB |
Total disk space needed: ~8–12 GB
The embedding script sleeps 0.3s between batches and retries on errors with a 60s backoff. If you still hit rate limits, increase SLEEP_BETWEEN_CALLS in embed.py.
If embed.py crashes or you stop it, just rerun it — it automatically resumes from the last checkpoint (data/processed/embeddings_checkpoint.npz).
The full pipeline needs ~12 GB of free disk space. Check with du -sh data/ periodically.
Each download step skips files that already exist. To re-download, delete the specific file from data/raw/ and rerun download.py.
Training code and experiment configurations will be added by Team Beta under src/training/, src/eval/, scripts/, and configs/.