Skip to content

Adigo10/Social-AI-Detector

Repository files navigation

Detection of AI-Generated Social Media Text

NTU AI6130 (Large Language Models) — Group G30

A system that classifies whether a social media post was written by a human or generated by AI, combining retrieval-augmented generation (RAG) with fine-tuned LLMs.

Team Structure

Team Responsibility Environment
Alpha Data preparation & retrieval pipeline Local machine (no GPU)
Beta Model training & experiments NTU CCDS TC2 GPU Cluster

Repository Layout

src/data_pipeline/       # Team Alpha: data preparation pipeline (this README)
src/training/            # Team Beta: model training (added later)
src/eval/                # Team Beta: evaluation (added later)
data/raw/                # Downloaded datasets (git-ignored)
data/processed/          # Pipeline outputs (git-ignored)
scripts/                 # SLURM job scripts (Team Beta)
configs/                 # Training configs (Team Beta)

Data Pipeline (Team Alpha)

Prerequisites

Setup

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Configure API key
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY

Running the Pipeline

Run each step sequentially from the project root directory:

Step 1: Download Datasets

python src/data_pipeline/download.py

Downloads three datasets to data/raw/:

  • MultiSocial (~472K texts) from Zenodo — social media posts across 5 platforms, 7 AI models
  • HC3 (~37K QA pairs) from HuggingFace — human vs ChatGPT answers
  • RAID test set from raid-bench.xyz — adversarial evaluation data

Expected time: 10–30 min depending on connection speed.

Step 2: Preprocess & Unify

python src/data_pipeline/preprocess.py

Cleans and merges all datasets into a single file:

  • Output: data/processed/corpus.jsonl
  • Format: One JSON object per line with fields: id, text, label, source_model, platform, dataset
  • Prints statistics (record counts by label, dataset, platform)

Step 3: Generate Embeddings

python src/data_pipeline/embed.py

Converts each text to a 768-dimensional vector using Gemini Embedding 2 API.

  • Output: data/processed/embeddings.npy (float32, N × 768)
  • Checkpoint: Progress saved every 10,000 texts. If interrupted, rerun the same command to resume.
  • Expected time: ~30–60 min for 500K texts (depends on API rate limits)
  • Cost: ~$5–10 for the full corpus

Step 4: Build FAISS Index

python src/data_pipeline/build_index.py

Builds a cosine similarity search index from the embeddings.

  • Output: data/processed/corpus.index
  • Runs a sanity check (first vector should find itself with similarity ~1.0)

Step 5: Generate Training Data

python src/data_pipeline/build_training_data.py

Creates two versions of instruction-tuning data:

  • data/processed/train_with_rag.jsonl — includes 5 similar retrieved texts as context
  • data/processed/train_without_rag.jsonl — target text only, no context

Both files have instruction and output fields, ready for LLM fine-tuning.

Expected Output Sizes

File Approximate Size
corpus.jsonl ~500–800 MB
embeddings.npy ~1.5 GB (500K × 768 × 4 bytes)
corpus.index ~1.5 GB
train_with_rag.jsonl ~2–4 GB
train_without_rag.jsonl ~500–800 MB

Total disk space needed: ~8–12 GB


Troubleshooting

Gemini API rate limits

The embedding script sleeps 0.3s between batches and retries on errors with a 60s backoff. If you still hit rate limits, increase SLEEP_BETWEEN_CALLS in embed.py.

Checkpoint recovery

If embed.py crashes or you stop it, just rerun it — it automatically resumes from the last checkpoint (data/processed/embeddings_checkpoint.npz).

Disk space

The full pipeline needs ~12 GB of free disk space. Check with du -sh data/ periodically.

Missing datasets

Each download step skips files that already exist. To re-download, delete the specific file from data/raw/ and rerun download.py.


Training & Experiments

Training code and experiment configurations will be added by Team Beta under src/training/, src/eval/, scripts/, and configs/.

About

Detection of AI Generated Social Media Text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors