## SetFit Model Training 🌟 🧮
#### **This notebook is not necessary to run** - Model has already been pre-trained and saved in the folder: `setfit_model` ✨

The SetFit notebook is designed for efficient model training using a sample size of 10,000. 🔬

#### **Requirements:**
- For optimal performance, it is **highly recommended** to run this notebook in **Google Colab with a T4 GPU.** 🚀
- NOTE: You likely won't be able to run this notebook locally (on your computer) with 10.000 samples, unless you have plenty of memory available

**Estimated Runtime (Colab T4 GPU)**: Approximately 22 minutes for processing 10,000 samples. ⏱️ 💻

### Model Training Process

In this notebook, we utilize the SetFit library to train a model on a dataset. The process includes:

1. **Data Preparation**: Convert the pre-labeled Twitter dataset (Bot vs. Human) into a Hugging Face Dataset and split it into training and testing sets. (80/20)
2. **Model Loading**: We load a pre-trained SetFit model `sentence-transformers/all-MiniLM-L6-v2`.
3. **Trainer Setup**: Configure the `SetFitTrainer` with the training and evaluation datasets, adjusting parameters such as batch size (set to 8) and number of iterations.
4. **Model Training**: SetFit trains the model using the specified training dataset `TwitterData_joined.csv` using the column `Tweet_text` as a feature and `Label` as target.
- Bot = 0
- Human = 1
5. **Model Evaluation**: Evaluate the model's performance
6. **Model Saving**: Saves the trained model as `setfit_model` for future use.

## Install Requirements 🎛️

In [None]:
!pip install -r requirements.txt -q

## Importing Libraries 🔌

In [2]:
# Datahandling
import os
import re
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import tqdm

# Disabling WanDB (can be enabled if you want to have model training performance available)
os.environ['WANDB_DISABLED'] = 'true'

### SetFit Data Import 🔬 & Data Cleaning 🧹 (default sample size: 10.000))

In [3]:
# DATA IMPORT & SAMPLING

## You can change n=10000 to a smaller number for faster training

data = pd.read_csv('TwitterData_Joined.csv')
data = data.sample(n=10000, random_state=19)

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    # Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags (but keep the text)
    text = re.sub(r'#', '', text)
    # Remove emojis and special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text

# Data Cleaning
data['Tweet_text'] = data['Tweet_text'].apply(clean_text)  # Custom cleaning function
data.dropna(subset=['Tweet_text'], inplace=True)

In [None]:
from setfit import SetFitModel, SetFitTrainer
from datasets import Dataset

# convert pandas dataframe to hugging face dataset
dataset = Dataset.from_pandas(data)

# split the dataset into training and test
dataset = dataset.train_test_split(test_size=0.2, seed=42)

train_ds = dataset["train"]
test_ds = dataset["test"]

# load a pre-trained setfit model
model = SetFitModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    batch_size=8,
    num_iterations=10,
    num_epochs=1,
    column_mapping={"Tweet_text": "text", "Label": "label"}
)

# train the model
trainer.train()

# evaluate the model
metrics = trainer.evaluate()
print(metrics)  # print precision, recall, and f1-score

model.save_pretrained("setfit_model")
print("model saved!")


In [None]:
# show metrics
print(metrics)

{'accuracy': 0.862}
