# CLIP Feature Extraction
In this notebook, we extract **image embeddings** from user's profile pictures using **CLIP (ViT-B/32)**, and evaluate how much information profile images contain about whether an account is a bot.  
In addition, we compute an optional semantic alignment score between each user's **profile image** and their **bio+posts**.  
This notebook forms the vision branch of the final multimodal bot-detection system.

**What we do here:**

1. Loading cleaned users.csv
These come from the TwiBot-22 preprocessing notebook and include:
   - Local filesystem paths to downloaded user profile images
   - User bios and aggregated posts (text)
   - Bot/human labels
   - A flag indicating whether a valid profile image exists

<br>

2. Loading the CLIP model and processor
   - We use `openai/clip-vit-base-patch32`.
   - Images are preprocessed into 224x224 tensors.
   - Both modalities are encoded into **512-dimensional** embeddings.

<br>

4. Train a simple classifier
For *all* users (including those with missing images):
   1. Construct **full_text** = *bio* + *posts*
   2. Encode full_text using CLIP's**text encoder**
   3. Encode profile images using the vision encoder, when available
   4. Compute cosine similarity between image and text embeddings
   5. Store the result in `clip_img_text_sim`
Users without an image recieve `NaN`, later filled with `0.0` in fusion.
This score approximates whether *the profile image semantically matches the account's textual identity*.

<br>

5. Save the results for the Multimodal Fusion notebook
We export `users_with_clip_sim.csv` which contains:
    - id
    - full_text
    - image_exists
    - profile_image_path
    - label_num
    - clip_img_text_sim

<br>

6. Extracting pure image embeddings
For users that have a profile image:
   - Create a PyTorch dataloader
   - Feed the images through CLIP's vision encoder
   - Normalize and store the 512-dimensional feature vectors
   - Train a simple logistic regression image classifier
   - Evaluate accuracy, F1, and AUROC on image-only prediction
This gives us an estimate of how useful profile pictures alone are for bot detection.

<br>

7. Evaluate image-only performance
   - Accuracy, macro F1, AUROC.

## Section 1: Imports and Configurations

In [None]:
#---------------------------------------------------------------------------------#
# HuggingFace Cache Location                                                      #
#---------------------------------------------------------------------------------#
# By default, HuggingFace downloads pretrained models into the user directory     #
# (e.g., ~/.cache/huggingface/). To make the project fully reproducible and       #
# avoid polluting the user's  global cache, we redirect HF_HOME to a local        #
# folder inside the project.                                                      #
#                                                                                 #
# If you prefer a different cache directory, simply modify HF_CACHE below.        #
# If the folder does not exist yet, HuggingFace will create it automatically.     #
#---------------------------------------------------------------------------------#
import os
from pathlib import Path

ROOT = Path.cwd().parent.resolve()
HF_CACHE = ROOT / "hf_cache"
os.environ["HF_HOME"] = str(HF_CACHE)

import random
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from PIL import Image

import torch
from torch.utils.data import Dataset, DataLoader

from transformers import CLIPProcessor, CLIPModel

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns

# --- Paths and Configurations --- #
TWIBOT_PATH = ROOT / "data/twibot22"
USERS_PATH  = TWIBOT_PATH / "processed/users.csv"
POSTS_PATH  = TWIBOT_PATH / "processed/posts.csv"
OUTPUT_DIR  = ROOT / "outputs/clip_twibot22"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

SEED = 42
BATCH_SIZE = 32

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if device.type == "cuda":
    print(f"Using cuda on {torch.cuda.get_device_name()}")
else:
    print("Using cpu")

def space():
    print("\n" ,"-" * 100, "\n")

## Section 2: Loading TwiBot users and with images

In [None]:
users_df = pd.read_csv(str(USERS_PATH))
posts_df = pd.read_csv(str(POSTS_PATH))
print("Users shape:", users_df.shape)
print(users_df.head())

space()

users_df["profile_image_path"] = users_df["profile_image_path"].astype(str)
users_df["image_exists"] = users_df["profile_image_path"].apply(lambda p: Path(p).exists())
#users_df = users_df[users_df["image_exists"]].copy().reset_index(drop=True)

# --- Aggregating tweets per user (concatenated text) --- #
posts_agg = (
    posts_df.groupby("id")["text"]
    .apply(lambda x: " ".join(x.astype(str)))
    .reset_index()
    .rename(columns={"text": "tweets_text"})
)

# --- Merging into users --- #
data_df = users_df.merge(posts_agg, on="id", how="left")
data_df["tweets_text"] = data_df["tweets_text"].fillna("")

# --- Explicit Bio text --- #
data_df["bio_text"] = data_df["description"].fillna("")

MAX_TWEET_CHARS = 6000
data_df["tweets_text"] = data_df["tweets_text"].str.slice(0, MAX_TWEET_CHARS)

# --- Full text = description + tweets --- #
data_df["full_text"] = (
    "Bio: " + data_df["bio_text"] + " Posts: " + data_df["tweets_text"]
).str.strip()

# --- Attaching image info --- #
data_df["profile_image_path"] = data_df["profile_image_path"].astype(str)
data_df["image_exists"] = data_df["profile_image_path"].apply(lambda p: Path(p).exists())

# --- Length stats --- #
data_df["char_length"] = data_df["full_text"].apply(len)
print("\nDoc length summary (chars):")
display(data_df["char_length"].describe())

space()

# --- Filtering out users with every little text --- #
min_char = 50
data_df = data_df[data_df["char_length"] >= min_char].reset_index(drop=True)
print("\nUsers after min_char filter:", len(data_df))
print(data_df["label"].value_counts())

space()

print("\nData with full_text & image flag:", data_df.shape)
print(data_df[["id", "label_num", "image_exists"]].head())

space()

print("Label distribution (Before balancing):")
print(data_df["label_num"].value_counts())
print("\nImage availability by class (before balancing):")
print(pd.crosstab(data_df["label_num"], data_df["image_exists"]))

space()

users_mm = data_df[["id", "full_text", "image_exists", "profile_image_path", "label_num"]].copy()
print("Final Users shape:", users_mm.shape)
print(users_mm.head())

## Section 3: Loading CLIP model

In [None]:
MODEL_NAME = "openai/clip-vit-base-patch32"

clip_model = CLIPModel.from_pretrained(MODEL_NAME).to(device)
clip_processor = CLIPProcessor.from_pretrained(MODEL_NAME)

clip_model.to(device)
clip_model.eval()

print("CLIP loaded on:", device)

## Section 4: Computing CLIP similarities & Exporting the final results

In [None]:
clip_sims = []

for _, row in tqdm(users_mm.iterrows(), total=len(users_mm), desc="Computing CLIP image-text similarity"):
    text = str(row["full_text"])
    img_path = row["profile_image_path"]

    sim = np.nan

    try:
        if isinstance(img_path, str) and Path(img_path).exists():
            image = Image.open(img_path).convert("RGB")
    
            inputs = clip_processor(
                text=[text],
                images=image,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=77
            ).to(device)
    
            with torch.no_grad():
                outputs = clip_model(**inputs)
                img_emb = outputs.image_embeds
                txt_emb = outputs.text_embeds
    
            img_emb = img_emb / img_emb.norm(dim=-1, keepdim=True)
            txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
            sim = (img_emb * txt_emb).sum(dim=-1).item()
        else:
            sim = np.nan

    except Exception:
        sim = np.nan

    clip_sims.append(sim)

users_mm.loc[:, "clip_img_text_sim"] = clip_sims

space()

display(users_mm.head())

space()

OUT_PATH = TWIBOT_PATH / "processed/users_with_clip_sim.csv"
users_mm.to_csv(str(OUT_PATH), index=False)
print("Saved:", OUT_PATH)

## Section 5: Train / Val / Test split (user-level, stratified)

In [None]:
users_img = users_mm[users_mm["image_exists"]].copy().reset_index(drop=True)

# --- Balancing human/bot distribution --- #
humans = users_img[users_img["label_num"] == 0]
bots = users_img[users_img["label_num"] == 1]

n = min(len(humans), len(bots))
humans_bal = humans.sample(n, random_state=SEED)
bots_bal = bots.sample(n, random_state=SEED)

data_bal = pd.concat([humans_bal, bots_bal]).sample(frac=1, random_state=SEED).reset_index(drop=True)
users_img = data_bal.copy().reset_index(drop=True)

train_val_df, test_df = train_test_split(
    users_img,
    test_size=0.2,
    stratify=users_img["label_num"],
    random_state=SEED
)

train_df, val_df = train_test_split(
    train_val_df,
    test_size=0.2,
    stratify=train_val_df["label_num"],
    random_state=SEED
)

print("Train users:", len(train_df))
print("Val users:", len(val_df))
print("Test users:", len(test_df))

space()

print("\nTrain label distribution:")
print(train_df["label_num"].value_counts())
space()
print("\nVal label distribution:")
print(val_df["label_num"].value_counts())
space()
print("\nTest label distribution:")
print(test_df["label_num"].value_counts())

In [None]:
class TwibotImageDataset(Dataset):
    def __init__(self, df, processor):
        self.paths = df["profile_image_path"].tolist()
        self.labels = df["label_num"].astype(int).tolist()
        self.processor = processor

    def __len__(self):
        return len(self.paths)

    def __getitem__(self, idx):
        path = self.paths[idx]
        label = self.labels[idx]

        image = Image.open(path).convert("RGB")
        return image, label

train_dataset = TwibotImageDataset(train_df, clip_processor)
val_dataset   = TwibotImageDataset(val_df, clip_processor)
test_dataset  = TwibotImageDataset(test_df, clip_processor)

print("Train size:", len(train_dataset))
print("Val size:", len(val_dataset))
print("Test size:", len(test_dataset))

## Section 6: Helper -> Extracts CLIP image embeddings
Using this helper, we run CLIP once over the images and cache the 512-dim image features.

In [None]:
def collate_fn(batch):
    images, labels = zip(*batch)
    return list(images), torch.tensor(labels, dtype=torch.long)

def extract_clip_image_features(dataset, batch_size=32):
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
    all_feats = []
    all_labels = []

    with torch.no_grad():
        for images, labels in tqdm(dataloader, desc="Extracting CLIP features"):
            inputs = clip_processor(images=images, return_tensors="pt").to(device)
            outputs = clip_model.get_image_features(**inputs)
            feats = outputs.detach().cpu().numpy()

            feats = feats / np.linalg.norm(feats, axis=1, keepdims=True)
            all_feats.append(feats)
            all_labels.append(labels.numpy())

    all_feats = np.vstack(all_feats)
    all_labels = np.concatenate(all_labels)
    return all_feats, all_labels

train_feats, train_labels = extract_clip_image_features(train_dataset, batch_size=BATCH_SIZE)
val_feats, val_labels     = extract_clip_image_features(val_dataset, batch_size=BATCH_SIZE)
test_feats, test_labels   = extract_clip_image_features(test_dataset, batch_size=BATCH_SIZE)

print("Train feats:", train_feats.shape)
print("Val feats:", val_feats.shape)
print("Test feats:", test_feats.shape)

In [None]:
np.savez(
    OUTPUT_DIR / "twibot_clip_image_features.npz",
    train_feats=train_feats, train_labels=train_labels,
    val_feats=val_feats,     val_labels=val_labels,
    test_feats=test_feats,   test_labels=test_labels
)

print("Saved CLIP features.")

## Section 7: Training a simple classifier (logistic regression)

In [None]:
scaler = StandardScaler(with_mean=False) # Because they are already been normalized
train_feats_scaled = scaler.fit_transform(train_feats)
val_feats_scaled = scaler.fit_transform(val_feats)
test_feats_scaled = scaler.fit_transform(test_feats)

clf = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    n_jobs=-1
)

clf.fit(train_feats_scaled, train_labels)

print("Validation performance (image-only CLIP):")
val_pred = clf.predict(val_feats_scaled)
print(classification_report(val_labels, val_pred, digits=3, zero_division=0))

## Section 8: Evaluating on Test set + Confusion Matrix

In [None]:
test_pred = clf.predict(test_feats_scaled)
test_proba = clf.predict_proba(test_feats_scaled)[:, 1]

print("Test performance (image-only CLIP):")
print(classification_report(test_labels, test_pred, digits=3, zero_division=0))

try:
    auroc = roc_auc_score(test_labels, test_proba)
    print("Test AUROC:", auroc)
except Exception as e:
    print("Could not compute AUROC:", e)

In [None]:
cm = confusion_matrix(test_labels, test_pred)
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
           xticklabels=["human", "bot"],
           yticklabels=["human","bot"])
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("TwiBot-22 Image-only CLIP - Confusion Matrix (Test)")
plt.tight_layout()
plt.savefig(str(OUTPUT_DIR / "confusion_matrix_test.png"))
plt.show()