# Pipeline Results Analysis
This notebook loads the output from the feature generation pipeline (`historical_tags.csv`) to visualize VLM and Transformer outputs.

In [13]:
import pandas as pd
import sys
import os

# Ensure we can find the config file if running from this directory
sys.path.append(os.getcwd())

try:
    import config
    # Use the absolute path defined in config
    OUTPUT_FILE = config.OFFLINE_FILE
except ImportError:
    # Fallback if config import fails (e.g. environment issues)
    OUTPUT_FILE = "historical_tags.csv"

# Load Data
try:
    df = pd.read_csv(OUTPUT_FILE)
    print(f"Loaded {len(df)} rows from {OUTPUT_FILE}")
except FileNotFoundError:
    print(f"File not found at {OUTPUT_FILE}. Please run main.py first.")


Loaded 250 rows from /Users/shubhambhardwaj/Shubham/datascience/study/LLM/embeddings/research/harmful_meme/historical_tags.csv


## 1. VLM & Text Transformer Outputs
Here we view the raw outputs from the VLM (Visual Summary, OCR) and the Text Transformer (Keywords).

In [14]:
# Filter important columns for visibility
cols = ['post_id', 'post_text', 'visual_summary', 'ocr_text', 'keywords']

# Check if columns exist (in case CSV structure changed)
display_cols = [c for c in cols if c in df.columns]

if not df.empty:
    display(df[display_cols].head(100))
else:
    print("Dataframe is empty.")

Unnamed: 0,post_id,post_text,visual_summary,ocr_text,keywords
0,46971,bravery at its finest,['The image shows a man with a beard and glass...,['bravery at its finest'],"['bravery', 'shows', 'shows:', ""fingers.']."", ..."
1,3745,your order comes to $37.50 and your white priv...,['A young woman with long blonde hair is stand...,['your order comes to $37.50\nand your white p...,"['standing', 'text', 'total', 'image', 'discou..."
2,83745,it is time.. to send these parasites back to t...,['A person in armor is holding a sword with a ...,['it is time... to send these parasites back t...,"['back', 'text', 'desert.', 'image', 'time...'..."
3,5279,"knowing white people , that's probably the bab...","[""A woman is standing next to a horse, both lo...","[""knowing white people, that's probably the ba...","['shirt', 'vest,', 'standing', 'green', 'trees..."
4,1796,life hack #23 how to get stoned with no weed,['A woman in a hijab is giving a woman in a hi...,life hack #23 how to get stoned with no weed,"['giving', 'woman', 'hijab', 'weed', 'weed.', ..."
...,...,...,...,...,...
95,84015,my irony meter just exploded,['A man in a blue shirt is being handcuffed by...,['my irony meter just exploded'],"['shirt', ""exploded']"", 'exploded.', 'police',..."
96,87251,i love apes they are both ugly and cute,"[""A child's face with a neutral expression is ...","['i love apes', 'they are both ugly and c']","['love', 'text', 'both', ""apes',"", 'apes\'."",'..."
97,59738,what's the difference between a refugee and e....,"[""A split-screen image of two older men, both ...","[""what's the difference between a refugee and ...","['bottom', 'one.', 'black', ""'what's"", 'text',..."
98,3524,terrorist can come to this country way to easy,"['A crowd of people, some of whom are sitting ...","['terrorist can come to this country way, to e...","['else.', 'standing', 'facing', 'text', 'possi..."


## 2. Generated Scores & Tags
These are the policy scores and final binary decisions.

In [15]:
score_cols = [c for c in df.columns if 'Score' in c or 'Is_' in c]
if not df.empty:
    display(df[['post_id'] + score_cols].head(200))

Unnamed: 0,post_id,Is_Harmful_Content,Is_Political_Content,Is_Spam,Is_Copyright_Infringement
0,46971,0,0,0,0
1,3745,0,0,0,0
2,83745,0,0,0,0
3,5279,0,1,0,0
4,1796,0,0,0,0
...,...,...,...,...,...
195,51306,0,1,0,0
196,50241,0,1,0,0
197,17045,0,0,0,0
198,89536,0,1,0,0
