# Pipeline Results Analysis
This notebook loads the output from the feature generation pipeline (`historical_tags.csv`) to visualize VLM and Transformer outputs.

In [1]:
import pandas as pd
import sys
import os

# Ensure we can find the config file if running from this directory
sys.path.append(os.getcwd())

try:
    import config
    # Use the absolute path defined in config
    OUTPUT_FILE = config.OFFLINE_FILE
except ImportError:
    # Fallback if config import fails (e.g. environment issues)
    OUTPUT_FILE = "historical_tags.csv"

# Load Data
try:
    df = pd.read_csv(OUTPUT_FILE)
    print(f"Loaded {len(df)} rows from {OUTPUT_FILE}")
except FileNotFoundError:
    print(f"File not found at {OUTPUT_FILE}. Please run main.py first.")


Loaded 7 rows from /Users/shubhambhardwaj/Shubham/datascience/study/LLM/content_moderation/src/harmful_meme/historical_tags.csv


## 1. VLM & Text Transformer Outputs
Here we view the raw outputs from the VLM (Visual Summary, OCR) and the Text Transformer (Keywords).

In [2]:
# Filter important columns for visibility
cols = ['post_id', 'post_text', 'visual_summary', 'ocr_text', 'keywords']

# Check if columns exist (in case CSV structure changed)
display_cols = [c for c in cols if c in df.columns]

if not df.empty:
    display(df[display_cols].head(100))
else:
    print("Dataframe is empty.")

Unnamed: 0,post_id,post_text,visual_summary,ocr_text,keywords
0,46971,bravery at its finest,"[""The image shows a man eating a cookie, with ...",['bravery at its finest'],"[""finest']"", 'finest.', 'text:', 'written', 's..."
1,3745,your order comes to $37.50 and your white priv...,['A woman with long blonde hair is standing be...,['your order comes to $37.50 and your white pr...,"['privilege', 'pens.', 'chalkboard', 'denim', ..."
2,83745,it is time.. to send these parasites back to t...,"[""A person dressed in a medieval costume, hold...",['it is time... to send these parasites back t...,"[""['it"", 'time..', 'send', 'time', 'text:', 'd..."
3,5279,"knowing white people , that's probably the bab...","[""The image shows a pregnant woman standing ne...","[""knowing white people, that's probably the ba...","['horse,', 'people,', 'baby', 'father.', 'body..."
4,1796,life hack #23 how to get stoned with no weed,['A woman in a hijab is kissing another woman ...,['life hack #23 how to get stoned with no weed'],"['says:', 'shows:', 'hijab', 'text', 'image', ..."
5,82301,cooooooooooooon!!!!,['A dog is howling at the top of a snowy mount...,['COOOOOOOOON!!!!'],"[""['cooooooooon!!!!']"", 'says:', 'snowy', 'sho..."
6,31752,when you get to choose your own mental illness,"[""A thumbs-up graphic with a rainbow-colored s...",['when you get to choose your own mental illne...,"['illness\'.""].', 'community,', 'thumbs-up', '..."


## 2. Generated Scores & Tags
These are the policy scores and final binary decisions.

In [3]:
score_cols = [c for c in df.columns if 'Score' in c or 'Is_' in c]
if not df.empty:
    display(df[['post_id'] + score_cols].head(200))

Unnamed: 0,post_id,Is_Harmful_Content,Is_Political_Content,Is_Spam,Is_Copyright_Infringement
0,46971,0,0,0,0
1,3745,0,1,0,0
2,83745,0,0,0,0
3,5279,0,0,0,0
4,1796,0,0,0,0
5,82301,0,1,0,0
6,31752,0,0,0,0
