<a href="https://colab.research.google.com/github/Lav363/gen-AI/blob/main/Code_of_gen_ai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# ==========================================
# Project: Cosmic Crew - Scientific PDF Integrity Suite
# Team ID: P-2023-27-AI&DS-13 (Lab AD)
# Focus: Gen AI Core Implementation (CPU Execution)
# ==========================================

# --- Installation of Required Libraries ---
# transformers, torch: For loading RoBERTa and GPT-2 Gen AI models
# pymupdf (fitz): For robust PDF text extraction
# scipy, numpy: For statistical calculations
!pip install -q transformers torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install -q pymupdf scipy numpy

print("Libraries installed successfully.")
print("Note: Running in CPU-only mode for privacy compliance.")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.9/24.9 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[?25hLibraries installed successfully.
Note: Running in CPU-only mode for privacy compliance.


In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, GPT2LMHeadModel, GPT2Tokenizer
import fitz  # PyMuPDF for PDF processing
import numpy as np
import os

# --- Configuration & Gen AI Parameters ---
# The specific pre-trained models chosen for the project
NEURAL_MODEL_NAME = "roberta-base-openai-detector" # RoBERTa for pattern detection
STYLOMETRIC_MODEL_NAME = "gpt2" # GPT-2 for Perplexity calculation

# THE CRITICAL FIX: Defining the max token limit to prevent tensor mismatches
MAX_LENGTH = 512

# Enforcing CPU execution for local privacy compliance
DEVICE = torch.device("cpu")
print(f"Runtime Device Set To: {DEVICE}")

# --- Loading Gen AI Models & Tokenizers ---
print("\nLoading Gen AI Models (this may take a moment)...")

try:
    # 1. Load Neural Detector (RoBERTa)
    # Note: This model detects linguistic patterns typical of AI generation.
    roberta_tokenizer = AutoTokenizer.from_pretrained(NEURAL_MODEL_NAME)
    roberta_model = AutoModelForSequenceClassification.from_pretrained(NEURAL_MODEL_NAME).to(DEVICE)
    roberta_model.eval() # Set to evaluation mode
    print(f"[SUCCESS] Loaded {NEURAL_MODEL_NAME}")

    # 2. Load Stylometric Analyzer (GPT-2)
    # Note: This model is used to calculate how "predictable" the text is (Perplexity).
    gpt2_tokenizer = GPT2Tokenizer.from_pretrained(STYLOMETRIC_MODEL_NAME)
    gpt2_model = GPT2LMHeadModel.from_pretrained(STYLOMETRIC_MODEL_NAME).to(DEVICE)
    gpt2_model.eval()
    print(f"[SUCCESS] Loaded {STYLOMETRIC_MODEL_NAME}")

except Exception as e:
    print(f"\n[ERROR] Failed to load models. Ensure internet access for initial download.\nError details: {e}")

Runtime Device Set To: cpu

Loading Gen AI Models (this may take a moment)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: roberta-base-openai-detector
Key                         | Status     |  | 
----------------------------+------------+--+-
roberta.pooler.dense.bias   | UNEXPECTED |  | 
roberta.pooler.dense.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


[SUCCESS] Loaded roberta-base-openai-detector


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

[SUCCESS] Loaded gpt2


In [None]:
# --- Core Helper Functions ---

def extract_and_truncate_text(pdf_path, tokenizer, max_len=512):
    """
    Extracts text from PDF and applies the CRITICAL 512-token truncation.
    This solves the 'tensor size mismatch' error encountered in long PDFs.
    """
    print(f"\nProcessing PDF: {pdf_path}...")
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    doc.close()

    # 1. Tokenize the full text first
    tokens = tokenizer.encode(full_text, add_special_tokens=False)
    original_token_count = len(tokens)

    # 2. Apply Truncation Logic if necessary
    if original_token_count > max_len:
        print(f"  [Truncation Active] Text length ({original_token_count} tokens) exceeds limit.")
        print(f"  Truncating to first {max_len} tokens for model compatibility.")
        truncated_tokens = tokens[:max_len]
        # Decode back to text for passing to the models
        processed_text = tokenizer.decode(truncated_tokens)
    else:
        print(f"  [No Truncation Needed] Text length ({original_token_count} tokens) is within limits.")
        processed_text = full_text

    return processed_text


def calculate_neural_confidence(text, model, tokenizer, device):
    """
    Runs inference using the RoBERTa model to get AI probability.
    """
    # Encode input ensuring it adheres to the max_length constraint
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH, padding="max_length").to(device)

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    probs = torch.softmax(logits, dim=-1)
    # Based on roberta-base-openai-detector label mapping: Index 1 is AI
    ai_confidence = probs[0][1].item()
    return ai_confidence


def calculate_perplexity(text, model, tokenizer, device):
    """
    Calculates Perplexity (PPL) using GPT-2 as a stylometric metric.
    Lower PPL indicates higher predictability (often correlates with AI text).
    """
    encodings = tokenizer(text, return_tensors='pt').to(device)
    input_ids = encodings.input_ids
    # Stride is used to handle longer texts for PPL calculation if needed,
    # but here we rely on the pre-truncated text.
    with torch.no_grad():
         outputs = model(input_ids, labels=input_ids)

    # Calculate loss and convert to perplexity
    loss = outputs.loss
    ppl = torch.exp(loss).item()
    return ppl

In [None]:
# --- Main Execution Loop ---
from google.colab import files

print("Please upload a sample scholarly PDF file for analysis.")
uploaded = files.upload()

if not uploaded:
    print("No file uploaded. Exiting.")
else:
    pdf_filename = next(iter(uploaded))

    try:
        # 1. Preprocessing Step (Extraction + Truncation)
        # We use the RoBERTa tokenizer to define truncation boundaries
        processed_text = extract_and_truncate_text(pdf_filename, roberta_tokenizer, MAX_LENGTH)

        if not processed_text.strip():
             print("[Error] Could not extract text from the PDF. It might be scanned image-only.")
        else:
            print("\n--- Running Gen AI Forensic Audits (CPU) ---")

            # 2. Neural Audit Step (RoBERTa)
            neural_score = calculate_neural_confidence(processed_text, roberta_model, roberta_tokenizer, DEVICE)
            print(f"Audit 1 Finished: Neural Classification")

            # 3. Stylometric Audit Step (GPT-2 Perplexity)
            ppl_score = calculate_perplexity(processed_text, gpt2_model, gpt2_tokenizer, DEVICE)
            print(f"Audit 2 Finished: Stylometric Analysis")


            # --- Final Report Generation ---
            print("\n" + "="*40)
            print("   COSMIC CREW: INTEGRITY REPORT   ")
            print("="*40)
            print(f"Analysed File: {pdf_filename}\n")

            print(f"--- GEN AI METRICS ---")
            # Neural Score Interpretation
            print(f"1. Neural AI Confidence: {neural_score:.4f}")
            if neural_score > 0.80:
                print("   -> INTERPRETATION: High probability of AI generation detected by RoBERTa.")
            else:
                print("   -> INTERPRETATION: Likely Human-authored based on neural patterns.")

            # Perplexity Interpretation
            print(f"\n2. Stylometric Perplexity (PPL): {ppl_score:.2f}")
            print("   -> NOTE: Lower PPL indicates highly predictable text (common in AI).")
            print("            Scholarly human text typically has higher PPL (>30-40).")

            print("="*40)
            print("Analysis Complete.")

    except Exception as e:
        print(f"An unexpected error occurred during analysis: {e}")

Please upload a sample scholarly PDF file for analysis.


Saving EH_CASESTUDY(LAV).pdf to EH_CASESTUDY(LAV).pdf

Processing PDF: EH_CASESTUDY(LAV).pdf...
  [No Truncation Needed] Text length (188 tokens) is within limits.

--- Running Gen AI Forensic Audits (CPU) ---
Audit 1 Finished: Neural Classification


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Audit 2 Finished: Stylometric Analysis

   COSMIC CREW: INTEGRITY REPORT   
Analysed File: EH_CASESTUDY(LAV).pdf

--- GEN AI METRICS ---
1. Neural AI Confidence: 0.9970
   -> INTERPRETATION: High probability of AI generation detected by RoBERTa.

2. Stylometric Perplexity (PPL): 41.60
   -> NOTE: Lower PPL indicates highly predictable text (common in AI).
            Scholarly human text typically has higher PPL (>30-40).
Analysis Complete.
