# GPT-4.1 Model Selection Phase 1
#### Author: Arnav Ahuja (223271095)

## Objective
This notebook is part of Phase 1 of the LLM Team’s experiment plan, where we test the GPT-4.1 model on our base prompt and data. The purpose is to evaluate its suitability and performance so we can select the best model for feedback generation and AI detection from our shortlisted options for the next phases.




## Azure API Integration
The notebook connects to Azure OpenAI Services using a securely provided API key. This setup ensures access to enterprise-grade deployments of GPT models while enabling structured experimentation, reproducibility, and controlled cost tracking.




## Experimental Goal
By running detection and feedback prompts on multiple student submissions, we gather insights on the accuracy, consistency, and usefulness of GPT-4.1 using human and GenAI rating rubrics. These results form the foundation for model comparison across domains and will guide the LLM Team’s decision for the optimal model moving forward.

In [17]:
!pip install -q "openai>=1.52.0"

### Configuration

This section sets up the Azure OpenAI endpoint, API key, API version, and deployment name, along with the datasets and maximum few-shot examples to be used in the experiment.

In [18]:
import os, json, re, time, csv, dataclasses
from typing import List, Dict, Any, Tuple, Optional
from openai import AzureOpenAI

# -----------------------------
# CONFIG – set your Azure values here
# -----------------------------
AZURE_OPENAI_ENDPOINT   = "<YOUR_ENDPOINT>"
AZURE_OPENAI_API_KEY    = "<YOUR_API_KEY>"
AZURE_OPENAI_API_VERSION= "2024-12-01-preview"
AZURE_OPENAI_DEPLOYMENT = "gpt-4.1"

DATASETS = [
    "engineering.json",
    "accounting.json",
    "it.json",
    "psychology.json",
    "teaching.json"
]
MAX_EXAMPLES = 3

### Helper Functions

This section contains supporting functions that make the experiment run more smoothly and consistently. These helpers are used throughout the workflow to:

- Load and manage data – bringing in student submissions and rubrics from files so they can be analyzed.
- Track progress – recording timestamps to know when different parts of the process are run.
- Select representative examples – choosing a balanced set of Human, AI, and Hybrid submissions to guide the model during classification.
- Prepare rubric information – turning rubric content into a clear and structured format that the model can use when generating feedback.

Together, these functions ensure that the core experiment focuses on evaluation, while the background tasks of organizing and preparing data are handled automatically.

In [19]:
# =============================
# Helpers
# =============================
def load_json(p: str) -> Dict[str, Any]:
    with open(p, "r", encoding="utf-8") as f:
        return json.load(f)

def now_ts() -> str:
    import datetime as dt
    return dt.datetime.now().isoformat(timespec="seconds")

def pick_few_shots(subs: List[Dict[str,Any]], max_examples:int=3) -> List[Dict[str,Any]]:
    """
    Prioritize one example from each label (Human, AI, Hybrid) if available, then fill up.
    """
    buckets = {"Human": [], "AI": [], "Hybrid": []}
    for s in subs:
        label = str(s.get("label_type", "")).strip()
        if label in buckets:
            buckets[label].append(s)
    shots: List[Dict[str,Any]] = []
    for lbl in ["Human","AI","Hybrid"]:
        if buckets[lbl]:
            shots.append(buckets[lbl][0])
    for s in subs:
        if len(shots) >= max_examples:
            break
        if s not in shots:
            shots.append(s)
    return shots[:max_examples]

def format_rubric(r: Dict[str,Any]) -> str:
    out = [f"Rubric ID: {r.get('rubric_id','unknown')}", "Criteria:"]
    for c in r.get("criteria", []):
        out.append(f"- {c.get('criterion_id')}: {c.get('name')}")
        out.append(f"  {c.get('description')}")
    return "\n".join(out)

### Prompt Builders

This section defines the reusable instructions we send to the model for two tasks—classification and feedback—so every run is consistent, comparable, and easy to read.

#### Detection Prompt (Academic Integrity)

Frames the model as a careful classifier to decide whether a submission is Human, AI, or Hybrid. It encourages brief, evidence-based reasoning and returns a simple, structured text block with a label, short rationale bullets, and any notable flags (e.g., style inconsistency or generic phrasing).

#### Feedback Prompt (Rubric-Aligned)

Guides the model to act as a supportive assessor. It produces a concise, structured review that includes an overall summary, per-criterion feedback (rating, brief reasons, and one concrete improvement tip), and an overall rating—making results easy to compare across submissions and rubrics.

In [20]:
# =============================
# Prompt Builders
# =============================
SYSTEM_PROMPT = "You are a careful academic assistant. Be precise and give clear structured output (not JSON, not CSV, no files)."

def build_detection_prompt(submission: str, few_shots: List[Dict[str, Any]]) -> List[Dict[str, str]]:
    """
    Academic Integrity Detector Prompt
    ----------------------------------
    Purpose:
        Classifies student submissions as Human, AI, or Hybrid (AI-assisted).

    Technique:
        - Role-based prompting
        - Few-shot support
        - CoT (reasoning encouraged but hidden from output)
        - Output in plain text

    Expected Output (example format in plain text):
        Label: Human | AI | Hybrid
        Rationale:
        - short bullet point 1
        - short bullet point 2
        Flags: style_inconsistency / high_verbatim / generic_phrasing / none
    """
    # Build few-shot block
    shot_texts = []
    for s in few_shots:
        shot_texts.append(
            f'Submission: """{s.get("final_submission","")}"""\n'
            f'Your analysis (2–4 bullet points): <analysis>\n'
            f'Label: {s.get("label_type","")}\n'
        )
    examples_block = "\n\n".join(shot_texts) if shot_texts else "/* no examples available */"

    user = f"""
You are an AI text-source classifier for academic integrity.
Decide whether the student submission is Human, AI, or Hybrid (AI-assisted).

Guidelines:
- Consider discourse features (specificity, subjectivity, personal context), style consistency, local/global coherence, repetitiveness, and cliché patterns.
- Hybrid = meaningful human writing with some AI assistance, or explicit admission of mixed use.

Examples:
{examples_block}

Now analyze the NEW submission and respond in plain text with the following structure:
Label: ...
Rationale:
- point 1
- point 2
Flags: ...
NEW submission:
\"\"\"{submission}\"\"\"\n
"""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user},
    ]

def build_feedback_prompt(domain: str, assignment_prompt: str, rubric_text: str, submission: str) -> List[Dict[str, str]]:
    """
    Rubric-Aligned Feedback Prompt
    ------------------------------
    Purpose:
        Generates structured, supportive feedback for a student submission.

    Technique:
        - Role-based prompting
        - Rubric-grounded evaluation
        - Output in plain text

    Expected Output (example format in plain text):
        Overall Summary:
        <2–4 sentence overview>

        Criteria Feedback:
        Criterion: <criterion_id>
        Rating: Excellent | Good | Average | Needs Improvement | Poor
        Reason:
        - point 1
        - point 2
        Improvement Tip: one concrete suggestion

        Overall Rating: Excellent | Good | Average | Needs Improvement | Poor
    """
    user = f"""
You are a supportive assessor. Provide actionable feedback aligned to the rubric.
Return plain structured text only (no JSON, no files).

Sections to include:
1) Overall Summary: 2–4 sentences on strengths and priorities.
2) Criteria Feedback: for each rubric criterion include:
   - Criterion
   - Rating (excellent, good, average, needs_improvement, poor)
   - Reason (1–3 bullet points citing reasoning)
   - Improvement Tip (one concrete step)
3) Overall Rating: Excellent | Good | Average | Needs Improvement | Poor

Context:
- Domain: {domain}
- Assignment prompt: {assignment_prompt}

Rubric (verbatim):
{rubric_text}

Student submission:
\"\"\"{submission}\"\"\"\n
"""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user},
    ]

### Azure Client

This part of the notebook sets up the connection to Azure OpenAI Services. It ensures that the model can be called securely and consistently using the configured endpoint, API key, version, and deployment.

- A small configuration class stores all key details for reuse.
- A client object is created to send messages to the GPT model.
- The helper function chat_complete wraps each request, returning both the model’s response and the time it took to generate it.

Together, these pieces form the bridge between our experiment code and the GPT-4.1 model hosted on Azure.

In [21]:
# =============================
# Azure Client
# =============================
@dataclasses.dataclass
class AzureCfg:
    endpoint: str
    api_key: str
    api_version: str
    deployment: str

if not AZURE_OPENAI_API_KEY:
    raise RuntimeError(
        "Missing AZURE_OPENAI_API_KEY environment variable. "
        "Set it with your Azure key, e.g., `export AZURE_OPENAI_API_KEY=...`"
    )

cfg = AzureCfg(
    AZURE_OPENAI_ENDPOINT,
    AZURE_OPENAI_API_KEY,
    AZURE_OPENAI_API_VERSION,
    AZURE_OPENAI_DEPLOYMENT
)
client = AzureOpenAI(
    api_key=cfg.api_key,
    api_version=cfg.api_version,
    azure_endpoint=cfg.endpoint
)

def chat_complete(msgs: List[Dict[str,str]], temp: float = 0.2, max_tokens: int = 800) -> Tuple[str, float]:
    t0 = time.perf_counter()
    resp = client.chat.completions.create(
        model=cfg.deployment,
        messages=msgs,
        temperature=temp,
        max_tokens=max_tokens,
    )
    txt = resp.choices[0].message.content or ""
    return txt, (time.perf_counter() - t0)

### Main Processing

This section runs the core workflow. It loops through each dataset, classifies submissions as Human/AI/Hybrid, generates rubric-based feedback, and prints the results with response times.

In [22]:
# =============================
# Main Processing
# =============================
def main():
    run_started = now_ts()
    print(f"=== Batch started: {run_started} ===\n")
    for path in DATASETS:
        data = load_json(path)
        domain = data["domain"]
        rubric = data["rubric"]
        subs   = data["submissions"]
        rubric_text = format_rubric(rubric)
        shots = pick_few_shots(subs, MAX_EXAMPLES)

        print(f"\n==============================")
        print(f"DATASET: {path} | Domain: {domain}")
        print(f"Submissions: {len(subs)}")
        print(f"==============================\n")

        for i, sub in enumerate(subs, 1):
            text = sub["final_submission"]
            true = sub.get("label_type", "NA")

            # --- Detection ---
            det_msgs = build_detection_prompt(text, shots)
            det_text, det_lat = chat_complete(det_msgs, temp=0.2, max_tokens=500)

            # --- Feedback ---
            fb_msgs  = build_feedback_prompt(domain, data["prompt"], rubric_text, text)
            fb_text, fb_lat = chat_complete(fb_msgs, temp=0.3, max_tokens=1200)

            # --- Print structured output ---
            print(f"----- [{domain}] Submission #{i} -----")
            print(f"Ground truth label (if any): {true}")
            print(f"\n[DETECTION] (latency ~{det_lat:.2f}s)")
            print(det_text.strip())
            print(f"\n[FEEDBACK] (latency ~{fb_lat:.2f}s)")
            print(fb_text.strip())
            print("-" * 40 + "\n")

    print(f"\n=== Batch finished: {now_ts()} ===")

if __name__ == "__main__":
    main()

=== Batch started: 2025-09-01T01:37:34 ===


DATASET: engineering.json | Domain: Manufacturing Engineering
Submissions: 6

----- [Manufacturing Engineering] Submission #1 -----
Ground truth label (if any): AI

[DETECTION] (latency ~2.12s)
Label: AI  
Rationale:
- The submission displays highly structured, formal, and technical language with consistent style throughout, characteristic of AI-generated text.
- It includes comprehensive coverage of the topic with precise terminology (e.g., "process flow diagrams," "OEE," "FMEA," "Kaizen," "5S audits") and lacks personal context, subjectivity, or anecdotal elements.
- The discourse is globally and locally coherent but somewhat generic, with no signs of personal insight or experience, and uses common AI patterns such as exhaustive listing and smooth transitions.
Flags: High technical density, absence of personal or subjective elements, polished and encyclopedic tone.

[FEEDBACK] (latency ~4.71s)
1) Overall Summary:
The submission provides a 

# End of Notebook