# Phase 1: DeepSeek R1 Model Selection
#### Author: Arnav Ahuja (223271095)

## Objective
This notebook is part of Phase 1 of the LLM Team’s experiment plan, evaluating DeepSeek-R1 on our base prompts and multi-domain datasets. The aim is to assess model suitability for two core tasks—AI/Human/Hybrid detection and rubric-aligned feedback—and to record latency and output quality for later model comparisons across phases.


## Azure API Integration
We integrate with Azure AI Inference using the azure-ai-inference SDK and our enterprise deployment endpoint. This provides governed access, reproducible runs, and clear cost/latency tracking while aligning with our security and compliance requirements. (The GPT-4.1 notebook uses Azure OpenAI; here we mirror the same experiment flow via Azure AI Inference for DeepSeek-R1.)


## Experimental Goal
The experimental goal is to run identical prompts across detection and feedback tasks for diverse student submissions drawn from engineering, accounting, IT, psychology, and teaching datasets. In doing so, we gather evidence of detection accuracy and consistency in distinguishing human, AI, and hybrid submissions. We also evaluate the usefulness and rubric alignment of feedback outputs, while measuring response time and token consumption. The information collected here is intended to become the basis for comparing DeepSeek-R1 with other models in subsequent phases.

In [None]:
! pip install azure-ai-inference

### Configuration
The configuration section declares key elements such as the Azure endpoint, API key, API version, and the model name (DeepSeek-R1). It also specifies the dataset JSON files containing the prompts, rubrics, and submissions, along with a MAX_EXAMPLES parameter that controls few-shot selection. This mirrors the GPT-4.1 setup and ensures consistent dataset handling across both experiments.

In [None]:
import os, json, re, time, csv, dataclasses
from typing import List, Dict, Any, Tuple, Optional

# Azure AI Inference SDK
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential

# -----------------------------
# CONFIG
# -----------------------------
AZURE_AI_ENDPOINT     = "<YOUR_ENDPOINT>"
AZURE_AI_API_KEY      = "<YOUR_API_KEY>"
AZURE_AI_API_VERSION  = "2024-05-01-preview"
AZURE_AI_MODEL_NAME   = "DeepSeek-R1"

DATASETS = [
    "engineering.json",
    "accounting.json",
    "it.json",
    "psychology.json",
    "teaching.json"
]
MAX_EXAMPLES = 3

### Helper Functions
To streamline execution, several helper functions are defined. These include loaders for JSON files, a timestamp generator for batch logging, a few-shot picker that balances human, AI, and hybrid labels, and a rubric formatter that compacts structured criteria into text. These utilities keep the main experiment loop clear and focused on evaluation logic, making it easier to maintain and reproduce results.

In [4]:
# =============================
# Helpers
# =============================
def load_json(p: str) -> Dict[str, Any]:
    with open(p, "r", encoding="utf-8") as f:
        return json.load(f)

def now_ts() -> str:
    import datetime as dt
    return dt.datetime.now().isoformat(timespec="seconds")

def pick_few_shots(subs: List[Dict[str,Any]], max_examples:int=3) -> List[Dict[str,Any]]:
    """
    Prioritize one example from each label (Human, AI, Hybrid) if available, then fill up.
    """
    buckets = {"Human": [], "AI": [], "Hybrid": []}
    for s in subs:
        label = str(s.get("label_type", "")).strip()
        if label in buckets:
            buckets[label].append(s)
    shots: List[Dict[str,Any]] = []
    for lbl in ["Human","AI","Hybrid"]:
        if buckets[lbl]:
            shots.append(buckets[lbl][0])
    for s in subs:
        if len(shots) >= max_examples:
            break
        if s not in shots:
            shots.append(s)
    return shots[:max_examples]

def format_rubric(r: Dict[str,Any]) -> str:
    out = [f"Rubric ID: {r.get('rubric_id','unknown')}", "Criteria:"]
    for c in r.get("criteria", []):
        out.append(f"- {c.get('criterion_id')}: {c.get('name')}")
        out.append(f"  {c.get('description')}")
    return "\n".join(out)


### Prompt Builders

This section defines the reusable instructions we send to the model for two tasks—classification and feedback—so every run is consistent, comparable, and easy to read.

#### Detection Prompt (Academic Integrity)

Frames the model as a careful classifier to decide whether a submission is Human, AI, or Hybrid. It encourages brief, evidence-based reasoning and returns a simple, structured text block with a label, short rationale bullets, and any notable flags (e.g., style inconsistency or generic phrasing).

#### Feedback Prompt (Rubric-Aligned)

Guides the model to act as a supportive assessor. It produces a concise, structured review that includes an overall summary, per-criterion feedback (rating, brief reasons, and one concrete improvement tip), and an overall rating—making results easy to compare across submissions and rubrics.

In [5]:
# =============================
# Prompt Builders
# =============================
SYSTEM_PROMPT = "You are a careful academic assistant. Be precise and give clear structured output (not JSON, not CSV, no files)."

def build_detection_prompt(submission: str, few_shots: List[Dict[str, Any]]) -> List[Dict[str, str]]:
    """
    Academic Integrity Detector Prompt
    ----------------------------------
    Purpose:
        Classifies student submissions as Human, AI, or Hybrid (AI-assisted).

    Technique:
        - Role-based prompting
        - Few-shot support
        - CoT (reasoning encouraged but hidden from output)
        - Output in plain text
    """
    # Build few-shot block
    shot_texts = []
    for s in few_shots:
        shot_texts.append(
            f'Submission: """{s.get("final_submission","")}"""\n'
            f'Your analysis (2–4 bullet points): <analysis>\n'
            f'Label: {s.get("label_type","")}\n'
        )
    examples_block = "\n\n".join(shot_texts) if shot_texts else "/* no examples available */"

    user = f"""
You are an AI text-source classifier for academic integrity.
Decide whether the student submission is Human, AI, or Hybrid (AI-assisted).

Guidelines:
- Consider discourse features (specificity, subjectivity, personal context), style consistency, local/global coherence, repetitiveness, and cliché patterns.
- Hybrid = meaningful human writing with some AI assistance, or explicit admission of mixed use.

Examples:
{examples_block}

Now analyze the NEW submission and respond in plain text with the following structure:
Label: ...
Rationale:
- point 1
- point 2
Flags: ...
NEW submission:
\"\"\"{submission}\"\"\"\n
"""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user},
    ]

def build_feedback_prompt(domain: str, assignment_prompt: str, rubric_text: str, submission: str) -> List[Dict[str, str]]:
    """
    Rubric-Aligned Feedback Prompt
    ------------------------------
    Purpose:
        Generates structured, supportive feedback for a student submission.
    """
    user = f"""
You are a supportive assessor. Provide actionable feedback aligned to the rubric.
Return plain structured text only (no JSON, no files).

Sections to include:
1) Overall Summary: 2–4 sentences on strengths and priorities.
2) Criteria Feedback: for each rubric criterion include:
   - Criterion
   - Rating (excellent, good, average, needs_improvement, poor)
   - Reason (1–3 bullet points citing reasoning)
   - Improvement Tip (one concrete step)
3) Overall Rating: Excellent | Good | Average | Needs Improvement | Poor

Context:
- Domain: {domain}
- Assignment prompt: {assignment_prompt}

Rubric (verbatim):
{rubric_text}

Student submission:
\"\"\"{submission}\"\"\"\n
"""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user},
    ]


### Azure Client

The client for Azure AI Inference is wrapped in a simple class that holds configuration details and exposes a function for chat completions. Messages are converted into the correct Azure message format, requests are sent, and latency is captured. This mirrors the GPT-4.1 pattern but adapts it for the DeepSeek deployment, providing consistency in both structure and reporting.

In [6]:
# =============================
# Azure AI Inference Client (DeepSeek)
# =============================
@dataclasses.dataclass
class AzureAICfg:
    endpoint: str
    api_key: str
    api_version: str
    model_name: str

if not AZURE_AI_API_KEY or AZURE_AI_API_KEY.startswith("<PASTE_"):
    raise RuntimeError(
        "Missing AZURE_AI_API_KEY. Set it via env var or paste it once in AZURE_AI_API_KEY."
    )

cfg = AzureAICfg(
    endpoint=AZURE_AI_ENDPOINT,
    api_key=AZURE_AI_API_KEY,
    api_version=AZURE_AI_API_VERSION,
    model_name=AZURE_AI_MODEL_NAME
)

client = ChatCompletionsClient(
    endpoint=cfg.endpoint,
    credential=AzureKeyCredential(cfg.api_key),
    api_version=cfg.api_version
)

def _to_azure_messages(msgs: List[Dict[str, str]]):
    """
    Convert [{'role': 'system'|'user'|'assistant', 'content': '...'}]
    into [SystemMessage(...), UserMessage(...)] for Azure AI Inference.
    (We map 'assistant' to UserMessage for simple parity since we don't
    need prior assistant turns here; adjust if you add multi-turn context.)
    """
    out = []
    for m in msgs:
        role = m.get("role", "user")
        content = m.get("content", "")
        if role == "system":
            out.append(SystemMessage(content=content))
        else:
            out.append(UserMessage(content=content))
    return out

def chat_complete(msgs: List[Dict[str,str]], temp: float = 0.2, max_tokens: int = 800) -> Tuple[str, float]:
    t0 = time.perf_counter()
    azure_msgs = _to_azure_messages(msgs)
    resp = client.complete(
        messages=azure_msgs,
        model=cfg.model_name,
        max_tokens=max_tokens,
        temperature=temp,  # if your deployment doesn’t accept this, remove it
    )
    txt = (resp.choices[0].message.content or "") if resp.choices else ""
    return txt, (time.perf_counter() - t0)

### Main Processing
The main processing loop is designed to handle each dataset systematically. It loads prompts, rubrics, and submissions, constructs a balanced few-shot set, and runs detection and feedback prompts on each submission. For every run, it prints the output label or feedback with recorded latency, separated clearly with markers and timestamps. This consistent structure enables easy log inspection and later aggregation into comparative analyses across domains and models.

In [7]:
# =============================
# Main Processing
# =============================
def main():
    run_started = now_ts()
    print(f"=== Batch started: {run_started} ===\n")
    for path in DATASETS:
        data = load_json(path)
        domain = data["domain"]
        rubric = data["rubric"]
        subs   = data["submissions"]
        rubric_text = format_rubric(rubric)
        shots = pick_few_shots(subs, MAX_EXAMPLES)

        print(f"\n==============================")
        print(f"DATASET: {path} | Domain: {domain}")
        print(f"Submissions: {len(subs)}")
        print(f"==============================\n")

        for i, sub in enumerate(subs, 1):
            text = sub["final_submission"]
            true = sub.get("label_type", "NA")

            # --- Detection ---
            det_msgs = build_detection_prompt(text, shots)
            det_text, det_lat = chat_complete(det_msgs, temp=0.2, max_tokens=500)

            # --- Feedback ---
            fb_msgs  = build_feedback_prompt(domain, data["prompt"], rubric_text, text)
            fb_text, fb_lat = chat_complete(fb_msgs, temp=0.3, max_tokens=1200)

            # --- Print structured output ---
            print(f"----- [{domain}] Submission #{i} -----")
            print(f"Ground truth label (if any): {true}")
            print(f"\n[DETECTION] (latency ~{det_lat:.2f}s)")
            print(det_text.strip())
            print(f"\n[FEEDBACK] (latency ~{fb_lat:.2f}s)")
            print(fb_text.strip())
            print("-" * 40 + "\n")

    print(f"\n=== Batch finished: {now_ts()} ===")

if __name__ == "__main__":
    main()

=== Batch started: 2025-09-03T09:38:42 ===


DATASET: engineering.json | Domain: Manufacturing Engineering
Submissions: 6

----- [Manufacturing Engineering] Submission #1 -----
Ground truth label (if any): AI

[DETECTION] (latency ~74.34s)
<think>
Okay, let's analyze this submission. The user wants to determine if it's Human, AI, or Hybrid.

First, looking at the discourse features. The text is very specific with technical terms like "process flow diagrams," "value stream maps," "takt time," "OEE (Overall Equipment Effectiveness)," and "FMEA." These are precise industry terms, which might indicate AI-generated content since humans might not list so many acronyms without explanation. Also, the structure is very methodical, each phase is laid out in a logical sequence without any personal anecdotes or subjective statements. 

Next, style consistency. The entire submission maintains a formal, technical tone throughout. Each sentence starts with a clear subject and follows a similar patter