# Week 3 Lab: Prompting for Effective LLM Use

Focus: clear rules, roles (system vs. user), separating data from instructions, formatting outputs for parsing, two-step prompting, and a small optional EDA demo.

Default model: gemini-2.5-flash-live (fast and free).

## 0) Setup

- pip install google-generativeai python-dotenv pandas numpy seaborn matplotlib scipy

Put your API key in .env as GEMINI_API_KEY=... and restart the kernel if needed.

In [None]:
from __future__ import annotations

import os, json, textwrap
from typing import Any, Dict

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

from dotenv import load_dotenv
load_dotenv()

GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')
assert GEMINI_API_KEY, 'Please set GEMINI_API_KEY in your environment or .env file.'

import google.generativeai as genai
genai.configure(api_key=GEMINI_API_KEY)

MODEL_NAME = os.getenv('GEMINI_MODEL', 'gemini-2.5-flash')
GEN_CONFIG = genai.GenerationConfig(temperature=0.3, max_output_tokens=800)
GEN_CONFIG_EXTENDED = genai.GenerationConfig(temperature=0.3, max_output_tokens=5000)

model_default = genai.GenerativeModel(MODEL_NAME)
def make_model(system_instruction: str | None = None):
    return genai.GenerativeModel(MODEL_NAME, system_instruction=system_instruction)


## 1) Clear rules for prompting

- Be specific: ask only for what you need; set scope and length.
- State a role: use a system instruction for role and constraints.
- Separate parts: instructions, data, and output format.
- Format outputs: require strict JSON or a fenced code block.
- Acceptance criteria: add a checklist and self-verify.
- Few-shot (minimal): one short example can anchor style.

In [None]:
# --- Bad Prompt: Vague and unstructured ---
bad_prompt = "Tell me about large language models."
print("--- Bad Prompt ---")
print(model_default.generate_content(bad_prompt, generation_config=GEN_CONFIG_EXTENDED).text)


# --- Good Prompt: Clear, role-based, and structured ---
good_prompt = textwrap.dedent('''
    You are an expert technical writer.
    Explain the concept of "Large Language Models" to a university student.
    - Start with a one-sentence definition.
    - Provide three key bullet points on how they work.
    - Limit the total response to under 100 words.
''')
print("\n--- Good Prompt ---")
print(model_default.generate_content(good_prompt, generation_config=GEN_CONFIG).text)

### Exercise 
Improve the following vague prompt by applying at least three of the "clear rules" (e.g., add a role, specify the format, set constraints).

**Vague prompt:** `"Why should I learn data science?"`

## 2) Role prompting changes outcomes (system vs. user)

We keep the user prompt identical and only change the system instruction.

In [None]:
PROMPT = 'In one sentence, what do you think about skateboarding?'

resp_plain = model_default.generate_content(PROMPT, generation_config=GEN_CONFIG)
print('No system instruction:', resp_plain.text)

model_cat = make_model('You are a cat.')
resp_cat = model_cat.generate_content(PROMPT, generation_config=GEN_CONFIG)
print('System: You are a cat.', resp_cat.text)


### Exercise 
Using the same `PROMPT` about skateboarding, create a new model with the system instruction `"You are a worried parent."` and generate a response. How does it differ from the "cat" and default responses?

## 3) Separate data from instructions

Keep your instructions stable and swap data safely using delimiters.

In [None]:
instructions = textwrap.dedent('''
Summarize the data in one short sentence and produce 3 topical tags.
Return strict JSON with keys summary (str) and tags (list of str).
''').strip()
data_block = textwrap.dedent('''<data>
Skateboarding participation has risen globally over the past decade,
with growing inclusion in major events and broader demographics. During the last olympic games, skateboarding made its debut,
highlighting its increasing recognition as a competitive sport. The culture around skateboarding continues to evolve,
influencing fashion, music, and lifestyle trends worldwide.
</data>''').strip()

prompt = f'Instructions: {instructions}{data_block}'
resp = model_default.generate_content(prompt, generation_config=GEN_CONFIG)
text = resp.text
print(text)
# Strip markdown fences if the model added them
if text.strip().startswith("```json"):
    text = text.strip().removeprefix("```json\n").removesuffix("\n```")
try:
    parsed = json.loads(text)
    print('Parsed:', parsed)
except Exception as e:
    print('Parsing failed; consider asking the model to fix format.')


### Exercise 
Use the `instructions` variable from the example above, but replace the `data_block` with a new paragraph about a topic of your choice. Verify that the model still follows the instructions and produces the correct JSON output.

## 4) Formatting outputs for parsing

Ask for strict JSON and validate with json.loads. If parsing fails, ask the model to fix the format.

In [None]:
schema = 'title: str; bullets: list[str]'
prompt = (
    'System: Return strict JSON with keys title (str) and bullets (list[str]).'
    'User:'
    '- Task: Summarize why clear prompts matter for data science.'
    '- Length: Title + 3 bullets.'
    f'- Output: {schema}'
)
resp = model_default.generate_content(prompt, generation_config=GEN_CONFIG)
text = resp.text or ''
print(text)

# Strip markdown fences if the model added them
if text.strip().startswith("```json"):
    text = text.strip().removeprefix("```json\n").removesuffix("\n```")

try:
    obj = json.loads(text)
    print('Valid JSON with', len(obj.get('bullets', [])), 'bullets')
except json.JSONDecodeError as e:
    print('JSON parse error:', e)


### Exercise 
Modify the prompt to request a different JSON schema. Ask the model to return an object with two keys: `topic` (a string) and `key_takeaways` (a list of exactly two strings).

## 5) Two-step prompting (prompt-writer)

Use the model to craft the prompt first, then run that prompt.

In [None]:
task = textwrap.dedent('''
Write exactly 5 concise bullets for an EDA plan on a churn dataset
with columns: age, tenure_months, monthly_charges, contract_type, churn.
''').strip()

prompt_writer = textwrap.dedent('''
You are a prompt engineer. Draft a clear, minimal prompt for another LLM to:
- Role: senior data scientist
- Task: {TASK}
- Output: exactly 5 bullets in plain text
- Constraints: be specific, no extra commentary
Return ONLY the prompt text that I can paste into the other model.
''').format(TASK=task)

response = model_default.generate_content(prompt_writer, generation_config=GEN_CONFIG_EXTENDED)
draft = response.text if response.text else "Error: No response generated"
print('--- Drafted prompt ---')
print(draft)

if draft != "Error: No response generated":
	response = model_default.generate_content(draft, generation_config=GEN_CONFIG_EXTENDED)
	result = response.text if response.text else "Error: No response generated"
	print('\n--- Result ---')
	print(result)


### Exercise
Use the two-step "prompt-writer" pattern for a different task. First, have the model generate a prompt that asks another LLM to write a four-line poem about Marseille. Then, execute the generated prompt.

## 6) Optional: quick EDA demo (code generation)

We generate a small dataset and ask the model for a minimal analyze(df) function.
We require ONLY raw Python code (no fences) to simplify execution.

In [None]:
rng = np.random.default_rng(42)
n = 600
age = rng.integers(18, 75, size=n)
tenure = rng.integers(1, 72, size=n)
monthly = rng.normal(65, 20, size=n).clip(5, 200)
contract_type = rng.choice(['month-to-month','one-year','two-year'], size=n, p=[0.55,0.25,0.20])
p_churn = 1/(1+np.exp(-(0.35 - 0.003*tenure - 0.10*(contract_type=='two-year') - 0.05*(contract_type=='one-year') + 0.0015*(monthly-65))))
churn = rng.binomial(1, p_churn)
df = pd.DataFrame({'age': age, 'tenure_months': tenure, 'monthly_charges': monthly, 'contract_type': contract_type, 'churn': churn})
df.head()


In [None]:
eda_prompt = (
    'Return ONLY Python code (no fences) defining a function `def analyze(df):`. '
    'Inside the function: '
    '1) Use pandas, seaborn, and matplotlib to perform a basic EDA. '
    '2) Print the results of df.describe(). '
    '3) Create two interesting plots (e.g., a histogram and a boxplot) to visualize the data. '
    '4) Do NOT call plt.show() or plt.close() inside the function. '
    '5) The function should not return any value. '
)

response = model_default.generate_content(eda_prompt, generation_config=GEN_CONFIG_EXTENDED, safety_settings=safety_settings)

# Check if the response was blocked before proceeding
if not response.candidates:
    print("Code generation failed. The response was blocked.")
    print("Reason:", response.prompt_feedback)
else:
    try:
        eda_code = response.text
        # Strip markdown fences if the model added them
        if eda_code.strip().startswith("```python"):
            eda_code = eda_code.strip().removeprefix("```python\n").removesuffix("\n```")
        # Add a print statement to debug the generated code
        print("--- Generated Code ---")
        print(eda_code)
        print("----------------------")

        # The sandbox provides the allowed modules.
        # We  allow standard functions.
        sandbox = {'pd': pd, 'np': np, 'sns': sns, 'plt': plt, 'stats': stats}
        exec(compile(eda_code, '<generated>', 'exec'), sandbox, sandbox)
        
        assert 'analyze' in sandbox, 'The generated code did not define the "analyze" function.'
        
        result = sandbox['analyze'](df)
        print("\n--- Analysis Results ---")
        print({k: ('ok' if not isinstance(v, (int, float, str)) else v) for k, v in result.items()})
        plt.show()
    except Exception as e:
        print(f"\nAn error occurred while executing the generated code: {e}")


### Exercice
Play with the above code generation model. How far can you push it?

### Exercice
Use the techniques of your choice to analyse the dataset `\data\weightHeightSex.txt`.  