In [None]:
# Run this cell, don't change anything.
from datascience import *
import numpy as np

from IPython.display import HTML, display
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

import warnings
warnings.filterwarnings('ignore')
%reload_ext autoreload
%autoreload 2

from bs4 import BeautifulSoup
import re
import requests
import seaborn as sns
import sklearn
from sklearn.metrics import confusion_matrix
from sklearn.metrics import cohen_kappa_score

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Project Part B: Introduction

The full specification for this project is in Google Docs:
* [Final Project: Part A](https://docs.google.com/document/d/11fLAJmCwJT55pWUVhHQQ2YXkPtPcGrH7Bmn0DBWM85M/edit?usp=sharing)
* [**Final Project: Part B**](https://docs.google.com/document/d/1xT-5wWedzn2U85U3GpnE40SFskxX-hNzwOGc67Asw8o/edit?usp=sharing)

The full description of each dataset is in Google Docs:
* [CSS Tasks](https://docs.google.com/document/d/1gWbpm0qHUWHySuj9rq2gl1CvN9BgcQDUQLTl2hYLlsk/edit?usp=sharing)

This Jupyter notebook contains code cells that should accompany your Final Project Part B PDF submission. You should submit this notebook (and associated files). 

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Gemini Setup: Installation, API Key

Run this section before running question cells. 

1. In `api_key.py`, set `my_client_access_token` to be the Gemini API key that we shared with you through email. Follow the corresponding instructions in the [Data 6 Notes](https://data6.org/notes/18-html/genius.html) to navigate to the `api_key.py` file and edit it.

2. After you have updated `api_key.py` (the file within this assignment directory), running the below cells should install Google Python packages, load your API token into `GOOGLE_API_KEY`, and create a Gemini Client.


<div class="alert alert-block alert-warning">

_**Important Note**_:

 Please DO NOT share your API key outside of this class. We will disable your API key (1) if you misuse it and/or exceed the free usage tier during the project, and (2) for all students after the semester has ended. If you'd like to play around with your code after the term, you'll have to get your own API key. Ask us how to do this!

</div>

In [None]:
# just run this cell
!pip install google-genai

In [None]:
# before running this cell, make sure that you have updated api_key.py with your API key
%reload_ext autoreload
%autoreload 2

import api_key
GOOGLE_API_KEY = api_key.my_client_access_token

In [None]:
# just run this cell. it makes a small API request for testing.
from google import genai

# first, create the client
client = genai.Client(api_key=GOOGLE_API_KEY)

# then make the API request
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Explain how AI works in a few words"
)
print(response.text)

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Part 1: Zero-Shot Prompting

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 1: Task Shortname

No code.

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 2: Zero-shot Prompt Engineering

Develop a zero-shot prompt to code (i.e., label/categorize) the 30 records in `<shortname>_val_uncoded.csv` using an LLM, where <shortname> is the shortname of your CSS task. Then, use Gemini's Python API to directly query the Gemini 2.5 Flash model with this prompt to generate AI-assisted codes, one per record in your dataset.

To avoid threats to validity, your zero-shot prompt should **not** include direct references any "gold labels.". Additionally, you should **not** include any direct references to the Coding Strategy you developed for human-coding in Part A. The goal is to write a prompt as close as possible to the one described in Ziems et al.

Additional prompting guidelines:
* Gemini API: Prompting Strategies
* Ziem et al., Table 1: LLM Prompting Guidelines to generate consistent, machine-readable outputs for CSS tasks.
* We strongly recommend that your prompt includes formatting instructions to cleanly output one code per record per line, so that you can quickly process these codes later. Some suggested ways to end prompts:
    * Provide me a CSV table of <A>, <B>, and <C>, where <A is…>, … . Ensure the output is a CSV table with three columns.
    * Provide an array of JSON objects with keys <A>, <B>, and <C>. <A is…>, … . Ensure the output is a JSON array of objects.
    * Provide me a list of <C>, where <C is …>, … Ensure the output is a list of … separated by newlines.

In the cells below, write code to make the appropriate API call. Some starter code is provided for you.

In [None]:
# edit these lines as needed

csv_fname = "datasets/emotion_val_uncoded.csv"
table_val_uncoded = Table.read_table(csv_fname)
arr = table_val_uncoded.column("text")

input_records = '\n'.join(arr)
print(input_records[:500]) # some preview

In [None]:
# edit these lines as needed

prompt = """Count the number of lines provided below. Return an integer."""
prompt_coda = """Make sure to format your output as an integer."""

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        prompt,
        '\n',
        input_records,
        '\n',
    ]
)

print(response.text)


Then, in your Google Doc report, report your custom prompt, omitting input records.

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 3: Export all codes to a CSV

In the cell below, use Gemini's Python API to directly query the **Gemini 2.5 Pro model** with the exact prompt you wrote above to generate a second set of AI-assisted codes, one per record in the `<shortname>_val_uncoded.csv`.

See how to do this in the GenAI lab. The model name is `"gemini-2.5-pro"`.

In [None]:
# your code here


Then, export a new CSV called `<shortname>_val_coded.csv` with all hand-coded and AI-coded labels. **Read the full description of this task (including CSV format) in the Google Doc.**

You can create this CSV by building a `datascience` table and saving it directly in DataHub; or by building a spreadsheet in, say, Google Sheets, exporting it to CSV, then uploading it to DataHub. See Final Project A's accompanying Jupyter Notebook (Part A.3) for how to do this.

Regardless of which approach you use, you may find it useful to use `datascience` Table methods to first find the corresponding gold labels for your records.

In [None]:
# your code here

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 4: Evaluate Performance Quantitatively

Compare the performance of these four strategies (two hand-coded, two AI-coded) by computing Cohen’s Kappa ($\kappa$) with respect to the human gold labels. Note that the human gold labels can still be unreliable, so computing Cohen’s Kappa is still applicable.

Notes:
* We have provided some starter code to compute Cohen’s Kappa from the appropriate functions in the sklearn library. It uses the `cohen_kappa_score` function from the `sklearn` library. It assumes that you have correctly labeled and uploaded the CSV file in the previous part.
* It is okay to report low κ/agreement level in this part. This part is graded on completion.
* For each of the four strategies, $\kappa$ should be computed with respect to the human gold labels.

Complete this part by writing code to compute Cohen’s Kappa ($\kappa$) for each strategy in the cell below. Then, in your report, report the $\kappa$ **and** agreement level for each strategy.

In [None]:
# edit the below code as needed

from sklearn.metrics import cohen_kappa_score

# replace the below lines
# depending on where you uploaded it, could be "datasets/<filename>.csv"
coded_fname = "emotion_ai.csv" # replace this line
gold_label = "emotions" # replace this with the name of your gold label column

table_codes = Table.read_table(coded_fname)

# you should write additional code to compute kappa for other strategies
kappa_human_A = cohen_kappa_score(table_codes.column("Gemini 2.5 Flash"),
                                  table_codes.column(gold_label))
print("human_A", kappa_human_A)

...

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 5: Evaluate Performance Qualitatively

No code.

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Part 2: Context-Based, Few-Shot Prompting

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 6: Few-Shot Prompt Engineering

Develop a prompt to code (i.e., label/categorize) all records in `<shortname>.csv` using an LLM. Your prompt should include examples drawn from the example records in `<shortname>_train.csv`.

Notes:
* We suggest that your prompt include some of the examples provided in `<shortname>_train.csv`, including gold labels. You can also consider including components of your hand-coder Coding Strategy you developed in Part A.
    * To avoid threats to validity, your prompt should not include any direct references to other gold labels in the full `<shortname>.csv` dataset. 
* Important: We strongly **advise against** making one giant API request to code the entire dataset; your prompt will likely lose context quickly and make mistakes, and the request will be costly (both price-wise and time-wise).
    * Instead, choose a reasonable subset, e.g., **30 or 50 records**, to iterate rapidly through a good prompt.
    * Then construct the full set of dataset codes in the next question.
* You should complete this part using **Gemini 2.5 Flash**. Do **not** use Gemini 2.5 Pro; it is costly and will generate response times that are untenable for some tasks.

In the cells below, write code to make the appropriate API call. Some starter code is provided for you.

In [None]:
# edit these lines as needed
# pulls small amount of records (e.g., 30) from the full dataset

csv_fname = "datasets/emotion.csv"
table_uncoded = Table.read_table(csv_fname)
arr = table_uncoded.column("text")
tiny_arr = arr[:30] # small amount of records

input_records = '\n'.join(tiny_arr)
print(tiny_arr)

In [None]:
# edit these lines as needed

prompt = """Count the number of lines provided below. Return an integer."""
prompt_coda = """Make sure to format your output as an integer."""

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        prompt,
        '\n',
        input_records,
        '\n',
    ]
)

print(response.text)


Then, in your Google Doc report, report your custom prompt, omitting input records.

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 7: Export all codes to a CSV

Use Gemini’s Python API to directly query the Gemini 2.5 Flash model with the exact prompt you wrote above to generate a full set of AI-assisted codes for `<shortname>.csv`, one per record in your dataset. 

This question is challenging because you will need to design a strategy that will work for large sets of data—involving multiple API requests. But you will be able to do this task with the tools you have in this class. If you get stuck, definitely come by office hours or make an Ed post!!!

**Hints/Notes:** 

* This question is intentionally open-ended. That being said, see our recommended strategy below, and heed the following:
    * As above, we strongly advise against making one giant API request to code the entire dataset. Instead, **loop through** all records and iteratively code each subset of records. We recommend a subset size like 30 or 50.
    * Recall that with the Gemini API, the argument to contents can be a list of strings. In each iteration of your loop, you may find it useful to build a new list that includes your prompt string from before and a new subset of records to code (as a string).
    * Some of your datasets may be up to 20,000 records—this means that any loop will run **for a long time**, with or without prompting!
* It is alright to use Generative AI for this question, but you will be expected to understand your approach. Depending on your approach, during grading we may ask you follow-up questions about your code.


**Recommended strategy**:

1. Build the loop structure for a small part of your data, say, 500 records at most.
    1. First, write a loop that iterates through the subsets of records in your data. We recommend print-ing the length of these subsets, or the IDs, just as a sanity check. Let’s assume your subset has 50 records, so your loop should run 10 times.
    1. Next, add code to your loop that iteratively builds an array or list of outputs with each iteration. This will simulate the API response outputs that you will need to save with each iteration.
        * We recommend saving “dummy” values for now, just to check that everything works.
        * Additionally, we recommend just creating this array, then cleaning/post-processing in a different cell. See below.
    1. Then, add code to your loop that builds the right contents list that incorporates the prompt for each iteration. print out contents list each iteration to double check it looks right.
    1. Finally, add code to your loop that makes the API call.
1. Run and check that the loop structure works for this 500 records. It may take a few minutes.
1. After you have your list of outputs, in a separate cell post-process the outputs, e.g., split into lines, make into an array, make into a column, save into the right CSV format, etc. Keeping the API calls into a separate cell, avoids you having to run all of the API calls again just to do some string manipulation.
1. After you’ve verified that both your prompting **and** exporting cells work for the 500-record case, then edit your loop to run on the *entire dataset*.
    * If your dataset is not a clean multiple of your 50-record chunk, consider letting your loop deal with the general case, then separately process the last few records outside of the loop.

In [None]:
# your code here


Then, export this set of code labels to a new CSV labeled `<shortname>_ai.csv`, where `<shortname>` is the shortname of your CSS task. **Read the full description of this task (including CSV format) in the Google Doc.**

In [None]:
# code cell if you need it

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 8: Evaluate Performance Quantitatively


---

### Question 8a: Cohen’s Kappa

Evaluate the performance of this AI-coding strategy by computing Cohen’s Kappa (κ) with respect to the **human gold labels**. Note that like before, the human gold labels can still be unreliable, so computing Cohen’s Kappa is still applicable.

Complete this part by writing code to compute Cohen’s Kappa ($\kappa$) below; we've filled out some parts for you. Then, in your report, report the $\kappa$ **and** agreement level for each strategy.


In [None]:
# edit the below code as needed

from sklearn.metrics import cohen_kappa_score

# replace the below lines
# depending on where you uploaded it, could be "datasets/<filename>.csv"
coded_fname = "emotion_ai.csv" # replace this line
gold_label = "emotions" # replace this with the name of your gold label column



table_codes = Table.read_table(coded_fname)
kappa = cohen_kappa_score(table_codes.column("Gemini 2.5 Flash"),
                                  table_codes.column(gold_label))
print(kappa)

<br></br>

### [Tutorial] Visualizing a Confusion Matrix

Below, we have provided some starter code that first computes a small confusion matrix with the sklearn method confusion_matrix [[sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)] then visualizes this matrix as a heatmap [[seaborn documentation](https://seaborn.pydata.org/generated/seaborn.heatmap.html)]. Feel free to start from this as an example.

In [None]:
# just run this cell
from sklearn.metrics import confusion_matrix
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]   # gold/standard labels
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]    # "AI"/predicted labels
labels = ["ant", "bird", "cat"] # categorical variable values, the order of entries in the resulting matrix

conf_matrix = confusion_matrix(y_true, y_pred, labels=labels) # a 2-D matrix!
conf_matrix

In [None]:
# just run this cell
import seaborn as sns
ax = sns.heatmap(conf_matrix,
                 cmap="YlOrBr",
                 xticklabels=labels,
                 yticklabels=labels,
                 annot=True,
                 annot_kws={"fontsize":20}
                )

# add x labels, change font size of x tick labels
ax.set_xlabel("Predicted labels")
ax.set_xticklabels(ax.get_xmajorticklabels(), fontsize=10)

# add y labels, change font size of y tick labels
ax.set_ylabel("Gold labels")
ax.set_yticklabels(ax.get_ymajorticklabels(), fontsize=10)
ax


<br></br>

---
    
### Question 8b: Confusion Matrix 

Evaluate the performance of this AI-coding strategy by computing the confusion matrix of the AI-coded labels and the human gold labels. Feel free to start from the example above.

In the cell below, write code that produces a visualization. We've copied over the relevant code; just modify the code where indicated to load in the correct columns of your `<shortname>_ai.csv` data. Remember, `tbl.column(colname)` returns the `colname` column of `tbl` as an array.  Set `y_true`, `y_pred` accordingly. You may need to manually input the categorical values of your label. 

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns


# edit this block of code
csv_fname = ...
table = Table.read_table(csv_fname)

y_true = table.column(...)   # gold/standard labels
y_pred = table.column(...)   # "AI"/predicted labels
labels = [..., ..., ]   # categorical variable values, the order of entries in the resulting matrix

# make minimal changes below this line
# only tweak color and formatting if needed
ax = sns.heatmap(conf_matrix,
                 cmap="YlOrBr",
                 xticklabels=labels,
                 yticklabels=labels,
                 annot=True,
                 annot_kws={"fontsize":20}
                )

# add x labels, change font size of x tick labels
ax.set_xlabel("Predicted labels")
ax.set_xticklabels(ax.get_xmajorticklabels(), fontsize=10)

# add y labels, change font size of y tick labels
ax.set_ylabel("Gold labels")
ax.set_yticklabels(ax.get_ymajorticklabels(), fontsize=10)
ax

Then, in your Google Doc report, include screenshot(s) of the resulting confusion matrix. You do not need to screenshot your code; writing your code in the cell below is sufficient.

Finally, in your Google Doc report, report the overall accuracy of the class accuracies (one per label in your categorical variable) of your AI codes. 

An additional code cell is provided for you below.

In [None]:
# code here, if needed

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 9: Evaluate Performance Qualitatively

No code.

If you need it, the below cell samples a few rows from your dataset.

In [None]:
# depending on where you uploaded it, could be "datasets/<filename>.csv"
all_coded_fname = "emotion_ai.csv" # replace this line

all_table = Table.read_table(all_coded_fname)

# sample 10 rows of your dataset, with no repeats. Take Data 8 for more!
all_table.sample(10, with_replacement=False)

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 10: Reflect

No code.

# Done!!!

You should download a ZIP of this folder, which should include this notebook, the `<shortname>_val_coded.csv` file you created in Part B.1, and the `<shortname>_ai.csv` file you created in part B.2. Read the [Google Doc](https://docs.google.com/document/d/1xT-5wWedzn2U85U3GpnE40SFskxX-hNzwOGc67Asw8o/edit?usp=sharing) for submission instructions. 