# Generating & Evaluating flashards using LLMs

In this notebook, we'll explore the process of generating flashcards using a large language model (LLM) and systematically evaluating their quality based on several metrics: atomicity, conciseness, leakage, readability (Flesch-Kincaid grade), and duplication. We'll walk through environment setup, data loading, flashcard generation, metric definitions, evaluation, and initial analysis of results.

## 1. Installing all necessary dependencies

```bash
lmstudio
pandas
google-genai
python-dotenv
nltk
textstat
sentence_transformers
jupyter
```

## 2. Determining evaluation metrics

### Atomicity score
Determines if the question focuses on one, single core concept.
Ranges: 0.0 (poor atomicity) - 1.0 (good atomicity)

```python
tokens = wordpunct_tokenize(question)
verbs = number of verbs in the question
conjs = number of coordinating conjunctions ("und", "oder", "aber", "jedoch", "bzw.", "beziehungsweise", "sowie", "doch", "jedoch", "sondern")
tokens = total word tokens in the question

score = 1.0 \
    - 0.4 * max(0, verbs - 1) \
    - 0.3 * conjs \
    - 0.2 * (len(tokens) > 20)
```

### Conciseness Score  
Measures whether the answer is short enough to be easily memorized.  

- **Range / Values**  
  - 1.0 if ≤ 10 tokens  
  - 0.7 if 11–20 tokens  
  - 0.4 if > 20 tokens  

```python
from nltk import word_tokenize

def conciseness_score(answer: str) -> float:
    """
    Is the answer short enough to be memorable?
      - 1.0 if <= 10 tokens
      - 0.7 if 11–20 tokens
      - 0.4 if > 20 tokens
    """
    n = len(word_tokenize(answer))
    if n <= 10:
        return 1.0
    elif n <= 20:
        return 0.7
    else:
        return 0.4
```

### Flesh-Kincaid Score
Estimates the U.S. school‐grade level required to understand the answer. Lower values ⇒ easier to read. We generally aim for a grade level ≤ 8.

```python
import textstat

def fk_grade(answer: str) -> float:
    """
    Compute the Flesch–Kincaid Grade Level of the answer.
    Lower ⇒ easier to read (target: grade <= 8).
    """
    return textstat.flesch_kincaid_grade(answer)
```


In [None]:
import json
import os
import pandas as pd
from google import genai
from google.genai import types
from IPython.display import display

from dotenv import load_dotenv
load_dotenv()

from evaluation import (
    atomicity_score,
    conciseness_score,
    leakage_score,
    fk_grade,
    find_exact_and_question_duplicates,
    compute_max_semantic_similarity
)

API_KEY = os.getenv("API_KEY")
client = genai.Client(api_key=API_KEY)

with open('summaries/esop.md', encoding='utf-8') as f:
    prompt = f.read()

with open('prompts.json', 'r') as f:
    data = json.load(f)
system_prompt = data['system_prompt_de']

response = client.models.generate_content(
    model="gemini-2.5-flash",
    config=types.GenerateContentConfig(
        system_instruction=system_prompt,
        response_mime_type="application/json",
    ),
    contents=prompt,
)

cards = json.loads(response.text)
df_cards = pd.DataFrame(cards)

rename_map = {
    "front": "question",
    "back":  "answer",
    "question": "question",
    "answer":   "answer",
    "frage": "question",
    "antwort": "answer",
}

df_cards.rename(columns={k: v for k, v in rename_map.items()
                         if k in df_cards.columns},
                inplace=True)

df_cards["atomicity_score"] = df_cards["question"].apply(atomicity_score)
df_cards["conciseness_score"] = df_cards["answer"].apply(conciseness_score)
df_cards["leakage_score"] = df_cards.apply(
    lambda row: leakage_score(row["question"], row["answer"]), axis=1
)
df_cards["fk_grade"]   = df_cards["answer"].apply(fk_grade)

exact_dups, question_dups = find_exact_and_question_duplicates(df_cards)
df_cards["is_exact_dup"]      = df_cards.index.isin(exact_dups)
max_sims = compute_max_semantic_similarity(df_cards, model_name="all-MiniLM-L6-v2")
df_cards["semantic_similarity"] = df_cards.index.map(max_sims)

threshold = 0.90
df_cards["is_semantic_dup"] = df_cards["semantic_similarity"] >= threshold    

pd.set_option('display.max_colwidth', 80)
pd.set_option('display.colheader_justify', 'center')

cols = [
    "question","answer",
    "atomicity_score","conciseness_score",
    "leakage_score","fk_grade",
    "semantic_similarity","is_exact_dup"
]

styled = (
    df_cards[cols]
      .style
      .set_properties(subset=["question","answer"], **{"white-space": "pre-wrap"})
      .set_table_styles([
          {"selector": "th", "props": [("text-align", "center")]},
      ])
      .format({
          "atomicity_score":      "{:.2f}",
          "conciseness_score":    "{:.2f}",
          "leakage_score":        "{:.2f}",
          "fk_grade":             "{:.1f}",
          "semantic_similarity":  "{:.2f}"
      })
)

display(styled)

Unnamed: 0,question,answer,atomicity_score,conciseness_score,leakage_score,fk_grade,semantic_similarity,is_exact_dup
0,Was ist ein Array in der Programmierung?,"Eine Datenstruktur, die eine feste Anzahl von Werten desselben Datentyps in einer geordneten Folge speichert.",1.0,0.7,1.0,9.9,0.61,False
1,Wie greift man auf Elemente eines eindimensionalen Arrays in Java zu und wo beginnt der Index?,Über den Index in eckigen Klammern (z.B. `a[0]`). Der Index beginnt bei 0.,0.7,0.4,0.8,5.1,0.69,False
2,Was geben `a.length` und `a[0].length` bei einem zweidimensionalen Array `int[][] a` in Java zurück?,"`a.length` gibt die Anzahl der Zeilen an, `a[0].length` die Anzahl der Spalten der ersten Zeile.",0.5,0.4,0.92,2.3,0.66,False
3,Welche Art von Wert speichert eine Array-Variable in Java?,"Die Adresse des Speicherbereichs, nicht die Daten selbst (Array-Variablen sind Referenztypen).",1.0,0.7,1.0,12.3,0.69,False
4,Was ist der Unterschied zwischen einer Klasse und einem Objekt in Java?,"Klassen sind benutzerdefinierte Datentypen (Blaupausen), die Daten und Methoden kapseln. Objekte sind Instanzen dieser Klassen.",0.7,0.7,1.0,9.4,0.59,False
5,"Welchen Zweck erfüllen Zugriffsattribute (`private`, `public`) bei Instanzvariablen und Methoden in Java-Klassen?","`private` schützt Instanzvariablen (Geheimnisprinzip), `public` macht Methoden zur Manipulation/zum Auslesen des Zustands öffentlich zugänglich.",0.7,0.4,0.71,16.0,0.5,False
6,Wie verhält sich die Parameterübergabe in Java für Grundtypen im Vergleich zu Objektreferenzen?,"Grundtypen werden 'Call-by-Value' (Wertkopie) übergeben. Objektreferenzen werden ebenfalls 'Call-by-Value' der Referenz übergeben, d.h. Änderungen im aufgerufenen Objekt wirken sich aus.",1.0,0.4,0.86,10.6,0.68,False
7,Welches Hauptproblem lösen generische Typen in Java im Vergleich zu nicht-generischen Containerklassen?,"Sie eliminieren die Notwendigkeit manueller Abwärtskonvertierungen (`ClassCastException`) und ermöglichen typsichere, homogene Container, deren Typkonsistenz bereits zur Compile-Zeit sichergestellt wird.",1.0,0.4,1.0,17.3,0.68,False
8,Was ist der Hauptunterschied zwischen Byte-Streams und Character-Streams in Java I/O?,"Byte-Streams (`InputStream`/`OutputStream`) verarbeiten rohe 8-Bit-Daten (z.B. Bilder), während Character-Streams (`Reader`/`Writer`) Text mit automatischer Zeichensatz-Konversion verarbeiten.",0.7,0.4,1.0,22.7,0.45,False
9,"Was ist Serialisierung in Java, und welche Anforderung muss eine Klasse erfüllen, um serialisierbar zu sein?","Serialisierung ist das Speichern eines Objekts in einem Stream. Eine Klasse muss das Marker-Interface `Serializable` implementieren, um serialisiert werden zu können.",0.7,0.4,0.83,13.2,0.59,False


In [14]:
import json
import os
import pandas as pd
from google import genai
from google.genai import types
from IPython.display import display

from dotenv import load_dotenv
load_dotenv()

from evaluation import (
    atomicity_score,
    conciseness_score,
    leakage_score,
    fk_grade,
    find_exact_and_question_duplicates,
    find_semantic_question_duplicates,
    compute_max_semantic_similarity
)

API_KEY = os.getenv("API_KEY")
client = genai.Client(api_key=API_KEY)

with open('summaries/esop.md', encoding='utf-8') as f:
    prompt = f.read()

system_prompt = """
Generate flashcards from this summary following this format: 
    {
        "question": "...",
        "answer": "..."
    }

Generate only 10 cards       
"""

response = client.models.generate_content(
    model="gemini-2.5-flash",
    config=types.GenerateContentConfig(
        system_instruction=system_prompt,
        response_mime_type="application/json",
    ),
    contents=prompt,
)

cards = json.loads(response.text)
df_cards = pd.DataFrame(cards)

rename_map = {
    "front": "question",
    "back":  "answer",
    "question": "question",
    "answer":   "answer",
    "frage": "question",
    "antwort": "answer",
}

df_cards.rename(columns={k: v for k, v in rename_map.items()
                         if k in df_cards.columns},
                inplace=True)

df_cards["atomicity_score"] = df_cards["question"].apply(atomicity_score)
df_cards["conciseness_score"] = df_cards["answer"].apply(conciseness_score)
df_cards["leakage_score"] = df_cards.apply(
    lambda row: leakage_score(row["question"], row["answer"]), axis=1
)
df_cards["fk_grade"]   = df_cards["answer"].apply(fk_grade)

exact_dups, question_dups = find_exact_and_question_duplicates(df_cards)
df_cards["is_exact_dup"]      = df_cards.index.isin(exact_dups)
max_sims = compute_max_semantic_similarity(df_cards, model_name="all-MiniLM-L6-v2")
df_cards["semantic_similarity"] = df_cards.index.map(max_sims)

threshold = 0.90
df_cards["is_semantic_dup"] = df_cards["semantic_similarity"] >= threshold    

pd.set_option('display.max_colwidth', 80)
pd.set_option('display.colheader_justify', 'center')

cols = [
    "question","answer",
    "atomicity_score","conciseness_score",
    "leakage_score","fk_grade",
    "semantic_similarity","is_exact_dup"
]

styled = (
    df_cards[cols]
      .style
      .set_properties(subset=["question","answer"], **{"white-space": "pre-wrap"})
      .set_table_styles([
          {"selector": "th", "props": [("text-align", "center")]},
      ])
      .format({
          "atomicity_score":      "{:.2f}",
          "conciseness_score":    "{:.2f}",
          "leakage_score":        "{:.2f}",
          "fk_grade":             "{:.1f}",
          "semantic_similarity":  "{:.2f}"
      })
)

display(styled)

Unnamed: 0,question,answer,atomicity_score,conciseness_score,leakage_score,fk_grade,semantic_similarity,is_exact_dup
0,What is an array in Java?,An array is a data structure that stores a fixed number of values of the same data type in an ordered sequence.,1.0,0.4,0.89,9.1,0.64,False
1,How do you declare and create a 1D integer array named 'a' with 10 elements in Java?,Declaration: `int[] a;` Creation: `a = new int[10];`,1.0,0.4,0.94,9.1,0.67,False
2,"In a Java array 'a' of length N, what are the valid indices for accessing elements, and how do you get its length?","Valid indices range from 0 to N-1. The length is obtained using the `length` property, e.g., `a.length`.",0.8,0.4,0.85,4.4,0.61,False
3,How do you access the element in the second row and second column of a 2D integer array named 'a' in Java?,`a[1][1]` (since indexing starts from 0).,0.8,0.7,0.95,2.5,0.62,False
4,"When you assign one array variable to another (e.g., `int[] b = a;`), what is the consequence regarding changes made through 'b'?","Since array variables store the address (reference) to the memory area, both `a` and `b` point to the same array. Changes made through `b` will also affect the array referenced by `a`.",0.8,0.4,0.76,9.1,0.58,False
5,How do you iterate over a one-dimensional array 'a' of integers using the enhanced for-loop in Java?,```java for (int val : a) {  System.out.println(val); } ```,0.8,0.7,0.88,10.4,0.67,False
6,"In object-oriented programming, what is the fundamental difference between a 'class' and an 'object'?","A class is a blueprint or a user-defined data type that defines data (attributes) and operations (methods). An object is an instance of a class, created using `new`, representing a concrete entity based on that blueprint.",0.8,0.4,0.9,10.8,0.34,False
7,"In Java, what is the typical access modifier for instance variables and methods, and what principle do they adhere to?","Instance variables are typically `private` (verdeckte Daten), while methods are typically `public` (sichtbare Operationen). This adheres to the principle of information hiding or encapsulation.",0.8,0.4,0.82,15.1,0.43,False
8,"What problem does the use of generics (e.g., `IStack`) primarily solve in Java collections?","Generics solve the problem of type unsafety and the need for explicit type casting when retrieving elements from collections. They ensure type consistency at compile-time, preventing `ClassCastException`s at runtime.",0.8,0.4,0.81,12.9,0.39,False
9,What is the purpose of a 'Stream' in Java's input/output (I/O) abstraction?,"A Stream represents a source or destination of data. It abstracts away the details of the underlying device, allowing programs to read from or write to various sources (like files, network connections, or memory) in a uniform way.",0.8,0.4,0.91,11.7,0.44,False
