# Introduction
This notebook is used to experiment with different prompts for the question generation model that uses the chatGPT API from open.ai.
First a helper function is created to call the API with the provided prompt. For this prompt different techniques are tried out and evaluated to find the best performing prompt template.

In [63]:
import os
from dotenv import load_dotenv
import openai
from src.datageneration.extractor import extract_text_without_image
from pypdfium2 import PdfDocument
import pandas as pd
from sklearn.model_selection import train_test_split
from src.evaluation.eval_main import Metrics

load_dotenv()
openai.api_key = os.getenv("OPENAI-API-KEY2")

def chat_gpt(prompt):
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return completion.choices[0].message.content

ModuleNotFoundError: No module named 'language_tool_python'

Now the evaluation data is loaded that is also used to evaluate the other question generation models that are used in the Ankinator project.

In [2]:
# # initially retrieve extracted text for each slide - only execute once
# slide_path = "../../../datasets/IT-Security_all_slides_no_duplicates.pdf"
# pdf = PdfDocument(slide_path)
# text = extract_text_without_image(pdf.raw)
# extracted_content = pd.DataFrame(columns=['Pagenumber', 'Page-Text', 'OCR-text'])
# for i in text:
#     extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-Text': i[1], 'OCR-text': i[2]}, ignore_index=True)
#
# # Define the file path and name
# file_path = "../../../datasets/extracted_text_content.csv"
#
# # Save the DataFrame to the specified folder
# extracted_content.to_csv(file_path, index=False)

100%|██████████| 596/596 [06:26<00:00,  1.54it/s]
  extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-Text': i[1], 'OCR-text': i[2]}, ignore_index=True)
  extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-Text': i[1], 'OCR-text': i[2]}, ignore_index=True)
  extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-Text': i[1], 'OCR-text': i[2]}, ignore_index=True)
  extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-Text': i[1], 'OCR-text': i[2]}, ignore_index=True)
  extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-Text': i[1], 'OCR-text': i[2]}, ignore_index=True)
  extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-Text': i[1], 'OCR-text': i[2]}, ignore_index=True)
  extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-Text': i[1], 'OCR-text': i[2]}, ignore_index=True)
  extracted_content = extracted_content.append({'Pagenumber': i[0], 'Page-T

In [None]:
# reload extracted content from file
file_path = "../../../datasets/extracted_text_content.csv"
extracted_content = pd.read(file_path)

In [52]:
file_path = '../../../datasets/Goldstandard.csv'

goldstandard = pd.read_csv(file_path, delimiter=';')

# Remove unnecessary columns
goldstandard.drop(['PDF-Name', 'Comment','Page Number'], axis=1, inplace=True)

# Join two DataFrames based on index
goldstandard = extracted_content.join(goldstandard, lsuffix='_left', rsuffix='_right')

# Delete records with value "No" and "no" in the "Marked for processing" column
goldstandard = goldstandard[(goldstandard['Marked for processing'] != 'No') & (goldstandard['Marked for processing'] != 'no')]

# Remove unnecessary columns
goldstandard.drop(['Marked for processing', 'Includes Image Data'], axis=1, inplace=True)

# Split the DataFrame into train, validation, and test sets
goldstandard_train_val, goldstandard_test = train_test_split(goldstandard, test_size=0.2, random_state=42)

print("Lenght of test set: ", len(goldstandard_test))
print(goldstandard_test)

Lenght of test set:  93
    Pagenumber                                          Page-Text  \
287        287  Phase 4: Post-Infection\r\n• Carries out attac...   
35          35  • Sig:\r\n– Given a message  ∈ ே and \r\nsecre...   
46          46  Access Control\r\n• Controls which authenticat...   
284        284  Phase 1: Development\r\n• Malware is developme...   
149        149  Phishing\r\n• Phishing = fishing for informati...   
..         ...                                                ...   
26          26  Key Agreement Protocol\r\nAlice Bob\r\nEve\r\n...   
449        449  HTTP\r\nWorkflow\r\n• Two parties:\r\n– Client...   
54          54  Access Matrix\r\nFormal Definition\r\n• At poi...   
217        217  Machine Learning\r\n• Extends content filterin...   
436        436  DNS Tunneling\r\n• Normal DNS requests only co...   

                                              OCR-text  \
287  te\nPhase 4: Post-Infection Be OP MANNHEIM\n\n...   
35   ol\nExample: RSA Signature

In [81]:
# Reset the index of the DataFrame
goldstandard_test = goldstandard_test.reset_index(drop=True)
goldstandard_train_val = goldstandard_train_val.reset_index(drop=True)

# this stores now the possible input for the chatGPT model
content = goldstandard_test[["Page-Text", "OCR-text"]]

# this stores the reference
references = goldstandard_test[["Question"]]

# Prompt Engineering
Having prepared everything it is possible to start with prompt engineering. It is started with simple prompts and continued with more complex prompts.
## Zero-Shot
1. Simple prompt

In [69]:
model_results = []
# the chatGPT API is called and results are stored
i=0
for index, row in content.iterrows():
    if (i == 3):
        break
    prompt = f"""
    Generate a question in a flashcard style for the content delimited by triple backticks.
    ```{row['Page-Text']}```
    """
    model_results.append(chat_gpt(prompt))
    i=i+1

print(model_results)
# # Performance is evaluated
# metrics = Metrics(save_to_file=False)
# result = pd.DataFrame(
#     metrics.evaluate(model_output=model_results, references=references),
#     index=["ChatGPT"]
# )
# result

['What are some examples of activities carried out during the post-infection phase of an attack?', 'What are the steps involved in the RSA Signature Scheme?', 'What is access control and on what levels can it exist?']


2. Provide more context on the task

In [74]:
model_results = []
# the chatGPT API is called and results are stored
i=0
for index, row in content.iterrows():
    if (i == 3):
        break
    prompt = f"""
    Generate a question in a flashcard style for the content delimited by triple backticks.
    When there are examples do not focus on their specifics but try to cover the overarching concept or idea.
    ```{row['Page-Text']}```
    """
    model_results.append(chat_gpt(prompt))
    i=i+1

print(model_results)
# # Performance is evaluated
# metrics = Metrics(save_to_file=False)
# result = pd.DataFrame(
#     metrics.evaluate(model_output=model_results, references=references),
#     index=["ChatGPT"]
# )
# result

['What actions does a virus typically carry out during the post-infection phase?', 'What is the purpose of the "Sig" operation in the RSA Signature Scheme?', 'What does access control in IT security refer to and on what levels can it exist?']


3. Even more context

In [76]:
model_results = []
# the chatGPT API is called and results are stored
i=0
for index, row in content.iterrows():
    if (i == 3):
        break
    prompt = f"""
    Generate a question in a flashcard style for the content delimited by triple backticks.
    Focus on concepts, definitions and key-words.
    Take into account how exam questions are normally formulated and formulate the question accordingly.
    When there are examples do not focus on their specifics but try to cover the overarching concept or idea.
    ```{row['Page-Text']}```
    """
    model_results.append(chat_gpt(prompt))
    i=i+1

print(model_results)
# # Performance is evaluated
# metrics = Metrics(save_to_file=False)
# result = pd.DataFrame(
#     metrics.evaluate(model_output=model_results, references=references),
#     index=["ChatGPT"]
# )
# result

['What are some examples of actions that can be carried out during the post-infection phase of a cyberattack?', 'What are the steps involved in the RSA signature scheme?', 'What does access control in IT-Security involve and what are the different levels at which it can exist?']


4. Comparison of copied text and OCR

In [79]:
model_results = []
# the chatGPT API is called and results are stored
i=0
for index, row in content.iterrows():
    if (i == 3):
        break
    prompt = f"""
    You are a bot to support in the generation of flashcards from lecture slides.
    You are provided with two inputs. The first input delimited by triple backticks is the text that is copied from the slides.
    The second input delimited by triple quotation marks is retrieved with an OCR tool to extract all text from a slide.
    Follow the below process:
    1. Step: Compare the first input with the second input to retrieve the relevant information
    2. Step: Generate a question for this information in a flashcard style
    Only return the generated question.
    ```{row['Page-Text']}```
    \"\"\"{row['OCR-text']}\"\"\"
    """
    model_results.append(chat_gpt(prompt))
    i=i+1

print(model_results)
# # Performance is evaluated
# metrics = Metrics(save_to_file=False)
# result = pd.DataFrame(
#     metrics.evaluate(model_output=model_results, references=references),
#     index=["ChatGPT"]
# )
# result

['What does "Phase 4: Post-Infection" entail?', 'What is the process of generating a signature in the RSA Signature Scheme?', 'What is the purpose of access control in a system?']


## Few-Shot
5. Few-Shot with simple question

In [89]:
model_results = []
# the chatGPT API is called and results are stored
i=0
for index, row in content.iterrows():
    if (i == 3):
        break
    prompt = f"""
    Generate a question in a flashcard style for the content delimited by triple backticks.
    ```{row['Page-Text']}```
    Follow a similar style for generating the question as in this two examples:
    1) Input: {goldstandard_train_val.loc[0, 'Page-Text']}, question: {goldstandard_train_val.loc[0, 'Question']}
    2) Input: {goldstandard_train_val.loc[1, 'Page-Text']}, question: {goldstandard_train_val.loc[1, 'Question']}
    """
    model_results.append(chat_gpt(prompt))
    i=i+1

print(goldstandard_test.loc[1, 'Page-Text'])

print(model_results)
# # Performance is evaluated
# metrics = Metrics(save_to_file=False)
# result = pd.DataFrame(
#     metrics.evaluate(model_output=model_results, references=references),
#     index=["ChatGPT"]
# )
# result

• Sig:
– Given a message  ∈ ே and 
secret key 
– Compute signature
 = ௗ  
• Ver:
– Given a signature  ∈ ே, a 
message  and public key 
– Check if  = ௘  
– If yes, output true, otherwise false
Example: RSA Signature Scheme
• Gen:
– Find large primes  and 
– Set  ≔  ⋅ .  is called the RSA￾modulus
– Public key: sample integer value , 
being co-prime (teilerfremd) to 
Φ  = ( − 1)( − 1)
– Secret key: determine  with  ⋅
  Φ() = 1.
Selected Topics in IT-Security
Prof. Dr. Frederik Armknecht - Chair of Computer Science IV 
6
['What activities are carried out during the post-infection phase?', 'What is the RSA Signature Scheme used for?', 'Question: What is Access Control and where can it exist in a system?']
