In [None]:
# Make sure to set your OpenAI API key in the .env file
import openai
import os
from dotenv import load_dotenv

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Text Summarization Use Case

In this notebook we'll look at a text summarization use case where we'll perform some prompt engineering experiments to reach a 
desired and functional solution. 

We'll follow the general steps from our prompt engineering template from notebook 
[1.1](./1.1-simple-framework-for-building-prompts.ipynb), however we'll integrate a few strategies to each step so we reach more comprehensive results. 

1. Define your task clearly
2. Define an evaluation metric
3. Generate prompt candidates
4. Experiment
5. Settle on a required performance threshold to stop experimentation.

## Defining our task

Our task will be to __summarize research papers__. Our initial specification of this task will be:

_Summarize the main findings and methodology of the following paper._

(followed by feeding the model with the paper's contents).

Remember that this initial specification of the task can be re-addressed if through experimentation we see that we are not readching the desired results.

 # Evaluation metric

There are many metrics we can use to quantify the quality of a summarization task, for example:

- __ROUGE__ (Recall-Oriented Understudy for Gisting Evaluation)
    - Source: Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out.
    - Definition: ROUGE is one of the most popular metrics used to evaluate automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-generated).
- __BLEU__ (Bilingual Evaluation Understudy)
    - Source: Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.
    - Definition: Originally designed for evaluating machine-translated texts, BLEU is also used for summarization. It measures the precision of the generated summaries by comparing the n-grams of the output with the n-grams of the reference text, providing scores based on the overlap.
- __METEOR__ (Metric for Evaluation of Translation with Explicit ORdering)
        - Source: Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
        - Definition: METEOR is another metric used for evaluation, which improves upon the foundations of BLEU by incorporating synonymy and stemming, allowing for a more nuanced matching between the evaluated summary and the reference text.
- __BERTScore__
    - Source: Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT. arXiv preprint arXiv:1904.09675.
    - Definition: BERTScore leverages the contextual embeddings from BERT language models to compare the semantic similarity of tokens between the candidate summary and reference texts. This metric evaluates the quality of content in summaries at a deeper, more semantic level.
- __LEPOR__ (Length Penalty, Overlap, and Recall)
    - Source: Han, L., Wong, D. F., & Chao, L. S. (2012). LEPOR: An augmented machine translation evaluation metric. Proceedings of the Workshop on Statistical Machine Translation.
    - LEPOR is a metric that combines several factors including n-gram precision and recall, length penalty, and word order to evaluate translations, which can also be applied to summarization tasks. It provides a holistic view of the summary's quality.

However as outlined in this paper, conventional referencebased metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Therefore, we'll be using an approach inspired by this [example](https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization) from the OpenAI cookbook that leverages GPT-4 as a judge for the quality of the summarization outputs obtained using the following criteria:

1. Relevance: Evaluates if the summary includes only important information and excludes redundancies.
2. Coherence: Assesses the logical flow and organization of the summary.
3. Consistency: Checks if the summary aligns with the facts in the source document.
4. Fluency: Rates the grammar and readability of the summary.


The simplified version of this experimentation pipeline will be:

![](./assets-resources/pipeline-experiment-summarization.png)

Where we iterate until we reach some criteria threshold that we'll define empirically as we experiment with different examples.

# Generate Prompt Candidates & Experiment

Now that we have a clear definition of our task and we have settled on an evaluation metric. Let's generate our first batch of prompt candidates and experiment with them.

For that we'll also use GPT-4 to save some time, combined with pydantic to return structured outputs (list of prompts to experiment with).

In [1]:
NUM_PROMPT_CANDIDATES = 5
TASK_DEFINITION = "Summarize the main findings and methodology of the following paper."
MODEL_GEN_PROMPT_CANDIDATES = "gpt-4-turbo"
MODEL_EVAL = "gpt-4-turbo"
MODEL_SUMMARY = "gpt-3.5-turbo-0125"
SYS_MSG_GEN_PROMPT_CANDIDATES = "You are a prompt engineering expert, specialized in generating appropriate prompts for a given task."
SYS_MSG_EVAL = """You are an evaluation engine, specialized in scoring the quality of summaries. \
    You always output 4 criteria scores: relevance(0-5), coherence(0-5), consistency(0-5) and fluency(0-5).
    1. Relevance: Evaluates if the summary includes only important information and excludes redundancies.
    2. Coherence: Assesses the logical flow and organization of the summary.
    3. Consistency: Checks if the summary aligns with the facts in the source document.
    4. Fluency: Rates the grammar and readability of the summary.
    """
SYS_MSG_SUMMARY = "You are a summarization engine, you will be fed a research paper and output a summary in a desired format, specified by the users."
TEMP_PROMPT_CANDIDATES = 0.2
TEMP_SUMMARY = 0.0
PDF_PATH = "./assets-resources/geval_method_paper.pdf"

In [2]:
from pydantic import BaseModel
import instructor
from typing import List
from openai import OpenAI

client_structured = instructor.from_openai(OpenAI())
client = OpenAI()

class PromptCandidates(BaseModel):
    prompt_candidates: List[str]

prompt_generate_candidates = f"Generate a list of {NUM_PROMPT_CANDIDATES} for the following task:\n\n{TASK_DEFINITION}"

def generate_prompt_candidates(prompt: str):
    response = client_structured.chat.completions.create(
        model=MODEL_GEN_PROMPT_CANDIDATES,
        messages=[{"role": "system", "content": SYS_MSG_GEN_PROMPT_CANDIDATES},
                  {"role": "user", "content": prompt}],
        response_model=PromptCandidates,
        temperature=TEMP_PROMPT_CANDIDATES)
    
    return response

prompt_generate_candidates_response = generate_prompt_candidates(prompt_generate_candidates)
prompt_generate_candidates_response

PromptCandidates(prompt_candidates=['Summarize the main findings and methodology of the paper.', 'Provide a concise summary of the key findings and the methodology used in the paper.', 'Outline the principal results and the methods employed in the study described in the paper.', 'Summarize the core findings and the research methodology of the paper.', 'Detail the main outcomes and the approach used in the research paper.'])

In [15]:
prompt_candidates = prompt_generate_candidates_response.prompt_candidates
prompt_candidates[1]

'Provide a concise summary of the key findings and the methodology used in the paper.'

Perfect! Now that we have the list with the prompt candidates, we can experiment with them! All we need is to define a function to generate the summaries using the chatgpt api and another to evaluate the quality of the summaries using the evaluation criteria we've just discussed.

One thing to note! Given the approach we've chosen, we'll have to feed the model both the paper and the summary
to generate appropriate scores for the performance of each output which is not necessarily ideal, given that it can take a toll
in terms of token cost. On the other hand, it will give us the ability to automate the entire process, which is quite nice! ;).

In [4]:
from pydantic import Field

class SummaryScore(BaseModel):
    relevance_score: float = Field(description="Evaluates if the summary includes only important information and excludes redundancies.")
    coherence_score: float = Field(description="Assesses the logical flow and organization of the summary.")
    consistency_score: float = Field(description="Checks if the summary aligns with the facts in the source document.")
    fluency_score: float = Field(description="Rates the grammar and readability of the summary.")


def eval_summary_output(paper_contents: str, summary_output: str):
    scoring_prompt = f"Given this task: {TASK_DEFINITION} with the contents of a research paper: \n\
        {paper_contents} \n\
        Score the quality of the summary generated below: \n\
            {summary_output}."
    response = client_structured.chat.completions.create(
        model=MODEL_EVAL,
        messages=[
            {"role": "system", "content": SYS_MSG_EVAL},
            {"role": "user", "content": scoring_prompt},
        ],
        response_model=SummaryScore)
    return response

In [5]:
def summarize_paper(prompt_summary: str):
    response = client.chat.completions.create(
        model=MODEL_SUMMARY,
        messages=[{"role": "system", "content": SYS_MSG_SUMMARY},
                  {"role": "user", "content": prompt_summary}],
        temperature=TEMP_SUMMARY)
    
    return response.choices[0].message.content

def load_paper(pdf_path: str):
    from langchain_community.document_loaders import PyPDFLoader
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    paper_contents = " ".join([p.page_content for p in pages])
    return paper_contents

In [6]:
paper_contents = load_paper(PDF_PATH)

paper_contents

'Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 2511–2522\nDecember 6-10, 2023 ©2023 Association for Computational Linguistics\nG-E VAL: NLG Evaluation using G PT-4 with Better Human Alignment\nYang Liu Dan Iter Yichong Xu\nShuohang Wang Ruochen Xu Chenguang Zhu\nMicrosoft Azure AI\nyaliu10@microsoft.com\nAbstract\nThe quality of texts generated by natural lan-\nguage generation (NLG) systems is hard to\nmeasure automatically. Conventional reference-\nbased metrics, such as BLEU and ROUGE,\nhave been shown to have relatively low cor-\nrelation with human judgments, especially for\ntasks that require creativity and diversity. Re-\ncent studies suggest using large language mod-\nels (LLMs) as reference-free metrics for NLG\nevaluation, which have the benefit of being ap-\nplicable to new tasks that lack human refer-\nences. However, these LLM-based evaluators\nstill have lower human correspondence than\nmedium-size neural evaluators. In thi

Before we run all the examples, let's do one through the entire pipeline to guarantee everything will work out as intended.

In [7]:
prompt_summary = prompt_candidates[0] + "\n" + "'''"+paper_contents+"'''"
prompt_summary

'Summarize the main findings and methodology of the paper.\n\'\'\'Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 2511–2522\nDecember 6-10, 2023 ©2023 Association for Computational Linguistics\nG-E VAL: NLG Evaluation using G PT-4 with Better Human Alignment\nYang Liu Dan Iter Yichong Xu\nShuohang Wang Ruochen Xu Chenguang Zhu\nMicrosoft Azure AI\nyaliu10@microsoft.com\nAbstract\nThe quality of texts generated by natural lan-\nguage generation (NLG) systems is hard to\nmeasure automatically. Conventional reference-\nbased metrics, such as BLEU and ROUGE,\nhave been shown to have relatively low cor-\nrelation with human judgments, especially for\ntasks that require creativity and diversity. Re-\ncent studies suggest using large language mod-\nels (LLMs) as reference-free metrics for NLG\nevaluation, which have the benefit of being ap-\nplicable to new tasks that lack human refer-\nences. However, these LLM-based evaluators\nstill have lower

In [8]:
summary_output = summarize_paper(prompt_summary)
summary_output

'The paper "G-E VAL: NLG Evaluation using GPT-4 with Better Human Alignment" presents a framework for evaluating the quality of texts generated by natural language generation (NLG) systems. The framework, G-E VAL, utilizes large language models (LLMs) with chain-of-thoughts (CoT) and a form-filling paradigm to assess NLG outputs. The study focuses on two generation tasks: text summarization and dialogue generation. The main findings include:\n1. G-E VAL outperforms reference-based and reference-free baseline metrics in correlating with human quality judgments, especially for open-ended and creative NLG tasks.\n2. Automatic chain-of-thoughts improve the performance of LLM-based evaluators by providing more context and guidance.\n3. Re-weighting the discrete scores by their respective token probabilities provides a more fine-grained continuous score for G-EVAL.\n4. Analysis suggests a potential bias of LLM-based evaluators towards LLM-generated texts over human-written texts, raising con

In [9]:
eval_score = eval_summary_output(paper_contents, summary_output)
eval_score

SummaryScore(relevance_score=5.0, coherence_score=5.0, consistency_score=5.0, fluency_score=5.0)

In [10]:
print(f"Relevance: {eval_score.relevance_score}")
print(f"Coherence: {eval_score.coherence_score}")
print(f"Consistency: {eval_score.consistency_score}")
print(f"Fluency: {eval_score.fluency_score}")

Relevance: 5.0
Coherence: 5.0
Consistency: 5.0
Fluency: 5.0


Oh Nice! Great initial results!

Let's store the results for each prompt in an organized table to keep track of their performance so we can select the best candidate by comparing their results.

In [11]:
import pandas as pd
import tiktoken

def setup_prompt_summary(prompt_candidate: str, paper_contents: str) -> str:
    prompt_summary = prompt_candidate + "\n" + "'''"+paper_contents+"'''"
    return prompt_summary

def calc_summary_score_avg(eval_scores: List):
    return sum([score for score in eval_scores])/len(eval_scores)


def get_num_tokens(prompt, model):
    """Calculates the number of tokens in a text prompt"""
    enc = tiktoken.encoding_for_model(MODEL_SUMMARY)
    return len(enc.encode(prompt))

df = pd.DataFrame(columns=['prompt_candidate', 'summary-score-avg', 'relevance-score', 'coherence-score', 'consistency-score', 'fluency-score', 'model-summary', 'prompt-token-count'])
df.head()

df

Unnamed: 0,prompt_candidate,summary-score-avg,relevance-score,coherence-score,consistency-score,fluency-score,model-summary,prompt-token-count


In [12]:
for prompt in prompt_candidates:
    prompt_candidate_token_count = get_num_tokens(prompt, MODEL_SUMMARY)
    prompt_summary = setup_prompt_summary(prompt, paper_contents)
    summary_output = summarize_paper(prompt_summary)
    eval_score = eval_summary_output(prompt_summary, summary_output)
    eval_scores_list = [eval_score.relevance_score, eval_score.coherence_score, eval_score.consistency_score, eval_score.fluency_score] 
    summary_score_avg = calc_summary_score_avg(eval_scores_list)
    df.loc[len(df)] = [prompt, summary_score_avg, eval_score.relevance_score, eval_score.coherence_score, eval_score.consistency_score, eval_score.fluency_score,MODEL_SUMMARY,prompt_candidate_token_count]
df

Unnamed: 0,prompt_candidate,summary-score-avg,relevance-score,coherence-score,consistency-score,fluency-score,model-summary,prompt-token-count
0,Summarize the main findings and methodology of...,5.0,5.0,5.0,5.0,5.0,gpt-3.5-turbo-0125,12
1,Provide a concise summary of the key findings ...,5.0,5.0,5.0,5.0,5.0,gpt-3.5-turbo-0125,16
2,Outline the principal results and the methods ...,5.0,5.0,5.0,5.0,5.0,gpt-3.5-turbo-0125,16
3,Summarize the core findings and the research m...,4.5,4.0,5.0,4.0,5.0,gpt-3.5-turbo-0125,14
4,Detail the main outcomes and the approach used...,5.0,5.0,5.0,5.0,5.0,gpt-3.5-turbo-0125,13


In [13]:
best_score = df['summary-score-avg'].max()
best_prompts = df[df['summary-score-avg'] == best_score]['prompt_candidate'].tolist()
best_prompts

['Summarize the main findings and methodology of the paper.',
 'Provide a concise summary of the key findings and the methodology used in the paper.',
 'Outline the principal results and the methods employed in the study described in the paper.',
 'Detail the main outcomes and the approach used in the research paper.']

Given that we have a tie between these prompts (all of them got the best score), we can settle this tie by using the 
token count as the criteria for defining the absolute best.

In [14]:
lowest_token_count_prompt = df[df["prompt_candidate"].isin(best_prompts)].sort_values("prompt-token-count").iloc[0]["prompt_candidate"]
lowest_token_count_prompt

'Summarize the main findings and methodology of the paper.'