## Problem Statement 3: Prompt Engineering
Problem: Design and evaluate prompts to improve the performance of a given AI model on a specific task (e.g., summarization, question answering). Requirements:
* Experiment with different prompt designs.
* Evaluate the effectiveness of each prompt using appropriate metrics. Evaluation Criteria:
* Creativity and effectiveness of prompt designs.
* Use of proper evaluation metrics.
* Clear explanation and documentation of the process and results.


In [None]:
pip install transformers datasets rouge-score

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m 

# Loading Model and Tokenizer

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from rouge_score import rouge_scorer

# Load the model and tokenizer
model_name = "largefacebook/bart--cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Initialize summarization pipeline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

## Model description
BART is a transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.

BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). This particular checkpoint has been fine-tuned on CNN Daily Mail, a large collection of text-summary pairs.

# Defining Prompts and Texts

Here is how to use this model with the pipeline API:



In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]


In [None]:
# Define the prompts (for summarization, prompts may not be as explicit but can be varied in structure)
prompts = [
    "Summarize the following text into a brief overview:",
    "Extract and present the key points from the text:",
]

# Sample text
text = """
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines designed to think and learn like humans. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
"""

# Generating Summaries

In [None]:
def generate_summary(prompt, text):
    input_text = f"{prompt} {text}"
    summary = summarizer(input_text, max_length=100, min_length=30, do_sample=False)
    return summary[0]['summary_text']

# Generate summaries for each prompt
summaries = [generate_summary(prompt, text) for prompt in prompts]

for i, summary in enumerate(summaries):
    print(f"Summary for prompt {i + 1}:\n{summary}\n")

Your max_length is set to 100, but your input_length is only 99. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=49)


Summary for prompt 1:
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines designed to think and learn like humans. Leading AI textbooks define the field as the study of "intelligent agents"

Summary for prompt 2:
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines designed to think and learn like humans. Leading AI textbooks define the field as the study of "intelligent agents"



# Evaluating Summaries

In [None]:
# Define a reference summary
reference_summary = "AI is the simulation of human intelligence in machines, designed to think and learn like humans."

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

def evaluate_summary(generated_summary, reference_summary):
    scores = scorer.score(reference_summary, generated_summary)
    return scores

# Evaluate each summary
for i, summary in enumerate(summaries):
    scores = evaluate_summary(summary, reference_summary)
    print(f"Evaluation for prompt {i + 1}:")
    print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
    print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}\n")


Evaluation for prompt 1:
ROUGE-1: 0.638
ROUGE-L: 0.638

Evaluation for prompt 2:
ROUGE-1: 0.638
ROUGE-L: 0.638



## ROUGE Scores Interpretation
ROUGE-1: Measures the overlap of unigrams (single words) between the generated and reference summaries. A ROUGE-1 score of 0.638 means that 63.8% of the unigrams in the generated summary match those in the reference summary.

ROUGE-L: Measures the longest common subsequence (LCS) between the generated and reference summaries. A ROUGE-L score of 0.638 indicates that 63.8% of the sequence in the generated summary matches the sequence in the reference summary.

## Analysis
### Consistent Scores:

Both prompts yield the same ROUGE-1 and ROUGE-L scores. This consistency suggests that the quality of summaries produced by both prompts is comparable in terms of unigrams and sequence matching.
### Prompt Effectiveness:

Since the ROUGE scores are identical, neither prompt shows a clear advantage over the other based on these metrics. This could imply that both prompts are equally effective or that the model's performance is not significantly influenced by the choice of prompt in this particular case.



---



# Experiment with new prompts:

In [None]:
from transformers import pipeline

# Load the summarization pipeline with BART
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Example text
text = """
Artificial Intelligence (AI) is a rapidly advancing field of technology that aims to create systems capable of performing tasks that typically require human intelligence. This includes tasks such as understanding natural language, recognizing patterns, and making decisions. AI technologies have the potential to revolutionize industries by automating processes, providing insights from data, and enhancing human capabilities. As AI continues to develop, it raises important questions about ethics, privacy, and the future of work.
"""

# New prompts
prompts = [
    "Provide a brief summary focusing on the main themes:",
    "Summarize the article in two sentences highlighting key points:",
    "Give a concise overview of the key facts and insights from the article:"
]

# Generate summaries for each prompt
for prompt in prompts:
    result = summarizer(f"{prompt} {text}", max_length=100, min_length=30, do_sample=False)
    print(f"Summary for prompt '{prompt}': {result[0]['summary_text']}")


Summary for prompt 'Provide a brief summary focusing on the main themes:': Artificial Intelligence (AI) is a rapidly advancing field of technology. It aims to create systems capable of performing tasks that typically require human intelligence. This includes tasks such as understanding natural language, recognizing patterns, and making decisions.
Summary for prompt 'Summarize the article in two sentences highlighting key points:': Artificial Intelligence (AI) aims to create systems capable of performing tasks that typically require human intelligence. As AI continues to develop, it raises important questions about ethics, privacy, and the future of work.
Summary for prompt 'Give a concise overview of the key facts and insights from the article:': Artificial Intelligence (AI) aims to create systems capable of performing tasks that typically require human intelligence. This includes tasks such as understanding natural language, recognizing patterns, and making decisions. AI technologies 

In [None]:
from rouge_score import rouge_scorer

# Define reference summary
reference_summary = """
Artificial Intelligence (AI) is a rapidly advancing field of technology that aims to create systems capable of performing tasks that typically require human intelligence. AI technologies have the potential to revolutionize industries by automating processes, providing insights from data, and enhancing human capabilities. It raises important questions about ethics, privacy, and the future of work.
"""

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

# Evaluate each generated summary
for prompt in prompts:
    generated_summary = summarizer(f"{prompt} {text}", max_length=100, min_length=30, do_sample=False)[0]['summary_text']
    scores = scorer.score(reference_summary, generated_summary)
    print(f"Evaluation for prompt '{prompt}':")
    print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
    print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")

Evaluation for prompt 'Provide a brief summary focusing on the main themes:':
ROUGE-1: 0.543
ROUGE-L: 0.522
Evaluation for prompt 'Summarize the article in two sentences highlighting key points:':
ROUGE-1: 0.682
ROUGE-L: 0.682
Evaluation for prompt 'Give a concise overview of the key facts and insights from the article:':
ROUGE-1: 0.543
ROUGE-L: 0.522


## Conclusion
### Effectiveness of Prompts:

### Best Prompt:
Prompt B (“Summarize the article in two sentences highlighting key points:”) achieved the highest ROUGE-1 and ROUGE-L scores (0.682), indicating that it produced the most effective summaries in terms of both word overlap and sequence coherence compared to the other prompts.
### Less Effective Prompts:
 Prompts A (“Provide a brief summary focusing on the main themes:”) and C (“Give a concise overview of the key facts and insights from the article:”) scored lower (0.543 for ROUGE-1 and 0.522 for ROUGE-L), suggesting that summaries generated with these prompts were less aligned with the reference summaries.
## Prompt Impact:

The results suggest that the specific instructions and structure of Prompt B are more effective for guiding the model to produce high-quality summaries. This prompt likely provides a clearer focus for the summarization task, leading to better alignment with the reference summaries.

## Summary
Prompt B is recommended for use due to its higher effectiveness in generating quality summaries.

Prompts A and C are less effective and may benefit from refinement.
Further experimentation and evaluation are encouraged to continuously improve prompt designs and model performance.


---

