#**NLP Natural Language Processing**
#**Exploring GPT-3 API through Zero-Shot and Few-Shot Prompting**

###Student: Naif Ganadily
####Professor Chandra Bhagavatula
### Final Due March 16 by 11:59pm <br> <br> 


#### The goal of this assignment is to explore and use the GPT3 API, through zero-shot and few-shot prompting. 

 

Download 5 datasets (COPA, RTE, WSC, ReCoRD and CommitmentBank) from the SuperGLUE datasetLinks to an external site.. 
Implement zero-shot and few-shot (up to 5 examples chosen from the training set) prompting. Try 2 different prompts for each dataset. An example of a prompt:
               Translate the following sentences from English to French:
               
               English: Nice to meet you
               French: Ravi de vous rencontrer
               
               English: This assignment is due in two weeks.
               French:
            
The few-shot training examples must be drawn from the training set. This can be done in three ways:
1.  Fixed: Choose training examples once. And use them for all test instances.
2.  Random: For each test instance, randomly select N training examples and use them in the prompt.
3.  Relevant: Find training examples most similar to the test instance (use text similarity AND embedding based similarity).  
Evaluate and report results across all the settings. 
 

Relevant links:

1. Login to OpenAI GPT3 Access: https://openai.com/api/login/Links to an external site.

2. OpenAI Documentation: https://platform.openai.com/docs/guides/completionLinks to an external site.

3. SuperGLUE Dataset: https://super.gluebenchmark.com/tasksLinks to an external site.

4. Code Examples: https://github.com/openai/openai-pythonLinks to an external site.

 

What to submit:

Code (either an executable Python script OR Jupyter Notebook). Please remove your API Keys from the submission.
A report (in PDF) reporting evaluation metrics on the five datasets across different settings described above. 
 

Due: March 16th.

In [None]:
!pip install jsonlines
!pip install openai
!pip install sentence_transformers

Collecting jsonlines
  Downloading jsonlines-3.1.0-py3-none-any.whl (8.6 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-3.1.0
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mCollecting openai
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
import itertools

In [None]:
import numpy as np
import pandas as pd
import jsonlines
import openai
import time
import warnings
warnings.simplefilter(action='ignore', category=Warning)
import random
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report
from sentence_transformers import SentenceTransformer

[percpu.cc : 552] RAW: rseq syscall failed with errno 1


In [None]:
# Define the paths to your JSONL files
datasets = {
    'CB': {
        'train': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/CB/train.jsonl',
        'test': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/CB/test.jsonl',
        'val': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/CB/val.jsonl'
    },
    'COPA': {
        'train': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/COPA/train.jsonl',
        'test': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/COPA/test.jsonl',
        'val': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/COPA/val.jsonl'
    },
    'RTE': {
        'train': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/RTE/train.jsonl',
        'test': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/RTE/test.jsonl',
        'val': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/RTE/val.jsonl'
    },
    'WSC': {
        'train': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/WSC/train.jsonl',
        'test': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/WSC/test.jsonl',
        'val': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/WSC/val.jsonl'
    },
    'ReCoRD': {
        'train': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/ReCoRD/train.jsonl',
        'test': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/ReCoRD/test.jsonl',
        'val': '/content/gdrive/MyDrive/NLP Projects/Assignment 2 NLP/ReCoRD/val.jsonl'
    },
    # Add the other datasets in the same format
}

data = {}

# Load the data from the JSONL files into the data dictionary
for dataset_name, paths in datasets.items():
    data[dataset_name] = {}
    for split, path in paths.items():
        data[dataset_name][split] = []
        with jsonlines.open(path) as reader:
            for obj in reader:
                data[dataset_name][split].append(obj)


In [None]:
openai.api_key = ""
model = SentenceTransformer('paraphrase-distilroberta-base-v1')

Downloading (…)7f4ef/.gitattributes: 100%|██████████| 391/391 [00:00<00:00, 45.9kB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 23.0kB/s]
Downloading (…)f279f7f4ef/README.md: 100%|██████████| 3.74k/3.74k [00:00<00:00, 1.09MB/s]
Downloading (…)79f7f4ef/config.json: 100%|██████████| 718/718 [00:00<00:00, 173kB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 50.7kB/s]
Downloading (…)279f7f4ef/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 4.38MB/s]
Downloading pytorch_model.bin: 100%|██████████| 329M/329M [00:05<00:00, 62.3MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 6.25kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 63.5kB/s]
Downloading (…)7f4ef/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 10.9MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 1.35k/1.35k [00:00<00:00, 384kB/s]
Downloading (…)279f7f4ef/vocab.json: 100%|█

In [None]:
def get_few_shot_examples(dataset, n, method='fixed', test_instance=None):
    if method == 'fixed':
        return random.sample(dataset, n)
    elif method == 'random':
        return random.sample(dataset, n)
    elif method == 'relevant':
        if test_instance is not None:
            test_embedding = model.encode([test_instance])
            train_embeddings = model.encode([x['premise'] for x in dataset])
            similarity_scores = np.inner(test_embedding, train_embeddings)[0]
            relevant_indices = np.argsort(similarity_scores)[-n:]
            return [dataset[i] for i in relevant_indices]
        else:
            raise ValueError("test_instance must be provided for 'relevant' method.")
    else:
        raise ValueError("Invalid method. Choose from 'fixed', 'random', or 'relevant'.")


def gpt3_predict(test_instance, few_shot_examples, prompt_template):
    few_shot_str = "\n".join([f"{x['premise']} -> {x['hypothesis']} ({x['label']})" for x in few_shot_examples if 'label' in x])
    prompt = prompt_template.format(few_shot_str, test_instance['premise'], test_instance['hypothesis'])
    return gpt3_complete(prompt)

In [None]:
def gpt3_complete(prompt):
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        max_tokens=50,
        n=1,
        stop=None,
        temperature=0.5,
    )

    return response.choices[0].text.strip()

In [None]:
import difflib
import pandas as pd
import time
from sklearn.metrics import confusion_matrix, classification_report

def extract_entailment_decision(prediction):
    prediction = prediction.lower()
    if 'entail' in prediction:
        return 'entailment'
    elif 'contradict' in prediction:
        return 'contradiction'
    elif 'neutral' in prediction:
        return 'neutral'
    else:
        return 'unknown'


methods = ['zero_shot', 'fixed', 'random', 'relevant']
prompt_templates = [
    "{}\nGiven the premise: {}\nAnd the hypothesis: {}\nDoes it entail or not entail?",
    "{}\nBased on the information: {}\nAnd considering the hypothesis: {}\nIs it entailed or not entailed?",
]

# Example prompt
example_prompt = (
    "Given the premise: 'If someone eats a hot pepper, their mouth will feel hot.'\n"
    "And the hypothesis: 'Jenny ate a hot pepper, so her mouth feels hot.'\n"
    "Does it entail or not entail?"
)
print("Example prompt for GPT-3:")
print(example_prompt)

results = []
print("-----------------------------------------------------")
print(" ")
print(" ")
print(" ")
for dataset_name, dataset_splits in data.items():
    for split_name, dataset in dataset_splits.items():
        for method in methods:
            for prompt_index, prompt_template in enumerate(prompt_templates):
                correct = 0
                total = 0
                debug_count = 0
                true_labels = []  # Initialize the true_labels list
                predicted_labels = []  # Initialize the predicted_labels list

                for test_instance in dataset:
                    if method != 'zero_shot':
                        few_shot_examples = get_few_shot_examples(data[dataset_name]['train'], 5, method=method, test_instance=test_instance)
                        few_shot_str = '\n'.join([f"({i})\nGiven the premise: {example['premise']}\nAnd the hypothesis: {example['hypothesis']}\n{example['label']}" for i, example in enumerate(few_shot_examples)])
                        prompt = prompt_template.format(few_shot_str, test_instance['premise'], test_instance['hypothesis'])
                    else:
                        prompt = prompt_template.format("", test_instance['premise'], test_instance['hypothesis'])
                    
                    prediction = gpt3_complete(prompt)
                    extracted_prediction = extract_entailment_decision(prediction)

                    if 'label' in test_instance:
                        true_labels.append(test_instance['label'].lower())  # Update true_labels list
                        predicted_labels.append(extracted_prediction)  # Update predicted_labels list
                        if extracted_prediction == test_instance['label'].lower():
                            correct += 1
                        elif debug_count < 5:
                            print(f"Correct label: {test_instance['label']}")
                            print(f"Predicted label: {prediction}\n")
                            debug_count += 1
                    else:
                        print(f"Prompt {prompt_index + 1}: {prompt}")
                        print(f"Prediction: {prediction}\n")
                        
                    total += 1
                    
                    # Add a sleep statement to pause between API calls
                    time.sleep(1)

                if 'label' in test_instance:
                    
                  cm = confusion_matrix(true_labels, predicted_labels, labels=['entailment', 'contradiction', 'neutral', 'unknown'])
                  print(f"\nConfusion Matrix:\n{cm}")
                  
                  report = classification_report(true_labels, predicted_labels, labels=['entailment', 'contradiction', 'neutral', 'unknown'])
                  print(f"\nClassification Report:\n{report}")


                accuracy = correct / total
                print(f"{dataset_name} - {split_name} - {method} - Prompt {prompt_index + 1}\nAccuracy: {accuracy}")

Example prompt for GPT-3:
Given the premise: 'If someone eats a hot pepper, their mouth will feel hot.'
And the hypothesis: 'Jenny ate a hot pepper, so her mouth feels hot.'
Does it entail or not entail?
-----------------------------------------------------
 
 
 
Correct label: contradiction
Predicted label: The hypothesis does not entail the premise.

Correct label: contradiction
Predicted label: The hypothesis does not entail that eliminating all witnesses would have needed much persuasion.

Correct label: contradiction
Predicted label: The hypothesis does not entail that any of the three kings stands a chance of ever making a comeback with him.

Correct label: contradiction
Predicted label: The hypothesis does not entail the premise.

Correct label: contradiction
Predicted label: It does not entail.


Confusion Matrix:
[[114   0   0   1]
 [117   0   0   2]
 [ 16   0   0   0]
 [  0   0   0   0]]

Classification Report:
               precision    recall  f1-score   support

   entail

KeyboardInterrupt: 

In [None]:
def plot_confusion_matrix(cm, classes, title='Confusion Matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


In [None]:
cm = confusion_matrix(true_labels, predicted_labels, labels=['entailment', 'contradiction', 'neutral', 'unknown'])
print(f"\nConfusion Matrix:\n{cm}")

# Plot the confusion matrix
plt.figure()
plot_confusion_matrix(cm, classes=['entailment', 'contradiction', 'neutral', 'unknown'])
plt.show()
