# Auto Big-bench
copyright 2023, Denis Rothman

[Big-bench](https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/README.md) contains more than 200+ NLP tasks. The goal is to evaluate a model.

In this notebook, we take ChatGPT-GPT4 a step further. We will not ask ChatGPT-GPT-4 to solve a Big-bench NLP problem and apply metrics. We will ask GPT-4 to create the tasks itself and solve them!

*The potential of the next generation of AI might be able to evaluate and benchmark itself.*

The program will feed GPT-4 a sample of 140+ Big-bench tasks with a two-part prompt:   

**The first part contains the instruction:**    
1. Explain the following task
2. Provide an example Solve it:  

**The second part is the description of a Big-bench:**  
Given a narrative, choose the most related proverb  

**The output will then be displayed for human evaluation**
Human evaluation plays an important role in LLM training and evaluations. Reinforcement Learning with Human Feedback(RLHF) will help mitigate the potential limits of automated models and evaluation metrics.

**Limit of the program:** The program does not run thousands of samples for each task. The goal is to show the potential of Large Language Models(LLMs)

**Potential:** We can see that GPT-4, PaLM 2, and other Foundations Models are just the beginning of what will become *Massive Multitask Language Understanding(MMLU)* models in one form or another in the years to come.



# Retrieve the list of Big-bench prompts designed for this notebook

The list was created from the list of tasks of [Big-bench](https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/README.md)

In [None]:
#Development access to delete when going into production
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
with open('drive/MyDrive/files/github.txt', 'r') as f:
    github_token = f.read().strip()

In [None]:
!curl -H 'Authorization: token {github_token}' -L https://raw.githubusercontent.com/Denis2054/Transformers_3rd_Edition/master/Chapter15/tasks.txt --output "tasks.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17838  100 17838    0     0  53960      0 --:--:-- --:--:-- --:--:-- 54054


# Read the file into a Pandas Dataframe

In [None]:
import pandas as pd

# read the file
df = pd.read_csv('tasks.txt', header=None, on_bad_lines='skip')

# If you want to add a column name after loading
df.columns = ['Tasks']

# print the dataframe
df

Unnamed: 0,Tasks
0,1.Explain the following task 2.Provide an exam...
1,1.Explain the following task 2.Provide an exam...
2,1.Explain the following task 2.Provide an exam...
3,1.Explain the following task 2.Provide an exam...
4,1.Explain the following task 2.Provide an exam...
...,...
139,1.Explain the following task 2.Provide an exam...
140,1.Explain the following task 2.Provide an exam...
141,1.Explain the following task 2.Provide an exam...
142,1.Explain the following task 2.Provide an exam...


In [None]:
nbt=len(df)
print("Number of tasks: ", nbt)

Number of tasks:  144


# Install OpenAI

In [None]:
#Importing openai
try:
  import openai
except:
  !pip install openai
  import openai

In [None]:
#API Key
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline()
f.close()

In [None]:
#The OpenAI Key
import os
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

# Defining the rolel of the model

In [None]:
import openai

gptmodel="gpt-4" # or select gpt-3.5-turbo

def openai_chat(input_text):
    response = openai.ChatCompletion.create(
        model=gptmodel,
        messages=[
            {"role": "system", "content": "You are an expert Natural Language Processing exercise expert."},
            {"role": "assistant", "content": "1.You can explain any NLP task. 2.Create an example 3.Solve the example"},
            {"role": "user", "content": input_text}
        ],
        temperature=0.1  # Add the temperature parameter here and other parameters you need
    )
    return response['choices'][0]['message']['content']

# Displaying the response of the model

In [None]:
  display(HTML(html_content))
  html_file = open("output.html", "w") #just to make sure the file is created before running the tasks because
  html_file.close()                    #the API runs on a pay-as-you-go policy.

In [None]:
from IPython.core.display import display, HTML
def display_response(input_text, response, bb_task):
  html_content = f"""
  <!DOCTYPE html>
  <html>
  <head>
      <title>Big-bench Tasks</title>
      <style>
        p {{
            max-width: 600px;
        }}
    </style>
  </head>
  <body>
      <h1>{bb_task}</h1>
      <p>{task}</p>
  </body>
  </html>
  """

  # And finally we display it
  display(HTML(html_content))
  html_file = open("output.html", "a")
  html_file.write(html_content)
  html_file.close()

# Running the tasks

Check OpenAI's policy for rate limits before running the tasks:
https://platform.openai.com/docs/guides/rate-limits/overview


In [None]:
import time

counter = 0
nb_tasks = nbt
for index, row in df.iterrows():
    input_text = row['Tasks']                 # the complete prompt
    counter += 1                              # task counter
    if counter > nb_tasks:
        break                                 # nb of tasks
    task = openai_chat(input_text)            # model call
    task = task.replace('\n', '<br>')         # formatting the output
    parts = input_text.split('Solve it:')     # extracting the task from the input
    bb_task = parts[1].strip()                # The strip() function
    display_response(input_text, task, bb_task) # displaying the task and reponse

    if counter % 50 == 0:                     # if the counter is divisible by 50
        print(f"Processed {counter} tasks. Pausing for 60 seconds.")
        time.sleep(60)                        # pause for 60 seconds

Processed 50 tasks. Pausing for 60 seconds.


Processed 100 tasks. Pausing for 60 seconds.
