### Upgrade pip

(The previous intern had this, not 100% sure when it's needed)

In [18]:
%pip install --upgrade pip 

Note: you may need to restart the kernel to use updated packages.


### Format the Google BigQuery query

It's strongly recommendeded that you first test your query in Google's BigQuery Sandbox (online). It returns results ~100 times faster, will yell at you pre-runtime if you introduce syntax errors, and it also shows a preview of expected data usage for the given query.

In [19]:
QUERY = """
SELECT
  q.id,
  q.title,
  q.body,
  q.accepted_answer_id,
  q.view_count,
  q.tags,
  q.answer_count,
  q.score AS question_score,
  a.score AS answer_score,
  a.body AS stackoverflow_answer
FROM
  bigquery-public-data.stackoverflow.posts_questions q
LEFT JOIN
  bigquery-public-data.stackoverflow.posts_answers a
ON
  q.accepted_answer_id = a.id
WHERE
  q.answer_count > 0
  AND q.accepted_answer_id > 0
  AND EXTRACT(YEAR FROM q.creation_date) >= 2022
  AND a.score >= 0
"""

### Fetch dataset using Google BigQuery

Authentication info: https://googleapis.dev/python/google-api-core/latest/auth.html (the query will fail if you don't set up authentication first).

To use server-side sampling with BigQuery (culls data server-side, reducing download size), use `TABLESAMPLE SYSTEM` e.g.:

`bigquery-public-data.stackoverflow.posts_questions q TABLESAMPLE SYSTEM (10 PERCENT)`

`bigquery-public-data.stackoverflow.posts_answers a TABLESAMPLE SYSTEM (50 PERCENT)`

(be aware that using less than 100% for both of questions and answers can result in no matches between the two if unlucky with rng, causing the query to return no results)

In [20]:
%pip install db-dtypes
import db_dtypes
import json
import os
import pandas as pd
from pathlib import Path
from google.cloud import bigquery as bq

# get project name from local secrets.json file
def load_project_ID(secrets_file="secrets.json"):
    with open(secrets_file) as f:
        secrets = json.load(f)
    return secrets["BQ_PROJECT_ID"]

# load a saved copy of the dataset from local disk (if it exists), otherwise run query and save it
PICKLED_RESULT = "dataset_pickled.pkl"  # the file name of the saved dataset (saved on / loaded from local disk)
cwd = Path().absolute()                 # current working directory (note: possibly different from execution directory)

# see if we have a copy of the dataset on local disk; if we do load that
try:
    dataset_path = os.path.join(cwd, PICKLED_RESULT)
    results = pd.read_pickle(dataset_path)
    print("Saved copy of dataset loaded from local disk.")

# if not, run the query (tested runtime was up to 1 hour for full query!)
except FileNotFoundError:
    print("Saved copy of dataset (" + PICKLED_RESULT + ") not found - running query...")

    # prepare query
    client = bq.Client(project=load_project_ID())

    # run the query and save result to a dataframe
    results = client.query(QUERY.format(offset=0)).result().to_dataframe()
    print("Dataset has been downloaded...")

    # pickle the dataframe for persistent copy on local storage
    results.to_pickle(PICKLED_RESULT)
    print("Dataset has been pickled and saved to local storage. File name: " + PICKLED_RESULT)

# yell if there's an *unexpected* error
except:
    raise Exception("Error while loading dataset.")

# dump some extra info
print("Number of questions:", len(results))
results

Note: you may need to restart the kernel to use updated packages.
Saved copy of dataset loaded from local disk.
Number of questions: 383845


Unnamed: 0,id,title,body,accepted_answer_id,view_count,tags,answer_count,question_score,answer_score,stackoverflow_answer
0,73688486,"Android Studio Error ""requires libraries and a...",<p>All librabries are updated.</p>\n<p><strong...,73688665,1342,android|xml|kotlin|gradle|mobile,2,7,7,<p><strong>There are 2 ways to fix this error:...
1,70601508,Can I use Java 16 record with JPA entity?,<p>I am trying to do something similar like be...,70601646,2096,java|spring-boot|jpa|spring-data-jpa,1,7,9,"<p>See the article, <a href=""https://wkorando...."
2,72056470,streamWriter does not apply changes until I re...,<p>I'm making a game in Unity with an editor l...,72056589,49,c#|unity3d|streamwriter,1,2,6,<p>You just need to refresh the assets in Unit...
3,72336177,Error: req#logout requires a callback function,<p>can't able to find solution of this tried e...,72336399,13559,javascript|express|passport.js|google-authenti...,6,30,64,<p>Since version 0.6.0 (which was released onl...
4,71941928,How to transfer ERC20 tokens to another addres...,"<p>I create ERC20 tokens, and i want to transf...",71943184,3360,solidity|polygon|erc20,1,1,8,<p>Your logic is wrong!\nIf you want to send a...
...,...,...,...,...,...,...,...,...,...,...
383840,71787300,variable declaration in while loop,<pre><code>using namespace std;\n\nclass Count...,71787493,87,c++,2,1,5,<p>The <code>&lt;=</code> operator has a <a hr...
383841,72910589,Sort a nested List on two elements,"<p>I have a <em>nested list</em>, named <code>...",72910898,60,java|list|sorting|nested-lists|comparator,3,1,5,<h1>Use the Power of Objects</h1>\n<p>The way ...
383842,72442808,Google play integrity api issue,<p>I am migrating my app from Google play safe...,72655771,1644,android|google-api,2,2,5,"<p>I've seen that problem, and for me the solu..."
383843,72625732,How to convert Option[List[String]] to List[St...,<p>I am having <code>Option</code> of <code>Li...,72626052,82,scala|playframework|sbt|implicit,2,2,5,<p>You should check the documentation for <a h...


### Cull / filter dataset

For the demo we're just arbitrarily culling the size of the dataset to make it more manageable, but you could also filter for other reasons such as focusing on a specific tag, or sampling based on answer and/or question scores.

A pandas dataframe can be sampled either:
* Using a fractional value: e.g., ``.sample(frac=0.01)`` will result in a number of samples equivalent to 1% of the dataset.
* Using an integer value: e.g., ``.sample(n=1000)`` will result in 1000 samples from the dataset.

In [21]:
# randomly sample dataset
fd_tiny = results.sample(frac=0.0001)       # used for testing / demo (intentionally small)
fd_1k   = results.sample(n=1000)            # used if/when verifying tests on a larger sample
fd_10k  = results.sample(n=10000)           # used if/when verifying tests on an even larger sample

# convenience: alias the filtered data so we can change it easily for later code
wd = fd_tiny

# dump info about the filtered result
print("Number of questions in currently selected filtered dataset:", len(wd))
wd

Number of questions in currently selected filtered dataset: 38


Unnamed: 0,id,title,body,accepted_answer_id,view_count,tags,answer_count,question_score,answer_score,stackoverflow_answer
375204,71148343,Can't docker build a Golang project with inter...,"<p>I'm trying to build a Golang project, which...",71148741,479,docker|go|dockerfile,1,2,2,<p>The issue is in your <code>Dockerfile</code...
53063,73286778,Insert data to different columns based on the ...,"<p>This is my text file:</p>\n<pre class=""lang...",73288668,49,file|apache-spark|pyspark|apache-spark-sql|tex...,1,1,1,<p>Thank you for the clarifications. Something...
342806,72371571,keeping track of injected variables when using...,<p>I'm looking for a way to keep track of vari...,72371985,59,kotlin|string-interpolation,2,0,3,<p>I have implemented a &quot;ContainerClass&q...
357314,72793940,Loading ANY Content Through PHP Script,<p>I have this script:</p>\n<pre><code>&lt;?ph...,72794228,28,php|switch-statement|mime-types|script,2,0,0,<p>You can use the php function <code>mime_con...
363895,73105574,Putting brackets around a specific column,"<p>With a bash script, I extracted a .conllu f...",73106177,55,python|bash|sed,2,1,1,<p>Using <code>sed</code></p>\n<pre><code>sed ...
265975,71770101,How to format a string into a string with curr...,"<p>Numerical formatting of int, double, decima...",71770217,70,c#|string|formatting,1,1,3,<blockquote>\n<p>Given it's a string I don't t...
132352,70961723,Find the minimum and maximum values of a numer...,<p>I have the dataframe below and I would like...,70963270,41,r,1,1,1,"<p>Initially, I arranged the data by year. In ..."
39321,71194498,fetch data from google sheets for svgmap,<p>i'm pretty sure i'm one line away from my s...,71197090,51,javascript|jquery|json|google-sheets,1,0,0,<p>All countries need to be added directly to ...
84850,71557745,Replace Hyperlinks to Local Directory,<p>I'm trying the following: when VBA finds hy...,71557854,29,vba|ms-word|hyperlink,1,0,0,<p>You can only set a range to a method if tha...
144560,73527552,How to observe multiple names that comes in a ...,<p>I have a data with a character and group:</...,73527634,31,r,3,0,2,<p><code>dplyr</code> option with <code>summar...


In [22]:
import numpy as np

# get info about view counts (from entire dataset)
avg = np.mean(results['view_count'])
max = np.max(results['view_count'])
min = np.min(results['view_count'])
med = np.median(results['view_count'])

# print the extra info
print("Misc view count stats for full dataset: ")
print("avg: " + str(avg))
print("max: " + str(max))
print("min: " + str(min))
print("med: " + str(med))

Misc view count stats for full dataset: 
avg: 187.74652529015617
max: 154191
min: 3
med: 66.0


### Strip HTML from filtered dataset

In the meeting on 2023-08-08 it was decided that we would strip HTML and use this "stripped" version as our default for evaluations.

TODO: contains a lot of unnecessary debugging and unused previously-WIP code

TODO: do we want tags included in the query? (the other paper did include the tags)

In [23]:
from bs4 import BeautifulSoup as soup

# separate out the text columns we want
titles  = wd["title"]
bodies  = wd["body"]
answers = wd["stackoverflow_answer"]

# preview the formatting currently being used pre-strip
print("Pre-stripped questions:\n========================")
for answer in bodies:
    print(answer)

# print(wd.applymap(lambda text: BeautifulSoup(text, 'html.parser').get_text()))

# # strip the HTML tags etc
# stripped_titles  = titles.apply(lambda x: BeautifulSoup(titles, "html.parser").get_text())
# stripped_bodies  = BeautifulSoup(bodies, "html.parser").get_text()
# stripped_answers = BeautifulSoup(answers, "html.parser").get_text()

# sanity check (surely this will always be true,  but *just in case*)
if len(titles) == len(bodies) and len(titles) == len(answers):
    pass
else:
    raise ValueError("columns are different lengths!")

# new approach
# for index, row in enumerate(titles):
#     print(str(index), row)

# modified harvey approach (iterating is slow(!) but comparatively easy to understand)
print("Post-stripped example questions:\n========================")
counter_good = 0
counter_bad = 0

# counteri = 0
# for index, row in wd.iterrows():
#     counteri = index
#     print(titles[i])

json_list = []

for idx, *row in wd.itertuples():
    print("::::::::::::::::::::::::::::::::::::", str(idx))
    try:
        question_composite = titles[idx] + "\n\n" + (soup(bodies[idx], "html.parser").get_text())

        print(question_composite)

        #print(soup(bodies[i], "html.parser").get_text())
        counter_good += 1

        stackoverflow_ans = (soup(answers[idx], "html.parser").get_text())

        json_object = {
            "input": question_composite,
            "ideal": stackoverflow_ans
        }
        print(json_object)

        json_list.append(json.dumps(json_object))

    # debugging bad entries (if everything goes smoothly this will never run)
    except:
        counter_bad += 1
        print("check:", str(idx))
        print("DING!!!!!!!")
        try:
            print(titles[idx])
        except:
            continue

        try:
            print(bodies[idx])
        except:
            continue

        try:
            print(answers[idx])
        except:
            continue

# because re-running the notebook changes the sampling we DON'T want a persistent jsonl file
JSONL_FILEPATH = "/temp/temp.jsonl"#TODO: this ends up putting the file outside of the cwd, need to combine file path I guess
with open(JSONL_FILEPATH, "w") as outfile:
    outfile.write("\n".join(json_list))

# # file path and file name for our jsonl file - used for both saving AND loading previously-saved file
# JSONL_FILENAME = "/temp/temp.jsonl"

# # first check if we have a saved .jsonl file
# try:
#     jsonl_path = os.path.join(cwd, JSONL_FILENAME)
#     print("A JSON Lines (.jsonl) file already exists. No action has been taken.")

# # if not, make one
# except FileNotFoundError:
#     print("No JSON Lines (.jsonl) file found, so a new file is being written...")

#     with open(JSONL_FILENAME, "w") as outfile:
#         outfile.write("\n".join(json_list))

# # yell if there's an *unexpected* error
# except:
#     raise Exception("Error while reading/writing JSON Lines (.jsonl) file.")

#         continue

# print("good results:", str(counter1))
# print("bad results:", str(counter2))

# # preview the formatting post-strip
# print("Post-stripped example question:\n========================")
# for body in stripped_answers["body"]:
#     print(body)

Pre-stripped questions:
<p>I'm trying to build a Golang project, which contains different levels of packages inside. I've uploaded an example project here: <a href="https://github.com/David-Lor/archive.org-telegrambot/tree/example-go-dockerfile-not-building" rel="nofollow noreferrer">https://github.com/David-Lor/archive.org-telegrambot/tree/example-go-dockerfile-not-building</a></p>
<h2>Files</h2>
<p>go.mod</p>
<pre><code>module github.com/David-Lor/go-example

go 1.16

require github.com/gammazero/workerpool v1.1.2
</code></pre>
<p>Dockerfile</p>
<pre><code>FROM golang:1.17.7

WORKDIR /app
COPY ./src/go.mod .
COPY ./src/go.sum .
RUN go mod download


COPY ./src/* ./
#RUN ls -lah # files are copied correctly; go.mod and main.go are in current directory
RUN go build -o /tmp/built
</code></pre>
<h2>Error</h2>
<p>When I docker build, I got the following error on the go build command:</p>
<pre class="lang-sh prettyprint-override"><code>Step 7/7 : RUN go build -o /tmp/built
 ---&gt; Running

### Restructure data into JSONL format

This is the format which OpenAI evals requires.

TODO: this is now outdated (and DOES NOT need to be run), since I'm stripping HTML at the same time I JSONL formatting.

In [24]:
# title = filtered_dataset['title']
# body = filtered_dataset['body']
# stackoverflow_answer = filtered_dataset['stackoverflow_answer']

# # for index, item in enumerate(filtered_dataset, start=0):
# #     print(index, item)

# # for index, item in enumerate(results, start=0):
# #     print(index, item)

# # temp = range(len(results))
# # temp

# from bs4 import BeautifulSoup

# temp2 = filtered_dataset.iloc[[31]].to_json(orient='records')
# #temp2 = temp2.replace("'", "\'").replace('"', '\"')
# temp2 = BeautifulSoup(temp2, "html.parser").get_text()
# temp2

# # # MH modified iterations
# # for index, item in enumerate(body, start=0):
# #     print(index, item)

# # for index, item in enumerate(title, start=0):
# #     print(index, item)

# # for index, item in enumerate(stackoverflow_answer, start=0):
# #     print(index, item)

# # MH iteration v2
# # TODO ;_;

# # Iterate through the DataFrame rows
# #for i in range(len(filtered_dataset)):

#     #title[i]
#     #body[i]
#     # Truncate the body content to 4095 characters if needed
#     #body_content = title[i]+(body[i] if body[i] else "None").replace("'", "\\'").replace('"', '\\"')

In [25]:
# def create_json(test_name):
#   # Extract the necessary data from the DataFrame
#   body = filtered_dataset['body']
#   title = filtered_dataset['title']
#   stackoverflow_answer = filtered_dataset['stackoverflow_answer']
#   #chatgpt_answer = results['chatgpt_answer']

#   # Create a list to store the JSON strings
#   json_strings = []

#   # Iterate through the DataFrame rows
#   for i in range(len(filtered_dataset)):
#       # Truncate the body content to 4095 characters if needed
#       body_content = title[i]+(body[i] if body[i] else "None").replace("'", "\\'").replace('"', '\\"')
#       #chatgpt_ans = (chatgpt_answer[i][:2000] if chatgpt_answer[i] else "None").replace("'", "\\'").replace('"', '\\"')
#       stackoverflow_ans = (stackoverflow_answer[i] if stackoverflow_answer[i] else "None").replace("'", "\\'").replace('"', '\\"')

#       # Create a JSON object with the desired structure
#       json_object = {
#             "input": body_content,
#             #"input2": chatgpt_ans,
#             "ideal": stackoverflow_ans
#       }
#       print(json_object)

#       # Convert the JSON object to a string and append it to the list
#       json_strings.append(json.dumps(json_object))

#   filename = '/evals/evals/registry/data/{test_name}/expert/samples.jsonl'.format(test_name=test_name)
#   # Save the JSON strings to a file, with newline characters between them
#   with open(filename, "w") as outfile:
#       outfile.write("\n".join(json_strings))

# create_json('coqa')

### Configure OpenAI API Key

Currently configured using secrets.json located at the root directory. An alternative method (which would require code changes) would be to read the system's environment variable.

Key can be generated from: https://platform.openai.com/account/api-keys

In [26]:
# Install OpenAI
%pip install openai
import openai

# loads OpenAI API key from file https://stackoverflow.com/a/76148268
def load_api_key(secrets_file="secrets.json"):
    with open(secrets_file) as f:
        secrets = json.load(f)
    return secrets["OPENAI_API_KEY"]

# read and set our OpenAI API key
api_key = load_api_key()
openai.api_key = api_key

Note: you may need to restart the kernel to use updated packages.


In [27]:
# test openAI query (TODO: remove - just checking that it works; not actually used in T2 project)

# response = openai.ChatCompletion.createresponse = openai.ChatCompletion.create(
#     model    = "gpt-3.5-turbo",#obviously gpt-4 is more expensive, so don't test too much with it! can use e.g. gpt-3.5-turbo instead
#     messages = [
#         {
#             "role": "system",
#             "content": "You will be provided with a piece of Python code, and your task is to find and fix bugs in it."
#         },
#         {
#             "role": "user",
#             "content": "import Random\na = random.randint(1,12)\nb = random.randint(1,12)\nfor i in range(10):\n    question = \"What is \"+a+\" x \"+b+\"? \"\n    answer = input(question)\n    if answer = a*b\n        print (Well done!)\n    else:\n        print(\"No.\")"
#         }
#     ],
#     temperature=0,
#     max_tokens=1024
# )

# print(response.choices[0].message.content)

### Install OpenAI evals

The T1 intern used their own modified evals (speculated to be for efficiency reasons?). For T2 we're trying the unmodified evals. They used the `-e` flag in their installation because they were modifying evals - since we're not doing that it's not needed.

In [28]:
# TODO: UNCOMMENT THIS CELL BEFORE PUBLISHING (but to save time on run-all, it's comment out since it's installed for me now)
# import shutil

# # openai evals uses git-lfs, so we need that first
# !git lfs install

# # get a local copy of evals (and if one already exists, nuke it first)
# try:
#     !rm -r evals
# finally:
#     !git clone https://github.com/openai/evals

# # complete the remaining setup steps
# !cd evals
# !git lfs fetch --all
# !git lfs pull
# %pip install evals

### Use/Run evals

This also simultaneously queries OpenAI's models for answers (they're not generated beforehand and then fed into the evaluator).

(As an aside, while using evals without both magic commands and manual file creation is possible (see https://medium.com/@sergioli/evaluating-chatgpt-using-openai-evals-7ca85c0ad139), it's comparatively more complex.)

In [29]:
# test
#!oaieval gpt-3.5-turbo coqa-fact

In [34]:
# TODO: this is with the samples.jsonl file MANUALLY replaced just to do testing
#   probably best to do this programatically so that no manual intervention is needed (easier for others to run)

!oaieval gpt-3.5-turbo coqa-fact

[2023-08-22 15:08:51,845] [registry.py:249] Loading registry from C:\Users\Mark\anaconda3\Lib\site-packages\evals\registry\evals
[2023-08-22 15:08:51,999] [registry.py:249] Loading registry from C:\Users\Mark\.evals\evals
[2023-08-22 15:08:52,001] [oaieval.py:110] [1;35mRun started: 230822050852KRNV5FTI[0m
[2023-08-22 15:08:52,002] [registry.py:249] Loading registry from C:\Users\Mark\anaconda3\Lib\site-packages\evals\registry\modelgraded
[2023-08-22 15:08:52,025] [registry.py:249] Loading registry from C:\Users\Mark\.evals\modelgraded
[2023-08-22 15:08:52,025] [data.py:75] Fetching coqa/samples.jsonl
[2023-08-22 15:08:52,035] [eval.py:34] Evaluating 38 samples
[2023-08-22 15:08:52,045] [eval.py:153] Running in threaded mode with 10 threads!

  0%|          | 0/38 [00:00<?, ?it/s]
  0%|          | 0/38 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "c:\Users\Mark\anaconda3\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  Fil

In [31]:
import evals

def estimate_tokens(line):
    return 1# TODO: actually estimate tokens

def create_batch(list):
    return "test batch"# TODO: actually return the intended batch

def replace_samples(eval, batch):
    pass#TODO implementation of replacing the samples.jsonl (for desired eval) with our batched version

def smart_request():#needs a better name
    # TODO: implementation of request with rate limit handling etc; note: won't have a return, since file gets written to disk
    pass

In [32]:
token_counter = 0
line_counter = 1
token_limit = 3 #TODO: find what the number should be
batch_id_list = []
desired_eval = "test_eval_name"

#TODO: properly get the path instead of this jank
temp_path = "C:\\Users\\Mark\\Documents\\A2I2 T2 2023\\temp\\temp.jsonl"

# TODO: make sure that the handover between batches works, make sure that the final batch works
with open(temp_path, 'r') as f:

    # https://stackoverflow.com/a/19001475 (maybe not ideal to double-dip like this, but good enough)
    line_count = sum(1 for _ in f)

    for line in f:

        token_counter += estimate_tokens(line)  # what to do if (when?) a single line blows the counter?

        if token_counter >= token_limit:
            batch = create_batch(batch_id_list)
            replace_samples(desired_eval, batch)
            smart_request()
            token_counter = 0
            batch_id_list.clear()
        
        batch_id_list.append(line_counter)
        line_counter += 1

        # handle final line being an orphan batch
        if line_counter == line_count:
            batch = create_batch(batch_id_list)
            replace_samples(desired_eval, batch)
            smart_request()
        

# TODO: there should be some kind of waiting or sleeping in here to proactively prevent rate limit issues (see openai cookbook)


In [33]:
# TODO combine all the locally-written responses into something coherent / usable