## OEIS Scraping

OEIS lists a number of integer sequences, including their descriptions in English, and some sequences also list a small sample of code in various languages that either yields specific elements of that sequence or the sequence itself. By prompting GPT with certain samples of OEIS sequence descriptions and their corresponding code, we can test whether or not GPT understands the code snippets by giving it a new code sample and seeing if it will be able to extend it with a valid description of what that code does.

In [1]:
# uncomment to scrape dataset (depending on internet speed may take a while - will probably refactor later if needed)
# from oeis_scraper import *
# scrape('oeis_data.jsonl')

For the purposes of keeping the code as readable as possible, only sequences with Python code were collected, based on Python's reputation for having very simple code that is easy for humans to read, and because it was part of the datasets that GPT-Codex was trained on due to its considerable popularity. Out of the many sequences in the OEIS database, there were only 6,777 sequences that contained Python code. Also, because I am not a mathematician and because many of the sequences are actually named after the mathematicians who discovered them, a smaller subset of about 80 of those sequences were selected for manual review to ensure that GPT is able to generate adequate descriptions. These sequences were chosen due to their relatively simple nature that would not require more complex knowledge of number theory: If the model were to generate a high level description of what the code does or what numbers are produced, I would be able to confirm that the code indeed does do what GPT says it does, whereas something like the Kolakoski sequence would be something that I did not know and would not be able to verify.

## Code obfuscation
Once we have the description of integer sequences and the code samples, we need to do some preprocessing first; because GPT-Codex is trained on a large corpus of data that is scraped from the internet, it is likely that something like OEIS would be something it has seen before. In fact, testing GPT with unmodified code samples yields descriptions that are lifted verbatim from OEIS. In order to get around this, we need to obfuscate the code in such a way that it is still fairly understandable (at least from a human perspective) but that functionally accomplishes the same thing as the original code sample.

Given that many of the code samples in the dataset are from high-level mathematical backgrounds, it makes sense that they would be more functional than imperative in terms of the programming style used. As such, high-level map, filter, and reduce methods are an ideal target to modify, given that they can easily be converted to and from imperative loops. List and set comprehensions are also good targets because they are also fairly simple to rewrite. To stretch out limited data and to see what combinations of changes are optimal for GPT, we can then chain different obfuscations in different orders. Full specifications of what I changed and how are listed in the obfuscate_data.py script.

In [2]:
from obfuscate_data import *
# obfuscate('oeis_data.jsonl', 'oeis_data_obfuscated.jsonl')
obfuscate('google-python-data/mbpp.jsonl', 'google-python-data/mbpp_obfuscated.jsonl')

## GPT-Codex processing
Now, we process the code samples through GPT-Codex. Currently, there are 5 obfuscations being applied one after the next with intermediate results being saved. Also, there's different settings that we can set GPT to, including sampling temperature and penalties for repeating words. For now, we'll experiment with penalizing repetitions based on presence and frequency with value set to either 1 or 0(see API for details), so that's 4 more combinations of settings for a total of 20 different GPT calls.

In [None]:
from gpt_codex_query import *

# indices.txt has the sequence numbers for 70 sequences that I can understand at a glance
# with open('indices.txt') as indices:
#    sequences = set(line.strip() for line in indices)

# get readable sequences
# write_completions('codex-completions-subset.jsonl', read_snippets('oeis_data_obfuscated.jsonl', sequences))
    
# get all sequences
# write_completions('codex-completions-full.jsonl', read_snippets('oeis_data_obfuscated.jsonl'))
write_completions('google-python-data/mbpp_completions.jsonl', read_snippets('google-python-data/mbpp_obfuscated.jsonl'))

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
