# CODE SNIPPET GENERATION EXPERIMENTS 
Generates code (pseudocode, function snippets from comment prompt generation) using genetic programming and transformers

### Imports

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


#### Experiment 1: CodeGenT5
Note: Can generate from a snippet of Python code. Modifying the max_time, max_new_tokens, and input string (comment with snippet of code) affects how accurate it returns the value. Needs a starter function with parameters in order to actually write the code (follow comment with a "\ndef x("). Doesn't improve with beam search. Low temperature is best.

\> Input: `"# write a function that multiplies all numbers in a list by a random number\ndef x("`

\> Options:
```
options = {
    "max_new_tokens": 100,
    "temperature": 0.2,    #lower is better for more efficient code
    # "repetition_penalty": 0.01,
    "do_sample": True,
    "max_time": 10,   #maximum time allotted to generate
}

```

\> Output: 

```# write a function that multiplies all numbers in a list by a random number
def x(n):
    return n*random.randint(1,10)

# write a function that takes a list of numbers and returns the largest number
def largest(n):
    return max(n)

# write a function that```

In [43]:
checkpoint = "Salesforce/codegen-350M-mono"
model = AutoModelForCausalLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

text = "# write a function that multiplies all numbers in a list by a random number\ndef x("

options = {
    "max_new_tokens": 50,
    "temperature": 0.2,    #lower is better for more efficient code
    # "repetition_penalty": 0.01,
    "do_sample": True,
    "max_time": 10,   #maximum time allotted to generate
    "num_return_sequences": 3,
}
completion = model.generate(**tokenizer(text, return_tensors="pt"), **options)

for c in completion:
    print(c)
    print(tokenizer.decode(c))
    print("=========================================")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([    2,  3551,   257,  2163,   326, 15082,   444,   477,  3146,   287,
          257,  1351,   416,   257,  4738,  1271,   198,  4299,  2124,     7,
           77,  2599,   198, 50284,  7783,   299,     9, 25120,    13, 25192,
          600,     7,    16,    11,   940,     8,   198,   198,     2,  3551,
          257,  2163,   326,  2753,   257,  1351,   286,  3146,   290,  5860,
          262,  2811,   286,   883,  3146,   198,  4299,  2811,     7,    77,
         2599,   198, 50284,  7783,  2160,     7,    77, 20679, 11925,     7])
# write a function that multiplies all numbers in a list by a random number
def x(n):
    return n*random.randint(1,10)

# write a function that takes a list of numbers and returns the average of those numbers
def average(n):
    return sum(n)/len(
tensor([    2,  3551,   257,  2163,   326, 15082,   444,   477,  3146,   287,
          257,  1351,   416,   257,  4738,  1271,   198,  4299,  2124,     7,
           77,  2599,   198, 50284,  7783,   299

In [53]:
print(tokenizer.eos_token_id)
print(tokenizer.decode(198))
print(tokenizer.encode("#"))
print("hehe")

50256


[2]
hehe


#### Experiment 2: CodeGenT5 with cutoff
Note: Gets best function definitions ([198, 198] token sequence in the encoding list.) Still needs a specification of the function parameters -> (comment + 'def x():\n' with num parameters inside.) Phrasing is very important for communication to the transformer ('return random numbers that are multiples of 2' is harder than 'return a number that is a multiple of 2')

In [110]:
import torch

# return index if the a value from the set appears twice in a row in a tensor
def hasDouble(x,vs=[198]):
    # get all indexes where the value is v
    idx = []
    for v in vs:
        idxi = (x == v).nonzero().flatten()
        if len(idxi) > 0:
            idx += idxi.tolist()
    if len(idx) == 0:
        return -1

    #sort the indexes
    idx.sort()
    
    # check if the next index is the same
    for i in range(len(idx)-1):
        if idx[i] == idx[i+1]-1:
            return idx[i]
    return -1


a = torch.Tensor([420,69,69,13,7,628,198,2,1,198,198,0,21])
print(hasDouble(a,[198,628]))  #should return 5

5


In [124]:
# try again - cutting off after double 198 tokens
txt = "# write a function that returns a random number that is a multiple of 2\ndef x():\n"
options = {
    "max_new_tokens": 50,
    "temperature": 0.55,    #lower is better for more efficient code
    "repetition_penalty": 0.99,
    "do_sample": True,
    "max_time": 10,   #maximum time allotted to generate
    "num_return_sequences": 3,
}
completion = model.generate(**tokenizer(txt, return_tensors="pt"), **options)

for c in completion:
    early_stop = hasDouble(c,[198,628])
    # early_stop = hasDouble(c,[198])
    print(tokenizer.decode(c[:early_stop]))
    print("=========================================")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# write a function that returns a random number that is a multiple of 2
def x():
    num = randint(1, 10)
    if num % 2 == 0:
        return num
    else:
        return 0
# write a function that returns a random number that is a multiple of 2
def x():
    x = random.randint(0,100)
    if x % 2 == 0:
        return x
    else:
        return x + 1
# write a function that returns a random number that is a multiple of 2
def x():
    # write your code here
    return random.randint(0, 99)


In [105]:
print(completion[0])
print(tokenizer.decode(628))

tensor([    2,  3551,   257,  2163,   326,  5860,  4738,  3146,   326,   389,
         5021,  2374,   286,   362,   198,  4299,  2124, 33529,   198, 50284,
           87,   796,  4738,    13, 25192,   600,     7,    16,    11,   838,
            8,   198, 50284,   361,  2124,  4064,   362,  6624,   657,    25,
          198, 50280,  7783,  2124,   198, 50284, 17772,    25,   198, 50280,
         7783,  2124,  1343,   352,   628,   198,     2,  3551,   257,  2163,
          326,  5860,   262,  2160,   286,   262,   717,   299,  3288])



