# Selection of Prompt Candidates for Code Generation

Only prompts that explicitly ask for code generation. If prompts contain an implementation question or ask for brainstorming, they are excluded since it would need a follow up prompt to actually generate code. Pure assignment paper task texts are excluded. Prompts that result in very short code or where the code is just typo corrected are also excluded (e.g "can you correct this?     print(f"First dimension (length): {objective(abalone.I, abalone.N, w_canonical_coordinate).2f}")"

Question: Should refactor prompts be included at all? We dont know if the code was written by the user or the llm before or if it was part of an existing code base.

In [12]:
import sqlite3
import pandas as pd

conn  = sqlite3.connect('../../giicg.db')
prompts = pd.read_sql("Select * from prompts", conn)

prompts

Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,,Man (cisgender),6
2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6
3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6
4,1137,55,user,Transform given code to process large .mbox file,You are tasked with separating user prompts in...,,Transform given code to process large .mbox file,Man (cisgender),6
...,...,...,...,...,...,...,...,...,...
930,726,31,user,"please update my code accordingly, no comments...","please update my code accordingly, no comments...",,,Man (cisgender),92
931,728,31,user,"Traceback (most recent call last):\n File ""/U...",,,"Traceback (most recent call last):\n File ""/U...",Man (cisgender),92
932,1131,54,user,import pandas as pd\nimport numpy as np\nfrom ...,"I want to tune optimal thresholds. Currently, ...",import pandas as pd\nimport numpy as np\nfrom ...,The narratives list looks like this:\nnarrativ...,Man (cisgender),92
933,1532,71,user,"from transformers import AutoTokenizer, AutoMo...",I want to use an LLM for listwise reranking in...,"from transformers import AutoTokenizer, AutoMo...",,Man (cisgender),92


In [13]:
prompt_candidates = prompts.groupby('conversation_id').first().reset_index()
prompt_candidates

Unnamed: 0,conversation_id,message_id,role,message_text,conversational,code,other,gender,user_id
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6
1,3,3,user,Can you adapt the following code so that inste...,Can you adapt the following code so that inste...,# Create a bar plot to visualize the counts of...,,Woman (cisgender),11
2,5,1733,user,\n SET_ALL_TABLES action is currently not f...,SET_ALL_TABLES action is currently not fetchin...,,,Man (cisgender),15
3,6,5,user,I want to use Dummy Hot encoding to replace th...,I want to use Dummy Hot encoding to replace th...,,,Woman (cisgender),16
4,7,43,user,whats the best way to encode and compress a ja...,whats the best way to encode and compress a ja...,,,Man (cisgender),25
...,...,...,...,...,...,...,...,...,...
78,86,1664,user,wie kann man mehrere xarray unter einer neuen ...,wie kann man mehrere xarray unter einer neuen ...,,,Woman (cisgender),60
79,87,1670,user,import numpy as np\nfrom sklearn.cluster impor...,please add information on the amounts of each ...,import numpy as np\nfrom sklearn.cluster impor...,You are tasked with separating user prompts in...,Woman (cisgender),73
80,88,1694,user,Can you document and lint this code please\n\n...,Can you document and lint this code please,@jit\ndef norm():\n OMEGA_R = 4.2 * 10**(-5...,,Man (cisgender),77
81,89,1697,user,ps aux | grep main_py.py\nsimul7 1711 108...,You are tasked with separating user prompts in...,ps aux | grep main_py.py,simul7 1711 1087 0.8 3589100 283700 pts/...,Woman (cisgender),79


## From Scratch


In [14]:
import numpy as np

scratch_ids_python = np.array([5, 43, 47, 57, 65, 126, 242, 266, 268, 290, 606, 608, 656, 730, 752, 756, 764, 766, 861, 865, 981, 985, 1023, 1133, 1162, 1164, 1464, 1510, 1520, 1524, 1534, 1538, 1598, 1664])
scratch_ids_ts = [606, 752, 764, 985]
scratch_ids_js = [43, 1520]
scratch_ids_jshtml = [608, 1510]
scratch_ids_htmlcss = [656, 1162]

ids_to_remove = np.concatenate([scratch_ids_ts, scratch_ids_js, scratch_ids_jshtml, scratch_ids_htmlcss])
filtered_ids = np.setdiff1d(scratch_ids_python, ids_to_remove)
print(filtered_ids)


[   5   43   47   57   65  126  242  266  268  290  730  756  766  861
  865  981 1023 1133 1164 1464 1524 1534 1538 1598 1664]


In [15]:
scratch_prompts = prompt_candidates[prompt_candidates['message_id'].isin(filtered_ids)]
scratch_prompts


Unnamed: 0,conversation_id,message_id,role,message_text,conversational,code,other,gender,user_id
3,6,5,user,I want to use Dummy Hot encoding to replace th...,I want to use Dummy Hot encoding to replace th...,,,Woman (cisgender),16
4,7,43,user,whats the best way to encode and compress a ja...,whats the best way to encode and compress a ja...,,,Man (cisgender),25
5,8,47,user,I have a pandas dataframe like this:\ndata\tpe...,I have a pandas dataframe like this:\n\nI want...,data\tpersona\tinstruction\toriginal\tcritique...,"Some prompts may only contain code, some only ...",Woman (cisgender),28
6,10,57,user,"as a NLP and LLM researcher, I am recently dow...","as a NLP and LLM researcher, I am recently dow...",,,Non-binary,30
8,12,65,user,Blender and Python. I have a collection of hun...,Blender and Python. I have a collection of hun...,,,Man (cisgender),34
9,13,126,user,"how to run a Python future without blocking, i...","how to run a Python future without blocking, i...",,,Man (cisgender),46
11,15,242,user,hey can you write me a short python script for...,hey can you write me a short python script for...,,,Woman (cisgender),48
14,18,266,user,wie kann ich zwei grib dateien in jupyter note...,wie kann ich zwei grib dateien in jupyter note...,,,Woman (cisgender),60
16,20,268,user,Ich arbeite mit Python und muss ein NMEA File ...,Ich arbeite mit Python und muss ein NMEA File ...,,Das file heißt Aufgabe3_NMEA.txt,Woman (cisgender),65
17,21,290,user,please write method to unzip file in python,please write method to unzip file in python,,,Woman (cisgender),73


In [16]:
scratch_prompts.to_sql('scratch_prompts', conn, if_exists='replace', index=False)

25

## Refactor

In [17]:
refactor_ids = [3, 248, 1751, 592, 596, 600, 654, 724, 732, 736, 855, 881, 1131, 1147, 1155, 1200, 1208, 1532, 1566, 1594, 1632, 1646, 1648, 1670, 1694 ]
refactor_prompts = prompt_candidates[prompt_candidates['message_id'].isin(refactor_ids)]
refactor_prompts

Unnamed: 0,conversation_id,message_id,role,message_text,conversational,code,other,gender,user_id
1,3,3,user,Can you adapt the following code so that inste...,Can you adapt the following code so that inste...,# Create a bar plot to visualize the counts of...,,Woman (cisgender),11
12,16,248,user,import itertools\nimport numpy as np\nimport p...,wrtie me a script that updates this chart to i...,import itertools\nimport numpy as np\nimport p...,Patients\n CT\n MRI\n WSI\n Genomics\n Clinica...,Woman (cisgender),55
15,19,1751,user,\n def remove_fast_irregular_vectors(fl...,can you give me a new code snipped where I jus...,"def remove_fast_irregular_vectors(flow, magnit...",You are tasked with separating user prompts in...,Woman (cisgender),63
18,22,592,user,I have this function that is supposed to perfo...,I have this function that is supposed to perfo...,"@jit\ndef simpson_uniform(f, x):\n """"""\n ...",100 0.3333333333333333 0.3483835051532227\n\n-...,Man (cisgender),77
19,23,596,user,import matplotlib.pyplot as plt\nimport numpy ...,partendo da qui potresti mettermi a zero tutti...,import matplotlib.pyplot as plt\nimport numpy ...,You are tasked with separating user prompts in...,Woman (cisgender),79
20,24,600,user,Ich habe ein Programmierprojekt und brauche Hi...,Ich habe ein Programmierprojekt und brauche Hi...,starter.py:\n\nimport sys\nimport os\nimport s...,,Man (cisgender),81
25,29,654,user,is there a way to get the object key in here?\...,is there a way to get the object key in here?,def get_events_as_notifications(bin_file: byte...,it is an sns notification,Woman (cisgender),90
27,31,724,user,import pandas as pd\nimport numpy as np\nfrom ...,Please replace my retrieval pipeline here with...,import pandas as pd\nimport numpy as np\nfrom ...,You are tasked with separating user prompts in...,Man (cisgender),92
29,33,732,user,I want to remove all rows where the task_id=56...,I want to remove all rows where the task_id=56...,df_base_incorrect_fast_removed = df_base_incor...,# remove users who were very fast and did not ...,Woman (cisgender),11
30,34,736,user,I have this three classes that are very simila...,I have this three classes that are very simila...,"class Block:\n """"""\n Implements a block ...",,Woman (cisgender),16


In [18]:
refactor_prompts.to_sql('refactor_prompts', conn, if_exists='replace', index=False)

25

In [19]:
conn.close()