# Selection of Prompt Candidates for Code Generation

Only prompts that explicitly ask for code generation. If prompts contain an implementation question or ask for brainstorming, they are excluded since it would need a follow up prompt to actually generate code. Pure assignment paper task texts are excluded. Prompts that result in very short code or where the code is just typo corrected are also excluded (e.g "can you correct this?     print(f"First dimension (length): {objective(abalone.I, abalone.N, w_canonical_coordinate).2f}")"

Question: Should refactor prompts be included at all? We dont know if the code was written by the user or the llm before or if it was part of an existing code base.

In [1]:
import sqlite3
import pandas as pd

conn  = sqlite3.connect('../../giicg.db')
prompts = pd.read_sql("Select * from prompts", conn)

prompts

Unnamed: 0,message_id,conversation_id,role,message_text,conversational,code,other,gender,user_id
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6
1,730,32,user,Write python function to do operations with in...,Write python function to do operations with in...,,,Man (cisgender),6
2,1133,55,user,Write shortest tutorial on creating RAG on ema...,Write shortest tutorial on creating RAG on ema...,,,Man (cisgender),6
3,1135,55,user,what is FAISS,what is FAISS,,,Man (cisgender),6
4,1137,55,user,Transform given code to process large .mbox file,You are tasked with separating user prompts in...,,Transform given code to process large .mbox file,Man (cisgender),6
...,...,...,...,...,...,...,...,...,...
934,1646,82,user,"def run_query(query, n_results):\n query_em...",this is my code. I want to: Get nodes and edge...,"def run_query(query, n_results):\n query_em...",,Man (cisgender),92
935,1845,37,user,\n nun möchte ich judgement balancing m...,\n nun möchte ich judgement balancing m...,,,Woman (cisgender),29
936,1847,37,user,\n ich sehe keine veränderung im Plot. Was ...,\n ich sehe keine veränderung im Plot. Was ...,,,Woman (cisgender),29
937,1849,2,user,\n I am working on the problem of reconstru...,\n I am working on the problem of reconstru...,,Classic CV - Drone navigation\nIf you ever tho...,Man (cisgender),8


### Only select first prompt of each conversation

In [2]:
prompt_candidates = prompts.groupby('conversation_id').first().reset_index()
prompt_candidates

Unnamed: 0,conversation_id,message_id,role,message_text,conversational,code,other,gender,user_id
0,1,1,user,"parsing data from python iterator, how it coul...","parsing data from python iterator, how it coul...",,,Man (cisgender),6
1,2,1849,user,\n I am working on the problem of reconstru...,\n I am working on the problem of reconstru...,,Classic CV - Drone navigation\nIf you ever tho...,Man (cisgender),8
2,3,3,user,Can you adapt the following code so that inste...,Can you adapt the following code so that inste...,# Create a bar plot to visualize the counts of...,,Woman (cisgender),11
3,5,1733,user,\n SET_ALL_TABLES action is currently not f...,SET_ALL_TABLES action is currently not fetchin...,,,Man (cisgender),15
4,6,5,user,I want to use Dummy Hot encoding to replace th...,I want to use Dummy Hot encoding to replace th...,,,Woman (cisgender),16
...,...,...,...,...,...,...,...,...,...
80,86,1664,user,wie kann man mehrere xarray unter einer neuen ...,wie kann man mehrere xarray unter einer neuen ...,,,Woman (cisgender),60
81,87,1670,user,import numpy as np\nfrom sklearn.cluster impor...,please add information on the amounts of each ...,import numpy as np\nfrom sklearn.cluster impor...,You are tasked with separating user prompts in...,Woman (cisgender),73
82,88,1694,user,Can you document and lint this code please\n\n...,Can you document and lint this code please,@jit\ndef norm():\n OMEGA_R = 4.2 * 10**(-5...,,Man (cisgender),77
83,89,1697,user,ps aux | grep main_py.py\nsimul7 1711 108...,You are tasked with separating user prompts in...,ps aux | grep main_py.py,simul7 1711 1087 0.8 3589100 283700 pts/...,Woman (cisgender),79


## From Scratch


In [4]:
import numpy as np

scratch_ids_python = np.array([5, 43, 47, 57, 65, 126, 242, 266, 268, 290, 606, 608, 656, 730, 752, 756, 764, 766, 861, 865, 981, 985, 1023, 1133, 1162, 1164, 1464, 1510, 1520, 1524, 1534, 1538, 1598, 1664])
scratch_ids_ts = [606, 752, 764, 985]
scratch_ids_js = [43, 1520]
scratch_ids_jshtml = [608, 1510]
scratch_ids_htmlcss = [656, 1162]

scratch_ids = np.concatenate([scratch_ids_python, scratch_ids_ts, scratch_ids_js, scratch_ids_jshtml, scratch_ids_htmlcss])



In [5]:
scratch_prompts = prompt_candidates[prompt_candidates['message_id'].isin(scratch_ids)]
scratch_prompts


Unnamed: 0,conversation_id,message_id,role,message_text,conversational,code,other,gender,user_id
4,6,5,user,I want to use Dummy Hot encoding to replace th...,I want to use Dummy Hot encoding to replace th...,,,Woman (cisgender),16
5,7,43,user,whats the best way to encode and compress a ja...,whats the best way to encode and compress a ja...,,,Man (cisgender),25
6,8,47,user,I have a pandas dataframe like this:\ndata\tpe...,I have a pandas dataframe like this:\n\nI want...,data\tpersona\tinstruction\toriginal\tcritique...,"Some prompts may only contain code, some only ...",Woman (cisgender),28
7,10,57,user,"as a NLP and LLM researcher, I am recently dow...","as a NLP and LLM researcher, I am recently dow...",,,Non-binary,30
9,12,65,user,Blender and Python. I have a collection of hun...,Blender and Python. I have a collection of hun...,,,Man (cisgender),34
10,13,126,user,"how to run a Python future without blocking, i...","how to run a Python future without blocking, i...",,,Man (cisgender),46
12,15,242,user,hey can you write me a short python script for...,hey can you write me a short python script for...,,,Woman (cisgender),48
15,18,266,user,wie kann ich zwei grib dateien in jupyter note...,wie kann ich zwei grib dateien in jupyter note...,,,Woman (cisgender),60
17,20,268,user,Ich arbeite mit Python und muss ein NMEA File ...,Ich arbeite mit Python und muss ein NMEA File ...,,Das file heißt Aufgabe3_NMEA.txt,Woman (cisgender),65
18,21,290,user,please write method to unzip file in python,please write method to unzip file in python,,,Woman (cisgender),73


### Annotate with programming language

In [7]:
language_dict = {
    5: "python",
    47: "python",
    57: "python",
    65: "python",
    126: "python",
    242: "python",
    266: "python",
    268: "python",
    290: "python",
    730: "python",
    756: "python",
    766: "python",
    861: "python",
    865: "python",
    981: "python",
    1023: "python",
    1133: "python",
    1164: "python",
    1464: "python",
    1524: "python",
    1534: "python",
    1538: "python",
    1598: "python",
    1664: "python",

    43: "javascript",
    1520: "javascript",

    606: "typescript",
    752: "typescript",
    764: "typescript",
    985: "typescript",

    608: "html_js",
    1510: "html_js",

    656: "html_css",
    1162: "html_css",
}


def get_language(message_id):
    if message_id in language_dict:
        return language_dict[message_id]
    else:
        return "unknown"

scratch_prompts["programming_language"] = scratch_prompts["message_id"].apply(get_language)
scratch_prompts

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  scratch_prompts["programming_language"] = scratch_prompts["message_id"].apply(get_language)


Unnamed: 0,conversation_id,message_id,role,message_text,conversational,code,other,gender,user_id,programming_language
4,6,5,user,I want to use Dummy Hot encoding to replace th...,I want to use Dummy Hot encoding to replace th...,,,Woman (cisgender),16,python
5,7,43,user,whats the best way to encode and compress a ja...,whats the best way to encode and compress a ja...,,,Man (cisgender),25,javascript
6,8,47,user,I have a pandas dataframe like this:\ndata\tpe...,I have a pandas dataframe like this:\n\nI want...,data\tpersona\tinstruction\toriginal\tcritique...,"Some prompts may only contain code, some only ...",Woman (cisgender),28,python
7,10,57,user,"as a NLP and LLM researcher, I am recently dow...","as a NLP and LLM researcher, I am recently dow...",,,Non-binary,30,python
9,12,65,user,Blender and Python. I have a collection of hun...,Blender and Python. I have a collection of hun...,,,Man (cisgender),34,python
10,13,126,user,"how to run a Python future without blocking, i...","how to run a Python future without blocking, i...",,,Man (cisgender),46,python
12,15,242,user,hey can you write me a short python script for...,hey can you write me a short python script for...,,,Woman (cisgender),48,python
15,18,266,user,wie kann ich zwei grib dateien in jupyter note...,wie kann ich zwei grib dateien in jupyter note...,,,Woman (cisgender),60,python
17,20,268,user,Ich arbeite mit Python und muss ein NMEA File ...,Ich arbeite mit Python und muss ein NMEA File ...,,Das file heißt Aufgabe3_NMEA.txt,Woman (cisgender),65,python
18,21,290,user,please write method to unzip file in python,please write method to unzip file in python,,,Woman (cisgender),73,python


In [8]:
scratch_prompts.to_sql('scratch_prompts', conn, if_exists='replace', index=False)

34

## Refactor

In [17]:
refactor_ids = [3, 248, 1751, 592, 596, 600, 654, 724, 732, 736, 855, 881, 1131, 1147, 1155, 1200, 1208, 1532, 1566, 1594, 1632, 1646, 1648, 1670, 1694 ]
refactor_prompts = prompt_candidates[prompt_candidates['message_id'].isin(refactor_ids)]
refactor_prompts

Unnamed: 0,conversation_id,message_id,role,message_text,conversational,code,other,gender,user_id
1,3,3,user,Can you adapt the following code so that inste...,Can you adapt the following code so that inste...,# Create a bar plot to visualize the counts of...,,Woman (cisgender),11
12,16,248,user,import itertools\nimport numpy as np\nimport p...,wrtie me a script that updates this chart to i...,import itertools\nimport numpy as np\nimport p...,Patients\n CT\n MRI\n WSI\n Genomics\n Clinica...,Woman (cisgender),55
15,19,1751,user,\n def remove_fast_irregular_vectors(fl...,can you give me a new code snipped where I jus...,"def remove_fast_irregular_vectors(flow, magnit...",You are tasked with separating user prompts in...,Woman (cisgender),63
18,22,592,user,I have this function that is supposed to perfo...,I have this function that is supposed to perfo...,"@jit\ndef simpson_uniform(f, x):\n """"""\n ...",100 0.3333333333333333 0.3483835051532227\n\n-...,Man (cisgender),77
19,23,596,user,import matplotlib.pyplot as plt\nimport numpy ...,partendo da qui potresti mettermi a zero tutti...,import matplotlib.pyplot as plt\nimport numpy ...,You are tasked with separating user prompts in...,Woman (cisgender),79
20,24,600,user,Ich habe ein Programmierprojekt und brauche Hi...,Ich habe ein Programmierprojekt und brauche Hi...,starter.py:\n\nimport sys\nimport os\nimport s...,,Man (cisgender),81
25,29,654,user,is there a way to get the object key in here?\...,is there a way to get the object key in here?,def get_events_as_notifications(bin_file: byte...,it is an sns notification,Woman (cisgender),90
27,31,724,user,import pandas as pd\nimport numpy as np\nfrom ...,Please replace my retrieval pipeline here with...,import pandas as pd\nimport numpy as np\nfrom ...,You are tasked with separating user prompts in...,Man (cisgender),92
29,33,732,user,I want to remove all rows where the task_id=56...,I want to remove all rows where the task_id=56...,df_base_incorrect_fast_removed = df_base_incor...,# remove users who were very fast and did not ...,Woman (cisgender),11
30,34,736,user,I have this three classes that are very simila...,I have this three classes that are very simila...,"class Block:\n """"""\n Implements a block ...",,Woman (cisgender),16


In [18]:
refactor_prompts.to_sql('refactor_prompts', conn, if_exists='replace', index=False)

25

In [19]:
conn.close()