# Populating Database 

In this notebook, I am populating the database with all the data. For this, I need to go through the data files I want to use and then see how they are set up and then insert rows in the table.

In [1]:
# Importing libraries
import pandas as pd
import re

## Putting the different data sources into the table.

In this section, I go through the different files I need for each data source and put it into table.

### daigt data - llama 70b and falcon180b

In [2]:
# Getting the data csvs
falcon = pd.read_csv('../data/llama70b-and-falcon70b-generated/falcon_180b_v1.csv')
llama = pd.read_csv('../data/llama70b-and-falcon70b-generated/llama_70b_v2.csv',index_col=0)
combined = pd.read_csv('../data/llama70b-and-falcon70b-generated/llama_falcon_v3.csv')

In [3]:
# Viewing falcon generated
falcon.head(5) # Falcon doesn't have a column for LLM generated, need to add it

Unnamed: 0,generated_text,writing_prompt
0,"Dear Principal,\n\nI am writing to express my ...",Your principal is considering changing school ...
1,When people are faced with a difficult decisio...,"When people ask for advice, they sometimes tal..."
2,"As a grade 12 student, I believe that summer p...",Some schools require students to complete summ...
3,"Dear Principal,\n\nI am writing to share my th...",Some of your friends perform community service...
4,"""Making Mona Lisa Smile"" is an interesting art...","In the article ""Making Mona Lisa Smile,"" the a..."


In [4]:
# Viewing llama generated
llama.head(5)

Unnamed: 0,generated_text,writing_prompt,generated
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1


In [5]:
# Viewing combined generated
combined.head(5)

Unnamed: 0,text,generated,prompt_name,model
0,One way we can make a positive change is by li...,1,Car-free cities,llama_70b
1,The experimental district of Vauban in Germany...,1,Car-free cities,llama_70b
2,"The successful ""Day Without Cars"" event in Bog...",1,Car-free cities,llama_70b
3,"The exhaust from cars pollutes our air, and th...",1,Car-free cities,llama_70b
4,"Recently, Paris faced a severe pollution probl...",1,Car-free cities,llama_70b


In [6]:
# Want to see how many prompts I have in combined
combined['prompt_name'].value_counts() # 7 total prompts

prompt_name
Car-free cities                     1000
Does the electoral college work?    1000
The Face on Mars                    1000
"A Cowboy Who Rode the Waves"       1000
Exploring Venus                     1000
Facial action coding system         1000
Driverless cars                     1000
Name: count, dtype: int64

Based on initial analysis, falcon and llama dataframes are ready to be inserted into the table. The falcone dataframe just needs a generated column which is going to be set to all 1s. For the combined dataframe, I need to map the prompt_name to the prompts. Fortunately, the persuade 2.0 corpus provides a mapping, so I can utilize this mapping prior to adding the combined dataframe into the table.

I also need a word count feature to add to the table. Hence, I need to add the word count feature.

In [7]:
# Adding a generated column in falcon
falcon['generated'] = 1
falcon.head(5)

Unnamed: 0,generated_text,writing_prompt,generated
0,"Dear Principal,\n\nI am writing to express my ...",Your principal is considering changing school ...,1
1,When people are faced with a difficult decisio...,"When people ask for advice, they sometimes tal...",1
2,"As a grade 12 student, I believe that summer p...",Some schools require students to complete summ...,1
3,"Dear Principal,\n\nI am writing to share my th...",Some of your friends perform community service...,1
4,"""Making Mona Lisa Smile"" is an interesting art...","In the article ""Making Mona Lisa Smile,"" the a...",1


In [8]:
# Creating a function to get the word count of a piece of text
def get_word_count(text:str) -> int:
    """
    get_word_count

    A function to get the word count of some text.

    inputs:
    - text: a string that indicates you want to get the word count for.

    outputs:
    - an integer representing the word count
    """
    return len(re.findall(r'[a-zA-Z_]+',text))

In [9]:
# Adding a word count feature to the falcon dataset
falcon['word_count'] = falcon['generated_text'].apply(get_word_count)
falcon.tail(5)

Unnamed: 0,generated_text,writing_prompt,generated,word_count
1050,(I am not capable of personal opinions or beli...,The role of zoos in conservation and education...,1,360
1051,"In ""The Challenge of Exploring Venus,"" the aut...","In ""The Challenge of Exploring Venus,"" the aut...",1,442
1052,"The article ""Making Mona Lisa Smile"" discusses...","In the article ""Making Mona Lisa Smile,"" the a...",1,327
1053,"As a grade 6 student, I am still learning abou...",The issue of gun control is a highly contentio...,1,313
1054,Passage 1:\n\nCars are one of the main ways in...,Write an explanatory essay to inform fellow ci...,1,401


In [10]:
# Adding a word count feature to llama dataset
llama['word_count'] = llama['generated_text'].apply(get_word_count)
llama.head(5)

Unnamed: 0,generated_text,writing_prompt,generated,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307


In [11]:
# Adding a word count feature to combined dataset
combined['word_count'] = combined['text'].apply(get_word_count)
combined.head(5)

Unnamed: 0,text,generated,prompt_name,model,word_count
0,One way we can make a positive change is by li...,1,Car-free cities,llama_70b,365
1,The experimental district of Vauban in Germany...,1,Car-free cities,llama_70b,486
2,"The successful ""Day Without Cars"" event in Bog...",1,Car-free cities,llama_70b,416
3,"The exhaust from cars pollutes our air, and th...",1,Car-free cities,llama_70b,367
4,"Recently, Paris faced a severe pollution probl...",1,Car-free cities,llama_70b,466


In [12]:
# Combining the llama dataset and the falcon dataset into one dataframe
llama_and_falcon = pd.concat([llama,falcon],axis=0,ignore_index=True)
llama_and_falcon

Unnamed: 0,generated_text,writing_prompt,generated,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
2222,(I am not capable of personal opinions or beli...,The role of zoos in conservation and education...,1,360
2223,"In ""The Challenge of Exploring Venus,"" the aut...","In ""The Challenge of Exploring Venus,"" the aut...",1,442
2224,"The article ""Making Mona Lisa Smile"" discusses...","In the article ""Making Mona Lisa Smile,"" the a...",1,327
2225,"As a grade 6 student, I am still learning abou...",The issue of gun control is a highly contentio...,1,313


In [13]:
# Renaming the columns
llama_and_falcon.rename(columns={'generated_text':'essay','writing_prompt':'prompt','generated':'LLM_written'},inplace=True)
llama_and_falcon.head(5)

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307


## persuade corpus 2.0

In [14]:
# Getting the data
persuade_corpus = pd.read_csv('../data/persuade-corpus/persuade_2.0_human_scores_demo_id_github.csv')
persuade_corpus.head(5)

Unnamed: 0,essay_id_comp,full_text,holistic_essay_score,word_count,prompt_name,task,assignment,source_text,gender,grade_level,ell_status,race_ethnicity,economically_disadvantaged,student_disability_status
0,423A1CA112E2,Phones\n\nModern humans today are always on th...,3,378,Phones and driving,Independent,Today the majority of humans own and operate c...,,M,,,Black/African American,,
1,BC75783F96E3,This essay will explain if drivers should or s...,4,432,Phones and driving,Independent,Today the majority of humans own and operate c...,,M,,,Black/African American,,
2,74C8BC7417DE,Driving while the use of cellular devices\n\nT...,2,179,Phones and driving,Independent,Today the majority of humans own and operate c...,,F,,,White,,
3,A8445CABFECE,Phones & Driving\n\nDrivers should not be able...,3,221,Phones and driving,Independent,Today the majority of humans own and operate c...,,M,,,Black/African American,,
4,6B4F7A0165B9,Cell Phone Operation While Driving\n\nThe abil...,4,334,Phones and driving,Independent,Today the majority of humans own and operate c...,,M,,,White,,


In [15]:
# Dropping the unnecessary columns
persuade_corpus.drop(['essay_id_comp','holistic_essay_score','task','source_text','gender',
                      'grade_level','ell_status','race_ethnicity','economically_disadvantaged',
                      'student_disability_status'],axis=1,inplace=True)
persuade_corpus.head(5)

Unnamed: 0,full_text,word_count,prompt_name,assignment
0,Phones\n\nModern humans today are always on th...,378,Phones and driving,Today the majority of humans own and operate c...
1,This essay will explain if drivers should or s...,432,Phones and driving,Today the majority of humans own and operate c...
2,Driving while the use of cellular devices\n\nT...,179,Phones and driving,Today the majority of humans own and operate c...
3,Phones & Driving\n\nDrivers should not be able...,221,Phones and driving,Today the majority of humans own and operate c...
4,Cell Phone Operation While Driving\n\nThe abil...,334,Phones and driving,Today the majority of humans own and operate c...


In [16]:
# Getting the prompt_name and assignment columns for the mappers
column_dict = persuade_corpus[['prompt_name','assignment']].drop_duplicates(subset='prompt_name').to_dict('records')
prompt_mapper = {}

# Iterating through the column_dict to make the prompt_mapper
for i in range(len(column_dict)):
    prompt_mapper[column_dict[i]['prompt_name']] = column_dict[i]['assignment']

# printing the prompt mapper
prompt_mapper

{'Phones and driving': 'Today the majority of humans own and operate cell phones on a daily basis. In essay form, explain if drivers should or should not be able to use cell phones in any capacity while operating a vehicle.',
 'Car-free cities': 'Write an explanatory essay to inform fellow citizens about the advantages of limiting car usage. Your essay must be based on ideas and information that can be found in the passage set. Manage your time carefully so that you can read the passages; plan your response; write your response; and revise and edit your response. Be sure to use evidence from multiple sources; and avoid overly relying on one source. Your response should be in the form of a multiparagraph essay. Write your essay in the space provided.',
 'Summer projects': 'Some schools require students to complete summer projects to assure they continue learning during their break. Should these summer projects be teacher-designed or student-designed? Take a position on this question. Su

In [17]:
# Using the mapping to add the prompt to the combined dataset
combined['prompt'] = combined['prompt_name'].map(prompt_mapper)
combined.head(5)

Unnamed: 0,text,generated,prompt_name,model,word_count,prompt
0,One way we can make a positive change is by li...,1,Car-free cities,llama_70b,365,Write an explanatory essay to inform fellow ci...
1,The experimental district of Vauban in Germany...,1,Car-free cities,llama_70b,486,Write an explanatory essay to inform fellow ci...
2,"The successful ""Day Without Cars"" event in Bog...",1,Car-free cities,llama_70b,416,Write an explanatory essay to inform fellow ci...
3,"The exhaust from cars pollutes our air, and th...",1,Car-free cities,llama_70b,367,Write an explanatory essay to inform fellow ci...
4,"Recently, Paris faced a severe pollution probl...",1,Car-free cities,llama_70b,466,Write an explanatory essay to inform fellow ci...


In [18]:
# Renaming and dropping columns 
combined.drop(['prompt_name','model'],axis=1,inplace=True)
combined.rename(columns={'text':'essay','generated':'LLM_written'},inplace=True)
combined.head(5)

Unnamed: 0,essay,LLM_written,word_count,prompt
0,One way we can make a positive change is by li...,1,365,Write an explanatory essay to inform fellow ci...
1,The experimental district of Vauban in Germany...,1,486,Write an explanatory essay to inform fellow ci...
2,"The successful ""Day Without Cars"" event in Bog...",1,416,Write an explanatory essay to inform fellow ci...
3,"The exhaust from cars pollutes our air, and th...",1,367,Write an explanatory essay to inform fellow ci...
4,"Recently, Paris faced a severe pollution probl...",1,466,Write an explanatory essay to inform fellow ci...


In [19]:
# Concating combined with the current dataset
fully_concatenated_data = pd.concat([llama_and_falcon,combined])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
6995,"Driverless cars, also known as autonomous cars...","In the article “Driverless Cars are Coming,” t...",1,397
6996,Driverless Cars: A Necessity for Our Future\n\...,"In the article “Driverless Cars are Coming,” t...",1,405
6997,The Pros and Cons of Driverless Cars\n\nThe wo...,"In the article “Driverless Cars are Coming,” t...",1,661
6998,The development of driverless cars has been a ...,"In the article “Driverless Cars are Coming,” t...",1,596


In [21]:
# Working with the persuade corpus
persuade_corpus

Unnamed: 0,full_text,word_count,prompt_name,assignment
0,Phones\n\nModern humans today are always on th...,378,Phones and driving,Today the majority of humans own and operate c...
1,This essay will explain if drivers should or s...,432,Phones and driving,Today the majority of humans own and operate c...
2,Driving while the use of cellular devices\n\nT...,179,Phones and driving,Today the majority of humans own and operate c...
3,Phones & Driving\n\nDrivers should not be able...,221,Phones and driving,Today the majority of humans own and operate c...
4,Cell Phone Operation While Driving\n\nThe abil...,334,Phones and driving,Today the majority of humans own and operate c...
...,...,...,...,...
25991,80% of Americans believe seeking multiple opin...,1050,Seeking multiple opinions,"When people ask for advice, they sometimes tal..."
25992,"When people ask for advice,they sometimes talk...",373,Seeking multiple opinions,"When people ask for advice, they sometimes tal..."
25993,"During a group project, have you ever asked a ...",631,Seeking multiple opinions,"When people ask for advice, they sometimes tal..."
25994,Making choices in life can be very difficult. ...,417,Seeking multiple opinions,"When people ask for advice, they sometimes tal..."


In [22]:
# Adding a column for generated
persuade_corpus['LLM_written'] = 0

# Dropping prompt name
persuade_corpus.drop(['prompt_name'],axis=1,inplace=True)

# Renaming columns
persuade_corpus.rename(columns={'full_text':'essay','assignment':'prompt'},inplace=True)
persuade_corpus.head(5)

Unnamed: 0,essay,word_count,prompt,LLM_written
0,Phones\n\nModern humans today are always on th...,378,Today the majority of humans own and operate c...,0
1,This essay will explain if drivers should or s...,432,Today the majority of humans own and operate c...,0
2,Driving while the use of cellular devices\n\nT...,179,Today the majority of humans own and operate c...,0
3,Phones & Driving\n\nDrivers should not be able...,221,Today the majority of humans own and operate c...,0
4,Cell Phone Operation While Driving\n\nThe abil...,334,Today the majority of humans own and operate c...,0


In [23]:
# Concatenating with the combined dataset
fully_concatenated_data = pd.concat([fully_concatenated_data,persuade_corpus])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
25991,80% of Americans believe seeking multiple opin...,"When people ask for advice, they sometimes tal...",0,1050
25992,"When people ask for advice,they sometimes talk...","When people ask for advice, they sometimes tal...",0,373
25993,"During a group project, have you ever asked a ...","When people ask for advice, they sometimes tal...",0,631
25994,Making choices in life can be very difficult. ...,"When people ask for advice, they sometimes tal...",0,417


### 1000 essays from Anthropic

In [28]:
anthropic = pd.read_csv('../data/persuade15_claude_instant1.csv')
anthropic.head(5)

Unnamed: 0,prompt_id,essay_title,essay_text
0,14,Some schools offer distance learning as an opt...,While distance learning offers certain benefit...
1,11,"In the article “Driverless Cars are Coming,” t...",The Development of Driverless Cars\n\nWhile dr...
2,8,You have read the article 'Unmasking the Face ...,"While the mysterious formation known as the ""F..."
3,6,"In ""The Challenge of Exploring Venus,"" the aut...",Studying Venus Remains a Worthy Pursuit\n\nWhi...
4,11,"In the article “Driverless Cars are Coming,” t...",Driverless Cars: An Argument in Favor\n\nThe d...


In [29]:
# Getting the word counts
anthropic['word_count'] = anthropic['essay_text'].apply(get_word_count)
anthropic.head(5)

Unnamed: 0,prompt_id,essay_title,essay_text,word_count
0,14,Some schools offer distance learning as an opt...,While distance learning offers certain benefit...,307
1,11,"In the article “Driverless Cars are Coming,” t...",The Development of Driverless Cars\n\nWhile dr...,356
2,8,You have read the article 'Unmasking the Face ...,"While the mysterious formation known as the ""F...",318
3,6,"In ""The Challenge of Exploring Venus,"" the aut...",Studying Venus Remains a Worthy Pursuit\n\nWhi...,303
4,11,"In the article “Driverless Cars are Coming,” t...",Driverless Cars: An Argument in Favor\n\nThe d...,322


In [30]:
# Dropping columns, adding columns, and renaming columns
anthropic.drop(['prompt_id'],axis=1,inplace=True)
anthropic['LLM_written'] = 1
anthropic.rename(columns={'essay_title':'prompt','essay_text':'essay'},inplace=True)
anthropic.head(5)

Unnamed: 0,prompt,essay,word_count,LLM_written
0,Some schools offer distance learning as an opt...,While distance learning offers certain benefit...,307,1
1,"In the article “Driverless Cars are Coming,” t...",The Development of Driverless Cars\n\nWhile dr...,356,1
2,You have read the article 'Unmasking the Face ...,"While the mysterious formation known as the ""F...",318,1
3,"In ""The Challenge of Exploring Venus,"" the aut...",Studying Venus Remains a Worthy Pursuit\n\nWhi...,303,1
4,"In the article “Driverless Cars are Coming,” t...",Driverless Cars: An Argument in Favor\n\nThe d...,322,1


In [31]:
# Adding anthropic to the final dataset
fully_concatenated_data = pd.concat([fully_concatenated_data,anthropic])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
995,Limiting car usage has many benefits for moder...,Write an explanatory essay to inform fellow ci...,1,349
996,The Rise of Driverless Cars\n\nThe development...,"In the article “Driverless Cars are Coming,” t...",1,297
997,Schools should allow students to design their ...,Some schools require students to complete summ...,1,303
998,The Open Sea Beckons\n\nThe Seagoing Cowboys p...,"You have just read the article, 'A Cowboy Who ...",1,357


### ArguGPT

In [33]:
# Importing the data
argugpt = pd.read_csv('../data/ArguGPT/argugpt.csv')
argugpt.head(5)

Unnamed: 0,id,prompt_id,prompt,text,model,temperature,exam_type,score,score_level
0,weccl_30,WECCL-17,Some people think the university education is ...,There are many people who think that universit...,text-babbage-001,0.5,weccl,19,high
1,weccl_51,WECCL-17,Some people think the university education is ...,There are a number of reasons why people might...,text-babbage-001,0.65,weccl,13,medium
2,weccl_48,WECCL-17,Some people think the university education is ...,There are many reasons why university educatio...,text-babbage-001,0.65,weccl,13,medium
3,weccl_50,WECCL-17,Some people think the university education is ...,There are many people who think that universit...,text-babbage-001,0.65,weccl,12,low
4,weccl_55,WECCL-17,Some people think the university education is ...,There is a general consensus that university e...,text-babbage-001,0.8,weccl,13,medium


In [34]:
# Getting the columns that I need
argugpt = argugpt[['prompt','text']]

# Adding the word_count and LL_written columns
argugpt['word_count'] = argugpt['text'].apply(get_word_count)
argugpt['LLM_written'] = 1

# Renaming the text column
argugpt.rename(columns={'text':'essay'},inplace=True)
argugpt.head()

Unnamed: 0,prompt,essay,word_count,LLM_written
0,Some people think the university education is ...,There are many people who think that universit...,324,1
1,Some people think the university education is ...,There are a number of reasons why people might...,241,1
2,Some people think the university education is ...,There are many reasons why university educatio...,114,1
3,Some people think the university education is ...,There are many people who think that universit...,174,1
4,Some people think the university education is ...,There is a general consensus that university e...,111,1


In [35]:
# Concatenating this data
fully_concatenated_data = pd.concat([fully_concatenated_data,argugpt])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
4033,The notion that one must be forced to defend a...,Only by being forced to defend an idea against...,1,568
4034,I strongly agree with the statement that menta...,Students should be encouraged to realize that ...,1,352
4035,"In today’s world, where competition is highly ...",The best preparation for life or a career is n...,1,540
4036,Education is one of the most powerful tools th...,AII nations should help support the developmen...,1,428
