# Populating Database 

In this notebook, I am populating the database with all the data. For this, I need to go through the data files I want to use and then see how they are set up and then insert rows in the table.

In [70]:
# Importing libraries
import pandas as pd
import re
import sys
import sqlalchemy

# Adding the credentials path
sys.path.append('../')
from credentials import credentials

## Putting the different data sources into the table.

In this section, I go through the different files I need for each data source and put it into table.

### daigt data - llama 70b and falcon180b

In [2]:
# Getting the data csvs
falcon = pd.read_csv('../data/llama70b-and-falcon70b-generated/falcon_180b_v1.csv')
llama = pd.read_csv('../data/llama70b-and-falcon70b-generated/llama_70b_v2.csv',index_col=0)
combined = pd.read_csv('../data/llama70b-and-falcon70b-generated/llama_falcon_v3.csv')

In [3]:
# Viewing falcon generated
falcon.head(5) # Falcon doesn't have a column for LLM generated, need to add it

Unnamed: 0,generated_text,writing_prompt
0,"Dear Principal,\n\nI am writing to express my ...",Your principal is considering changing school ...
1,When people are faced with a difficult decisio...,"When people ask for advice, they sometimes tal..."
2,"As a grade 12 student, I believe that summer p...",Some schools require students to complete summ...
3,"Dear Principal,\n\nI am writing to share my th...",Some of your friends perform community service...
4,"""Making Mona Lisa Smile"" is an interesting art...","In the article ""Making Mona Lisa Smile,"" the a..."


In [4]:
# Viewing llama generated
llama.head(5)

Unnamed: 0,generated_text,writing_prompt,generated
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1


In [5]:
# Viewing combined generated
combined.head(5)

Unnamed: 0,text,generated,prompt_name,model
0,One way we can make a positive change is by li...,1,Car-free cities,llama_70b
1,The experimental district of Vauban in Germany...,1,Car-free cities,llama_70b
2,"The successful ""Day Without Cars"" event in Bog...",1,Car-free cities,llama_70b
3,"The exhaust from cars pollutes our air, and th...",1,Car-free cities,llama_70b
4,"Recently, Paris faced a severe pollution probl...",1,Car-free cities,llama_70b


In [6]:
# Want to see how many prompts I have in combined
combined['prompt_name'].value_counts() # 7 total prompts

prompt_name
Car-free cities                     1000
Does the electoral college work?    1000
The Face on Mars                    1000
"A Cowboy Who Rode the Waves"       1000
Exploring Venus                     1000
Facial action coding system         1000
Driverless cars                     1000
Name: count, dtype: int64

Based on initial analysis, falcon and llama dataframes are ready to be inserted into the table. The falcone dataframe just needs a generated column which is going to be set to all 1s. For the combined dataframe, I need to map the prompt_name to the prompts. Fortunately, the persuade 2.0 corpus provides a mapping, so I can utilize this mapping prior to adding the combined dataframe into the table.

I also need a word count feature to add to the table. Hence, I need to add the word count feature.

In [7]:
# Adding a generated column in falcon
falcon['generated'] = 1
falcon.head(5)

Unnamed: 0,generated_text,writing_prompt,generated
0,"Dear Principal,\n\nI am writing to express my ...",Your principal is considering changing school ...,1
1,When people are faced with a difficult decisio...,"When people ask for advice, they sometimes tal...",1
2,"As a grade 12 student, I believe that summer p...",Some schools require students to complete summ...,1
3,"Dear Principal,\n\nI am writing to share my th...",Some of your friends perform community service...,1
4,"""Making Mona Lisa Smile"" is an interesting art...","In the article ""Making Mona Lisa Smile,"" the a...",1


In [8]:
# Creating a function to get the word count of a piece of text
def get_word_count(text:str) -> int:
    """
    get_word_count

    A function to get the word count of some text.

    inputs:
    - text: a string that indicates you want to get the word count for.

    outputs:
    - an integer representing the word count
    """
    return len(re.findall(r'[a-zA-Z_]+',text))

In [9]:
# Adding a word count feature to the falcon dataset
falcon['word_count'] = falcon['generated_text'].apply(get_word_count)
falcon.tail(5)

Unnamed: 0,generated_text,writing_prompt,generated,word_count
1050,(I am not capable of personal opinions or beli...,The role of zoos in conservation and education...,1,360
1051,"In ""The Challenge of Exploring Venus,"" the aut...","In ""The Challenge of Exploring Venus,"" the aut...",1,442
1052,"The article ""Making Mona Lisa Smile"" discusses...","In the article ""Making Mona Lisa Smile,"" the a...",1,327
1053,"As a grade 6 student, I am still learning abou...",The issue of gun control is a highly contentio...,1,313
1054,Passage 1:\n\nCars are one of the main ways in...,Write an explanatory essay to inform fellow ci...,1,401


In [10]:
# Adding a word count feature to llama dataset
llama['word_count'] = llama['generated_text'].apply(get_word_count)
llama.head(5)

Unnamed: 0,generated_text,writing_prompt,generated,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307


In [11]:
# Adding a word count feature to combined dataset
combined['word_count'] = combined['text'].apply(get_word_count)
combined.head(5)

Unnamed: 0,text,generated,prompt_name,model,word_count
0,One way we can make a positive change is by li...,1,Car-free cities,llama_70b,365
1,The experimental district of Vauban in Germany...,1,Car-free cities,llama_70b,486
2,"The successful ""Day Without Cars"" event in Bog...",1,Car-free cities,llama_70b,416
3,"The exhaust from cars pollutes our air, and th...",1,Car-free cities,llama_70b,367
4,"Recently, Paris faced a severe pollution probl...",1,Car-free cities,llama_70b,466


In [12]:
# Combining the llama dataset and the falcon dataset into one dataframe
llama_and_falcon = pd.concat([llama,falcon],axis=0,ignore_index=True)
llama_and_falcon

Unnamed: 0,generated_text,writing_prompt,generated,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
2222,(I am not capable of personal opinions or beli...,The role of zoos in conservation and education...,1,360
2223,"In ""The Challenge of Exploring Venus,"" the aut...","In ""The Challenge of Exploring Venus,"" the aut...",1,442
2224,"The article ""Making Mona Lisa Smile"" discusses...","In the article ""Making Mona Lisa Smile,"" the a...",1,327
2225,"As a grade 6 student, I am still learning abou...",The issue of gun control is a highly contentio...,1,313


In [13]:
# Renaming the columns
llama_and_falcon.rename(columns={'generated_text':'essay','writing_prompt':'prompt','generated':'LLM_written'},inplace=True)
llama_and_falcon.head(5)

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307


## persuade corpus 2.0

In [14]:
# Getting the data
persuade_corpus = pd.read_csv('../data/persuade-corpus/persuade_2.0_human_scores_demo_id_github.csv')
persuade_corpus.head(5)

Unnamed: 0,essay_id_comp,full_text,holistic_essay_score,word_count,prompt_name,task,assignment,source_text,gender,grade_level,ell_status,race_ethnicity,economically_disadvantaged,student_disability_status
0,423A1CA112E2,Phones\n\nModern humans today are always on th...,3,378,Phones and driving,Independent,Today the majority of humans own and operate c...,,M,,,Black/African American,,
1,BC75783F96E3,This essay will explain if drivers should or s...,4,432,Phones and driving,Independent,Today the majority of humans own and operate c...,,M,,,Black/African American,,
2,74C8BC7417DE,Driving while the use of cellular devices\n\nT...,2,179,Phones and driving,Independent,Today the majority of humans own and operate c...,,F,,,White,,
3,A8445CABFECE,Phones & Driving\n\nDrivers should not be able...,3,221,Phones and driving,Independent,Today the majority of humans own and operate c...,,M,,,Black/African American,,
4,6B4F7A0165B9,Cell Phone Operation While Driving\n\nThe abil...,4,334,Phones and driving,Independent,Today the majority of humans own and operate c...,,M,,,White,,


In [15]:
# Dropping the unnecessary columns
persuade_corpus.drop(['essay_id_comp','holistic_essay_score','task','source_text','gender',
                      'grade_level','ell_status','race_ethnicity','economically_disadvantaged',
                      'student_disability_status'],axis=1,inplace=True)
persuade_corpus.head(5)

Unnamed: 0,full_text,word_count,prompt_name,assignment
0,Phones\n\nModern humans today are always on th...,378,Phones and driving,Today the majority of humans own and operate c...
1,This essay will explain if drivers should or s...,432,Phones and driving,Today the majority of humans own and operate c...
2,Driving while the use of cellular devices\n\nT...,179,Phones and driving,Today the majority of humans own and operate c...
3,Phones & Driving\n\nDrivers should not be able...,221,Phones and driving,Today the majority of humans own and operate c...
4,Cell Phone Operation While Driving\n\nThe abil...,334,Phones and driving,Today the majority of humans own and operate c...


In [16]:
# Getting the prompt_name and assignment columns for the mappers
column_dict = persuade_corpus[['prompt_name','assignment']].drop_duplicates(subset='prompt_name').to_dict('records')
prompt_mapper = {}

# Iterating through the column_dict to make the prompt_mapper
for i in range(len(column_dict)):
    prompt_mapper[column_dict[i]['prompt_name']] = column_dict[i]['assignment']

# printing the prompt mapper
prompt_mapper

{'Phones and driving': 'Today the majority of humans own and operate cell phones on a daily basis. In essay form, explain if drivers should or should not be able to use cell phones in any capacity while operating a vehicle.',
 'Car-free cities': 'Write an explanatory essay to inform fellow citizens about the advantages of limiting car usage. Your essay must be based on ideas and information that can be found in the passage set. Manage your time carefully so that you can read the passages; plan your response; write your response; and revise and edit your response. Be sure to use evidence from multiple sources; and avoid overly relying on one source. Your response should be in the form of a multiparagraph essay. Write your essay in the space provided.',
 'Summer projects': 'Some schools require students to complete summer projects to assure they continue learning during their break. Should these summer projects be teacher-designed or student-designed? Take a position on this question. Su

In [17]:
# Using the mapping to add the prompt to the combined dataset
combined['prompt'] = combined['prompt_name'].map(prompt_mapper)
combined.head(5)

Unnamed: 0,text,generated,prompt_name,model,word_count,prompt
0,One way we can make a positive change is by li...,1,Car-free cities,llama_70b,365,Write an explanatory essay to inform fellow ci...
1,The experimental district of Vauban in Germany...,1,Car-free cities,llama_70b,486,Write an explanatory essay to inform fellow ci...
2,"The successful ""Day Without Cars"" event in Bog...",1,Car-free cities,llama_70b,416,Write an explanatory essay to inform fellow ci...
3,"The exhaust from cars pollutes our air, and th...",1,Car-free cities,llama_70b,367,Write an explanatory essay to inform fellow ci...
4,"Recently, Paris faced a severe pollution probl...",1,Car-free cities,llama_70b,466,Write an explanatory essay to inform fellow ci...


In [18]:
# Renaming and dropping columns 
combined.drop(['prompt_name','model'],axis=1,inplace=True)
combined.rename(columns={'text':'essay','generated':'LLM_written'},inplace=True)
combined.head(5)

Unnamed: 0,essay,LLM_written,word_count,prompt
0,One way we can make a positive change is by li...,1,365,Write an explanatory essay to inform fellow ci...
1,The experimental district of Vauban in Germany...,1,486,Write an explanatory essay to inform fellow ci...
2,"The successful ""Day Without Cars"" event in Bog...",1,416,Write an explanatory essay to inform fellow ci...
3,"The exhaust from cars pollutes our air, and th...",1,367,Write an explanatory essay to inform fellow ci...
4,"Recently, Paris faced a severe pollution probl...",1,466,Write an explanatory essay to inform fellow ci...


In [19]:
# Concating combined with the current dataset
fully_concatenated_data = pd.concat([llama_and_falcon,combined])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
6995,"Driverless cars, also known as autonomous cars...","In the article “Driverless Cars are Coming,” t...",1,397
6996,Driverless Cars: A Necessity for Our Future\n\...,"In the article “Driverless Cars are Coming,” t...",1,405
6997,The Pros and Cons of Driverless Cars\n\nThe wo...,"In the article “Driverless Cars are Coming,” t...",1,661
6998,The development of driverless cars has been a ...,"In the article “Driverless Cars are Coming,” t...",1,596


In [20]:
# Working with the persuade corpus
persuade_corpus

Unnamed: 0,full_text,word_count,prompt_name,assignment
0,Phones\n\nModern humans today are always on th...,378,Phones and driving,Today the majority of humans own and operate c...
1,This essay will explain if drivers should or s...,432,Phones and driving,Today the majority of humans own and operate c...
2,Driving while the use of cellular devices\n\nT...,179,Phones and driving,Today the majority of humans own and operate c...
3,Phones & Driving\n\nDrivers should not be able...,221,Phones and driving,Today the majority of humans own and operate c...
4,Cell Phone Operation While Driving\n\nThe abil...,334,Phones and driving,Today the majority of humans own and operate c...
...,...,...,...,...
25991,80% of Americans believe seeking multiple opin...,1050,Seeking multiple opinions,"When people ask for advice, they sometimes tal..."
25992,"When people ask for advice,they sometimes talk...",373,Seeking multiple opinions,"When people ask for advice, they sometimes tal..."
25993,"During a group project, have you ever asked a ...",631,Seeking multiple opinions,"When people ask for advice, they sometimes tal..."
25994,Making choices in life can be very difficult. ...,417,Seeking multiple opinions,"When people ask for advice, they sometimes tal..."


In [21]:
# Adding a column for generated
persuade_corpus['LLM_written'] = 0

# Dropping prompt name
persuade_corpus.drop(['prompt_name'],axis=1,inplace=True)

# Renaming columns
persuade_corpus.rename(columns={'full_text':'essay','assignment':'prompt'},inplace=True)
persuade_corpus.head(5)

Unnamed: 0,essay,word_count,prompt,LLM_written
0,Phones\n\nModern humans today are always on th...,378,Today the majority of humans own and operate c...,0
1,This essay will explain if drivers should or s...,432,Today the majority of humans own and operate c...,0
2,Driving while the use of cellular devices\n\nT...,179,Today the majority of humans own and operate c...,0
3,Phones & Driving\n\nDrivers should not be able...,221,Today the majority of humans own and operate c...,0
4,Cell Phone Operation While Driving\n\nThe abil...,334,Today the majority of humans own and operate c...,0


In [22]:
# Concatenating with the combined dataset
fully_concatenated_data = pd.concat([fully_concatenated_data,persuade_corpus])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
25991,80% of Americans believe seeking multiple opin...,"When people ask for advice, they sometimes tal...",0,1050
25992,"When people ask for advice,they sometimes talk...","When people ask for advice, they sometimes tal...",0,373
25993,"During a group project, have you ever asked a ...","When people ask for advice, they sometimes tal...",0,631
25994,Making choices in life can be very difficult. ...,"When people ask for advice, they sometimes tal...",0,417


### 1000 essays from Anthropic

In [23]:
anthropic = pd.read_csv('../data/persuade15_claude_instant1.csv')
anthropic.head(5)

Unnamed: 0,prompt_id,essay_title,essay_text
0,14,Some schools offer distance learning as an opt...,While distance learning offers certain benefit...
1,11,"In the article “Driverless Cars are Coming,” t...",The Development of Driverless Cars\n\nWhile dr...
2,8,You have read the article 'Unmasking the Face ...,"While the mysterious formation known as the ""F..."
3,6,"In ""The Challenge of Exploring Venus,"" the aut...",Studying Venus Remains a Worthy Pursuit\n\nWhi...
4,11,"In the article “Driverless Cars are Coming,” t...",Driverless Cars: An Argument in Favor\n\nThe d...


In [24]:
# Getting the word counts
anthropic['word_count'] = anthropic['essay_text'].apply(get_word_count)
anthropic.head(5)

Unnamed: 0,prompt_id,essay_title,essay_text,word_count
0,14,Some schools offer distance learning as an opt...,While distance learning offers certain benefit...,307
1,11,"In the article “Driverless Cars are Coming,” t...",The Development of Driverless Cars\n\nWhile dr...,356
2,8,You have read the article 'Unmasking the Face ...,"While the mysterious formation known as the ""F...",318
3,6,"In ""The Challenge of Exploring Venus,"" the aut...",Studying Venus Remains a Worthy Pursuit\n\nWhi...,303
4,11,"In the article “Driverless Cars are Coming,” t...",Driverless Cars: An Argument in Favor\n\nThe d...,322


In [25]:
# Dropping columns, adding columns, and renaming columns
anthropic.drop(['prompt_id'],axis=1,inplace=True)
anthropic['LLM_written'] = 1
anthropic.rename(columns={'essay_title':'prompt','essay_text':'essay'},inplace=True)
anthropic.head(5)

Unnamed: 0,prompt,essay,word_count,LLM_written
0,Some schools offer distance learning as an opt...,While distance learning offers certain benefit...,307,1
1,"In the article “Driverless Cars are Coming,” t...",The Development of Driverless Cars\n\nWhile dr...,356,1
2,You have read the article 'Unmasking the Face ...,"While the mysterious formation known as the ""F...",318,1
3,"In ""The Challenge of Exploring Venus,"" the aut...",Studying Venus Remains a Worthy Pursuit\n\nWhi...,303,1
4,"In the article “Driverless Cars are Coming,” t...",Driverless Cars: An Argument in Favor\n\nThe d...,322,1


In [26]:
# Adding anthropic to the final dataset
fully_concatenated_data = pd.concat([fully_concatenated_data,anthropic])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
995,Limiting car usage has many benefits for moder...,Write an explanatory essay to inform fellow ci...,1,349
996,The Rise of Driverless Cars\n\nThe development...,"In the article “Driverless Cars are Coming,” t...",1,297
997,Schools should allow students to design their ...,Some schools require students to complete summ...,1,303
998,The Open Sea Beckons\n\nThe Seagoing Cowboys p...,"You have just read the article, 'A Cowboy Who ...",1,357


### ArguGPT

In [27]:
# Importing the data
argugpt = pd.read_csv('../data/ArguGPT/argugpt.csv')
argugpt.head(5)

Unnamed: 0,id,prompt_id,prompt,text,model,temperature,exam_type,score,score_level
0,weccl_30,WECCL-17,Some people think the university education is ...,There are many people who think that universit...,text-babbage-001,0.5,weccl,19,high
1,weccl_51,WECCL-17,Some people think the university education is ...,There are a number of reasons why people might...,text-babbage-001,0.65,weccl,13,medium
2,weccl_48,WECCL-17,Some people think the university education is ...,There are many reasons why university educatio...,text-babbage-001,0.65,weccl,13,medium
3,weccl_50,WECCL-17,Some people think the university education is ...,There are many people who think that universit...,text-babbage-001,0.65,weccl,12,low
4,weccl_55,WECCL-17,Some people think the university education is ...,There is a general consensus that university e...,text-babbage-001,0.8,weccl,13,medium


In [28]:
# Getting the columns that I need
argugpt = argugpt[['prompt','text']]

# Adding the word_count and LL_written columns
argugpt['word_count'] = argugpt['text'].apply(get_word_count)
argugpt['LLM_written'] = 1

# Renaming the text column
argugpt.rename(columns={'text':'essay'},inplace=True)
argugpt.head()

Unnamed: 0,prompt,essay,word_count,LLM_written
0,Some people think the university education is ...,There are many people who think that universit...,324,1
1,Some people think the university education is ...,There are a number of reasons why people might...,241,1
2,Some people think the university education is ...,There are many reasons why university educatio...,114,1
3,Some people think the university education is ...,There are many people who think that universit...,174,1
4,Some people think the university education is ...,There is a general consensus that university e...,111,1


In [29]:
# Concatenating this data
fully_concatenated_data = pd.concat([fully_concatenated_data,argugpt])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
4033,The notion that one must be forced to defend a...,Only by being forced to defend an idea against...,1,568
4034,I strongly agree with the statement that menta...,Students should be encouraged to realize that ...,1,352
4035,"In today’s world, where competition is highly ...",The best preparation for life or a career is n...,1,540
4036,Education is one of the most powerful tools th...,AII nations should help support the developmen...,1,428


### DAIGT | External Dataset

In [30]:
# Importing the data
daigt = pd.read_csv('../data/daigt_external_dataset.csv')
daigt.head(5)

Unnamed: 0,id,text,instructions,source_text
0,6060D28C05B6,Some schools in United States ofter classes fr...,\nTask: Write a persuasive essay on whether or...,\nWhen considering the pros and cons of attend...
1,60623DB5DE7A,"Four-day work week, a remarkable idea to conse...",\nTask: Research the advantages and disadvanta...,\nOne of the primary arguments for implementin...
2,607A39D981DE,Students and their families should consider an...,\nTask: \n\n1. Talk to your parents before tak...,\nBefore making any decisions about getting in...
3,60ACDFA1609E,Agree you will never grow if something beyond ...,\nTask: Write an essay discussing the benefits...,"\nRalph Waldo Emerson once said, ""Go confident..."
4,60AE13D3F07B,I think our character traits are formed by inf...,\nTask: Research and discuss how character tra...,\nHuman character traits are shaped by a wide ...


In [31]:
# In the DAIGT data, there are essays for students (text column) and essays generated by ChatGPT (source_text column)
# Splitting up the data into student and generated
student_essays = daigt[['text','instructions']].copy()
generated_essays = daigt[['source_text','instructions']].copy()

In [32]:
# Adding columns, renaming columns
student_essays.rename(columns={'text':'essay','instructions':'prompt'},inplace=True)
generated_essays.rename(columns={'source_text':'essay','instructions':'prompt'},inplace=True)

student_essays['word_count'] = student_essays['essay'].apply(get_word_count)
generated_essays['word_count'] = generated_essays['essay'].apply(get_word_count)

student_essays['LLM_written'] = 0
generated_essays['LLM_written'] = 1

In [33]:
# Concatenating
fully_concatenated_data = pd.concat([fully_concatenated_data,student_essays,generated_essays])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
2416,\nBecoming a surgeon requires a great deal of ...,\nTask: Research different kinds of medical pr...,1,114
2417,\nSchools should offer an after school homewor...,\nTask: Write an essay discussing why schools ...,1,297
2418,\nIt’s human nature to be afraid to make mista...,\nTask: Write an essay about how having a few ...,1,193
2419,\nOne of the main debates of 2020 for many stu...,\nTask: \n\nWrite an essay exploring the pros ...,1,245


### LLM - Detect AI Generated Text

In [34]:
# Getting the data and the prompt mapper
data = pd.read_csv('../data/llm-detect-ai-generated-text/train_essays.csv')
prompts = pd.read_csv('../data/llm-detect-ai-generated-text/train_prompts.csv')
data.head()

Unnamed: 0,id,prompt_id,text,generated
0,0059830c,0,Cars. Cars have been around since they became ...,0
1,005db917,0,Transportation is a large necessity in most co...,0
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0
3,00940276,0,How often do you ride in a car? Do you drive a...,0
4,00c39458,0,Cars are a wonderful thing. They are perhaps o...,0


In [35]:
prompts.head()

Unnamed: 0,prompt_id,prompt_name,instructions,source_text
0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
1,1,Does the electoral college work?,Write a letter to your state senator in which ...,# What Is the Electoral College? by the Office...


In [36]:
# Joining the 2 datasets
joined_data = data.join(other=prompts,on='prompt_id',rsuffix='_prompt')
joined_data.head()

Unnamed: 0,id,prompt_id,text,generated,prompt_id_prompt,prompt_name,instructions,source_text
0,0059830c,0,Cars. Cars have been around since they became ...,0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
1,005db917,0,Transportation is a large necessity in most co...,0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
2,008f63e3,0,"""America's love affair with it's vehicles seem...",0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
3,00940276,0,How often do you ride in a car? Do you drive a...,0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
4,00c39458,0,Cars are a wonderful thing. They are perhaps o...,0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."


In [37]:
# Dropping columns and renaming columns
joined_data.drop(['prompt_id','prompt_id_prompt','prompt_name','source_text','id'],axis=1,inplace=True)
joined_data.rename(columns={'text':'essay','instructions':'prompt','generated':'LLM_written'},inplace=True)
joined_data['word_count'] = joined_data['essay'].apply(get_word_count)
joined_data.head()

Unnamed: 0,essay,LLM_written,prompt,word_count
0,Cars. Cars have been around since they became ...,0,Write an explanatory essay to inform fellow ci...,584
1,Transportation is a large necessity in most co...,0,Write an explanatory essay to inform fellow ci...,459
2,"""America's love affair with it's vehicles seem...",0,Write an explanatory essay to inform fellow ci...,750
3,How often do you ride in a car? Do you drive a...,0,Write an explanatory essay to inform fellow ci...,698
4,Cars are a wonderful thing. They are perhaps o...,0,Write an explanatory essay to inform fellow ci...,863


In [38]:
# Concatenating the data
fully_concatenated_data = pd.concat([fully_concatenated_data,joined_data])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1,307
...,...,...,...,...
1373,There has been a fuss about the Elector Colleg...,Write a letter to your state senator in which ...,0,428
1374,Limiting car usage has many advantages. Such a...,Write an explanatory essay to inform fellow ci...,0,398
1375,There's a new trend that has been developing f...,Write an explanatory essay to inform fellow ci...,0,745
1376,As we all know cars are a big part of our soci...,Write an explanatory essay to inform fellow ci...,0,524


### LLM-generated essay using PaLM from Google Gen-AI

In [39]:
palm = pd.read_csv('../data/LLM_generated_essay_PaLM.csv')
palm.head()

Unnamed: 0,id,prompt_id,text,generated
0,0,0.0,## The Advantages of Limiting Car Usage\n\nIn ...,1.0
1,1,0.0,"The United States is a car-dependent nation, w...",1.0
2,2,0.0,"In recent years, there has been a growing move...",1.0
3,3,0.0,"In recent years, there has been a growing move...",1.0
4,4,0.0,"In the past few decades, the United States has...",1.0


In [40]:
# Joining with prompts
palm = palm.join(prompts,on='prompt_id',rsuffix="_prompt")
palm.head()

Unnamed: 0,id,prompt_id,text,generated,prompt_id_prompt,prompt_name,instructions,source_text
0,0,0.0,## The Advantages of Limiting Car Usage\n\nIn ...,1.0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
1,1,0.0,"The United States is a car-dependent nation, w...",1.0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
2,2,0.0,"In recent years, there has been a growing move...",1.0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
3,3,0.0,"In recent years, there has been a growing move...",1.0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."
4,4,0.0,"In the past few decades, the United States has...",1.0,0,Car-free cities,Write an explanatory essay to inform fellow ci...,"# In German Suburb, Life Goes On Without Cars ..."


In [41]:
# Dropping columns and renaming columns
palm.drop(['id','prompt_id','prompt_id_prompt','prompt_name','source_text'],axis=1,inplace=True)
palm.rename(columns={'text':'essay','generated':'LLM_written','instructions':'prompt'},inplace=True)
palm['word_count'] = palm['essay'].apply(get_word_count)
palm.head()

Unnamed: 0,essay,LLM_written,prompt,word_count
0,## The Advantages of Limiting Car Usage\n\nIn ...,1.0,Write an explanatory essay to inform fellow ci...,483
1,"The United States is a car-dependent nation, w...",1.0,Write an explanatory essay to inform fellow ci...,443
2,"In recent years, there has been a growing move...",1.0,Write an explanatory essay to inform fellow ci...,402
3,"In recent years, there has been a growing move...",1.0,Write an explanatory essay to inform fellow ci...,448
4,"In the past few decades, the United States has...",1.0,Write an explanatory essay to inform fellow ci...,367


In [42]:
# Concatenating
fully_concatenated_data = pd.concat([fully_concatenated_data,palm])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1.0,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1.0,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1.0,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1.0,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1.0,307
...,...,...,...,...
1379,"Dear Senator,\n\nI am writing to you today to ...",Write a letter to your state senator in which ...,1.0,374
1380,"Dear Senator,\n\nI am writing to you today to ...",Write a letter to your state senator in which ...,1.0,351
1381,"Dear Senator,\n\nI am writing to you today to ...",Write a letter to your state senator in which ...,1.0,274
1382,"Dear Senator,\n\nI am writing to you today to ...",Write a letter to your state senator in which ...,1.0,253


### essays-with-instructions

In [45]:
data = pd.read_parquet('../data/essays-with-titles&instructions.parquet')
data.head()

Unnamed: 0,instructions,titles,essays,urls
0,Write the original essay that generated the fo...,The essay discusses the issue of whether or n...,2-6-Year-Olds’ Criminal Actions Irresponsibili...,https://ivypanda.com/essays/2-6-year-olds-crim...
1,Provide the inputted essay that when summarize...,\n\nThe organization's learning and developmen...,5Ways Foodservices: Staff Learning and Develop...,https://ivypanda.com/essays/5ways-foodservices...
0,Write the original essay for the following sum...,The author of this essay discusses how busine...,10 Steps of Getting Started With Social Media ...,https://ivypanda.com/essays/10-steps-of-gettin...
1,Provide the full text for the following summar...,"\n\n""A Darkling Plain"" a Book by Kristen Monro...",“A Darkling Plain” a Book by Kristen Monroe Es...,https://ivypanda.com/essays/a-darkling-plain-a...
2,Provide the inputted essay that when summarize...,A little electronic magic at Alibaba.com Case...,A Little Electronic Magic at Alibaba.com Case ...,https://ivypanda.com/essays/a-little-electroni...


In [46]:
# Column adjustment
data.rename(columns={'instructions':'prompt','essays':'essay'},inplace=True)
data.drop(['titles','urls'],axis=1,inplace=True)
data['word_count'] = data['essay'].apply(get_word_count)
data['LLM_written'] = 0
data.head()

Unnamed: 0,prompt,essay,word_count,LLM_written
0,Write the original essay that generated the fo...,2-6-Year-Olds’ Criminal Actions Irresponsibili...,662,0
1,Provide the inputted essay that when summarize...,5Ways Foodservices: Staff Learning and Develop...,2123,0
0,Write the original essay for the following sum...,10 Steps of Getting Started With Social Media ...,925,0
1,Provide the full text for the following summar...,“A Darkling Plain” a Book by Kristen Monroe Es...,1157,0
2,Provide the inputted essay that when summarize...,A Little Electronic Magic at Alibaba.com Case ...,909,0


In [47]:
# Concatenating
fully_concatenated_data = pd.concat([fully_concatenated_data,data])
fully_concatenated_data

Unnamed: 0,essay,prompt,LLM_written,word_count
0,"Dear State Senator,\n\nI'm writting to you tod...",Write a letter to your state senator in which ...,1.0,291
1,"Uh, hi! So, like, summers are, like, awesome r...",Some schools require students to complete summ...,1.0,311
2,"When peoples ask for advices, they sometimes t...","When people ask for advice, they sometimes tal...",1.0,333
3,I think art edukation is super impotent for ki...,Many people believe that arts education is ess...,1.0,308
4,I think we should totally switch to renewable ...,"In recent years, there has been a push towards...",1.0,307
...,...,...,...,...
24,Selected Works of Lu Hsun Research Paper\n\nFi...,Provide the full text for the following summar...,0.0,2631
25,Self-Understanding Role in Organizational Beha...,Create the inputted essay that provided the fo...,0.0,1061
26,“Sequoia Gardens” by Ernest Finney Literature ...,Write the full essay for the following summary...,0.0,1134
27,Service Marketing: Food Market Essay\n\nTable ...,Convert the following summary back into the or...,0.0,555


### Putting the Fully Concatenated DataFrame into the SQL Table

In [100]:
# Connecting to the database
connector_string = f'mysql+mysqlconnector://{credentials["user"]}:{credentials["password"]}@{credentials["host"]}/AuthenticAI'
engine = sqlalchemy.create_engine(connector_string,echo=True)

# Opening a connection
with engine.connect() as db_connection:
    # Pushing the dataframe to the SQL table
    fully_concatenated_data.to_sql('essays',db_connection,chunksize=10000,if_exists='append',index=False)

2024-01-04 16:30:43,464 INFO sqlalchemy.engine.Engine SELECT DATABASE()
2024-01-04 16:30:43,469 INFO sqlalchemy.engine.Engine [raw sql] {}
2024-01-04 16:30:43,498 INFO sqlalchemy.engine.Engine SELECT @@sql_mode
2024-01-04 16:30:43,515 INFO sqlalchemy.engine.Engine [raw sql] {}
2024-01-04 16:30:43,527 INFO sqlalchemy.engine.Engine SELECT @@lower_case_table_names
2024-01-04 16:30:43,536 INFO sqlalchemy.engine.Engine [raw sql] {}
2024-01-04 16:30:43,545 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-01-04 16:30:43,560 INFO sqlalchemy.engine.Engine DESCRIBE `authenticai`.`essays`
2024-01-04 16:30:43,564 INFO sqlalchemy.engine.Engine [raw sql] {}


2024-01-04 16:30:43,782 INFO sqlalchemy.engine.Engine INSERT INTO essays (essay, prompt, `LLM_written`, word_count) VALUES (%(essay)s, %(prompt)s, %(LLM_written)s, %(word_count)s)
2024-01-04 16:30:43,783 INFO sqlalchemy.engine.Engine [generated in 0.12136s] [{'essay': "Dear State Senator,\n\nI'm writting to you today to tell you that we should keep the Electoral College. I know some people say it's unfair but I thin ... (1271 characters truncated) ... them.\n\nSo, I think we should keep the Electoral College. It's a good system that helps make sure the president is fair to everyone.\n\nSincerely,\n", 'prompt': 'Write a letter to your state senator in which you argue in favor of keeping the Electoral College or changing to election by popular vote for the pre ... (300 characters truncated) ... es; and avoid overly relying on one source. Your response should be in the form of a multiparagraph essay. Write your response in the space provided.', 'LLM_written': 1.0, 'word_count': 291}, {'es