# Exploratory Data Analysis 2

In this first notebook, I looked at differences in the overall essays from each class. In this notebook, I dive deeper and ask questions about the physical words and the language structure of each essay.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import sqlalchemy
from tqdm import tqdm
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import nltk
from torchtext.data import get_tokenizer
from torchtext.data.utils import ngrams_iterator
import spacy
from transformers import pipeline, AutoTokenizer
from sklearn.model_selection import train_test_split

# Adding the credentials
sys.path.append('../')
from credentials import credentials

# Making pandas tqdm
tqdm.pandas()

In [2]:
# Creating the database engine 
connector_string = f'mysql+mysqlconnector://{credentials["user"]}:{credentials["password"]}@{credentials["host"]}/AuthenticAI'
db_engine = sqlalchemy.create_engine(connector_string,echo=True)

# Connecting to the database
db_conn = db_engine.connect()

TypeError: 'Credentials' object is not subscriptable

## How many unique words per essay?

According to papers, it seems that human text tends to have more unique words than LLM generated text. Furthermore, the ratio of unique words to total words tends to be higher. Researchers attribute this result to LLMs being more like dreams/mimickers of human text so they are very rule based and not "conscious". In this section, I see how many unique words each essay has, what the ratio to unique words to total words is for each essay. 

In [None]:
# Getting the tokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]+')

In [None]:
# Making a function to get the unique words
def get_unique_words(text:str) -> int:
    # Tokenize the text
    tokenized = set(tokenizer.tokenize(text))
    return len(tokenized)

In [None]:
# Getting the data 
data = pd.DataFrame([row for row in db_conn.execute(sqlalchemy.text('select * from essays;'))])
data

In [None]:
# Applying the unique word count function
data['unique_word_count'] = data['essay'].progress_apply(get_unique_words)

In [None]:
# Making a bar plot
plt.figure(figsize=(15,6))
sns.boxplot(data,x='unique_word_count',hue='LLM_written')
plt.title('Box Plot of Unique Word Counts for Each Class')
plt.show()

This box plot shows that the LLM written essays have little outliers in terms of the number of unique words whilst the student essays have a lot of outliers in terms of unique words in the essay. 

In [None]:
# Looking at the describe statistics 
print('Student Unique Words')
print(data[data['LLM_written'] == 0]['unique_word_count'].describe())
print()
print('LLM Unique Words')
print(data[data['LLM_written'] == 1]['unique_word_count'].describe())

From the descriptive statistics, I see that the student essays tend to have a higher average number of unique words. However, this value may be skewed due to many outliers. Specifically, one essays seems to have as many as 2708 unique words. That is much higher than the the LLM essays. THe LLM essays, on the other hand, have much lower number of unique words as evidenced by the mean and standard deviation. Furthermore, the maximum number of unique words for a LLM is lower. However, is this because that student essays tend to be longer? Obviously, if an essay is longer there are more opportunities to include unique words. To close off this analysis, I will need to look at the ratio between unique words and total words for each essay.

In [None]:
data['unique_to_total'] = data['unique_word_count'] / data['word_count']

In [None]:
# Making a bar plot
plt.figure(figsize=(15,6))
sns.boxplot(data,x='unique_to_total',hue='LLM_written')
plt.title('Box Plot of Unique Word Counts to Total Words Ratio for Each Class')
plt.show()

In [None]:
# Looking at the describe statistics 
print('Student Unique Words to Total')
print(data[data['LLM_written'] == 0]['unique_to_total'].describe())
print()
print('LLM Unique Word to Total')
print(data[data['LLM_written'] == 1]['unique_to_total'].describe())

From this analysis, it seems that the first finding, student essays have more unique words than LLM essays, was flawed. The box plot shows that the LLM written essays tend to have more unique words in comparison to the total amount of words whereas a student written essay tends to have less unique words in comparison to total words. However, one flaw with this comparison is that student essays tend to be larger. Thus, for the unique word count / total word count to be high, unique word count needs to be much higher. 

Both experiments realize great findings but there is a great flaw in both. To mitigate this flaw, I will look at essays that are less than 400 words (the median of word count for student essays).

In [None]:
smaller_word_count = data[data['word_count'] <= 400]
smaller_word_count['LLM_written'].value_counts()

In [None]:
# Making a box plot
plt.figure(figsize=(15,6))
sns.boxplot(smaller_word_count,x='unique_word_count',hue='LLM_written')
plt.title('Box Plot of Unique Word Counts for Each Class for Essays <= 400 words')
plt.show()

In [None]:
# Making a bar plot
plt.figure(figsize=(15,6))
sns.boxplot(smaller_word_count,x='unique_to_total',hue='LLM_written')
plt.title('Box Plot of Unique Word Counts to Total Word Counts for Each Class for Essays <= 400 words')
plt.show()

Analysis:

If I cap the essay word at 400, I see that I am left with about 15000 student essays and 12000 LLM written essays. I chose 400 since that is the median of the word count of the student essays. If I perform the same analysis, I can see that LLM written essays tend to have more unique words and a higher unique word ratio. While the first experiment showed that student written essays tend to have more unique words. This finding can be attributed to the fact that student written essays are much longer than LLM written ones. Hence, there is more room to add words. 

TLDR: A high unique word count + a high unique word count / total_word_count indicates that the essay might be written by a LLM. I can utilize these 2 findings as features.

## How many stop words does per essay? Is there a difference between a LLM and a student written essay?

Stop words are defined as "filler" words such as the, a, I. These words don't provide much context to a piece of text. In this experiment, I see if there is a difference in the number of stop words for a LLM and a student written essay. I also see the ratio of stop words to total words for each essay to determine how much of a factor total_word_count makes. 

In [None]:
# Getting the list of stop words
nltk.download('stopwords')
stop_words = stopwords.words('english')
print(stop_words)

In [None]:
# Making a function to count the stop words for each essay
def stop_word_count(text:str) -> int:
    # Tokenize the text
    tokenized = tokenizer.tokenize(text)
    count = 0

    for word in tokenized:
        if word in stop_words:
            count += 1
    
    return count

In [None]:
# Getting the stop word count
data['stop_word_count'] = data['essay'].progress_apply(stop_word_count)

In [None]:
# Making a box plot
plt.figure(figsize=(15,6))
sns.boxplot(data,x='stop_word_count',hue='LLM_written')
plt.title('Box Plot of Stop Words')
plt.show()

In [None]:
# Looking at the describe statistics 
print('Student Stop Words')
print(data[data['LLM_written'] == 0]['stop_word_count'].describe())
print()
print('Student Stop Word')
print(data[data['LLM_written'] == 1]['stop_word_count'].describe())

In [None]:
# Stop word ratio
data['stop_word_ratio'] = data['stop_word_count'] / data['word_count']

In [None]:
# Making a box plot
plt.figure(figsize=(15,6))
sns.boxplot(data,x='stop_word_ratio',hue='LLM_written')
plt.title('Box Plot of Stop Word Ratio')
plt.show()

From this boxplot, it is clear that student essays include a higher ratio of stop words/total words. However, there are a lot of outliers that fall outside the first quartile. Initially, it seems like student essays tend to include more stop words. However, there are a lot of outliers. I want to see how much word_count impacts this. Will perform the same experiment on essays <= 400 words.

In [None]:
smaller_word_count = data[data['word_count'] <= 400]
smaller_word_count['LLM_written'].value_counts()

In [None]:
# Making a box plot
plt.figure(figsize=(15,6))
sns.boxplot(smaller_word_count,x='stop_word_count',hue='LLM_written')
plt.title('Box Plot of Stop Words for Essays less than 400 words')
plt.show()

In [None]:
# Making a box plot
plt.figure(figsize=(15,6))
sns.boxplot(smaller_word_count,x='stop_word_ratio',hue='LLM_written')
plt.title('Box Plot of Stop Words/Total Words for Essays less than 400 words')
plt.show()

Analysis:

This experiment leads to simple results: Students tend to utilize more stop words than LLMs. This conclusion can be drawn when we look at the distribution of the stop_word_count for both classes. Even when we cap the word_count at 400, we can see that the median of stop word count is higher in the student essays. Furthermore, there is a higher ratio of stop words to total words in the student essays. This leads me to conclude that student essays tend to incorporate more stop words. This makes sense since the last experiment showed that LLMs tend to incorporate more unique words. 

TLDR: Include columns for stop word count and stop word to total word count.

## How does the counts of punctuation differ?

In this section, I want to analyze how punctuation differs between the 2 classes. Do students utilize more diverse punctuation? Is there is difference in the counts between each class?

In [None]:
# Getting the tokenizer
pytorch_tokenizer = get_tokenizer('spacy',language='en_core_web_sm')

In [None]:
# Functions for counting punctuations
# (?, !, ;, :)
def count_punc(text: str) -> int:
    tokenized_text = pytorch_tokenizer(text)
    count_q = 0
    count_ex = 0
    count_semi = 0
    count_col = 0
    for token in tokenized_text:
        if token == "?":
            count_q += 1
        elif token == "!":
            count_ex += 1
        elif token == ";":
            count_semi += 1
        elif token == ":":
            count_col += 1
    
    return count_q, count_ex,count_semi, count_col

In [None]:
# Making columns and adding to the dataframe
counts = data['essay'].progress_apply(count_punc)
data['count_question'] = [row[0] for row in counts]
data['count_exclamation'] = [row[1] for row in counts]
data['count_semi'] = [row[2] for row in counts]
data['count_colon'] = [row[3] for row in counts]
data.head()

In [None]:
print('Student')
print(data[data['LLM_written'] == 0][['count_question','count_exclamation','count_semi','count_colon']].describe())
print()
print('LLM')
print(data[data['LLM_written'] == 1][['count_question','count_exclamation','count_semi','count_colon']].describe())

Across the board, students tend to utilize every character of punctuation more. The means are higher, but the medians are the same. What is curious is that the max counts for these marks are much higher for students than for LLMs. This means that students tend to use punctuation more. If we compare some of them, we can see that for question marks and semi-colons, the mean is much higher for the student essays. The student essays also have a higher standard deviation for each mark. This means that the counts are more spread, leading me to believe that they can get very high in comparison to the LLM essays. 

TLDR: The counts of each the punctuation marks matter. Can use this as a feature. The higher the punctuation mark count, the more likely for a student essay. More question marks and semi-colons are big predictors.

## What n-grams separate each essay?

In this section, I want to see what n-grams separate each essay. Are there words and phrases that appear more in LLM essays than in student essays?

### Unigrams

I start with unigrams and work my way till tri-grams. 

In [None]:
# I can tokenize each essay and store the unigram counts in 2 fold, one for the student and one for the llm
unigrams = {}
tokenized_essays = data['essay'].progress_apply(lambda row: pytorch_tokenizer(row))

In [None]:
# Iterating through each tokenized essay to get the unigrams
unigrams = {'student':{},'llm':{}}
labels = data['LLM_written'].tolist()
for index in tqdm(range(len(labels))):
    if labels[index] == 0:
        label = 'student'
    else:
        label = 'llm'
    for token in tokenized_essays[index]:
        if token in unigrams[label].keys():
            count = unigrams[label][token] + 1
            unigrams[label][token] = count
        else:
            unigrams[label][token] = 1

In [None]:
unigrams_df = pd.DataFrame.from_dict(unigrams).fillna(value=0)
unigrams_df['student_dom'] = unigrams_df['student'] - unigrams_df['llm']
unigrams_df['llm_dom'] = unigrams_df['llm'] - unigrams_df['student']
unigrams_df.sort_values(by='student_dom',ascending=False).head(20)

From this table, I didn't learn much other than the fact that the student essays tend to have more commas. All the other tokens are stop words or \n\n. There isn't really a word or set of words that differentiate.

In [None]:
unigrams_df.sort_values(by='llm_dom',ascending=False).head(20)

This table doesn't show me anything I didn't already know. I knew that the LLM essays has more unique words than the student essays. But I do notice that a LLM tends to use the words "potential" and "Face" a lot more. 

### Bigrams

In [None]:
# Getting the tokenized essays by bigrams
tokenized_essays_bigrams = []
for essay in tqdm(tokenized_essays):
    tokenized_essays_bigrams.append(list(ngrams_iterator(essay,2))[len(essay):])

In [None]:
# Iterating through each tokenized essay to get the bigrams
bigrams = {'student':{},'llm':{}}
labels = data['LLM_written'].tolist()
for index in tqdm(range(len(labels))):
    if labels[index] == 0:
        label = 'student'
    else:
        label = 'llm'
    for token in tokenized_essays_bigrams[index]:
        if token in bigrams[label].keys():
            count = bigrams[label][token] + 1
            bigrams[label][token] = count
        else:
            bigrams[label][token] = 1

In [None]:
bigrams_df = pd.DataFrame.from_dict(bigrams).fillna(value=0)
bigrams_df['student_dom'] = bigrams_df['student'] - bigrams_df['llm']
bigrams_df['llm_dom'] = bigrams_df['llm'] - bigrams_df['student']
bigrams_df.sort_values(by='student_dom',ascending=False).head(20)

For the bigrams, the student essays are still just a combination of stop words like "to be" or "to do". Again, not a new finding.

In [None]:
bigrams_df.sort_values(by='llm_dom',ascending=False).head(20)

The bigrams present in the LLMs more often seem to have more meaning. We see things like "In conclusion" or "Additionally" pop up more. We also see parts of the prompts present as well. Let's see if tri-grams are better!

### Tri-grams

In [None]:
# Getting the tokenized essays by tri-grams
tokenized_essays_trigrams = []
for essay in tqdm(tokenized_essays):
    tokenized_essays_trigrams.append(list(ngrams_iterator(essay,3))[len(essay)*2-1:])

In [None]:
# Iterating through each tokenized essay to get the trigrams
trigrams = {'student':{},'llm':{}}
labels = data['LLM_written'].tolist()
for index in tqdm(range(len(labels))):
    if labels[index] == 0:
        label = 'student'
    else:
        label = 'llm'
    for token in tokenized_essays_trigrams[index]:
        if token in trigrams[label].keys():
            count = trigrams[label][token] + 1
            trigrams[label][token] = count
        else:
            trigrams[label][token] = 1

In [None]:
trigrams_df = pd.DataFrame.from_dict(trigrams).fillna(value=0)
trigrams_df['student_dom'] = trigrams_df['student'] - trigrams_df['llm']
trigrams_df['llm_dom'] = trigrams_df['llm'] - trigrams_df['student']
trigrams_df.sort_values(by='student_dom',ascending=False).head(20)

In [None]:
trigrams_df.sort_values(by='llm_dom',ascending=False).head(20)

Very interesting. As we can see both the LLM essays and student essays tend to utilize first-person. We can see trigrams such as "I believe that" and "I think that" as popular ones in both groups. The LLM essays seem to restate prompts more and follow a classic essay structure. "In conclusion" is the most popular trigrams. I wonder if adding a prompt name feature helps. I imagine that this would help detect LLM essays. I see that LLM essays contain key words found in prompts. We can see that they include things like "the Electoral College" for the electoral college prompt. However, this method may only work for prompts found in the dataset. Additional prompts may cause the model to go awry. 

However, what is clear from this analysis is that the more I extend the ngrams (unigram to bigram to trigram), the more clear is becomes that the LLM essays have more structure and less stop words. The words in the LLM trigrams seem to have more "meaning" than the words in the student trigrams. If I need to utilize a deep learning approach, I think using a trigram vocabulary would be something to test (a vocabulary of trigrams, unigrams, and bigrams). This is because if I utilize this type of vocabulary, I can hope to capture some of the structure present in LLM essays and missing in student ones.

## Does emotion play a role?

Some papers have mentioned that LLMs sometimes are devoid of emotion meaning that the texts they produce can be somewhat neutral. Furthermore, LLMs are less likely to produce texts that present negative emotions such as anger due to guardrails placed. In my last experiment, I want to see if emotion changes in the essays. Do student essays have a wider variety of emotions present? To do this, I utilize the Emotion English DistilRoBERTa-base model available on Hugging Face.

I will utilize a sample of 1000 random examples stratified from labels to see this since it will take a while for the model to make all the predictions without a GPU. 

In [None]:
model_tokenizer = AutoTokenizer.from_pretrained("j-hartmann/emotion-english-distilroberta-base")
def num_of_tokens(text:str) -> int:
    tokenized_text = model_tokenizer(text)['input_ids']
    return len(tokenized_text)

In [None]:
# Getting the tokenized text for each text
data['token_count'] = data['essay'].progress_apply(num_of_tokens)

In [None]:
# Selecting the examples that match the token count
valid_examples = data[data['token_count'] <= 512]

In [None]:
_, sample = train_test_split(valid_examples,test_size=1000,random_state=42,shuffle=True,stratify=valid_examples['LLM_written'])
sample['LLM_written'].value_counts()

In [None]:
# Getting the model
classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base")

In [None]:
# Making predictions
emotion_predictions = []
for essay in tqdm(sample['essay']):
    emotion_predictions.append(classifier(essay))

In [None]:
sample['emotion_pred'] = [exam['label'] for exam in [example[0] for example in emotion_predictions]]
sample.head()

In [None]:
# Converting LLM_written to categories
def llm_written_cat(label:int) -> str:
    if label == 1:
        return 'LLM'
    else:
        return 'student'
sample['LLM_written_cat'] = sample['LLM_written'].progress_apply(llm_written_cat)        

In [None]:
# Making a histogram
plt.title('Emotion Prediction Per Class')
plot = sns.countplot(sample,x='LLM_written_cat',hue='emotion_pred')
for i in plot.containers:
    plot.bar_label(i,)
plt.show()

In [None]:
# How the emotions are broken up for students
probs_given_student = sample[sample['LLM_written'] == 0]['emotion_pred'].value_counts() / sample[sample['LLM_written'] == 0].shape[0]
probs_given_student

In [None]:
# How the emotions are broken up for LLM
probs_given_llm = sample[sample['LLM_written'] == 1]['emotion_pred'].value_counts() / sample[sample['LLM_written'] == 1].shape[0]
probs_given_llm

In [None]:
# Overall probabilities
sample_probs = sample['emotion_pred'].value_counts() / sample.shape[0]
sample_probs

In [None]:
# Using bayes rule to find P(y = student | emotion) and P(y = llm | emotion)
total_probs = sample['LLM_written'].value_counts() / sample.shape[0]
student_given_emotion = (probs_given_student * total_probs[0]) / sample_probs 
llm_given_emotion = probs_given_llm * total_probs[1] / sample_probs 
student_given_emotion

In [None]:
llm_given_emotion

Analysis:

From the plot, I identified that student essays tend to have a lot more anger, surprise, and sadness emotions predicted. From this finding, I decided to utilize Bayes Rule to find P(Y|Emotion) for each Y and each emotion. The probabilities I found lined up with the findings from the plot: If an essay is predicted to have an angry tone, there is almost guranteed to be a student essay. The same can be said for sadness and surprise. For fear, whilst there are no gurantees, there is a higher chance for an essay to be a student's essay if it is predicted to exhibit fear. Disgust, joy, and neutral are relatively the same probabilities. 

TLDR: I need to add categorical features that mark whether an essay exhibits anger, surprise, sadness or fear. I found this to be suitable predictors.

In [None]:
# Closing connections and deleting the engine
db_conn.close()
db_engine.dispose()