#DATA

First, we need to import the actual data. All of this is pretty simple and easily found online for sources, so we won't explain this too much. It's just taking the database from the online

The import uses wget to get the file online that we uploaded, the Persuade 2.0 Corpus that has over 25,000 student essays that were ranked on a scale of 1-6.

In [2]:
import pandas as pd
import numpy as np
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
!wget https://raw.githubusercontent.com/Azraelix316/AI_Club_Files/main/persuade_2.0_human_scores_demo_id_github.csv -O persuade_corpus_2.0_train.csv
# Step 1: Load the training and testing datasets
# Use the 'sep' parameter to specify the delimiter and handle quoting
train_data = pd.read_csv(
    "persuade_corpus_2.0_train.csv",
    on_bad_lines="skip",              # Skip rows with errors
    quoting=csv.QUOTE_MINIMAL,        # Use minimal quoting (only when necessary)
    quotechar='"',                    # Handle quotes correctly (even though none expected)
    engine='python',                  # Use the Python engine for better error handling
    encoding='utf-8',                 # Ensure proper encoding
)

# Display the first few rows of the dataframe to inspect the data
test_data = pd.read_csv(
    "persuade_corpus_2.0_train.csv",
    on_bad_lines="skip",              # Skip rows with errors
    quoting=csv.QUOTE_MINIMAL,        # Use minimal quoting (only when necessary)
    quotechar='"',                    # Handle quotes correctly (even though none expected)
    engine='python',                  # Use the Python engine for better error handling
    encoding='utf-8',                 # Ensure proper encoding
)

--2025-02-19 13:54:18--  https://raw.githubusercontent.com/Azraelix316/AI_Club_Files/main/persuade_2.0_human_scores_demo_id_github.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75931833 (72M) [text/plain]
Saving to: ‘persuade_corpus_2.0_train.csv’


2025-02-19 13:54:24 (273 MB/s) - ‘persuade_corpus_2.0_train.csv’ saved [75931833/75931833]



Next, we do some data cleaning and sorting. This is because we used a csv for our training data and is specifically because we used the Persuade Corpus.


In [3]:
# Convert scores to numeric and handle any errors
train_data['holistic_essay_score'] = pd.to_numeric(train_data['holistic_essay_score'], errors='coerce')
print(1)
# Drop rows with NaN in 'holistic_essay_score'
train_data = train_data.dropna(subset=['holistic_essay_score'])
print(2)
# Step 2: Prepare the training and testing datasets
X_train = train_data['full_text']  # Text column in the training data
y_train = train_data['holistic_essay_score']  # Score column in the training data


1
2


# ACTUAL CODE

Everything above is just to grab the actual data and is not a part of TFIDF. Now, we actually start using the algorithms. First, we create a tfidf vectorizer for each score, 1 through 6.

Each vectorizer is used to grade how **similar** our test essay is to a **document**, each level of our training set. In this case, we create 6 vectorizers to test similarity between the test set and each level - basically, we check how similar our test data is to level 1 writing, level 2 writing, and so on.

In [4]:

one =[]
two=[]
three=[]
four =[]
five = []
six = []
scores = [one, two, three, four, five, six]

for i in range(len(y_train)):
  scores[y_train[i]-1].append(X_train[i]) # Text column in the test data

# Initialize vectorizers for each score (1-6)
vectorizers = [TfidfVectorizer(stop_words='english', lowercase=True) for _ in range(6)]

# Fit and transform the training data to get the TF-IDF matrix for each score group
tfidf_matrices = [vectorizers[i].fit_transform(scores[i]) for i in range(6)]

Next, we initialize our X_test field, which is the actual essay we are using. This is a horrendous filler essay, replace it with any text you want to use.

In [5]:
X_test = pd.Series(["So like, let’s talk about stuff. Y’know, like, stuff that happens, stuff that we see, stuff that we just do without even thinking, like, all that. I was thinking about it the other day, like, there’s this thing where we all do stuff without even thinking and then we just move on with life like it’s no biggie. But honestly, sometimes it’s a big deal and sometimes it’s just whatever. Like, you ever think about stuff you did last week? I bet you probably forgot. But it was there, you know? Stuff just goes and then we forget, but it’s always around us. Kinda like how we eat, but we don’t really care what we eat until we eat it and then we’re like, “oh yeah, that was food.” But then we’re full and move on, and that’s the vibe. Also, I’ve been seeing some stuff happening on social media, and it’s like, whoa, there’s too much stuff happening at once. Everyone’s just out here posting stuff that nobody really cares about, but they think people care, so it’s a whole thing. Like, who even asked for all this stuff? It’s just too much stuff all the time, and it’s hard to keep up. Then, there’s the whole “chill” thing. We’re supposed to chill, right? But no one knows how to chill anymore. We just keep getting caught up in the stuff and forget how to chill. People used to just sit and talk or do nothing, but now we gotta be “doing stuff” all the time. Like, why is everything so fast-paced? Just slow down, bro. And then, like, let’s talk about school or work or whatever you do, ‘cause it’s all the same. It’s just more stuff. You get through one thing, and there’s another thing right after it, and you gotta get through that too. It’s like one long line of stuff, and there’s no end to it. So much stuff. And nobody even knows why we’re doing any of it. It’s just stuff, and we do it ‘cause that’s what we do. No biggie, right? In conclusion, life is just a whole lot of stuff. Stuff everywhere, stuff in your face, stuff that never stops. So just remember, when you’re doing all the stuff, just don’t forget to chill and breathe ‘cause at the end of the day, it’s all just stuff. Yeah "], dtype="object")  # Create a new Series with the desired element and dtype
# X_test = test_data['full_text']
# Transform the test data using the same vectorizer for each score group
X_test_tfidf = [vectorizers[i].transform(X_test) for i in range(6)]

Finally, we create a prediction by summing up every tfidf score for each vectorizer, 1-6 and after some normalization, we create a final output prediction.

In [6]:
# Now for prediction (find which vectorizer has the highest sum for each test document)
res = []

# Iterate over rows (documents) in the test data
for i in range(X_test_tfidf[0].shape[0]):  # Assume all matrices have the same number of rows
    column_sums = [np.array(matrix[i, :].sum()) for matrix in X_test_tfidf]

    # Find the index of the matrix with the maximum sum
    max_matrix_index = max(range(6), key=lambda x: column_sums[x]/len(scores[x]))  # 0-indexed max
    res.append(max_matrix_index + 1)  # Convert to 1-indexed score

print(res)

[1]
