# Data Analysis


To develop a classifier that can decide which one of two prompts will be prefered by a user we have to identify what data is relevant for a user to like or dislike a text. This data analysis aims to sort out the relevant data for our task. Our goal is to gain more relevant data from the undelying data, since text encodes a lot of information but without further processing it could be really hard for a AI model to predict something based on this information.

We have the following assumptions to solve the task:
- The liking of a text is not random but follows certain patterns
- These patterns can be identified through data analysis
- The patterns are related to measurable text characteristics
- Users have consistent preferences that can be learned
- The preferences can be generalized across different texts


Possibly relevant data: 
1. Amount of nouns in response
2. Amount of adjectives in response
3. sentence-sentiment
4. amount of words per response
5. amount of (., ,, -, !, etc.) in response
6. amount of letters per response

In [15]:
# Necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler


In [16]:
# Import nouns and adjectives
from nltk.corpus import wordnet as wn
import nltk

# Ensure WordNet is downloaded
nltk.download('wordnet')

# Function to extract all nouns
def get_nouns():
    nouns = set()
    for synset in wn.all_synsets(pos=wn.NOUN):
        for lemma in synset.lemmas():
            name = lemma.name()
            # Replace underscores with empty string to concatenate words
            name = name.replace('_', '')
            nouns.add(name)
    return nouns

# Function to extract all adjectives
def get_adjectives():
    adjectives = set()
    for synset in wn.all_synsets(pos=wn.ADJ):
        for lemma in synset.lemmas():
            adjectives.add(lemma.name())
    return adjectives

# Get nouns and adjectives
nouns = get_nouns()
adjectives = get_adjectives()

print("Nouns: ", list(nouns)[:10])
print("Adjectives: ", list(adjectives)[:10])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\siran\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Nouns:  ['climbingfumitory', 'leucothoe', 'secularization', 'solicitorship', 'grapehyacinth', 'perissodactylmammal', 'pinacloth', 'acetylcholine', 'PAGAD', 'dolmanjacket']
Adjectives:  ['sequined', 'milklike', 'uncollectible', 'emulous', 'premature', 'intracranial', 'calcifugous', 'ill-starred', 'tubby', 'dogged']


In [17]:
# Load the different datasets
df_train = pd.read_csv('../Data/LLM Classification Finetuning/train.csv')
df_test = pd.read_csv('../Data/LLM Classification Finetuning/test.csv')

# Display basic information about the datasets
print("Training dataset shape:", df_train.shape)
print("Test dataset shape:", df_test.shape) 
print("Dataset columns:", df_train.columns)


Training dataset shape: (57477, 9)
Test dataset shape: (3, 4)
Dataset columns: Index(['id', 'model_a', 'model_b', 'prompt', 'response_a', 'response_b',
       'winner_model_a', 'winner_model_b', 'winner_tie'],
      dtype='object')


## 1. Amount of nouns in response

In [18]:
def count_nouns(sentence):
    words = sentence.split(" ")
    count = 0
    for word in words:
        if word.lower() in nouns:
            count += 1
    return count

    
def count_nouns_rows(row):
    a_count = count_nouns(row["response_a"])
    b_count = count_nouns(row["response_b"])
    return a_count, b_count


In [19]:
# The apply function returns a Series/array, need to convert to separate columns
counts = df_train.apply(count_nouns_rows, axis=1)
df_train["noun_count_a"] = [x[0] for x in counts]
df_train["noun_count_b"] = [x[1] for x in counts]

In [1]:
# Perform t-test to compare noun counts between winning and losing responses
winning_nouns = []
losing_nouns = []

for _, row in df_train.iterrows():
    if row['winner_model_a'] == 1:
        winning_nouns.append(row['noun_count_a'])
        losing_nouns.append(row['noun_count_b'])
    elif row['winner_model_b'] == 1:
        winning_nouns.append(row['noun_count_b'])
        losing_nouns.append(row['noun_count_a'])

from scipy import stats
t_stat, p_value = stats.ttest_ind(winning_nouns, losing_nouns)

print("T-test Results for Noun Counts:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print("\nMean noun count in winning responses:", f"{np.mean(winning_nouns):.2f}")
print("Mean noun count in losing responses:", f"{np.mean(losing_nouns):.2f}")

NameError: name 'df_train' is not defined

## Amount of adjectives in response

In [22]:
## Amount of adjectives in response
def count_adjectives(sentence:str):
    words = sentence.split(" ")
    count = 0
    if len(words) > 0:
        for word in words:
            if word.lower() in nouns:
                count += 1
    return count

def count_adjectives_row(row):
    a_count = count_adjectives(row["response_a"])
    b_count = count_adjectives(row["response_b"])
    return a_count, b_count

In [23]:
counts = df_train.apply(count_adjectives_row, axis=1)
df_train["adj_count_a"] = [x[0] for x in counts]
df_train["adj_count_b"] = [x[1] for x in counts]

In [24]:
df_train[:2]

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie,noun_count_a,noun_count_b,adj_count_a,adj_count_b
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0,217,61,217,61
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0,213,203,213,203
