# Data Analysis


To develop a classifier that can decide which one of two prompts will be prefered by a user we have to identify what data is relevant for a user to like or dislike a text. This data analysis aims to sort out the relevant data for our task.

We have the following assumptions to solve the task:
- The liking of a text is not random but follows certain patterns
- These patterns can be identified through data analysis
- The patterns are related to measurable text characteristics
- Users have consistent preferences that can be learned
- The preferences can be generalized across different texts


In [2]:
# Necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler


In [14]:
# Import nouns and adjectives
from nltk.corpus import wordnet as wn
import nltk

# Ensure WordNet is downloaded
nltk.download('wordnet')

# Function to extract all nouns
def get_nouns():
    nouns = set()
    for synset in wn.all_synsets(pos=wn.NOUN):
        for lemma in synset.lemmas():
            name = lemma.name()
            # Replace underscores with empty string to concatenate words
            name = name.replace('_', '')
            nouns.add(name)
    return nouns

# Function to extract all adjectives
def get_adjectives():
    adjectives = set()
    for synset in wn.all_synsets(pos=wn.ADJ):
        for lemma in synset.lemmas():
            adjectives.add(lemma.name())
    return adjectives

# Get nouns and adjectives
nouns = get_nouns()
adjectives = get_adjectives()

print("Nouns: ", list(nouns)[:10])
print("Adjectives: ", list(adjectives)[:10])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\siran\AppData\Roaming\nltk_data...


In [4]:
# Load the different datasets
df_train = pd.read_csv('../Data/LLM Classification Finetuning/train.csv')
df_test = pd.read_csv('../Data/LLM Classification Finetuning/test.csv')

# Display basic information about the datasets
print("Training dataset shape:", df_train.shape)
print("Test dataset shape:", df_test.shape) 
print("Dataset columns:", df_train.columns)


Training dataset shape: (57477, 9)
Test dataset shape: (3, 4)
Dataset columns: Index(['id', 'model_a', 'model_b', 'prompt', 'response_a', 'response_b',
       'winner_model_a', 'winner_model_b', 'winner_tie'],
      dtype='object')


In [12]:
# Load nouns and adjectives from CSV files
#nouns = set(pd.read_csv('../Data/English/nouns.csv', header=None)[0].values)
adjectives = set(pd.read_csv('../Data/English/adjectives.csv', header=None)[0].values)

print(adjectives)


