Binary classification (basic/advanced) for synsets

We use the already classified synsets from data folder as training/test set.
For each synset provided there are one or more relevant words that belongs to the synset. And there is the label (basic/advanced) for the synset.

We then consider as features for classification the following:
1. The vector representation of the first word **in the dataset**
2. The depth of the synset in the WordNet hierarchy
3. The pronunciation complexity of the first word **in the dataset**
4. The length of the first word **in the dataset**
5. The synset classification to concrete or abstract concept

Given a new synset we want to classify we then get its features by:
1. The vector representation of the first word **in the synset**
2. Getting the depth of the sysnet in the WordNet hierarchy
3. Getting the pronunciation complexity of the first (most frequently used) word **in the synset**
4. Getting the length of the first (most frequently used) word **in the synset**
5. Predicting the synset classification to concrete or abstract concept

After defining data we train a binary classifier to predict the label (basic/advanced) of the synset.
Since we have a small dataset we use a simple logistic regression classifier trained using 5-fold cross validation.

Steps:
1. Load and format the JSON dataset (synsets, word(s), labels, definitions) 

In [247]:
from typing import List, Dict
import json
import joblib
import pandas as pd
from pandas import DataFrame
import openpyxl # install it as it is required by pandas to read excel files

from nltk.corpus.reader import Synset
from nltk.corpus import wordnet as wn
from nltk.corpus import cmudict
import spacy

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, cross_val_score

# nltk.download('wordnet')
# # download the CMU Pronouncing Dictionary
# nltk.download('cmudict')
# spacy.cli.download("en_core_web_md")

nlp = spacy.load("en_core_web_md")

## 1. Load and format the JSON dataset (synsets, word(s), labels, definitions)

### 1.1 Functions needed later on

Get WordNet Synset from string of the form "Synset('word.pos.n')"

In [248]:
def get_synset_from_string(s: str) -> Synset:
    # find first ' and last '
    start = s.find('\'')
    end = s.rfind('\'')
    # get synset name from start to end
    synset_name = s[start + 1:end]
    return wn.synset(synset_name)

Calculate pronunce complexity based on number of phonemes
If word not found in dictionary return 50, indicating high complexity (max complexity is around 25 in this dataset)

In [249]:
pronunce_dict = cmudict.dict()

def calculate_pronunce_complexity(sentence: str) -> int:
    sentence = sentence.split()
    complexity = 0
    for word in sentence:
        if word.lower() in pronunce_dict:
            phonemes = pronunce_dict[word.lower()][0]  # get phonetic representation
            complexity += len(phonemes)  # complexity based on number of phonemes
        else:
            return 50 # if a word is not found in dictionary return high complexity since it is not a common word
    
    if complexity == 0: # if no word is found in dictionary return high complexity since there are no common words
        return 50
    return complexity

### 1.2 Load dataset and concrete/abstract classifier

Load dataset from JSON file and get dataset and answers values

In [250]:
with open('data/basicness_dataset.json') as f:
    data = json.load(f)
    dataset = data['dataset']
    labels = data['answers']

Load Logistic Regression classifier used later to predict if a word is abstract or concrete

In [251]:
# get Logistic Regression classifier with joblib
concrete_abstract_cls: LogisticRegression = joblib.load('trained_models/concrete_abstract_classifier.joblib')

### 1.3 Create dataset containing the features

Create a new DataFrame containing the original dataset information
- Synset
- Words
- Label
- Definition

plus the features needed for training and prediction:
- Synset depth
- Pronunciation complexity
- Length of most frequently used word
- Abstract/concrete classification
- Vector representation of most frequently used word

In [252]:
cols = ['synset', 'words', 'synset_depth', 'pronunce_complexity', 'first_word_length', 'abstract', 'word_vector', 'label', 'definition']
dataset_df: DataFrame = pd.DataFrame(columns=cols)

splitted_list: List[List[str]] = []
label_index = 0
for row in dataset:
    # split elements in original dataset string
    row_list = []
    temp_split = row.split(':')
    for elem in temp_split:
        splitted = elem.split('|')
        row_list.extend([x for x in splitted])
    
    # get synset
    synset: Synset = get_synset_from_string(row_list[0])
    # get words
    words = row_list[1]
    words = words.split(',')
    words = [word.strip() for word in words]
    # take only first word, it should be the most frequently used
    first_word = words[0]
    # get synset depth
    synset_depth = synset.max_depth()
    # get pronunce complexity
    pronunce = calculate_pronunce_complexity(first_word)
    # get first word length
    first_word_length = len(first_word)
    # get concreteness
    word_vector = nlp(first_word)[0].vector
    # alternative using probability instead of binary label: 
    #   need to change the classifier for considering this as a numerical features instead of categorical
    # is_abstract = concrete_abstract_cls.predict_proba([word_vector])[0][1]
    is_abstract = concrete_abstract_cls.predict([word_vector])[0]
    # get label
    label = labels[label_index]
    label_index += 1
    # get definition
    definition = row_list[3]
    # add row to dataframe
    new_row = [[synset, first_word, synset_depth, pronunce, first_word_length, is_abstract, word_vector, label, definition]]
    dataset_df = pd.concat(
        [dataset_df, pd.DataFrame(new_row, columns=cols)], 
        ignore_index=True)

In [253]:
dataset_df.head(600)

Unnamed: 0,synset,words,synset_depth,pronunce_complexity,first_word_length,abstract,word_vector,label,definition
0,Synset('war.n.01'),war,7,3,3,1,"[1.4858, -1.8245, -3.4561, -2.0548, 4.5762, 3....",basic,the waging of armed conflict against an enemy
1,Synset('fiefdom.n.01'),fiefdom,6,6,7,1,"[-4.6732, -7.3621, 0.26127, 2.5247, 4.8547, -5...",advanced,the domain controlled by a feudal lord
2,Synset('bed.n.03'),bed,5,3,3,0,"[-2.0862, 1.5808, -7.5852, -1.8082, -1.3864, 3...",basic,a depression forming the ground under a body ...
3,Synset('return_on_invested_capital.n.01'),return on invested capital,6,22,26,1,"[-1.72, 1.7105, -1.5638, 1.3427, 4.4956, 5.316...",advanced,"(corporate finance) the amount, expressed as ..."
4,Synset('texture.n.02'),texture,9,6,7,0,"[-1.7606, -0.68817, -2.7257, 0.86493, -0.88825...",basic,the essential quality of something
...,...,...,...,...,...,...,...,...,...
499,Synset('reading.n.03'),reading,6,5,7,0,"[1.5773, -2.6604, 1.7931, -3.062, -0.093512, -...",basic,a datum about some physical state that is pre...
500,Synset('sanctimoniousness.n.01'),sanctimoniousness,10,50,17,1,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",advanced,the quality of being hypocritically devout
501,Synset('chalcedony.n.01'),chalcedony,8,9,10,0,"[-2.3944, -0.11777, -1.3401, 3.253, 3.1655, -2...",advanced,a milky or greyish translucent to transparent...
502,Synset('stopcock.n.01'),stopcock,11,50,8,0,"[-2.5869, 1.5372, -2.7638, 5.6035, 1.5544, 3.7...",advanced,faucet consisting of a rotating device for re...


## 2. Train a binary classifier to predict the label (basic/advanced) of the synset

### 2.1 Fix dataset formatting and split into features and labels

In [254]:
# drop columns that are not features
X = dataset_df.drop(columns=['synset', 'label', 'definition', 'words'], axis=1)
y = dataset_df['label']

# split word_vector into columns, each element of the vector is a feature (column)
X = pd.concat([X, pd.DataFrame(X['word_vector'].to_list(), columns=[f'word_vector_{i}' for i in range(300)])], axis=1)
X.drop(columns=['word_vector'], inplace=True) # drop original word_vector column, not needed anymore
# X.drop(columns=['abstract'], inplace=True) # we tried to drop abstract column

In [255]:
X.head()

Unnamed: 0,synset_depth,pronunce_complexity,first_word_length,abstract,word_vector_0,word_vector_1,word_vector_2,word_vector_3,word_vector_4,word_vector_5,...,word_vector_290,word_vector_291,word_vector_292,word_vector_293,word_vector_294,word_vector_295,word_vector_296,word_vector_297,word_vector_298,word_vector_299
0,7,3,3,1,1.4858,-1.8245,-3.4561,-2.0548,4.5762,3.0929,...,11.635,-3.5747,0.10567,6.7869,-3.8354,2.2621,-0.92491,-0.51409,-5.9212,-0.30886
1,6,6,7,1,-4.6732,-7.3621,0.26127,2.5247,4.8547,-5.0618,...,5.648,0.22874,3.145,2.2475,5.105,5.3162,-3.2155,-3.5213,1.1198,0.96926
2,5,3,3,0,-2.0862,1.5808,-7.5852,-1.8082,-1.3864,3.3168,...,3.0212,-2.8594,3.4525,0.70655,-8.1775,-0.32947,-5.4147,2.303,-1.9646,1.6448
3,6,22,26,1,-1.72,1.7105,-1.5638,1.3427,4.4956,5.3168,...,4.1487,-0.13711,-3.0225,1.7869,1.6244,1.4162,-2.0241,-2.7348,-4.6322,0.12388
4,9,6,7,0,-1.7606,-0.68817,-2.7257,0.86493,-0.88825,-6.8168,...,2.2933,0.69526,4.4373,-2.3009,-1.0168,-0.34995,5.309,-0.48802,-2.6492,0.1563


### 2.2 Split data into training and test set, preprocess and create pipeline

In [256]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# define numerical features
word_vector_features = [f'word_vector_{i}' for i in range(300)] # get column names for word_vector features
numeric_features = ['synset_depth', 'pronunce_complexity', 'first_word_length'] + word_vector_features

# create feature transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features)
    ])

# create pipeline for preprocessing and classification
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression()),
])

### 2.3 Train and evaluate classifier

In [257]:
# train classifier
clf.fit(X_train, y_train)

# predict test set and evaluate
y_pred = clf.predict(X_test)
print(f"Accuracy score: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

Accuracy score: 0.7722772277227723
              precision    recall  f1-score   support

    advanced       0.68      0.69      0.68        36
       basic       0.83      0.82      0.82        65

    accuracy                           0.77       101
   macro avg       0.75      0.75      0.75       101
weighted avg       0.77      0.77      0.77       101


5-fold cross validation

In [258]:
scores = cross_val_score(clf, X, y, cv=5)
print(f"Mean accuracy: {scores.mean()}")

Mean accuracy: 0.7420990099009902


## 3. Additional code

### 3.1 Correlation between features and labels (basic/advanced)

In [259]:
from scipy.stats import chi2_contingency

def chi_square_test(feature, label):
    contingency_table = pd.crosstab(feature, label)
    chi2, p_val, _, _ = chi2_contingency(contingency_table)
    return chi2, p_val

In [260]:
chi_square_results = {}

labels = dataset_df['label']
# convert label to binary
binary_labels = labels.apply(lambda x: 1 if x == 'advanced' else 0)

In [261]:
for col in dataset_df.columns:
    if col in ['synset', 'words', 'label', 'definition', 'word_vector']:
        continue
    pearson_corr = dataset_df[col].corr(binary_labels)
    spearman_corr = dataset_df[col].corr(binary_labels, 'spearman')
    chi2, p_val = chi_square_test(dataset_df[col], binary_labels)
    chi_square_results[col] = {'Chi-square': chi2, 'p-value': p_val}
    print(f"Correlation between {col} and labels")
    print(f"\tPearson: {pearson_corr}")
    print(f"\tSpearman: {spearman_corr}")
    print(f"\tChi-square test: {chi2}, p-value: {p_val}")
    print()

Correlation between synset_depth and labels
	Pearson: 0.3092353073716985
	Spearman: 0.3085555402867728
	Chi-square test: 52.5260479799988, p-value: 2.188834326325188e-07

Correlation between pronunce_complexity and labels
	Pearson: 0.4667051196561559
	Spearman: 0.5158882133823424
	Chi-square test: 151.45632548090128, p-value: 3.3062077113394753e-22

Correlation between first_word_length and labels
	Pearson: 0.4435413690068497
	Spearman: 0.4488886147922671
	Chi-square test: 113.39734295225205, p-value: 1.1443724223891585e-14

Correlation between abstract and labels
	Pearson: 0.058893876665495996
	Spearman: 0.0588938766654962
	Chi-square test: 1.5176244751259842, p-value: 0.2179793352116985


### 3.2 Predict basicness of a new synset or word

In [262]:
def predict_basicness(to_predict: Synset | str) -> str:
    if isinstance(to_predict, str):
        # get first synset
        synsets = wn.synsets(to_predict)
        if len(synsets) == 0:
            print(f"Warning: word '{to_predict}' not found in WordNet")
            return "advanced" # we consider a word not related to any concept in WordNet as advanced
        synset = synsets[0]
        print("Synset selected:", synset)
    elif isinstance(to_predict, Synset):
        synset = to_predict
    else:
        raise ValueError("to_predict must be a string or a Synset")
    
    # get words
    words = synset.lemma_names()
    words = [word.strip() for word in words]
    # take only first word, it should be the most frequently used
    first_word = words[0]
    # get synset depth
    synset_depth = synset.max_depth()
    # get pronunce complexity
    pronunce = calculate_pronunce_complexity(first_word)
    # get first word length
    first_word_length = len(first_word)
    # get concreteness
    word_vector = nlp(first_word)[0].vector
    is_abstract = concrete_abstract_cls.predict([word_vector])[0]
    
    X = pd.DataFrame([[synset_depth, pronunce, first_word_length, is_abstract, *word_vector]], columns=['synset_depth', 'pronunce_complexity', 'first_word_length', 'abstract', *word_vector_features])
    # predict
    return clf.predict(X)[0]

In [263]:
# test prediction
synset = wn.synset('dog.n.01')
print(f"Input: {synset}")
print(f"Result: {predict_basicness(synset)} \n")

words = ['person', 'car', 'apple', 'tree', 'galaxy', 'rpg', 'celebrity', 'aberration', 'fps']

for word in words:
    print(f"Input: {word}")
    print(f"Result: {predict_basicness(word)}\n")


Input: Synset('dog.n.01')
Result: basic 

Input: person
Synset selected: Synset('person.n.01')
Result: basic

Input: car
Synset selected: Synset('car.n.01')
Result: basic

Input: apple
Synset selected: Synset('apple.n.01')
Result: basic

Input: tree
Synset selected: Synset('tree.n.01')
Result: basic

Input: galaxy
Synset selected: Synset('galaxy.n.01')
Result: advanced

Input: rpg
Result: advanced

Input: celebrity
Synset selected: Synset('celebrity.n.01')
Result: advanced

Input: aberration
Synset selected: Synset('aberrance.n.01')
Result: advanced

Input: fps
Synset selected: Synset('federal_protective_service.n.01')
Result: advanced
