Binary classification (basic/advanced) for synsets

We use the already classified synsets from data folder as training/test set.
For each synset provided there are one or more relevant words that belongs to the synset. And there is the label (basic/advanced) for the synset.

We then consider as features for classification the following:
1. The depth of the synset in the WordNet hierarchy
2. The pronunciation complexity of the first word **in the dataset**
3. The length of the first word **in the dataset**
4. The synset classification to concrete or abstract concept

Given a new synset we want to classify we then get its features by:
1. Getting the depth of the sysnet in the WordNet hierarchy
2. Getting the pronunciation complexity of the first (most frequently used) word **in the synset**
3. Getting the length of the first (most frequently used) word **in the synset**
4. Predicting the synset classification to concrete or abstract concept

After defining data we train a binary classifier to predict the label (basic/advanced) of the synset.
Since we have a small dataset we use a simple logistic regression classifier trained using 5-fold cross validation.

Steps:
1. Load and format the JSON dataset (synsets, word(s), labels, definitions) 

In [210]:
from typing import List, Dict
import json
import pandas as pd
from pandas import DataFrame

import nltk
from nltk.corpus.reader import Synset
from nltk.corpus import wordnet as wn
from nltk.corpus import cmudict
nltk.download('wordnet')
# download the CMU Pronouncing Dictionary
nltk.download('cmudict')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Gianl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package cmudict to
[nltk_data]     C:\Users\Gianl\AppData\Roaming\nltk_data...
[nltk_data]   Package cmudict is already up-to-date!


True

## 1. Load and format the JSON dataset (synsets, word(s), labels, definitions)

In [211]:
with open('data/1.json') as f:
    data = json.load(f)
    dataset = data['dataset']
    labels = data['answers']

Get Synset from string of the form "Synset('word.pos.n')"

In [212]:
def get_synset_from_string(s: str) -> Synset:
    # find first ' and last '
    start = s.find('\'')
    end = s.rfind('\'')
    synset_name = s[start + 1:end]
    return wn.synset(synset_name)

Calculate pronunce complexity based on number of phonemes
If word not found in dictionary return 50, indicating high complexity (max complexity is around 25 in this dataset)

In [213]:
pronunce_dict = cmudict.dict()

def calculate_pronunce_complexity(sentence: str) -> int:
    sentence = sentence.split()
    complexity = 0
    for word in sentence:
        if word.lower() in pronunce_dict:
            phonemes = pronunce_dict[word.lower()][0]  # get phonetic representation
            complexity += len(phonemes)  # complexity based on number of phonemes
        else: # TODO check if returning high complexity if one word not found is better. Maybe
            return 50
    
    if complexity == 0: # if word not found in dictionary return high complexity
        return 50
    return complexity

Select synset and word(s) from dataset

In [214]:
# dataframe containing synset, word(s), definition, label
cols = ['synset', 'words', 'synset_depth', 'pronunce_complexity', 'first_word_length', 'label', 'definition']
dataset_df: DataFrame = pd.DataFrame(columns=cols)

splitted_list: List[List[str]] = []
label_index = 0
for row in dataset:
    row_list = []
    temp_split = row.split(':')
    for elem in temp_split:
        splitted = elem.split('|')
        row_list.extend([x for x in splitted])
    
    # get synset
    synset = get_synset_from_string(row_list[0])
    # get words
    words = row_list[1]
    words = words.split(',')
    words = [word.strip() for word in words]
    # take only first word for now, it should be the most frequently used
    first_word = words[0]
    words = ",".join(words)
    # get synset depth
    synset_depth = synset.max_depth()
    # get pronunce complexity
    pronunce = calculate_pronunce_complexity(first_word)
    # get first word length
    first_word_length = len(first_word)
    # get label
    label = labels[label_index]
    # get definition
    definition = row_list[3]
    # add row to dataframe
    new_row = [[synset, first_word, synset_depth, pronunce, first_word_length, label, definition]]
    dataset_df = pd.concat(
        [dataset_df, pd.DataFrame(new_row, columns=cols)], 
        ignore_index=True)
    label_index += 1

In [215]:
dataset_df.head(600)

Unnamed: 0,synset,words,synset_depth,pronunce_complexity,first_word_length,label,definition
0,Synset('war.n.01'),war,7,3,3,basic,the waging of armed conflict against an enemy
1,Synset('fiefdom.n.01'),fiefdom,6,6,7,advanced,the domain controlled by a feudal lord
2,Synset('bed.n.03'),bed,5,3,3,basic,a depression forming the ground under a body ...
3,Synset('return_on_invested_capital.n.01'),return on invested capital,6,22,26,advanced,"(corporate finance) the amount, expressed as ..."
4,Synset('texture.n.02'),texture,9,6,7,basic,the essential quality of something
...,...,...,...,...,...,...,...
499,Synset('reading.n.03'),reading,6,5,7,basic,a datum about some physical state that is pre...
500,Synset('sanctimoniousness.n.01'),sanctimoniousness,10,50,17,advanced,the quality of being hypocritically devout
501,Synset('chalcedony.n.01'),chalcedony,8,9,10,advanced,a milky or greyish translucent to transparent...
502,Synset('stopcock.n.01'),stopcock,11,50,8,advanced,faucet consisting of a rotating device for re...
