Binary classification (basic/advanced) for synsets

We use the already classified synsets from data folder as training/test set.
For each synset provided there are one or more relevant words that belongs to the synset. And there is the label (basic/advanced) for the synset.

We then consider as features for classification the following:
1. The depth of the synset in the WordNet hierarchy
2. The pronunciation complexity of the first word **in the dataset**
3. The length of the first word **in the dataset**
4. The synset classification to concrete or abstract concept

Given a new synset we want to classify we then get its features by:
1. Getting the depth of the sysnet in the WordNet hierarchy
2. Getting the pronunciation complexity of the first (most frequently used) word **in the synset**
3. Getting the length of the first (most frequently used) word **in the synset**
4. Predicting the synset classification to concrete or abstract concept

After defining data we train a binary classifier to predict the label (basic/advanced) of the synset.
Since we have a small dataset we use a simple logistic regression classifier trained using 5-fold cross validation.

Steps:
1. Load and format the JSON dataset (synsets, word(s), labels, definitions) 

In [146]:
from typing import List, Dict
import json
import pandas as pd
from pandas import DataFrame

import nltk
from nltk.corpus.reader import Synset
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Gianl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# 1. Load and format the JSON dataset (synsets, word(s), labels, definitions)

In [147]:
with open('data/1.json') as f:
    data = json.load(f)
    dataset = data['dataset']
    labels = data['answers']

Get Synset from string of the form "Synset('word.pos.n')"

In [148]:
def get_synset_from_string(s: str) -> Synset:
    # find first ' and last '
    start = s.find('\'')
    end = s.rfind('\'')
    synset_name = s[start + 1:end]
    return wn.synset(synset_name)

Select synset and word(s) from dataset

In [149]:
# dataframe containing synset, word(s), definition, label
dataset_df: DataFrame = pd.DataFrame(columns=['synset', 'words', 'label', 'definition'])

splitted_list: List[List[str]] = []
label_index = 0
for row in dataset:
    row_list = []
    temp_split = row.split(':')
    for elem in temp_split:
        splitted = elem.split('|')
        row_list.extend([x for x in splitted])
    
    synset = get_synset_from_string(row_list[0])
    words = row_list[1]
    words = words.split(',')
    words = [word.strip() for word in words]
    words = ",".join(words)
    label = labels[label_index]
    definition = row_list[3]
    dataset_df = pd.concat(
        [dataset_df, pd.DataFrame([[synset, words, label, definition]], columns=['synset', 'words', 'label', 'definition'])], 
        ignore_index=True)
    label_index += 1

In [150]:
dataset_df.head()

Unnamed: 0,synset,words,label,definition
0,Synset('war.n.01'),"war,warfare",basic,the waging of armed conflict against an enemy
1,Synset('fiefdom.n.01'),fiefdom,advanced,the domain controlled by a feudal lord
2,Synset('bed.n.03'),"bed,bottom",basic,a depression forming the ground under a body ...
3,Synset('return_on_invested_capital.n.01'),"return on invested capital,return on investmen...",advanced,"(corporate finance) the amount, expressed as ..."
4,Synset('texture.n.02'),texture,basic,the essential quality of something
