# **Adept**

This is an overview of the Adept dataset first introduced by Emami et al. in [ADEPT: An Adjective-Dependent Plausibility Task](https://aclanthology.org/2021.acl-long.553/) (2021).

The overview investigates the statistical distributions of the dataset features such as labels and sentence length distributions in order to provide an introductory but informative look at the data.

**By team Tennant: Anna Golub, Beate Zywietz**

# Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

# Read the data

The data is split into train set, dev (validation) set and test set.

In [None]:
train = pd.read_json('../adept/train-dev-test-split/train.json')
dev = pd.read_json('../adept/train-dev-test-split/val.json')
test = pd.read_json('../adept/train-dev-test-split/test.json')

Dataset sizes

In [None]:
print('Train:', train.shape[0])
print('Dev:', dev.shape[0])
print('Test:', test.shape[0])

Dataset features:
* sentence1 - plain sentence
* sentence2 - sentence with modifier
* modifier
* noun that is being modified
* class label
* idx - data point index

In [None]:
train.head()

Check for missing values - none found

In [None]:
train.isna().sum().sum(), dev.isna().sum().sum(), test.isna().sum().sum()

In [None]:
train[train['label'] == 4].head()

# Label distribution
0 - very implausible  
4 - very plausible

As shown by the bar chart below, most of the data lies in the middle of the scale (the annotators were unsure about how plausible those sentences are). The rest of the data is significantly skewed towards the non-plausible end. The label 4 (very plausible) is only represented by 65 examples.

In [None]:
label_counts = train['label'].value_counts().reset_index().rename(
    columns={'index': 'label', 'label': 'count'}
).sort_values(by='label')
label_counts

In [None]:
labels_total = train['label'].value_counts().sum()
label_counts['bar_chart_labels'] = label_counts['count'].apply(
    lambda x: '< 1%' if x / labels_total < 0.01 else '{:2.2%}'.format(x / labels_total)
)

In [None]:
ax = sns.barplot(data=label_counts, x='label', y='count', color='b');
ax.bar_label(ax.containers[0], labels=label_counts['bar_chart_labels']);

# Nouns & modifiers

## Noun distribution

Let's look at noun occurrences

In [None]:
train_uniq_noun = list(train['noun'].unique())
dev_uniq_noun = list(dev['noun'].unique())
test_uniq_noun = list(test['noun'].unique())

print(f'Overall: {len(set(train_uniq_noun + dev_uniq_noun + test_uniq_noun))} unique nouns')
print(f'Train set: {len(train_uniq_noun)} unique nouns')
print(f'Dev set: {len(dev_uniq_noun)} unique nouns;  {len([m for m in dev_uniq_noun if m not in train_uniq_noun])} of them are NOT in train')
print(f'Test set: {len(test_uniq_noun)} unique nouns;  {len([m for m in test_uniq_noun if m not in train_uniq_noun])} of them are NOT in train')

Which nouns are the most common in the training set,
and how common are they in the development and test set?

In [None]:
print('Most common nouns\n')
n = 5  # number of instances to show
print(f"train:\n{train['noun'].value_counts().nlargest(n)}")
print(f"\ntest:\n{test['noun'].value_counts().nlargest(n)}")

# count how often the most common nouns from the training set appear in the other sets,
# relative to the size of the split
print("\nnoun\ttrain\tdev\ttest")
for noun, train_val in train['noun'].value_counts()[:n].to_dict().items():
    if noun in dev['noun'].to_dict().values():
        dev_val = dev['noun'].value_counts()[noun]
    else: dev_val = 0
    if noun in test['noun'].to_dict().values():
        test_val = test['noun'].value_counts()[noun]
    else: test_val = 0
    train_val = train_val/train['noun'].size
    dev_val = dev_val/dev['noun'].size
    test_val = test_val/test['noun'].size
    print('{}\t{:2.2%}\t{:2.2%}\t{:2.2%}'.format(noun, train_val, dev_val, test_val))

With the exeption of the word 'menu' the most frequent nouns are different between the three sets. No single noun seems overly common in any set, with the most frequent noun in the training set only appearing in less than 1% of all instances.

## Modifier distribution

Let's look at modifier occurrences

In [None]:
train_uniq_mod = list(train['modifier'].unique())
dev_uniq_mod = list(dev['modifier'].unique())
test_uniq_mod = list(test['modifier'].unique())

print(f'Overall: {len(set(train_uniq_mod + dev_uniq_mod + test_uniq_mod))} unique modifiers')
print(f'Train set: {len(train_uniq_mod)} unique modifiers')
print(f'Dev set: {len(dev_uniq_mod)} unique modifiers;  {len([m for m in dev_uniq_mod if m not in train_uniq_mod])} of them are NOT in train')
print(f'Test set: {len(test_uniq_mod)} unique modifiers;  {len([m for m in test_uniq_mod if m not in train_uniq_mod])} of them are NOT in train')

Which modifiers are the most common in the training set,
and how common are they in the development and test set?

In [None]:
print('Most common modifiers\n')

n = 5  # number of instances to show
print(f"train:\n{train['modifier'].value_counts().nlargest(n)}\n")
print(f"test:\n{test['modifier'].value_counts().nlargest(n)}")

# count how often the most common modifiers from the training set appear in the other sets,
# relative to the size of the split
print("\nmod\ttrain\tdev\ttest")
for mod, train_val in train['modifier'].value_counts()[:n].to_dict().items():
    if mod in dev['modifier'].to_dict().values():
        dev_val = dev['modifier'].value_counts()[mod]
    else: dev_val = 0
    if mod in test['modifier'].to_dict().values():
        test_val = test['modifier'].value_counts()[mod]
    else: test_val = 0
    train_val = train_val / train['modifier'].size
    dev_val = dev_val / dev['modifier'].size
    test_val = test_val / test['modifier'].size
    print('{}\t{:2.2%}\t{:2.2%}\t{:2.2%}'.format(mod, train_val, dev_val, test_val))

Even though there are more different modifiers than nouns, the common modifiers appear more frequently relative to the size of the dataset (~3% instead of <1%). The modifiers that are frequent in one set are also frequent in the other sets.

## Noun-modifier distribution
Most common noun-modifier combinations

In [None]:
noun_mod = {}  # keys are (noun, mod) tuples, values are the number of their appearances
for i in range(train['noun'].size):
    t = (train['noun'][i], train['modifier'][i])
    if noun_mod.get(t):
        noun_mod[t] += 1
    else:
        noun_mod[t] = 1
nm_series = pd.Series(noun_mod)
print(nm_series.sort_values(ascending=False)[:5])

# Sentences

## Unique sentences

Let's look at the number of plain sentences (sentence1): almost all sentences only occur once. However, not all. There's some overlap between the train and dev, test sets.

In [None]:
train_uniq_sent = list(train['sentence1'].unique())
dev_uniq_sent = list(dev['sentence1'].unique())
test_uniq_sent = list(test['sentence1'].unique())

print(f'Overall: {len(set(train_uniq_sent + dev_uniq_sent + test_uniq_sent))} unique sentences')
print(f'Train set: {len(train_uniq_sent)} unique sentences')
print(f'Dev set: {len(dev_uniq_sent)} unique sentences;  {len([m for m in dev_uniq_sent if m in train_uniq_sent])} of them ARE in train')
print(f'Test set: {len(test_uniq_sent)} unique sentences;  {len([m for m in test_uniq_sent if m in train_uniq_sent])} of them ARE in train')

Let's check that all modified sentences (sentence2) are unique - no!

There are duplicates within the train set, that is some sentences are recorded multiple times with the same OR different labels.

There is a 4-sentence overlap between train and dev and 1-sentence overlap between train and test. These can be used for sanity checks later on during model training.

In [None]:
train['set'] = 'train'
dev['set'] = 'dev'
test['set'] = 'test'
df = pd.concat([train, dev, test])
sent_counts = df['sentence2'].value_counts().sort_values(ascending=False).reset_index().rename(
    columns={'index': 'sentence2', 'sentence2': 'count'}
)
df = df.merge(sent_counts, on='sentence2')
df[df['count'] > 1][['sentence2', 'set', 'label']]

## Sentence length

Since some ML models struggle with long sentences, we decided to find the longest sentences in the dataset. Their length is calculated based on their character count, including spaces.

In [None]:
sentence_len = {}  # keys are the index (not the idx) of each sentence, values are their character count
for i in range(train['sentence2'].size):
    sentence_len[i] = len(train['sentence2'][i])
sl_series = pd.Series(sentence_len)
for k, v in sl_series.nlargest(5).items():
    print(v, train['sentence2'][k])

Mean sentence length

In [None]:
sl_series.mean()