# Exploring the Labels

I did no do the train/dev split, so in addition to doing regular EDA, this notebook verifies that these data have similar characteristics.

In [1]:
import string
from collections import Counter
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

TRAIN_PATH = "data/train.csv"
DEV_PATH = "data/dev.csv"

train = pd.read_csv(TRAIN_PATH, header=0)
train.head()

Unnamed: 0,text,admiration,amusement,gratitude,love,pride,relief,remorse
0,My favourite food is anything I didn't have to...,0,0,0,0,0,0,0
1,"Now if he does off himself, everyone will thin...",0,0,0,0,0,0,0
2,Yes I heard abt the f bombs! That has to be wh...,0,0,1,0,0,0,0
3,Damn youtube and outrage drama is super lucrat...,1,0,0,0,0,0,0
4,It might be linked to the trust factor of your...,0,0,0,0,0,0,0


In [2]:
dev = pd.read_csv(DEV_PATH, header=0)
dev.head()

Unnamed: 0,text,admiration,amusement,gratitude,love,pride,relief,remorse
0,Is this in New Orleans?? I really feel like th...,0,0,0,0,0,0,0
1,"You know the answer man, you are programmed to...",0,0,0,0,0,0,0
2,The economy is heavily controlled and subsidiz...,0,0,0,0,0,0,0
3,"Thank you for your vote of confidence, but we ...",0,0,1,0,0,0,0
4,There it is!,0,0,0,0,0,0,0


In [3]:
# process labels
def process_labels(data):
    labels_raw = data.iloc[:, 1:]
    labels_list = labels_raw.apply(lambda row: row.tolist(), axis=1)
    return {"list": labels_list, "raw": labels_raw}
train_labels = process_labels(train)
dev_labels = process_labels(dev)
print("Train Labels Summary:")
print(train_labels["list"].describe())
print("Dev Labels Summary:")
print(dev_labels["list"].describe())

Train Labels Summary:
count                     25196
unique                       31
top       [0, 0, 0, 0, 0, 0, 0]
freq                      14001
dtype: object
Dev Labels Summary:
count                      3149
unique                       20
top       [0, 0, 0, 0, 0, 0, 0]
freq                       1741
dtype: object


- In both training and dev, the most frequent category is no label for all classes.
- There are 31 unique combinations in the training data, and only 20 in the dev.

## How many instances have more than one label?

In [4]:
def n_labels(labels_lists):
    summed_labels = [sum(lst) for lst in labels_lists]
    multiple_labels = [x for x in summed_labels if x > 1]
    n_multiple_labels = round(len(multiple_labels)/len(labels_lists), 2)
    return n_multiple_labels
train_n_multiple_labels = n_labels(train_labels["list"])
dev_n_multiple_labels = n_labels(dev_labels["list"])
print(f"Training Data: {train_n_multiple_labels}\nDev Data: {dev_n_multiple_labels}")

Training Data: 0.03
Dev Data: 0.03


## Is there class imbalance?

In [5]:
def class_sums(data):
    summed_labels = data.apply(sum, axis=0)
    total_n = summed_labels.sum()
    proportions = summed_labels / total_n
    sorted_proportions = proportions.sort_values(ascending=False)
    return sorted_proportions
train_class_sums = class_sums(train_labels["raw"])
dev_class_sums = class_sums(dev_labels["raw"])
print("Train:")
print(train_class_sums)
print("Dev:")
print(dev_class_sums)

Train:
admiration    0.343737
gratitude     0.221556
amusement     0.193758
love          0.173616
remorse       0.045360
relief        0.012734
pride         0.009238
dtype: float64
Dev:
admiration    0.324900
gratitude     0.238349
amusement     0.201731
love          0.167776
remorse       0.045273
relief        0.011984
pride         0.009987
dtype: float64


There is significant class imbalance in both the training and the dev data. It is uniform across both datasets.

# how long are the text strings?

In [6]:
# combine the dataframes
data = pd.concat([train, dev])
text = data['text'].tolist()
def text_clean(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.lower()

clean = [text_clean(sentence) for sentence in text]

# make string tokens
def count_tokens(text):
    split = text.split()
    n_words = len(split)
    return n_words

n_words_clean = [count_tokens(sentence) for sentence in clean]
n_words_dirty = [count_tokens(sentence) for sentence in text]

n_words_clean_np = np.asarray(n_words_clean)
n_words_dirty_np = np.asarray(n_words_dirty)

print(f"Punctuation stripped sentence length range: ({min(n_words_clean_np)}, {max(n_words_clean_np)})")
print(f"unprocessed sentence length range: ({min(n_words_dirty_np)}, {max(n_words_dirty_np)})")

Punctuation stripped sentence length range: (0, 33)
unprocessed sentence length range: (1, 33)


After removing the punctuation there is at least one text string of length zero.

# What is the distribution of word frequencies?

In [22]:
def concat_texts(data):
    """takes a list of pandas datasets and combines all of their text columns into a single string"""
    cat_string = ""
    cat_string += " ".join(data)
    return cat_string

text_string = text_clean(concat_texts(text))

words = text_string.split()
word_freq = dict(Counter(words))

# sort the dict by word frequency
word_freq = dict(sorted(word_freq.items(), key=lambda item: item[1]))

count = 0
for freq in word_freq.values():
    if freq == 1:
        count += 1
    else:
        break

print(count)

13032


Nearly half of the unique words in the corpus occur only once. Embeddings should help with associating similar words among all the infrequent ones.