# TODO
## Unified Emotion
- ~~remove classes with limited data~~
- ~~convert multiple labels to multiple examples with different labels~~
- come up with better assignment scheme for train/valid/test splits

## Go Emotions
- Get rid of print when loading (low priority)

## Manual Tokenizer
- check if works for go emotion
- ~~incorporate special tokens into huggingface tokenizer~~
- ~~drop "#SemST" from ssec sentences~~
- include goemotions cases for manual tokenizer

## Dataloaders
- loop Stratifiedloader for infinite sampling
- ~~rewrite train script to use correct dataloaders~~
- ~~allow Stratifiedloader to keep all classes (for supervised training)~~
- ~~allow Stratifiedloader to keep map classes subset to internal mapping (for meta training)~~

In [2]:
import torch

from data.utils.data_loader import StratifiedLoader, AdaptiveNKShotLoader

# Datasets
## Unified Emotion

Data source download: https://drive.google.com/file/d/1y7yjshepNRPhnh-Qz5MTRbnopGn7KzUm/view?usp=sharing
Originally from: https://github.com/sarnthil/unify-emotion-datasets


Klinger, R. & Bostan, L. (2018, August). An analysis of annotated corpora for emotion classification in text. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 2104-2119).

In [2]:
import pandas as pd
from data.unified_emotion import unified_emotion, unified_emotion_info

pd.DataFrame(unified_emotion_info())

Unnamed: 0,source,size,domain,classes,special
0,affectivetext,250,headlines,6,"non-discrete, multiple labels"
1,crowdflower,40000,tweets,14,includes no-emotions class
2,dailydialog,13000,conversations,6,includes no-emotions class
3,electoraltweets,4058,tweets,8,includes no-emotions class
4,emobank,10000,headlines,3,VAD regression
5,emoint,7097,tweets,6,annotated by experts
6,emotion-cause,2414,artificial,6,
7,fb-valence-arousal-anon,2800,facebook,3,VA regression
8,grounded_emotions,2500,tweets,2,
9,ssec,4868,tweets,8,multiple labels per sentence


In [3]:
unified = unified_emotion("./data/datasets/unified-dataset.jsonl",\
    include=['crowdflower', 'dailydialog', 'electoraltweets', 'emoint', 'emotion-cause', 'grounded_emotions', 'ssec', 'tec'])

unified.prep()

In [4]:
unified.lens

{'grounded_emotions': 2585,
 'ssec': 15444,
 'crowdflower': 39821,
 'dailydialog': 102805,
 'emotion-cause': 1960,
 'tec': 21043,
 'emoint': 7090,
 'electoraltweets': 3682}

In [5]:
for k in unified.lens.keys():
    dataset = unified.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)
    labels, text, _, _ = next(trainloader)
    print(k)
    for lab, sent in zip(labels, text):
        print(lab, sent)
    print()

grounded_emotions
1 RT @NICKinNOVA: @TuxcedoCat @RosLehtinen @Deidramayfair And that's without Medicaid expansion.
0 #CEO &amp; CoFounder of #fes Parimal Naik and my #family @BallysVegas #team #creditsisters #credit #business #training https://t.co/6LCeW2PVb9

ssec
0 Stupid Feminists, the civilization you take for granted was built with the labour, blood sweat and tears of men. #SemST
1 it's ironic that ppl will perform lifesaving therapies on animals to preserve their lives-but have staunch views in favor of #SemST
2 #Weneedfeminism because Twitter has its very own misogynist harassment machine. #YesAllWomen #HeForShe #Feminist #SemST
4 @TarheelKrystle @primatemachine Bowing down but not in a threatening way. B/c even bowing down can be triggering. #SemST
5 Meredith giving Don crap was great,but HOLY CRAP PEGGY. No spoilers, but DAMN was it a great scene. #Peggy #MadMen #SemST
6 You stick in the postcode on where you want to go and God will set the SatNav onhow to get there @davegilpi

## GoEmotion

In [6]:
from data.go_emotions import go_emotions

go_emotion = go_emotions(first_label_only=True)
go_emotion.prep()

go_emotion.lens

No config specified, defaulting to: go_emotions/simplified
Reusing dataset go_emotions (C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-07a134cc41feca48.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-d4dc7f7f3530d91a.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-91437c43c19e5bec.arrow
Removed go_emotions/embarrassment for too little data |train|=236, |test|=28
Removed go_emotions/grief for too little data |train|=63, |test|=6
Removed go_emotions/remo

{'go_emotions': 35447}

In [7]:
go_emotion = go_emotions(first_label_only=False)
go_emotion.prep()

go_emotion.lens

No config specified, defaulting to: go_emotions/simplified
Reusing dataset go_emotions (C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-07a134cc41feca48.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-d4dc7f7f3530d91a.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-91437c43c19e5bec.arrow
Removed go_emotions/embarrassment for too little data |train|=338, |test|=42
Removed go_emotions/grief for too little data |train|=84, |test|=6
Removed go_emotions/reli

{'go_emotions': 43101}

In [8]:
for k in go_emotion.lens.keys():
    dataset = go_emotion.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)
    labels, text, _, _ = next(trainloader)
    print(k)
    for lab, sent in zip(labels, text):
        print(lab, sent)
    print()

go_emotions
2 It's a photograph with one of our greatest ever defender and captain. What's your problem?
14 Not only do they make me cringe, I become angry because usually or is my nmother and her flying monkey daughter posting them. Just horrible.
3 A machine would literally do a worse job than him.
26 Wow, that crazy. Hope this doesn't happen again with you. Stay safe!
15 Thanks bud, hope it’s a great one!
8 I want to give my peers money, so they can fight for my rights and privilges- see the bus pass discussion in this thread. 
20 Oh that also would have been good. Plus, breaking pottery is so satisfying
0 I can only focus on that pastry.
6 Trash pick up vehicle driver. Uh.. a garbage man?
1 This is what I'm currently dealing with and I'm glad I read this. Nice reminder thank you lol.
4 This should be interesting...
5 I am not having a great day either. Sending good vibes, hope you feel better :)
22 For some reason I find these charming 🤷🏼‍♀️
25 They usually do hoss, hence the steak

## Or just get everything at once with the MetaDataset method

In [1]:
from data.meta_dataset import MetaDataset

dataset = MetaDataset(verbose=True, include=['go_emotions'])
dataset.prep()

Removed a total of 0 classes and 0 examples.
No config specified, defaulting to: go_emotions/simplified
Reusing dataset go_emotions (C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-07a134cc41feca48.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-d4dc7f7f3530d91a.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-91437c43c19e5bec.arrow
Removed go_emotions/embarrassment for too little data |train|=338, |test|=42
Removed go_emotions/grief for too little data 

In [7]:
import torch

from data.utils.data_loader import StratifiedLoader, AdaptiveNKShotLoader

for k in dataset.lens.keys():
    data_subset = dataset.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=data_subset,\
        device=torch.device('cpu'), k=1)
    labels, text, _, _ = next(trainloader)
    print(k)
    for lab, sent in zip(labels, text):
        print(lab, sent)
    print()

go_emotions
2 Oh look another shit call
14 This must have been totally horrifying for you . I m glad to hear you found peace in the second birth .
3 Tyl has got to be the most cringey , condescending thing I ve seen on Reddit in a while .
26 OH MY GOD ! The PTA has disbanded ! Ahh ! Ahh ! AHHHHH jumps through window
15 Nice to see someone actually posted something useful here . :D Thanks !
8 Maybe one day but I am very cultured *
20 Those not a part of the Austinfred bashing hive mind at the other sub will find themselves here tomorrow . :face_with_rolling_eyes: .
0 I consider everyone attractive in their own way , so everyone is intimidating .
6 Is this season 7 ? I do not remember it .
1 Oh , I thought it was about the audio glitch at the end , lol Now I see
4 My vanity point for my love for SS is that it has a lance lord .
5 It may get you pregnant so just make sure you eat all her birth control before hand .
22 I thought it was gen 2 that was affected by the dead batteries due to t

In [9]:
dataset.datasets['go_emotions']['train'][0][0]['text'].encode('latin-1').decode('utf-8')

'Damn youtube and outrage drama is super lucrative for reddit'

# Custom Tokenizer
Here we define some rules for manually cleaning the imported data.
Given this is all internet sourced, it's strongly recommended to define something at least.
Current manual tokenizer will:
- Correct the text encodings
- Align contractions with BERT tokenizers
- Handles emojis (using emoji package) and twitter handles
- Deals with some edge cases where Spacy's tokenizer fails

In [10]:
from transformers import AutoTokenizer, AutoModel

from data.utils.tokenizer import manual_tokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer.additional_special_tokens = ["HTTPURL", "@USER"]

The raw data

In [11]:
dataset = unified.datasets['grounded_emotions']['train']
trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=8)

labels, text, _, _ = next(trainloader)
text

['@belisawriter @Eliawriter fucking awesome!',
 'Trump Wants Faster Growth. The Fed Isnt So Sure. - The New York Times https://t.co/uQulh2CYHk',
 '@kenzie_MFC @AubreyCyles_ what? How?',
 '@JihadiJew thank you! i will :)',
 'RT @mysafela: Join us March 18th in San Pedro for a #SmokeAlarmAwarenessMonth community event in #SanPedro https://t.co/lESN9H2yCY',
 "RT @CynthiaEriVo: Much respect @DanielKaluuya_ well said, heartbreakingly honest and for what it's worth I thought you were brilliant. http",
 'RT @SenWarren: .@realDonaldTrump, your Muslim ban is now 0 for 2 vs the Constitution. Stop fighting the rule of law and start fighting for',
 'Trump adviser admits contact with Guccifer 2.0 during campaign - CBS News https://t.co/HD5M7bwx3F',
 'RT @VAPolitical: Treason: Appearing on Russian state television, longtime Trump adviser Roger Stone pushes Trumps wiretap lie https://t.co',
 '@ReaganBattalion @Gavin_McInnes this guy needs a big Jewish foot up his ass! What sick twisted guy @Gavin_McI

The same, but now manually tokenized, sample

In [12]:
list(map(manual_tokenizer, text))

['<USER> <USER> fucking awesome !',
 'Trump Wants Faster Growth . The Fed Is nt So Sure . - The New York Times',
 '<USER> <USER> _ what ? How ?',
 '<USER> thank you ! i will :)',
 'RT <USER> : Join us March 18th in San Pedro for a # SmokeAlarmAwarenessMonth community event in # SanPedro',
 'RT <USER> : Much respect <USER> _ well said , heartbreakingly honest and for what it s worth I thought you were brilliant . http',
 'RT <USER> : .<USER> , your Muslim ban is now 0 for 2 vs the Constitution . Stop fighting the rule of law and start fighting for',
 'Trump adviser admits contact with Guccifer 2.0 during campaign - CBS News',
 'RT <USER> : Treason : Appearing on Russian state television , longtime Trump adviser Roger Stone pushes Trumps wiretap lie',
 '<USER> <USER> this guy needs a big Jewish foot up his ass ! What sick twisted guy <USER> is ... #Nazi # Bigot',
 'If <USER> lets one # weak corporate owned dem vote for # trumpcare then he is totally ineffective as minority wh',
 'RT <USE

Can be easily slotted into the data loading process
Does quite a lot longer though...

In [13]:
# Use below if you additionally want to limit sentences to those that overlap well with BERT
# Not recommended for initial training 
#unified.prep(text_tokenizer=manual_tokenizer, text_tokenizer_kwargs={'bert_vocab': tokenizer.vocab.keys(), 'OOV_cutoff' :0.5, 'verbose':True})

unified.prep(text_tokenizer=manual_tokenizer)

In [14]:
print('\nExample data')
for k in unified.lens.keys():
    dataset = unified.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)

    labels, text, _, _ = next(trainloader)
    print(k)

    label_map = {v: k for k, v in unified.label_map[k].items()}
    tokenized_texts = list(map(tokenizer.decode, tokenizer(text)['input_ids']))
    for txt, label in zip(tokenized_texts, labels):
        print(label_map[label], txt)
    print()


Example data
grounded_emotions
sadness [CLS] rt < user > : amazing video showing how people rely on medicaid to * live * donald trump wants to slash the program by $ 880 billion. [SEP]
joy [CLS] rt < user > : scared little hands! so - called potus # trumptraitor # russianpawn # p2 # tcot # resistance # trumprussia [SEP]

ssec
anger [CLS] < user > thank you, # progressive - minded pursuers of familial and societal dystopia. # democrats # liberals [SEP]
disgust [CLS] < user > and where is < user >? nowhere to be seen - cause they only care about [SEP]
fear [CLS] < user > your treatment of the press, amb stevens fam, and the intel of the american ppl is enough to send you to jail [SEP]
sadness [CLS] i see more and more people each day question god s work and why he does things. if you believe, you should not have any questions. [SEP]
surprise [CLS] fantastic emmet county dem meeting tonight in estherville. glad to meet some < user > supporters! [SEP]
trust [CLS] pundits say jim webb faci

In [15]:
#go_emotion.prep(text_tokenizer=manual_tokenizer)

# Sampling
## Dataset sampling

In [16]:
from data.utils.sampling import dataset_sampler

source_name = dataset_sampler(unified, sampling_method='sqrt')
source_name

'tec'

## Dataloaders
Changed somewhat from last time. 

Now dataloaders must be generated manually using specific dataset (dict with labels as keys, lists of examples as values).

Samples from data and returns **both** the support and query.

Thus,

IN: dataset

OUT: support labels, support text, query labels, query text

If Huggingface tokenizer is passed, text is full model input (attention masks, token types, etc.)

Can be fed into model as,

```
model(**text)
```

### Stratified Sampling
Traditional N-way k-shot, balanced across classes.

Requires manually specifying k, which corresponds to batch size.

In [17]:
from collections import Counter

from data.utils.data_loader import StratifiedLoader, AdaptiveNKShotLoader

In [18]:
dataset = unified.datasets['ssec']['train']
trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=16)
support_labels, support_text, query_labels, query_text = next(trainloader)

In [19]:
Counter(support_labels)

Counter({0: 16, 1: 16, 2: 16, 4: 16, 5: 16, 6: 16, 3: 16})

In [20]:
Counter(query_labels)

Counter({0: 16, 1: 16, 2: 16, 4: 16, 5: 16, 6: 16, 3: 16})

In [21]:
#while True:
#    next(trainloader)

### Adaptive N-way k-shot
Dataloader with adaptive/stochastic N-way, k-shot batches.

Support set has random number of examples per class, although proportional to class size.

Query set is always balanced.

Not all classes are present if more than 5 classes are present in the dataset.

Algorithm taken from:

    Triantafillou et al. (2019). Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096.

Steps are as follows:
    
1. Sample subset of classes (min 5, max all classes)
    
2. Define query set size (max 10 per class)
    
3. Define support set size (max 128 for all)
    
4. Fill support set with samples, stochastically proportional to support set size
    
5. Fill query set with remaining samples


In [22]:
dataset = unified.datasets['ssec']['train']
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64)
support_labels, support_text, query_labels, query_text = next(trainloader)

In [23]:
Counter(support_labels)

Counter({0: 38, 2: 15, 1: 10})

In [24]:
Counter(query_labels)

Counter({0: 21, 2: 21, 1: 21})

Set `subset_classes=False` to retain all classes

In [25]:
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64, subset_classes=False)

for i in range(10):
    support_labels, support_text, query_labels, query_text = next(trainloader)
    print(Counter(support_labels), len(set(support_labels)))

Counter({4: 14, 2: 12, 6: 9, 1: 8, 0: 7, 3: 6, 5: 5}) 7
Counter({0: 12, 4: 11, 1: 9, 6: 9, 3: 8, 2: 6, 5: 5}) 7
Counter({0: 16, 4: 16, 1: 12, 3: 6, 5: 4, 6: 4, 2: 3}) 7
Counter({4: 16, 3: 13, 0: 8, 1: 7, 2: 7, 6: 5, 5: 5}) 7
Counter({4: 15, 0: 12, 2: 8, 6: 8, 1: 6, 5: 6, 3: 5}) 7
Counter({0: 17, 6: 14, 4: 11, 3: 6, 2: 6, 1: 4, 5: 3}) 7
Counter({6: 12, 3: 11, 1: 10, 2: 8, 4: 7, 0: 7, 5: 5}) 7
Counter({0: 14, 3: 12, 4: 11, 6: 9, 2: 6, 1: 6, 5: 3}) 7
Counter({4: 17, 3: 15, 0: 8, 1: 6, 6: 5, 2: 4, 5: 4}) 7
Counter({0: 16, 6: 14, 3: 9, 4: 8, 2: 6, 1: 5, 5: 3}) 7


In [26]:
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64, subset_classes=True)

for i in range(10):
    support_labels, support_text, query_labels, query_text = next(trainloader)
    print(Counter(support_labels), len(set(support_labels)))

Counter({0: 33, 1: 30}) 2
Counter({1: 9, 0: 6}) 2
Counter({3: 18, 4: 15, 1: 12, 0: 9, 2: 8}) 5
Counter({1: 18, 0: 17, 2: 13, 3: 10, 4: 4}) 5
Counter({2: 22, 0: 21, 1: 19}) 3
Counter({5: 12, 2: 12, 3: 7, 1: 7, 4: 6, 0: 6}) 6
Counter({1: 36, 2: 14, 0: 13}) 3
Counter({3: 22, 0: 20, 2: 9, 1: 7, 4: 4}) 5
Counter({2: 18, 4: 14, 0: 12, 1: 10, 3: 9}) 5
Counter({0: 27, 2: 23, 1: 12}) 3


Set `temp_map=False` to retain label definitions according to the dataset. 

Needs to be re-mapped to allow for generating one-hot vectors for loss computation.

In [27]:
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64, temp_map=True)

for i in range(10):
    support_labels, support_text, query_labels, query_text = next(trainloader)
    print(sorted(Counter(support_labels).keys()))

[0, 1, 2]
[0, 1, 2]
[0, 1, 2]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4, 5]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1]
[0, 1]
[0, 1, 2, 3]


In [28]:
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64, temp_map=False)

for i in range(10):
    support_labels, support_text, query_labels, query_text = next(trainloader)
    print(sorted(Counter(support_labels).keys()))

[1, 2, 3, 4, 5, 6]
[3, 6]
[2, 3, 4, 5, 6]
[1, 2]
[0, 4, 6]
[0, 1, 3, 4, 5, 6]
[0, 1, 2, 4, 5, 6]
[2, 5]
[0, 1]
[1, 3, 5]


In [29]:
dataset = unified.datasets['ssec']['train']

trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cuda'), tokenizer=tokenizer, max_support_size=8, temp_map=True)
for i in range(1000):
    batch = next(trainloader)
    support_labels, support_text, query_labels, query_text = batch

In [30]:
for task in unified.lens.keys():
    subset = unified.datasets[task]['test']
    for c in subset.keys():
        print(task, c, len(subset[c]))

grounded_emotions 1 211
grounded_emotions 0 306
ssec 0 1245
ssec 1 912
ssec 3 757
ssec 6 1205
ssec 4 1061
ssec 2 800
ssec 5 527
crowdflower 5 1893
crowdflower 6 1033
crowdflower 3 1854
crowdflower 2 1692
crowdflower 7 438
crowdflower 4 769
crowdflower 0 287
dailydialog 1 71
dailydialog 4 17115
dailydialog 3 2577
dailydialog 0 205
dailydialog 6 365
dailydialog 5 230
emotion-cause 3 96
emotion-cause 4 115
emotion-cause 0 97
emotion-cause 2 85
tec 5 770
tec 4 766
tec 3 1647
tec 1 153
tec 2 564
tec 0 311
emoint 0 340
emoint 1 449
emoint 2 323
emoint 3 307
electoraltweets 3 328
electoraltweets 0 114
electoraltweets 1 64
electoraltweets 9 162
electoraltweets 5 70
