# TODO
## Unified Emotion
- Remove samples with no-emotion class
- Convert multiple labels to multiple examples with different labels
- Come up with better assignment scheme for train/valid/test splits
- Drop "#SemST" from ssec sentences

## Go Emotions
- Get rid of print when loading (low priority)
- Include cases for manual tokenizer
- Convert multiple labels to multiple examples with different labels (check with Luuk)

## Manual Tokenizer
- check if works for go emotion
- incorporate special tokens into huggingface tokenizer

## Dataloaders
- loop Stratifiedloader for infinite sampling
- Rewrite train script to use correct dataloaders

In [1]:
import torch

from data.utils.data_loader import StratifiedLoader, AdaptiveNKShotLoader

# Datasets
## Unified Emotion

Data source download: https://drive.google.com/file/d/1y7yjshepNRPhnh-Qz5MTRbnopGn7KzUm/view?usp=sharing
Originally from: https://github.com/sarnthil/unify-emotion-datasets


Klinger, R. & Bostan, L. (2018, August). An analysis of annotated corpora for emotion classification in text. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 2104-2119).

In [2]:
import pandas as pd
from data.unified_emotion import unified_emotion, unified_emotion_info

pd.DataFrame(unified_emotion_info())

Unnamed: 0,source,size,domain,classes,special
0,affectivetext,250,headlines,6,"non-discrete, multiple labels"
1,crowdflower,40000,tweets,14,includes no-emotions class
2,dailydialog,13000,conversations,6,includes no-emotions class
3,electoraltweets,4058,tweets,8,includes no-emotions class
4,emobank,10000,headlines,3,VAD regression
5,emoint,7097,tweets,6,annotated by experts
6,emotion-cause,2414,artificial,6,
7,fb-valence-arousal-anon,2800,facebook,3,VA regression
8,grounded_emotions,2500,tweets,2,
9,ssec,4868,tweets,8,multiple labels per sentence


In [3]:
unified = unified_emotion("./data/datasets/unified-dataset.jsonl",\
    include=['crowdflower', 'dailydialog', 'electoraltweets', 'emoint', 'emotion-cause', 'grounded_emotions', 'ssec', 'tec'])

unified.prep()

In [4]:
unified.lens

{'grounded_emotions': 2585,
 'ssec': 4868,
 'crowdflower': 40000,
 'dailydialog': 102979,
 'emotion-cause': 2414,
 'tec': 21051,
 'emoint': 7102,
 'electoraltweets': 4056}

In [5]:
for k in unified.lens.keys():
    dataset = unified.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)
    labels, text, _, _ = next(trainloader)
    print(k)
    for lab, sent in zip(labels, text):
        print(lab, sent)
    print()

grounded_emotions
1 @realDonaldTrump @POTUS @IvankaTrump Mental health benefits removed from trumpcare. #trumpcare https://t.co/pE5a3YFh7Z
0 RT @Tackspayer: @IorettaIynch @HITEXECUTIVE I'd prefer an unemployed Constitutional scholar, Barack Obama. The karma would be deliciously sâ¦

ssec
0 @brandileighhhhh its called sexual coercion, and it is the same as rape. #RapeCulture #SemST
2 #Northwest #HeatWave continues. First time. Ever. My tomato plants have fruit. In JUNE! #Oregon #Organic #SemST
3 Make sure to make it to the Brew House in Pella, IA tomorrow @ 3 to meet with @HillaryClinton supporters! #SemST
1 The guy in the multicolored shirt looks chi as fuck. #SemST
6 We are what we are. Nothing more, nothing less. #spirituality #SemST
4 Pretend I'm a #tree and #save me. -babies everywhereyouthgen #SemST
5 Serious question for my atheist libertarians: How can rights exist without God? #ChristianLibertarian #SemST

crowdflower
5 @jeffparks Good morning, sir
6 Fell down the stairs at da

## GoEmotion

In [6]:
from data.go_emotions import go_emotions

go_emotion = go_emotions(first_label_only=True)
go_emotion.prep()

go_emotion.lens

No config specified, defaulting to: go_emotions/simplified
Reusing dataset go_emotions (C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-07a134cc41feca48.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-d4dc7f7f3530d91a.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-91437c43c19e5bec.arrow


{'go_emotions': 36491}

In [7]:
go_emotion = go_emotions(first_label_only=False)
go_emotion.prep()

go_emotion.lens

No config specified, defaulting to: go_emotions/simplified
Reusing dataset go_emotions (C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-07a134cc41feca48.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-d4dc7f7f3530d91a.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-91437c43c19e5bec.arrow


{'go_emotions': 44208}

In [8]:
for k in go_emotion.lens.keys():
    dataset = go_emotion.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)
    labels, text, _, _ = next(trainloader)
    print(k)
    for lab, sent in zip(labels, text):
        print(lab, sent)
    print()

go_emotions
2 Whaaaaaaat the heck was that?
14 I agree with you, but you know, sharia *is* sharia, which I find a tad more frightening 
3 You finna be nauseous as hell man, Dont stress it though just throw up
26 OMG what??? ...[NAME] actually showed up for a gig without cancelling with a terrible excuse?
15 Nah I'm good thanks, I still got Netflix and the Pirate Bay
8 Just got the Ultimate edition and it's already my favourite game of all time. I just wish everyone could agree [NAME] is best girl
20 No problem mate. If you're looking for activities, give these boys a look: 
0 I think it doesn't have to be only in school, but anyway, good attitude, I'm sure you can overcome it (with patience)
6 they might have seen me pooping
1 It's an mi5 threat to Ireland. Funny how [NAME] may wants to amend GFA the next day
4 Hmm. I don't disagree. It's unfortunate.
5 Wish you the best of luck and greener pastures!
12 What a cringeworthy load of kak this is. Wow I have no words.
22 I think the only p

# Custom Tokenizer
Here we define some rules for manually cleaning the imported data.
Given this is all internet sourced, it's strongly recommended to define something at least.
Current manual tokenizer will:
- Correct the text encodings
- Align contractions with BERT tokenizers
- Handles emojis (using emoji package) and twitter handles
- Deals with some edge cases where Spacy's tokenizer fails

In [9]:
from transformers import AutoTokenizer, AutoModel

from data.utils.tokenizer import manual_tokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer.additional_special_tokens = ["HTTPURL", "@USER"]

The raw data

In [10]:
dataset = unified.datasets['grounded_emotions']['train']
trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=8)

labels, text, _, _ = next(trainloader)
text

['I know you @spinaltap https://t.co/ny8EdRLthF',
 "https://t.co/on7RtxSkOl HOW MANY MORE WAYS CAN #45 SHOW US HE DOESN'T GIVE A FUCK ABOUT AMERICA(NS)! He's doing all he can to hurt us!!",
 "RT @PhyllisSilver: Andy: Can't say it better than Gump...Stupid is as stupid does! These cuts are #PennyWiseAndPoundFoolish They will costâ\x80¦",
 "@jonfavs he has no clue. Like all the EO's.",
 'RT @CreativeFuture: Photographer Spencer Amonwatvorakul talks about quitting his day job for his dream #StandCreative2 https://t.co/ToJcRG1â\x80¦',
 "My #TMCð\x9f\x87ºð\x9f\x87¸ð\x9f\x98\x8e, U have been on pointâ\x9c\x85ð\x9f\x94¥!Exemplary comebacks &amp; brilliant original content!Doesn't get any better than that!!Getting caught upâ\x98ºï¸\x8fð\x9f\x87ºð\x9f\x87¸.",
 'RT @petermaer: #Oklahoma authorities get Shortey.  #GOP state senator, former #Trump coordinator faces child prostitution charges. https:/â\x80¦',
 'RT @NastyResister: @SecPriceMD @POTUS @DanaBashCNN @wolfblitzer They should take your me

The same, but now manually tokenized, sample

In [11]:
list(map(manual_tokenizer, text))

['i know you @USER HTTPURL',
 "HTTPURL how many more ways can # 45 show us he doesn ' t give a fuck about america ( ns ) ! he s doing all he can to hurt us ! !",
 'rt @USER : andy : can not say it better than gump ... stupid is as stupid does ! these cuts are # pennywiseandpoundfoolish they will cost ...',
 '@USER he has no clue . like all the eo s .',
 'rt @USER : photographer spencer amonwatvorakul talks about quitting his day job for his dream # standcreative2 HTTPURL ...',
 "my # tmc 🇺 🇸 :smiling_face_with_sunglasses: , u have been on point :check_mark_button: :fire: !exemplary comebacks brilliant original content ! doesn ' t get any better than that !! getting caught up :smiling_face:  🇺 🇸 .",
 'rt @USER : # oklahoma authorities get shortey . # gop state senator , former # trump coordinator faces child prostitution charges . https :/ ...',
 'rt @USER : @USER @USER @USER @USER they should take your medical license for violating your oath . " first do n ...',
 'that is truly sickeni

Can be easily slotted into the data loading process
Does quite a lot longer though...

In [12]:
# Use below if you additionally want to limit sentences to those that overlap well with BERT
# Not recommended for initial training 
#unified.prep(text_tokenizer=manual_tokenizer, text_tokenizer_kwargs={'bert_vocab': tokenizer.vocab.keys(), 'OOV_cutoff' :0.5, 'verbose':True})

unified.prep(text_tokenizer=manual_tokenizer)

Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.


In [13]:
print('\nExample data')
for k in unified.lens.keys():
    dataset = unified.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)

    labels, text, _, _ = next(trainloader)
    print(k)

    label_map = {v: k for k, v in unified.label_map[k].items()}
    tokenized_texts = list(map(tokenizer.decode, tokenizer(text)['input_ids']))
    for txt, label in zip(tokenized_texts, labels):
        print(label_map[label], txt)
    print()


Example data
grounded_emotions
sadness [CLS] @ user what regulations caused harm to remington? thank you. [SEP]
joy [CLS] i am doing it!! @ user @ user # atmosphereplus # socool # nbavirginnomore # staplescenter... httpurl [SEP]

ssec
anger [CLS] ( it s just a night watcheasy to set up. staggered pickets ( google it ) work best ) # whoisburningblackchurches # semst [SEP]
fear [CLS] rosalind peterson addressing un on how aerosol spraying ( chemtrails ) is affecting agriculture. # notadebate # geoengineering # semst [SEP]
joy [CLS] coming from a female that was taken away the ability to have children. i still believe women should have the ability to choose. # semst [SEP]
disgust [CLS] i don't buy a dress that i can not completely zip up and fasten on my own. # semst [SEP]
trust [CLS] people who have been pregnant can be pro choice. people who can not have kids can be pro choice. people who have a uterus can be # semst [SEP]
sadness [CLS] pretend i am a # tree and # save me. - babies eve

In [14]:
#go_emotion.prep(text_tokenizer=manual_tokenizer)

# Sampling
## Dataset sampling

In [15]:
from data.utils.sampling import dataset_sampler

source_name = dataset_sampler(unified, sampling_method='sqrt')
source_name

'dailydialog'

## Dataloaders
Changed somewhat from last time. 

Now dataloaders must be generated manually using specific dataset (dict with labels as keys, lists of examples as values).

Samples from data and returns **both** the support and query.

Thus,

IN: dataset

OUT: support labels, support text, query labels, query text

If Huggingface tokenizer is passed, text is full model input (attention masks, token types, etc.)

Can be fed into model as,

```
model(**text)
```

### Stratified Sampling
Traditional N-way k-shot, balanced across classes.

Requires manually specifying k, which corresponds to batch size.

In [16]:
from collections import Counter

from data.utils.data_loader import StratifiedLoader, AdaptiveNKShotLoader

In [17]:
dataset = unified.datasets['ssec']['train']
trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=16)
support_labels, support_text, query_labels, query_text = next(trainloader)

In [18]:
Counter(support_labels)

Counter({0: 16, 2: 16, 3: 16, 1: 16, 6: 16, 4: 16, 5: 16})

In [19]:
Counter(query_labels)

Counter({0: 16, 2: 16, 3: 16, 1: 16, 6: 16, 4: 16, 5: 15})

In [20]:
#while True:
#    next(trainloader)

### Adaptive N-way k-shot
Dataloader with adaptive/stochastic N-way, k-shot batches.

Support set has random number of examples per class, although proportional to class size.

Query set is always balanced.

Not all classes are present if more than 5 classes are present in the dataset.

Algorithm taken from:

    Triantafillou et al. (2019). Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096.

Steps are as follows:
    
1. Sample subset of classes (min 5, max all classes)
    
2. Define query set size (max 10 per class)
    
3. Define support set size (max 128 for all)
    
4. Fill support set with samples, stochastically proportional to support set size
    
5. Fill query set with remaining samples


In [21]:
dataset = unified.datasets['ssec']['train']
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=128)
support_labels, support_text, query_labels, query_text = next(trainloader)

In [22]:
Counter(support_labels)

Counter({4: 3, 6: 3, 0: 84, 5: 2, 3: 25, 2: 8})

In [23]:
Counter(query_labels)

Counter({4: 10, 6: 10, 0: 10, 5: 10, 3: 10, 2: 10})