# Text Analytics Coursework -- Data Loader

This notebook is to help get you started with the datasets used in the coursework assignment. 

For this coursework, we recommend that you use your virtual environment that you created for the labs. 

In [19]:
%load_ext autoreload
%autoreload 2

# Use HuggingFace's datasets library to access the Emotion dataset
from datasets import load_dataset
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Social Media Emotion Classification

The dataset classifies Tweets into anger, joy, optimism or sadness.

First we need to load the data. The data is already split into train, validation and test. The _validation_ set (also called 'development' set or 'devset') can be used to compute performance of your model when tuning hyperparameters, optimising combinations of features, or looking at the errors your model makes before improving it. This allows you to hold out the test set (i.e., not to look at it at all when developing your method) to give a fair evaluation of the model and how well it generalises to new examples. This avoids tuning the model to specific examples in the test set. An alternative approach to validation is to not use a single fixed validation set, but instead use [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html). 

In [20]:
cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="train",
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")


val_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="validation",
    cache_dir=cache_dir,
)
print(f"Development/validation dataset with {len(val_dataset)} instances loaded")


test_dataset = load_dataset(
    "tweet_eval",
    name="emotion",
    split="test",
    cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")

# Access the input text and target labels like this...

train_texts = train_dataset['text']
train_labels = train_dataset['label']

val_texts = val_dataset['text']
val_labels = val_dataset['label']

test_texts = test_dataset['text']
test_labels = test_dataset['label']

Training dataset with 3257 instances loaded
Development/validation dataset with 374 instances loaded
Test dataset with 1421 instances loaded


In [21]:
import pandas as pd

In [22]:
def textsets_to_csv(textsets, labels):
    emotion = {
        0: "Anger",
        1: "Joy",
        2: "Optimism",
        3: "Sadness"
    }
    data = pd.DataFrame()
    data["text"] = textsets
    data["label"] = labels
    data["emotion"] = data["label"].apply(lambda x: emotion[x])
    return data

In [23]:
train = textsets_to_csv(train_texts, train_labels)
train.to_csv("../data/train.csv",index=False)

val = textsets_to_csv(val_texts, val_labels)
val.to_csv("../data/val.csv",index=False)

test = textsets_to_csv(test_texts, test_labels)
test.to_csv("../data/test.csv",index=False)

In [18]:
import matplotlib.pyplot as plt
import seaborn as sns

Unnamed: 0,text,label
0,“Worry is a down payment on a problem you may ...,2
1,My roommate: it's okay that we can't spell bec...,0
2,No but that's so cute. Atsu was probably shy a...,1
3,Rooneys fucking untouchable isn't he? Been fuc...,0
4,it's pretty depressing when u hit pan on ur fa...,3
...,...,...
3252,I get discouraged because I try for 5 fucking ...,3
3253,The @user are in contention and hosting @user ...,3
3254,@user @user @user @user @user as a fellow UP g...,0
3255,You have a #problem? Yes! Can you do #somethin...,0


In [8]:
type(train_texts)

list

In [4]:
train_texts[:5]

["“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 "My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs",
 "No but that's so cute. Atsu was probably shy about photos before but cherry helped her out uwu",
 "Rooneys fucking untouchable isn't he? Been fucking dreadful again, depay has looked decent(ish)tonight",
 "it's pretty depressing when u hit pan on ur favourite highlighter"]

# Bio Creative V

Marks chemicals and diseases in Pubmed articles as named entities. For further details, see [the HuggingFace page](https://huggingface.co/datasets/tner/bc5cdr)

In [3]:
ner_dataset = load_dataset(
    "tner/bc5cdr", 
)

print(f'The dataset is a dictionary with {len(dataset)} splits: \n\n{dataset}')

Downloading builder script:   0%|          | 0.00/2.77k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.51M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.42M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.42M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

NameError: name 'dataset' is not defined

In [None]:
# It  may be useful to obtain the data in a list format for some sequence tagging methods
train_sentences_ner = [item['tokens'] for item in ner_dataset['train']]
train_labels_ner = [[str(tag) for tag in item['tags']] for item in ner_dataset['train']]

val_sentences_ner = [item['tokens'] for item in ner_dataset['validation']]
val_labels_ner = [[str(tag) for tag in item['tags']] for item in ner_dataset['validation']]

test_sentences_ner = [item['tokens'] for item in ner_dataset['test']]
test_labels_ner = [[str(tag) for tag in item['tags']] for item in ner_dataset['test']]

In [None]:
# Show the different tag values in the dataset:
np.unique(np.concatenate(train_labels_ner))

### (Optional) Transformer Sequence Tagger

People that want to use a transformer for task 2 may want to take a look at the [Token Classification tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=vc0BSBLIIrJQ) from HuggingFace. There is no requirement to use a transformer to achieve high marks, this is one option that you may consider. Feel free to skip this part of the notebook if you are using a different kind of model that does not require it.

A useful function provided by HuggingFace as part of the Token Classification page is tokenize_and_align. You can reuse this function if you are working with a method that tokenizes the text in a diferent way to the Bio Creative V dataset. This function is provided below:

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, max_length=128, is_split_into_words=True)
    print(tokenized_inputs.keys())
    labels = []
    for i, label in enumerate(examples["tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
from transformers import AutoTokenizer

# An example of how to use tokenize_and_align:
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 
label_all_tokens=False

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

When we get the predictions from the transformer sequence tagger, we will need to skip tokens with a training label of -100, as these are parts of a word or special tokens. 