### Student Information
Name: Yu Ying Chen

Student ID: 108060030

GitHub ID: ARui-tw

Kaggle name: imissina.com

Kaggle private scoreboard snapshot: [Snapshot](img/pic0.png)

----

### Instructions
1. First: __This part is worth 30% of your grade.__ Do the **take home** exercises in the [DM2022-Lab2-master Repo](https://github.com/keziatamus/DM2022-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm2022-isa5810-lab2-homework) regarding Emotion Recognition on Twitter by this link https://www.kaggle.com/t/2b0d14a829f340bc88d2660dc602d4bd. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)   
    Submit your last submission __BEFORE the deadline (Nov. 22th 11:59 pm, Tuesday)_. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook** and **add minimal comments where needed**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 25th 11:59 pm, Friday)__. 

In [1]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn.functional as F
from transformers import BertTokenizer, BertConfig,AdamW, BertForSequenceClassification,get_linear_schedule_with_warmup


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report
# Import and evaluate each test batch using Matthew's correlation coefficient
from sklearn.metrics import accuracy_score,matthews_corrcoef

import tqdm
import random
import os
import io

In [2]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [3]:
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

SEED = 19

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if device == torch.device("cuda"):
    torch.cuda.manual_seed_all(SEED)

In [4]:
data = pd.read_pickle('data/data_pd.pkl')
test = pd.read_pickle('data/test_pd.pkl')

In [5]:
data['emotion'].unique()

array(['anticipation', 'sadness', 'fear', 'joy', 'anger', 'trust',
       'disgust', 'surprise'], dtype=object)

In [6]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data['label_enc'] = labelencoder.fit_transform(data['emotion'])

In [7]:
data[['emotion','label_enc']].drop_duplicates(keep='first')

Unnamed: 0_level_0,emotion,label_enc
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0x376b20,anticipation,1
0x2d5350,sadness,5
0x1cd5b0,fear,3
0x1d755c,joy,4
0x1fde89,anger,0
0x33832f,trust,7
0x2f4b5c,disgust,2
0x1d5cff,surprise,6


In [8]:
data.rename(columns={'emotion':'label_desc'},inplace=True)
data.rename(columns={'label_enc':'label'},inplace=True)

In [9]:
data.head()

Unnamed: 0_level_0,hashtags,text,score,crawldate,label_desc,identification,label
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0x376b20,Snapchat,Snapchat People who post add me on Snapchat ...,391,2015-05-23 11:42:47,anticipation,train,1
0x2d5350,freepress Trump Legacy CNN,freepress Trump Legacy CNN brianklaas As we se...,433,2016-01-28 04:52:09,sadness,train,5
0x1cd5b0,,Now ISSA is stalking Tasha 😂😂😂 LH,376,2016-01-24 23:53:05,fear,train,3
0x1d755c,authentic Laugh Out Loud,authentic Laugh Out Loud RISKshow The Kevin A...,120,2015-06-11 04:44:05,joy,train,4
0x2c91a8,,Still waiting on those supplies Liscus LH,1021,2015-08-18 02:30:07,anticipation,train,1


In [10]:
small_data = data

In [11]:
sentences = small_data.text.values

#check distribution of data based on labels
print("Distribution of data based on labels: \n", small_data.label.value_counts())

# Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway. 
# In the original paper, the authors used a length of 512.
MAX_LEN = 256

## Import BERT tokenizer, that is used to convert our text into tokens that corresponds to BERT library
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)
input_ids = [tokenizer.encode(sent, add_special_tokens=True,max_length=MAX_LEN,pad_to_max_length=True) for sent in sentences]
labels = small_data.label.values

print("Actual sentence before tokenization: ",sentences[2])
print("Encoded Input from dataset: ",input_ids[2])

## Create attention mask
attention_masks = []
## Create a mask of 1 for all input tokens and 0 for all padding tokens
attention_masks = [[float(i>0) for i in seq] for seq in input_ids]
print(attention_masks[2])

Distribution of data based on labels: 
 4    516017
1    248935
7    205478
5    193437
2    139101
3     63999
6     48729
0     39867
Name: label, dtype: int64


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


KeyboardInterrupt: 

In [None]:
train_inputs,validation_inputs,train_labels,validation_labels = train_test_split(input_ids,labels,random_state=41,test_size=0.1)
train_masks,validation_masks,_,_ = train_test_split(attention_masks,input_ids,random_state=41,test_size=0.1)

In [None]:
# convert all our data into torch tensors, required data type for our model
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
batch_size = 32

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory
train_data = TensorDataset(train_inputs,train_masks,train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data,sampler=train_sampler,batch_size=batch_size)

validation_data = TensorDataset(validation_inputs,validation_masks,validation_labels)
validation_sampler = RandomSampler(validation_data)
validation_dataloader = DataLoader(validation_data,sampler=validation_sampler,batch_size=batch_size)

In [14]:
# Load BertForSequenceClassification, the pretrained BERT model with a single linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=8).to(device)

# Parameters:
lr = 2e-5
adam_epsilon = 1e-8

# Number of training epochs (authors recommend between 2 and 4)
epochs = 3

num_warmup_steps = 0
num_training_steps = len(train_dataloader)*epochs

### In Transformers, optimizer and schedules are splitted and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr,eps=adam_epsilon,correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [15]:
## Store our loss and accuracy for plotting
train_loss_set = []
learning_rate = []

# Gradients gets accumulated by default
model.zero_grad()

i = 0

# tnrange is a tqdm wrapper around the normal python range
for _ in tqdm.notebook.trange(1,epochs+1,desc='Epoch'):
  print("<" + "="*22 + F" Epoch {_} "+ "="*22 + ">")
  # Calculate total loss for this epoch
  batch_loss = 0

  for step, batch in enumerate(train_dataloader):
    # Set our model to training mode (as opposed to evaluation mode)
    model.train()
    
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch

    # Forward pass
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    loss = outputs[0]
    
    # Backward pass
    loss.backward()
    
    # Clip the norm of the gradients to 1.0
    # Gradient clipping is not in AdamW anymore
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    
    # Update learning rate schedule
    scheduler.step()

    # Clear the previous accumulated gradients
    optimizer.zero_grad()
    
    # Update tracking variables
    batch_loss += loss.item()

  # Calculate the average loss over the training data.
  avg_train_loss = batch_loss / len(train_dataloader)

  #store the current learning rate
  for param_group in optimizer.param_groups:
    print("\n\tCurrent Learning rate: ",param_group['lr'])
    learning_rate.append(param_group['lr'])
    
  train_loss_set.append(avg_train_loss)
  print(F'\n\tAverage Training loss: {avg_train_loss}')
    
  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Tracking variables 
  eval_accuracy,eval_mcc_accuracy,nb_eval_steps = 0, 0, 0

  # Evaluate data for one epoch
  for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    
    # Move logits and labels to CPU
    logits = logits[0].to('cpu').numpy()
    label_ids = b_labels.to('cpu').numpy()

    pred_flat = np.argmax(logits, axis=1).flatten()
    labels_flat = label_ids.flatten()
    
    df_metrics=pd.DataFrame({'Epoch':epochs,'Actual_class':labels_flat,'Predicted_class':pred_flat})
    
    tmp_eval_accuracy = accuracy_score(labels_flat,pred_flat)
    tmp_eval_mcc_accuracy = matthews_corrcoef(labels_flat, pred_flat)
    
    eval_accuracy += tmp_eval_accuracy
    eval_mcc_accuracy += tmp_eval_mcc_accuracy
    nb_eval_steps += 1

  print(F'\n\tValidation Accuracy: {eval_accuracy/nb_eval_steps}')
  print(F'\n\tValidation MCC Accuracy: {eval_mcc_accuracy/nb_eval_steps}')

  i += 1
  model_save_folder = 'model/'
  tokenizer_save_folder = 'tokenizer/'

  path_model = f'./kaggle/working/{model_save_folder}_{i}'
  path_tokenizer = f'./kaggle/working/{tokenizer_save_folder}_{i}'

  #create the dir

  !mkdir -p {path_model}
  !mkdir -p {path_tokenizer}

  ## Now let's save our model and tokenizer to a directory
  model.save_pretrained(path_model)
  tokenizer.save_pretrained(path_tokenizer)

  model_save_name = 'fineTuneModel.pt'
  path = path_model = f'./kaggle/working/{model_save_folder}_{i}/{model_save_name}'
  torch.save(model.state_dict(),path)

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]


	Current Learning rate:  1.3333333333333333e-05

	Average Training loss: 1.0666608575488914

	Validation Accuracy: 0.6496505642265699

	Validation MCC Accuracy: 0.5505549229091288

	Current Learning rate:  6.666666666666667e-06

	Average Training loss: 0.8855636883391453

	Validation Accuracy: 0.6616001423651456

	Validation MCC Accuracy: 0.5661935263752111

	Current Learning rate:  0.0

	Average Training loss: 0.7615322709681235

	Validation Accuracy: 0.6620499403322551

	Validation MCC Accuracy: 0.5691702111614892


In [16]:
data[['label','label_desc']].drop_duplicates(keep='first')

Unnamed: 0_level_0,label,label_desc
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0x376b20,1,anticipation
0x2d5350,5,sadness
0x1cd5b0,3,fear
0x1d755c,4,joy
0x1fde89,0,anger
0x33832f,7,trust
0x2f4b5c,2,disgust
0x1d5cff,6,surprise


In [17]:
## emotion labels
label2int = {
  "anticipation": 1,
  "sadness": 5,
  "fear": 3,
  "joy": 4,
  "anger": 0,
  "trust": 7,
  "disgust": 2,
  "surprise": 6
}

In [18]:
model_save_folder = 'model/'
tokenizer_save_folder = 'tokenizer/'

i = 4

path_model = F'./kaggle/working/{model_save_folder}_{i}'
path_tokenizer = F'./kaggle/working/{tokenizer_save_folder}_{i}'

#create the dir

!mkdir -p {path_model}
!mkdir -p {path_tokenizer}

## Now let's save our model and tokenizer to a directory
model.save_pretrained(path_model)
tokenizer.save_pretrained(path_tokenizer)

model_save_name = 'fineTuneModel.pt'
path = path_model = F'./kaggle/working/{model_save_folder}/{model_save_name}'
torch.save(model.state_dict(),path);

## load modles and test data

In [4]:
## Now load the model and tokenizer from the directory
model_save_folder = 'model2/'
tokenizer_save_folder = 'tokenizer2/'

i = 1

path_model = F'./kaggle/working/{model_save_folder}_{i}'
path_tokenizer = F'./kaggle/working/{tokenizer_save_folder}_{i}'

model = BertForSequenceClassification.from_pretrained(path_model, num_labels=8).to(device)
tokenizer = BertTokenizer.from_pretrained(path_tokenizer)

In [5]:
## Test the model
test = pd.read_pickle('data/test_pd.pkl')

test.head()

Unnamed: 0_level_0,hashtags,text,score,crawldate,emotion,identification
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0x28b412,bibleverse,bibleverse Confident of your obedience I writ...,232,2017-12-25 04:39:20,,test
0x2de201,,Trust is not the same as faith A friend is so...,989,2016-01-08 17:18:59,,test
0x218443,materialism money possessions,materialism money possessions When do you hav...,66,2015-09-09 09:22:55,,test
0x2939d5,Gods Plan Gods Work,Gods Plan Gods Work God woke you up now chas...,104,2015-10-10 14:33:26,,test
0x26289a,,In these tough times who do YOU turn to as yo...,310,2016-10-23 08:49:50,,test


In [6]:
## emotion labels
label2int = {
  "anticipation": 1,
  "sadness": 5,
  "fear": 3,
  "joy": 4,
  "anger": 0,
  "trust": 7,
  "disgust": 2,
  "surprise": 6
}

In [7]:
MAX_LEN = 512

sentences = test.text.values

input_ids = [tokenizer.encode(sent, add_special_tokens=True,max_length=MAX_LEN,pad_to_max_length=True) for sent in sentences]

## Create attention mask
attention_masks = []
## Create a mask of 1 for all input tokens and 0 for all padding tokens
attention_masks = [[float(i>0) for i in seq] for seq in input_ids]

input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)


test_dataset = TensorDataset(input_ids, attention_masks)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=32)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [8]:
a = []
for batch in test_dataloader:
	# Add batch to GPU
	batch = tuple(t.to(device) for t in batch)
	# Unpack the inputs from our dataloader
	b_input_ids, b_input_mask = batch
	# Telling the model not to compute or store gradients, saving memory and speeding up validation
	with torch.no_grad():
		# Forward pass, calculate logit predictions
		logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

	# Move logits and labels to CPU
	logits = logits[0].to('cpu').numpy()
	pred_flat = np.argmax(logits, axis=1).flatten()
	a.append(pred_flat)

print(len(a))
print(a[:10])


12875
[array([1, 7, 4, 1, 7, 4, 3, 0, 4, 5, 4, 1, 5, 4, 4, 4, 4, 2, 4, 4, 5, 4,
       4, 5, 4, 4, 2, 0, 4, 4, 5, 5]), array([5, 4, 5, 5, 5, 4, 4, 4, 1, 1, 4, 4, 1, 4, 5, 4, 2, 4, 4, 5, 2, 4,
       5, 4, 5, 5, 2, 6, 5, 7, 7, 2]), array([4, 1, 4, 1, 4, 4, 4, 4, 5, 1, 1, 5, 4, 5, 4, 5, 4, 7, 2, 4, 4, 4,
       4, 5, 5, 4, 1, 1, 4, 2, 7, 5]), array([5, 4, 1, 2, 4, 2, 7, 5, 2, 4, 5, 5, 1, 1, 7, 2, 1, 4, 4, 4, 5, 5,
       2, 7, 5, 5, 5, 4, 2, 1, 5, 5]), array([5, 1, 5, 4, 5, 2, 4, 1, 4, 4, 1, 4, 7, 2, 1, 4, 4, 1, 5, 2, 4, 7,
       5, 4, 5, 2, 5, 4, 4, 3, 4, 2]), array([4, 4, 7, 1, 5, 5, 5, 4, 7, 5, 5, 5, 5, 4, 4, 7, 2, 7, 2, 2, 4, 5,
       5, 5, 4, 7, 5, 6, 5, 3, 4, 4]), array([1, 5, 5, 7, 2, 1, 1, 2, 4, 5, 5, 1, 1, 4, 4, 3, 5, 5, 4, 5, 2, 5,
       4, 5, 5, 2, 4, 5, 4, 2, 1, 4]), array([3, 5, 1, 4, 2, 1, 2, 4, 1, 1, 0, 4, 5, 4, 5, 4, 7, 4, 5, 4, 4, 7,
       4, 7, 5, 1, 5, 4, 2, 4, 4, 4]), array([1, 5, 3, 5, 5, 5, 4, 5, 1, 2, 1, 5, 7, 7, 5, 4, 5, 5, 4, 4, 4, 2,
       7, 2, 5, 7, 1, 2,

In [9]:
# output result
test['label'] = np.concatenate(a)
test.head()

Unnamed: 0_level_0,hashtags,text,score,crawldate,emotion,identification,label
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0x28b412,bibleverse,bibleverse Confident of your obedience I writ...,232,2017-12-25 04:39:20,,test,1
0x2de201,,Trust is not the same as faith A friend is so...,989,2016-01-08 17:18:59,,test,7
0x218443,materialism money possessions,materialism money possessions When do you hav...,66,2015-09-09 09:22:55,,test,4
0x2939d5,Gods Plan Gods Work,Gods Plan Gods Work God woke you up now chas...,104,2015-10-10 14:33:26,,test,1
0x26289a,,In these tough times who do YOU turn to as yo...,310,2016-10-23 08:49:50,,test,7


In [10]:
# change label to label_desc
inv_map = {v: k for k, v in label2int.items()}

test['label_desc'] = test['label'].map(lambda x: inv_map[x])
test.index.names = ['id']
test.drop(columns=['emotion'],inplace=True)
test.rename(columns={'label_desc':'emotion'},inplace=True)
test.head()

Unnamed: 0_level_0,hashtags,text,score,crawldate,identification,label,emotion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0x28b412,bibleverse,bibleverse Confident of your obedience I writ...,232,2017-12-25 04:39:20,test,1,anticipation
0x2de201,,Trust is not the same as faith A friend is so...,989,2016-01-08 17:18:59,test,7,trust
0x218443,materialism money possessions,materialism money possessions When do you hav...,66,2015-09-09 09:22:55,test,4,joy
0x2939d5,Gods Plan Gods Work,Gods Plan Gods Work God woke you up now chas...,104,2015-10-10 14:33:26,test,1,anticipation
0x26289a,,In these tough times who do YOU turn to as yo...,310,2016-10-23 08:49:50,test,7,trust


In [11]:
test.to_csv(f'output_{i}.csv', columns = ['emotion'], index=True)