### Student Information
Name: 陳培熹

Student ID: 113065425

GitHub ID: Tedious8

Kaggle name: Tadeus

Kaggle private scoreboard snapshot:

![Scoreboard snapshot](./img/pic0.png)

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)__. 

DISCLAIMER: THIS CODE WAS RUNNING ON KAGGLE NOTEBOOK WITH GPU P100

# 1. Import libraries

In [None]:
# Import necessary libraries
import pandas as pd
from tqdm import tqdm
import json
from transformers import TFAutoModelForSequenceClassification, DataCollatorWithPadding, AutoTokenizer, AutoConfig, create_optimizer
from transformers.keras_callbacks import KerasMetricCallback
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from datasets import Dataset
import tensorflow as tf

# 2. Preprocessing

Proceed to part 2.2 if the data has been preprocessed

## 2.1 Preprocess the given data

In [None]:
# Load datasets
data_identification = pd.read_csv('/kaggle/input/dm-2024-isa-5810-lab-2-homework/data_identification.csv')
emotion = pd.read_csv('/kaggle/input/dm-2024-isa-5810-lab-2-homework/emotion.csv')
with open('/kaggle/input/dm-2024-isa-5810-lab-2-homework/tweets_DM.json', 'r') as file:
    tweets_DM = []
    for line in tqdm(file):
        # Parse each line as a JSON object
        json_object = json.loads(line)
        tweets_DM.append(json_object)

In [None]:
# Merge dataframes
df = pd.merge(data_identification, emotion, on='tweet_id', how='left')
df.head()

In [None]:
# Extract tweet information
tweets_df = []
for tweet in tqdm(tweets_DM):
    tweet_info = tweet['_source']['tweet']
    tweet_info['_score'] = tweet['_score']
    tweet_info['_index'] = tweet['_index']
    tweet_info['_crawldate'] = tweet['_crawldate']
    tweet_info['_type'] = tweet['_type']
    tweets_df.append(tweet_info)

# Convert to DataFrame and drop unnecessary columns
tweets_df = pd.DataFrame(tweets_df)
tweets_df.drop(columns=['_index', '_type'], inplace=True)

In [None]:
# Merge tweet information into the main dataframe
df = pd.merge(df, tweets_df, on='tweet_id')

In [None]:
# Separate data into training and testing datasets
raw_train = df[df['identification'] == 'train'][['tweet_id', 'emotion', 'text']]
raw_test = df[df['identification'] == 'test'][['tweet_id', 'emotion', 'text']]

In [None]:
# Save datasets to pickles
raw_train.to_pickle('raw_train.pkl')
raw_test.to_pickle('raw_test.pkl')

## 2.2 Load the preprocessed data

In [None]:
# Load preprocessed datasets
raw_train = pd.read_pickle('/kaggle/input/dm2024-lab2-homework-data/raw_train.pkl')
raw_test = pd.read_pickle('/kaggle/input/dm2024-lab2-homework-data/raw_test.pkl')

# 3. Training using Cardiffnlp model

Use the direct model from the Cardiffnlp if it's the first time training

In [None]:
# Specify the pre-trained model
MODEL = 'cardiffnlp/twitter-roberta-base-sentiment-latest'

Use the saved model

In [None]:
# Kaggle model path
MODEL = '/kaggle/input/dm2024-lab2-homework/model/'

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
# Encode emotion labels
encoder = OneHotEncoder()
encoder.fit(np.array(raw_train.emotion).reshape(-1, 1))

In [None]:
# Define preprocessing function for tokenization and encoding labels
def preprocess_function(examples):
    # Tokenize text with truncation
    out = tokenizer(examples['text'], truncation=True)

    # Encode labels and get class indices
    out['label'] = encoder.transform(np.array(examples['emotion']).reshape(-1, 1)).toarray().argmax(axis=1)
    
    return out

In [None]:
# Create Hugging Face Dataset and tokenize it
ds = Dataset.from_pandas(raw_train[['text', 'emotion']])
tokenized_ds = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/1455563 [00:00<?, ? examples/s]

In [None]:
# Split dataset into training and validation sets
tokenized_ds = tokenized_ds.train_test_split(test_size=0.1)

In [None]:
# Create a data collator for padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [None]:
# Install and import evaluation metric
%pip install evaluate -q
import evaluate
accuracy = evaluate.load("accuracy")

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
# Define metric computation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Convert logits to predicted class indices
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
# Create label mappings for the model
categories = encoder.categories_[0]  # Get the list of unique categories
label2id = {label: idx for idx, label in enumerate(categories)}
id2label = {idx: label for label, idx in label2id.items()}
print("label2id:", label2id)
print("id2label:", id2label)

label2id: {'anger': 0, 'anticipation': 1, 'disgust': 2, 'fear': 3, 'joy': 4, 'sadness': 5, 'surprise': 6, 'trust': 7}
id2label: {0: 'anger', 1: 'anticipation', 2: 'disgust', 3: 'fear', 4: 'joy', 5: 'sadness', 6: 'surprise', 7: 'trust'}


Note: I don't directly utilize 5 epochs for training. Instead I do one epoch for three different times.

In [None]:
# Configure training parameters
batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_ds["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

The below two cells is the setup for first time training and can be skipped if it's not the first time training

In [None]:
# Load model configuration and set label mappings
config = AutoConfig.from_pretrained(MODEL)
config.id2label = id2label
config.label2id = label2id
config.num_labels = 8

In [None]:
# Load pre-trained model
model = TFAutoModelForSequenceClassification.from_pretrained(
    MODEL, config=config, ignore_mismatched_sizes=True
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
Some weights of TFDistilBertForSequenceClassification were no

The cell below is only used after the saved model exist (first training) on Kaggle

In [None]:
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)

In [None]:
# Prepare datasets for TensorFlow
tf_train_set = model.prepare_tf_dataset(
    tokenized_ds["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_ds["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

In [None]:
# Compile the model
model.compile(optimizer=optimizer)

In [None]:
# Add metric evaluation callback
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)                            
callbacks = [metric_callback]

In [None]:
# Train the model
model.fit(x=tf_train_set, 
          validation_data=tf_validation_set, 
          epochs=num_epochs, 
          callbacks=callbacks)

Cause: for/else statement not yet supported


I0000 00:00:1732457295.188817     112 service.cc:145] XLA service 0x7f1edf93de70 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1732457295.188872     112 service.cc:153]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
I0000 00:00:1732457295.188878     112 service.cc:153]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
I0000 00:00:1732457295.389783     112 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.




<tf_keras.src.callbacks.History at 0x7f1f31f82320>

In [None]:
# Save the trained model and tokenizer
model.save_pretrained("/kaggle/working/model")
tokenizer.save_pretrained("/kaggle/working/model")

('/kaggle/working/model/tokenizer_config.json',
 '/kaggle/working/model/special_tokens_map.json',
 '/kaggle/working/model/vocab.txt',
 '/kaggle/working/model/added_tokens.json',
 '/kaggle/working/model/tokenizer.json')

In [None]:
# Preprocess test dataset
def preprocess_test(examples):
    return tokenizer(examples['text'], truncation=True)

test_ds = Dataset.from_pandas(raw_test[['text']])
tokenized_test_ds = test_ds.map(preprocess_test, batched=True)

Map:   0%|          | 0/411972 [00:00<?, ? examples/s]

In [None]:
# Prepare test set for TensorFlow
tf_test_set = model.prepare_tf_dataset(
    tokenized_test_ds,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

In [None]:
# Get the predictions
predictions = model.predict(tf_test_set)

# Convert logits to predicted class indices
predicted_class_indices = np.argmax(predictions.logits, axis=1)

# Map class indices to their corresponding labels
predicted_labels = [id2label[idx] for idx in predicted_class_indices]

 1279/25749 [>.............................] - ETA: 12:08

In [None]:
# Create submission file
submission = raw_test.copy()
submission = raw_test[['tweet_id', 'emotion']]
submission['emotion'] = predicted_labels
submission = submission.rename(columns={'tweet_id': 'id'})
submission.to_csv('submission.csv', index=False)
submission.head()