# CS 335: Introduction to Large Language Models
## Assignment 01
### **Total Marks**: 100
### **Deadline**: Sunday, 3rd March, 2024, 11:59 PM
### **Name**: Muhammad Talha
### **ID**: mj06974

#Instructions

1. Please rename your notebook as *Assignment_1_aa1234.ipynb* before the final submission. Notebooks which do not follow appropriate naming convention will not be graded.

2. Please submit your own work. If you have any questions, please feel free to reach out to the course instructors or RA.



# Assignment Overview

In this assignment, you are required to fine tune a LLM model of your that classifies which human value category a textual arguement belongs to. Your model will evaluated against 1-baseline, random-baseline results on the following dataset: test, Nahjalbalagha, Zhihu


# Setup



In [None]:
# IMPORT ALL YOUR LIBRARIES
# SUGGESTED LIBRARIES
import torch
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score

# Download Files


##Evaluator

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!wget https://raw.githubusercontent.com/touche-webis-de/touche-code/main/semeval23/human-value-detection/evaluator/evaluator.py

## 1-Baseline

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!wget https://raw.githubusercontent.com/touche-webis-de/touche-code/main/semeval23/human-value-detection/1-baseline/1-baseline.py

## Random-Baseline

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!wget https://raw.githubusercontent.com/touche-webis-de/touche-code/main/semeval23/human-value-detection/random-baseline/random-baseline.py

## Dataset Files

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!wget https://zenodo.org/api/records/10564870/files-archive

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!unzip files-archive -d Dataset

In [None]:
!mkdir Dataset/zhihu
!mkdir Dataset/nahjalbalagha
!mkdir Dataset/train
!mkdir Dataset/test
!mkdir Dataset/validation

In [None]:
!mv Dataset/*-zhihu.tsv Dataset/zhihu
!mv Dataset/*-nahjalbalagha.tsv Dataset/nahjalbalagha
!mv Dataset/*-training.tsv Dataset/train
!mv Dataset/*-test.tsv Dataset/test
!mv Dataset/*-validation.tsv Dataset/validation

# Background Information

## Human Value Detection 2023 <br/>
## SemEval 2023 Task 4. ValueEval: Identification of Human Values behind Arguments



Given a textual argument and a human value category, classify whether or not the argument draws on that category. This task uses a set of 20 value categories compiled from the social science literature and described in our [ACL paper](https://webis.de/publications.html#kiesel_2022b). Arguments are given as premise text, conclusion text, and binary stance of the premise to the conclusion ("in favor of" or "against").

The 20 value categories are shown here on Schwartz' value continuum below:

[![JEPBxUu.md.png](https://iili.io/JEPBxUu.md.png)](https://freeimage.host/i/JEPBxUu)







## Data


Data is provided as tab-separated values files with one header line. The arguments-validation.tsv files contain one argument per line: its unique argument ID, the conclusion, the premise's stance towards the conclusion, and the premise itself. Example with tab-separated columns are shown below

<pre><span class="column">Argument ID</span>	<span class="column">Conclusion</span>	<span class="column">Stance</span>	<span class="column">Premise</span>
<span class="column">A01010</span>	<span class="column">We should prohibit school prayer</span>	<span class="column">against</span>	<span class="column">it should be allowed if the student wants to pray as long as it is not interfering with his classes</span>
<span class="column">A01011</span>	<span class="column">We should abolish the three-strikes laws</span>	<span class="column">in favor of</span>	<span class="column">three strike laws can cause young people to be put away for life without a chance to straight out their life</span>
<span class="column">A01012</span>	<span class="column">The use of public defenders should be mandatory</span>	<span class="column">in favor of</span>	<span class="column">the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't</span>
</pre>

The labels-validation.tsv  files also contain one argument per line: its unique argument ID and one column for each of the 20 value categories with a 1 meaning that the argument resorts to the value category and a 0 that not. Example with tab-separated columns are shown below:

<pre><span class="column">Argument ID</span>	<span class="column">Self-direction: thought</span>	<span class="column">Self-direction: action</span>	<span class="column">Stimulation</span>	<span class="column">Hedonism</span>	<span class="column">Achievement</span>	<span class="column">Power: dominance</span>	<span class="column">Power: resources</span>	<span class="column">Face</span>	<span class="column">Security: personal</span>	<span class="column">Security: societal</span>	<span class="column">Tradition</span>	<span class="column">Conformity: rules</span>	<span class="column">Conformity: interpersonal</span>	<span class="column">Humility</span>	<span class="column">Benevolence: caring</span>	<span class="column">Benevolence: dependability</span>	<span class="column">Universalism: concern</span>	<span class="column">Universalism: nature</span>	<span class="column">Universalism: tolerance</span>	<span class="column">Universalism: objectivity</span>
<span class="column">A01010</span>	<span class="column">1</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>
<span class="column">A01011</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">1</span>
<span class="column">A01012</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>
</pre>

In addition, there are other datasets for evaluating the robustness of our model: validation-zhihu from the recommendation and hotlist section of the Chinese question-answering website Zhihu, test-nahjalbalagha from and based on the Nahj al-Balagha.



## Evaluation

Runs are evaluated on the basis of F1-score, Precision, and Recall: averaged over all value categories and for each category individually.

## Baseline Results

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!mkdir baseline
!mkdir output

### 1-Baseline

#### Test Dataset

In [None]:
# DO NOT EDIT
!python3 /content/1-baseline.py --inputDataset /content/Dataset/test --outputDataset /content/baseline
!python3 evaluator.py --inputDataset /content/Dataset/test/ --inputRun /content/baseline --outputDataset /content/output
!head -n 12 /content/output/evaluation.prototext

#### Zhihu

In [None]:
# DO NOT EDIT
!python3 /content/1-baseline.py --inputDataset /content/Dataset/zhihu/ --outputDataset /content/baseline
!python3 evaluator.py --inputDataset /content/Dataset/zhihu/ --inputRun /content/baseline --outputDataset /content/output
!head -n 12 /content/output/evaluation.prototext

#### Nahjalbalagha

In [None]:
# DO NOT EDIT
!python3 /content/1-baseline.py --inputDataset /content/Dataset/nahjalbalagha/ --outputDataset /content/baseline
!python3 evaluator.py --inputDataset /content/Dataset/nahjalbalagha/ --inputRun /content/baseline --outputDataset /content/output
!head -n 12 /content/output/evaluation.prototext

### Random-Baseline


#### Test

In [None]:
# DO NOT EDIT
!python3 random-baseline.py --inputDataset Dataset/test --outputDataset baseline
!python3 evaluator.py --inputDataset Dataset/test/ --inputRun baseline1 --outputDataset output
!head -n 12 output/evaluation.prototext

#### Zhihu


In [None]:
# DO NOT EDIT
!python3 random-baseline.py --inputDataset Dataset/zhihu/ --outputDataset baseline
!python3 evaluator.py --inputDataset Dataset/zhihu/ --inputRun baseline --outputDataset output
!head -n 12 output/evaluation.prototext

#### Nahjalbalagha

In [None]:
# DO NOT EDIT
!python3 random-baseline.py --inputDataset Dataset/nahjalbalagha/ --outputDataset baseline
!python3 evaluator.py --inputDataset Dataset/nahjalbalagha/ --inputRun baseline --outputDataset output
!head -n 12 output/evaluation.prototext



# Tasks

## [20 Points] Task 01 - Load Datasets

In this task, you are required to load the Training, Test, Validation, Nahjalbalagha & Zhihu into seperate dataframes.

In [39]:
# IMPORT ALL YOUR LIBRARIES
# SUGGESTED LIBRARIES
import torch
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification,BertModel
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score
from torch import cuda
import math
import torch
from transformers import pipeline, BertForSequenceClassification, BertTokenizerFast
from torch.utils.data import Dataset
import evaluate


In [40]:
# Importing files into dataframes:  naming convention: 
# dataset_arguments = the data
# dataset_result_sem_eval = the answers to the data
# dataset_result_level_1 = don't know what that is right now, maybe some answers for some other task. 

# Nahjalbalagah
nahjalbalagha_arguments = pd.read_csv("Dataset/nahjalbalagha/arguments-test-nahjalbalagha.tsv",sep="\t")
nahjalbalagha_result_sem_eval= pd.read_csv("Dataset/nahjalbalagha/labels-test-nahjalbalagha.tsv",sep="\t")
nahjalbalagha_result_level_1= pd.read_csv("Dataset/nahjalbalagha/level1-labels-test-nahjalbalagha.tsv",sep="\t")
# Zhihu 
zhihu_arguments = pd.read_csv("Dataset/zhihu/arguments-validation-zhihu.tsv",sep="\t")
zhihu_result_sem_eval = pd.read_csv("Dataset/zhihu/labels-validation-zhihu.tsv",sep="\t")
zhihu_result_level_1 = pd.read_csv("Dataset/zhihu/level1-labels-validation-zhihu.tsv",sep="\t")

# Testing 
test_arguments = pd.read_csv("Dataset/test/arguments-test.tsv",sep="\t")
test_result_sem_eval = pd.read_csv("Dataset/test/labels-test.tsv",sep="\t")
test_result_level_1 = pd.read_csv("Dataset/test/level1-labels-test.tsv",sep="\t")

# Training Dataset
training_arguments = pd.read_csv("Dataset/train/arguments-training.tsv",sep="\t")
training_result_sem_eval = pd.read_csv("Dataset/train/labels-training.tsv",sep="\t")
training_result_level_1 = pd.read_csv("Dataset/train/level1-labels-training.tsv",sep="\t")

# Validation 
validation_arguments = pd.read_csv("Dataset/validation/arguments-validation.tsv",sep="\t")
validation_result_sem_eval = pd.read_csv("Dataset/validation/labels-validation.tsv",sep="\t")
validation_result_level_1 = pd.read_csv("Dataset/validation/level1-labels-validation.tsv",sep="\t")



##  Defining Labels

In [41]:
labels = [
    "Self-direction: thought",
    "Self-direction: action",
    "Stimulation",
    "Hedonism",
    "Achievement",
    "Power: dominance",
    "Power: resources",
    "Face",
    "Security: personal",
    "Security: societal",
    "Tradition",
    "Conformity: rules",
    "Conformity: interpersonal",
    "Humility",
    "Benevolence: caring",
    "Benevolence: dependability",
    "Universalism: concern",
    "Universalism: nature",
    "Universalism: tolerance",
    "Universalism: objectivity"
]
label_dict={
    1: "Self-direction: thought",
    2: "Self-direction: action",
    3: "Stimulation",
    4: "Hedonism",
    5: "Achievement",
    6: "Power: dominance",
    7: "Power: resources",
    8: "Face",
    9: "Security: personal",
    10: "Security: societal",
    11: "Tradition",
    12: "Conformity: rules",
    13: "Conformity: interpersonal",
    14: "Humility",
    15: "Benevolence: caring",
    16: "Benevolence: dependability",
    17: "Universalism: concern",
    18: "Universalism: nature",
    19: "Universalism: tolerance",
    20: "Universalism: objectivity"
}
label_dict_inverse = {v: k for k, v in label_dict.items()}


##  Merging Dataframes


Train

In [42]:

train_argument_values= {}

# Iterate over each row in the dataframe
for index, row in training_result_sem_eval.iterrows():
    # Get the 'Argument ID'
    argument_id = row['Argument ID']
    # Initialize an empty list to store column names with value 1
    columns_with_1 = []
    # Iterate over each column in the row (starting from index 1 to skip 'Argument ID')
    for col in training_result_sem_eval.columns[1:]:
        # Check if the value in the current column is 1
        if row[col] == 1:
            # If so, append the column name to the list
            columns_with_1.append(label_dict_inverse[col])
    # Add the 'Argument ID' and list of columns with value 1 to the dictionary
    train_argument_values[argument_id] = columns_with_1
df_argument_values = pd.DataFrame(train_argument_values.items(), columns=['Argument ID', 'label'])

train_df = pd.merge(training_arguments, df_argument_values, on='Argument ID', how='left')

train_labels=train_df['label'].tolist()

train_labels = [max(label) if label and not any(math.isnan(x) for x in label) else 0 for label in train_labels]


Validate

In [49]:
validate_argument_values= {}
# Iterate over each row in the dataframe
for index, row in validation_result_sem_eval.iterrows():
    # Get the 'Argument ID'
    argument_id = row['Argument ID']
    # Initialize an empty list to store column names with value 1
    columns_with_1 = []
    # Iterate over each column in the row (starting from index 1 to skip 'Argument ID')
    for col in validation_result_sem_eval.columns[1:]:
        # Check if the value in the current column is 1
        if row[col] == 1:
            # If so, append the column name to the list
            columns_with_1.append(label_dict_inverse[col])

    # Add the 'Argument ID' and list of columns with value 1 to the dictionary
    validate_argument_values[argument_id] = columns_with_1
df_argument_values = pd.DataFrame(validate_argument_values.items(), columns=['Argument ID', 'label'])

validate_df = pd.merge(validation_arguments, df_argument_values, on='Argument ID', how='left')
validate_labels=validate_df['label'].tolist()

validate_labels = [max(label) if label and not any(math.isnan(x) for x in label) else 0 for label in validate_labels]
print(validate_labels)



[10, 17, 17, 2, 20, 20, 14, 12, 9, 20, 20, 9, 17, 15, 10, 9, 19, 19, 17, 20, 5, 2, 9, 18, 19, 6, 17, 16, 14, 15, 10, 12, 20, 15, 16, 19, 15, 20, 17, 20, 16, 20, 17, 19, 15, 15, 20, 20, 19, 12, 20, 16, 20, 17, 19, 19, 12, 20, 15, 19, 15, 17, 17, 10, 15, 9, 17, 19, 14, 15, 17, 19, 16, 11, 15, 15, 9, 9, 3, 9, 17, 18, 15, 17, 19, 9, 17, 17, 17, 20, 13, 15, 7, 17, 7, 17, 9, 19, 17, 19, 18, 16, 19, 20, 9, 17, 20, 16, 19, 19, 15, 19, 20, 5, 17, 17, 17, 18, 20, 17, 9, 5, 20, 20, 15, 20, 20, 20, 5, 9, 16, 17, 15, 14, 17, 17, 12, 17, 16, 5, 19, 16, 16, 2, 17, 20, 20, 5, 16, 8, 10, 15, 20, 17, 15, 17, 19, 20, 11, 17, 15, 15, 19, 15, 19, 17, 17, 15, 18, 9, 18, 17, 16, 16, 9, 4, 18, 15, 15, 9, 17, 15, 19, 17, 15, 7, 5, 17, 9, 9, 17, 17, 9, 12, 15, 17, 15, 17, 15, 15, 19, 17, 9, 16, 12, 16, 20, 19, 17, 8, 17, 6, 20, 16, 16, 19, 19, 20, 19, 16, 17, 19, 2, 15, 12, 20, 17, 3, 17, 16, 20, 17, 17, 14, 17, 15, 1, 1, 20, 17, 20, 15, 17, 17, 20, 11, 12, 19, 17, 16, 12, 16, 17, 15, 17, 19, 17, 20, 20, 20, 19

## [10 Points] Task 02 - Define Tokenizer & Model


In this task, you are required to define the Tokenizer and LLM model of your choice.

In [50]:

# Write your code here
device = 'cuda' if cuda.is_available() else 'cpu'
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',num_labels=20)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",id2label=label_dict,label2id=label_dict_inverse)
model.to(device)

train_data = train_df.apply(lambda row: ','.join(map(str, row)), axis=1).tolist()
validate_data=validate_df.apply(lambda row: ','.join(map(str, row)), axis=1).tolist()


train_encodings=tokenizer(train_data,padding='max_length',truncation=True)
validate_encodings=tokenizer(validate_data,padding='max_length',truncation=True)

# for sentence in train_data:
#     train_encodings.append(tokenize(sentence))
# validate_encodings=[]
# for sentence in validate_data:
#     validate_encodings.append(tokenizer(sentence))
# print(train_encodings[0])

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [51]:
class DataLoader(Dataset):
    """
    Custom Dataset class for handling tokenized text data and corresponding labels.
    Inherits from torch.utils.data.Dataset.
    """
    def __init__(self, encodings,labels):
        """
        Initializes the DataLoader class with encodings and labels.

        Args:
            encodings (dict): A dictionary containing tokenized input text data
                              (e.g., 'input_ids', 'token_type_ids', 'attention_mask').
            labels (list): A list of integer labels for the input text data.
        """
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        """
        Returns a dictionary containing tokenized data and the corresponding label for a given index.

        Args:
            idx (int): The index of the data item to retrieve.

        Returns:
            item (dict): A dictionary containing the tokenized data and the corresponding label.
        """
        # Retrieve tokenized data for the given index
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # Add the label for the given index to the item dictionary
        return item

    def __len__(self):
        """
        Returns the number of data items in the dataset.

        Returns:
            (int): The number of data items in the dataset.
        """
        return len(self.labels)
    
    


train_dataloader=DataLoader(train_encodings,train_labels)

validate_dataloader=DataLoader(validate_encodings,validate_labels)




## [20 Points] Task 03 - Optimizer & Hyperparameters


In this task, you are required to define the hyperparameters & the optimizer for training your model.

In [52]:
# Write your code here
BATCH_SIZE = 32
NUM_PROCS = 32
LR = 0.00005
EPOCHS = 5
MODEL = 'bert-base-uncased'
OUT_DIR = 'finetuned '+ MODEL

In [53]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## [20 Points] Task 04 -  Training Loop


In this task, you are required to implement the training loop for fine tuning your model. You are also required to plot on the same graph: Loss vs Epochs & Accuracy vs Epochs

In [57]:
#Write your code
from transformers import TrainingArguments, Trainer


trainer=Trainer(
    model=model, 
    train_dataset=train_dataloader,         
    eval_dataset=validate_dataloader,  
    args=TrainingArguments('test-trainer'),
    compute_metrics=compute_metrics
)
trainer.train()

ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,token_type_ids,attention_mask.

## [10 Points]  Task 05 - Model Evaluation: Test Dataset

In this task, you are required your fine tuned model on the Test dataset using ``evaluator.py`` and compare your results with random and 1-baseline.

ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,token_type_ids,attention_mask.

## [10 Points]  Task 06 - Model Evaluation: Zhihu Dataset

In this task, you are required your fine tuned model on the Zhihu
 dataset using ``evaluator.py`` and compare your results with random and 1-baseline.

## [10 Points]  Task 07 - Model Evaluation: Nahjalbalagha Dataset

In this task, you are required your fine tuned model on the Nahjalbalagha dataset using ``evaluator.py`` and compare your results with random and 1-baseline.

# References

In this section, cite any resources or references that you use for solving this assignment.


title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding}
Link : https://huggingface.co/google-bert/bert-base-uncased 

