# CS 335: Introduction to Large Language Models
## Assignment 01
### **Total Marks**: 100
### **Deadline**: Sunday, 3rd March, 2024, 11:59 PM
### **Name**: Muhammad Talha
### **ID**: mj06974

#Instructions

1. Please rename your notebook as *Assignment_1_aa1234.ipynb* before the final submission. Notebooks which do not follow appropriate naming convention will not be graded.

2. Please submit your own work. If you have any questions, please feel free to reach out to the course instructors or RA.



# Assignment Overview

In this assignment, you are required to fine tune a LLM model of your that classifies which human value category a textual arguement belongs to. Your model will evaluated against 1-baseline, random-baseline results on the following dataset: test, Nahjalbalagha, Zhihu


# Setup



In [None]:
# IMPORT ALL YOUR LIBRARIES
# SUGGESTED LIBRARIES
import torch
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score

# Download Files


##Evaluator

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!wget https://raw.githubusercontent.com/touche-webis-de/touche-code/main/semeval23/human-value-detection/evaluator/evaluator.py

## 1-Baseline

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!wget https://raw.githubusercontent.com/touche-webis-de/touche-code/main/semeval23/human-value-detection/1-baseline/1-baseline.py

## Random-Baseline

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!wget https://raw.githubusercontent.com/touche-webis-de/touche-code/main/semeval23/human-value-detection/random-baseline/random-baseline.py

## Dataset Files

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!wget https://zenodo.org/api/records/10564870/files-archive

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!unzip files-archive -d Dataset

In [None]:
!mkdir Dataset/zhihu
!mkdir Dataset/nahjalbalagha
!mkdir Dataset/train
!mkdir Dataset/test
!mkdir Dataset/validation

In [None]:
!mv Dataset/*-zhihu.tsv Dataset/zhihu
!mv Dataset/*-nahjalbalagha.tsv Dataset/nahjalbalagha
!mv Dataset/*-training.tsv Dataset/train
!mv Dataset/*-test.tsv Dataset/test
!mv Dataset/*-validation.tsv Dataset/validation

# Background Information

## Human Value Detection 2023 <br/>
## SemEval 2023 Task 4. ValueEval: Identification of Human Values behind Arguments



Given a textual argument and a human value category, classify whether or not the argument draws on that category. This task uses a set of 20 value categories compiled from the social science literature and described in our [ACL paper](https://webis.de/publications.html#kiesel_2022b). Arguments are given as premise text, conclusion text, and binary stance of the premise to the conclusion ("in favor of" or "against").

The 20 value categories are shown here on Schwartz' value continuum below:

[![JEPBxUu.md.png](https://iili.io/JEPBxUu.md.png)](https://freeimage.host/i/JEPBxUu)







## Data


Data is provided as tab-separated values files with one header line. The arguments-validation.tsv files contain one argument per line: its unique argument ID, the conclusion, the premise's stance towards the conclusion, and the premise itself. Example with tab-separated columns are shown below

<pre><span class="column">Argument ID</span>	<span class="column">Conclusion</span>	<span class="column">Stance</span>	<span class="column">Premise</span>
<span class="column">A01010</span>	<span class="column">We should prohibit school prayer</span>	<span class="column">against</span>	<span class="column">it should be allowed if the student wants to pray as long as it is not interfering with his classes</span>
<span class="column">A01011</span>	<span class="column">We should abolish the three-strikes laws</span>	<span class="column">in favor of</span>	<span class="column">three strike laws can cause young people to be put away for life without a chance to straight out their life</span>
<span class="column">A01012</span>	<span class="column">The use of public defenders should be mandatory</span>	<span class="column">in favor of</span>	<span class="column">the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't</span>
</pre>

The labels-validation.tsv  files also contain one argument per line: its unique argument ID and one column for each of the 20 value categories with a 1 meaning that the argument resorts to the value category and a 0 that not. Example with tab-separated columns are shown below:

<pre><span class="column">Argument ID</span>	<span class="column">Self-direction: thought</span>	<span class="column">Self-direction: action</span>	<span class="column">Stimulation</span>	<span class="column">Hedonism</span>	<span class="column">Achievement</span>	<span class="column">Power: dominance</span>	<span class="column">Power: resources</span>	<span class="column">Face</span>	<span class="column">Security: personal</span>	<span class="column">Security: societal</span>	<span class="column">Tradition</span>	<span class="column">Conformity: rules</span>	<span class="column">Conformity: interpersonal</span>	<span class="column">Humility</span>	<span class="column">Benevolence: caring</span>	<span class="column">Benevolence: dependability</span>	<span class="column">Universalism: concern</span>	<span class="column">Universalism: nature</span>	<span class="column">Universalism: tolerance</span>	<span class="column">Universalism: objectivity</span>
<span class="column">A01010</span>	<span class="column">1</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>
<span class="column">A01011</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">1</span>
<span class="column">A01012</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">1</span>	<span class="column">0</span>	<span class="column">0</span>	<span class="column">0</span>
</pre>

In addition, there are other datasets for evaluating the robustness of our model: validation-zhihu from the recommendation and hotlist section of the Chinese question-answering website Zhihu, test-nahjalbalagha from and based on the Nahj al-Balagha.



## Evaluation

Runs are evaluated on the basis of F1-score, Precision, and Recall: averaged over all value categories and for each category individually.

## Baseline Results

In [None]:
# DO NOT EDIT
# RUN ONLY ONCE
!mkdir baseline
!mkdir output
!pip install -U accelerate
!pip install -U transformers

### 1-Baseline

#### Test Dataset

In [None]:
# DO NOT EDIT
!python3 /content/1-baseline.py --inputDataset /content/Dataset/test --outputDataset /content/baseline
!python3 evaluator.py --inputDataset /content/Dataset/test/ --inputRun /content/baseline --outputDataset /content/output
!head -n 12 /content/output/evaluation.prototext

#### Zhihu

In [None]:
# DO NOT EDIT
!python3 /content/1-baseline.py --inputDataset /content/Dataset/zhihu/ --outputDataset /content/baseline
!python3 evaluator.py --inputDataset /content/Dataset/zhihu/ --inputRun /content/baseline --outputDataset /content/output
!head -n 12 /content/output/evaluation.prototext

#### Nahjalbalagha

In [None]:
# DO NOT EDIT
!python3 /content/1-baseline.py --inputDataset /content/Dataset/nahjalbalagha/ --outputDataset /content/baseline
!python3 evaluator.py --inputDataset /content/Dataset/nahjalbalagha/ --inputRun /content/baseline --outputDataset /content/output
!head -n 12 /content/output/evaluation.prototext

### Random-Baseline


#### Test

In [None]:
# DO NOT EDIT
!python3 random-baseline.py --inputDataset Dataset/test --outputDataset baseline
!python3 evaluator.py --inputDataset Dataset/test/ --inputRun baseline1 --outputDataset output
!head -n 12 output/evaluation.prototext

#### Zhihu


In [None]:
# DO NOT EDIT
!python3 random-baseline.py --inputDataset Dataset/zhihu/ --outputDataset baseline
!python3 evaluator.py --inputDataset Dataset/zhihu/ --inputRun baseline --outputDataset output
!head -n 12 output/evaluation.prototext

#### Nahjalbalagha

In [None]:
# DO NOT EDIT
!python3 random-baseline.py --inputDataset Dataset/nahjalbalagha/ --outputDataset baseline
!python3 evaluator.py --inputDataset Dataset/nahjalbalagha/ --inputRun baseline --outputDataset output
!head -n 12 output/evaluation.prototext



# Tasks

## [20 Points] Task 01 - Load Datasets

In this task, you are required to load the Training, Test, Validation, Nahjalbalagha & Zhihu into seperate dataframes.

In [None]:
# IMPORT ALL YOUR LIBRARIES
# SUGGESTED LIBRARIES
import torch
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification,BertModel,DistilBertForSequenceClassification,DistilBertTokenizerFast
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score
from torch import cuda
import math
from transformers import pipeline, BertForSequenceClassification
import evaluate
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader


In [None]:
# Importing files into dataframes:  naming convention: 
# dataset_arguments = the data
# dataset_result_sem_eval = the answers to the data
# dataset_result_level_1 = don't know what that is right now, maybe some answers for some other task. 

# Nahjalbalagah
nahjalbalagha_arguments = pd.read_csv("Dataset/nahjalbalagha/arguments-test-nahjalbalagha.tsv",sep="\`t")
nahjalbalagha_result_sem_eval= pd.read_csv("Dataset/nahjalbalagha/labels-test-nahjalbalagha.tsv",sep="\t")
nahjalbalagha_result_level_1= pd.read_csv("Dataset/nahjalbalagha/level1-labels-test-nahjalbalagha.tsv",sep="\t")
# Zhihu 
zhihu_arguments = pd.read_csv("Dataset/zhihu/arguments-validation-zhihu.tsv",sep="\t")
zhihu_result_sem_eval = pd.read_csv("Dataset/zhihu/labels-validation-zhihu.tsv",sep="\t")
zhihu_result_level_1 = pd.read_csv("Dataset/zhihu/level1-labels-validation-zhihu.tsv",sep="\t")

# Testing 
test_arguments = pd.read_csv("Dataset/test/arguments-test.tsv",sep="\t")
test_result_sem_eval = pd.read_csv("Dataset/test/labels-test.tsv",sep="\t")
test_result_level_1 = pd.read_csv("Dataset/test/level1-labels-test.tsv",sep="\t")

# Training Dataset
training_arguments = pd.read_csv("Dataset/train/arguments-training.tsv",sep="\t")
training_result_sem_eval = pd.read_csv("Dataset/train/labels-training.tsv",sep="\t")
training_result_level_1 = pd.read_csv("Dataset/train/level1-labels-training.tsv",sep="\t")

# Validation 
validation_arguments = pd.read_csv("Dataset/validation/arguments-validation.tsv",sep="\t")
validation_result_sem_eval = pd.read_csv("Dataset/validation/labels-validation.tsv",sep="\t")
validation_result_level_1 = pd.read_csv("Dataset/validation/level1-labels-validation.tsv",sep="\t")


##  Defining Labels

In [None]:
labels = [
    "Self-direction: thought",
    "Self-direction: action",
    "Stimulation",
    "Hedonism",
    "Achievement",
    "Power: dominance",
    "Power: resources",
    "Face",
    "Security: personal",
    "Security: societal",
    "Tradition",
    "Conformity: rules",
    "Conformity: interpersonal",
    "Humility",
    "Benevolence: caring",
    "Benevolence: dependability",
    "Universalism: concern",
    "Universalism: nature",
    "Universalism: tolerance",
    "Universalism: objectivity"
]
label_dict={
    0: "Self-direction: thought",
    1: "Self-direction: action",
    2: "Stimulation",
    3: "Hedonism",
    4: "Achievement",
    5: "Power: dominance",
    6: "Power: resources",
    7: "Face",
    8: "Security: personal",
    9: "Security: societal",
    10: "Tradition",
    11: "Conformity: rules",
    12: "Conformity: interpersonal",
    13: "Humility",
    14: "Benevolence: caring",
    15: "Benevolence: dependability",
    16: "Universalism: concern",
    17: "Universalism: nature",
    18: "Universalism: tolerance",
    19: "Universalism: objectivity"
}
label_dict_inverse = {v: k for k, v in label_dict.items()}


##  Merging Dataframes


Train

In [None]:

train_argument_values= {}

# Iterate over each row in the dataframe
for index, row in training_result_sem_eval.iterrows():
    # Get the 'Argument ID'
    argument_id = row['Argument ID']
    # Initialize an empty list to store column names with value 1
    columns_with_1 = []
    # Iterate over each column in the row (starting from index 1 to skip 'Argument ID')
    for col in training_result_sem_eval.columns[1:]:
        # Check if the value in the current column is 1
        if row[col] == 1:
            # If so, append the column name to the list
            columns_with_1.append(label_dict_inverse[col])
    # Add the 'Argument ID' and list of columns with value 1 to the dictionary
    train_argument_values[argument_id] = columns_with_1
df_argument_values = pd.DataFrame(train_argument_values.items(), columns=['Argument ID', 'label'])

train_df = pd.merge(training_arguments, df_argument_values, on='Argument ID', how='left')

train_labels=train_df['label'].tolist()

train_labels = [max(label) if label and not any(math.isnan(x) for x in label) else 0 for label in train_labels]
# print(max(train_labels

Validate

In [None]:
validate_argument_values= {}
# Iterate over each row in the dataframe
for index, row in validation_result_sem_eval.iterrows():
    # Get the 'Argument ID'
    argument_id = row['Argument ID']
    # Initialize an empty list to store column names with value 1
    columns_with_1 = []
    # Iterate over each column in the row (starting from index 1 to skip 'Argument ID')
    for col in validation_result_sem_eval.columns[1:]:
        # Check if the value in the current column is 1
        if row[col] == 1:
            # If so, append the column name to the list
            columns_with_1.append(label_dict_inverse[col])

    # Add the 'Argument ID' and list of columns with value 1 to the dictionary
    validate_argument_values[argument_id] = columns_with_1
df_argument_values = pd.DataFrame(validate_argument_values.items(), columns=['Argument ID', 'label'])

validate_df = pd.merge(validation_arguments, df_argument_values, on='Argument ID', how='left')
validate_labels=validate_df['label'].tolist()

validate_labels = [max(label) if label and not any(math.isnan(x) for x in label) else 0 for label in validate_labels]

# print(max(validate_labels))

## [10 Points] Task 02 - Define Tokenizer & Model


In this task, you are required to define the Tokenizer and LLM model of your choice.

In [None]:

# # Write your code here

# tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
# model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
# model.to('cpu')

# train=train_df['Conclusion'].tolist()
# validate=validate_df['Conclusion'].tolist()
# train_encoding=tokenizer(train,padding=True,truncation=True)
# validate_encoding=tokenizer(validate,padding=True,truncation=True)


# # tokenized_train = [tokenizer.tokenize(sentence) for sentence in train]
# # train_labels=torch.tensor(train_labels)
# # train_input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_train]
# # train_input_ids = torch.nn.utils.rnn.pad_sequence([torch.tensor(ids) for ids in train_input_ids], batch_first=True)
# # train_attention_masks = torch.tensor([[1] * len(ids) for ids in train_input_ids])  # Assuming all tokens are valid
# # train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)

# # tokenized_validate = [tokenizer.tokenize(sentence) for sentence in validate]
# # validate_labels=torch.tensor(validate_labels)
# # validate_input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_validate]
# # validate_input_ids = torch.nn.utils.rnn.pad_sequence([torch.tensor(ids) for ids in validate_input_ids], batch_first=True)
# # validate_attention_masks = torch.tensor([[1] * len(ids) for ids in validate_input_ids])  # Assuming all tokens are valid
# # validate_dataset = TensorDataset(validate_input_ids, validate_attention_masks, validate_labels)




# Write your code here

model_name = "bert-base-uncased"

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=20)  # Assuming 20 human value categories
model.to('cuda')

# tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
# model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")


train=train_df['Premise'].tolist()
validate=validate_df['Premise'].tolist()
train_encodings = tokenizer(train, truncation=True, padding=True, max_length=512, return_tensors="pt")
validate_encodings=tokenizer(validate,truncation=True, padding=True, max_length=512, return_tensors="pt")


# tokenized_train = [tokenizer.tokenize(sentence) for sentence in train]
# train_labels=torch.tensor(train_labels)
# train_input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_train]
# train_input_ids = torch.nn.utils.rnn.pad_sequence([torch.tensor(ids) for ids in train_input_ids], batch_first=True)
# train_attention_masks = torch.tensor([[1] * len(ids) for ids in train_input_ids])  # Assuming all tokens are valid
# train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)

# tokenized_validate = [tokenizer.tokenize(sentence) for sentence in validate]
# validate_labels=torch.tensor(validate_labels)
# validate_input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_validate]
# validate_input_ids = torch.nn.utils.rnn.pad_sequence([torch.tensor(ids) for ids in validate_input_ids], batch_first=True)
# validate_attention_masks = torch.tensor([[1] * len(ids) for ids in validate_input_ids])  # Assuming all tokens are valid
# validate_dataset = TensorDataset(validate_input_ids, validate_attention_masks, validate_labels)



In [None]:

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
    
train_dataset=CustomDataset(train_encodings,train_labels)
validate_dataset=CustomDataset(validate_encodings,validate_labels)

## [20 Points] Task 03 - Optimizer & Hyperparameters


In this task, you are required to define the hyperparameters & the optimizer for training your model.

## [20 Points] Task 04 -  Training Loop


In this task, you are required to implement the training loop for fine tuning your model. You are also required to plot on the same graph: Loss vs Epochs & Accuracy vs Epochs

In [None]:
#Write your code

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=2,  # batch size per device during training
    per_device_eval_batch_size=4,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)
trainer = Trainer(
    model=model,  # The instantiated Transformers model to be trained
    args=training_args,  # Training arguments
    train_dataset=train_dataset,  # Training dataset
    eval_dataset=train_dataset, 
    # Evaluation dataset
    )
trainer.train()

In [None]:
trainer.evaluate()
save_dir='./finetined'
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)





In [None]:
finetuned_model=BertForSequenceClassification.from_pretrained(save_dir)  # Assuming 20 human value categories
finetuned_tokenizer=BertTokenizer.from_pretrained(save_dir)
nlp= pipeline("sentiment-analysis", model=finetuned_model, tokenizer=finetuned_tokenizer)


## [10 Points]  Task 05 - Model Evaluation: Test Dataset

In this task, you are required your fine tuned model on the Test dataset using ``evaluator.py`` and compare your results with random and 1-baseline.

In [45]:
test_premise=test_arguments['Premise'].tolist()
test_premise_ids=test_arguments["Argument ID"].tolist()

test_answer=[]
columns = [
    "Argument ID",
    "Self-direction: thought",
    "Self-direction: action",
    "Stimulation",
    "Hedonism",
    "Achievement",
    "Power: dominance",
    "Power: resources",
    "Face",
    "Security: personal",
    "Security: societal",
    "Tradition",
    "Conformity: rules",
    "Conformity: interpersonal",
    "Humility",
    "Benevolence: caring",
    "Benevolence: dependability",
    "Universalism: concern",
    "Universalism: nature",
    "Universalism: tolerance",
    "Universalism: objectivity"
]

data_frame=df = pd.DataFrame(columns=columns)
data_frame["Argument ID"]=test_premise_ids
for index,row in test_arguments.iterrows():
    print(row['Premise'])


0 Argument ID                                              A26004
Conclusion                     We should end affirmative action
Stance                                                  against
Premise        affirmative action helps with employment equity.
Name: 0, dtype: object


## [10 Points]  Task 06 - Model Evaluation: Zhihu Dataset

In this task, you are required your fine tuned model on the Zhihu
 dataset using ``evaluator.py`` and compare your results with random and 1-baseline.

## [10 Points]  Task 07 - Model Evaluation: Nahjalbalagha Dataset

In this task, you are required your fine tuned model on the Nahjalbalagha dataset using ``evaluator.py`` and compare your results with random and 1-baseline.

# References

In this section, cite any resources or references that you use for solving this assignment.


title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding}
Link : https://huggingface.co/google-bert/bert-base-uncased 

