## Homework 2, CS678 Fall 2023

### This is due on October 13th, 2023. 

Submit the report to Gradescope with the naming convention of
**John_Doe_HW2_Report_CS678_F23.pdf** if your name is John Doe.
Only the report should be submitted to Gradescope. The rest of the code and data files, 
including this notebook in its completion form,
should be submitted to Blackboard in a single zipped folder.

<!-- #### IMPORTANT: 

After copying this notebook to your Google Drive, please paste a link to it
below. To get a publicly-accessible link, hit the *Share* button at the top
right, then click "Get shareable link" and copy over the result.If you fail to
do this, you will receive no credit for this homework!

***LINK:***

--- -->


##### *How to submit this problem set:*
- Write all the answers in this notebook. 

- When creating your final version of the notebook to hand in, 
  please do a fresh restart and execute every cell in order. 
  One handy way to do this is by clicking `Runtime -> Run All` in the notebook menu.

##### *Policy regarding Google Colab:*
- The instruction in this notebook assumes that you will use Colab.

- However, using Colab is not required. You are free to run the code on your local machine, though in that case you may suffer from a slow runtime due to the lack of proper GPU resources.

---

##### *Academic honesty*

- We will audit the notebooks from a set number of students, chosen at
  random. The audits will check that the code you wrote actually generates the
  answers in your report PDF. If you turn in correct answers on your PDF without code
  that actually generates those answers, we will consider this a serious case of
  cheating. See the course page for honesty policies.

- We will also run automatic checks of notebooks for plagiarism. 
  Copying code from others is also considered a serious case of cheating.

---

# Part 1: Data Collection and Annotation

In this homework, you will first collect a labeled dataset of **150** sentences for a text classification task of your choice. This process will include:

1. *Data collection*: Collect 150 sentences from any source you find interesting (e.g., literature, Tweets, news articles, reviews, etc.)

2. *Task design*: Come up with a multilabel sentence-level classification task that you would like to perform on your sentences. 

3. On your dataset, collect annotations from **two** classmates for your task on a **second, separate set** of a minimum of **150** sentences. Everyone in this class will need to both create their own dataset and also serve as an annotator for two other classmates. In order to get everything done on time, you need to complete the following steps:

> *   Find two classmates willing to label 150 sentences each (use the Piazza "search for teammates" thread if you're having issues finding labelers).
*   Collect the labeled data from each of the two annotators.
*   Sanity check the data for basic cleanliness (are all examples annotated? are all labels allowable ones?)

4. Collect feedback from annotators about the task including annotation time and obstacles encountered (e.g., maybe some sentences were particularly hard to annotate!)

5. Calculate and report inter-annotator agreement.

6. Aggregate output from both annotators to create final dataset (include your first 150 sentences too).

7. Perform NLP experiments on your new dataset!

The mapping of label names and IDs in seed.tsv is as follows:

```json
{
    'Economic': 1.0,
    'Capacity and Resources': 2.0,
    'Morality': 3.0,
    'Fairness and Equality': 4.0,
    'Legality, Constitutionality, Jurisdiction': 5.0,
    'Policy Prescription and Evaluation': 6.0,
    'Crime and Punishment': 7.0,
    'Security and Defense': 8.0,
    'Health and Safety': 9.0,
    'Quality of Life': 10.0,
    'Cultural Identity': 11.0,
    'Public Sentiment': 12.0,
    'Political': 13.0,
    'External Regulation and Reputation': 14.0,
    'Other': 15.0
}
```

Make sure that this mapping is followed in all of your data files.

## Question 3 (8 points):
Now, compute the inter-annotator agreement between your two annotators. Upload both .tsv files to your Colab session (click the folder icon in the sidebar to the left of the screen). In the code cell below, read the data from the two files and compute both the raw agreement (% of examples for which both annotators agreed on the label) and the [Cohen's Kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa). Feel free to use implementations in existing libraries (e.g., [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)). After you’re done, report the raw agreement and Cohen’s scores in your report.

*If you're curious, Cohen suggested the Kappa result be interpreted as follows: values ≤ 0 as indicating no agreement and 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41– 0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement.*

In [66]:
### WRITE CODE TO LOAD ANNOTATIONS AND 
### COMPUTE AGREEMENT + COHEN'S KAPPA HERE!
import pandas as pd
from sklearn.metrics import cohen_kappa_score

def raw_agreement(annotator1_df, annotator2_df):
    num_matches = annotator1_df["label_ID"].eq(annotator2_df["label_ID"]).sum()

    raw_agreement = num_matches / len(annotator1_df)
    
    return raw_agreement

def kohens_cappa(annotator1_df, annotator2_df):

    kohens_cappa = cohen_kappa_score(annotator1_df["label_ID"], annotator2_df["label_ID"])

    return kohens_cappa


# TODO  Using CSV as subtitute format, TSV wasn't parsing correctly. 
annotator1_df = pd.read_csv('./data/cleung5_annotations.csv')
annotator2_df = pd.read_csv('./data/ichen6_annotations.csv')

print(f"Len Annotator1 - {len(annotator1_df)}")
print(f"Len Annotator2 - {len(annotator2_df)}")


print("--- Raw agreement between annotator1 and annotator2 ---")
print(raw_agreement(annotator1_df, annotator2_df))

print("--- Cohen's kappa score between annotator1 and annotator2 ---")
print(kohens_cappa(annotator1_df, annotator2_df))


Len Annotator1 - 150
Len Annotator2 - 150
--- Raw agreement between annotator1 and annotator2 ---
0.44
--- Cohen's kappa score between annotator1 and annotator2 ---
0.3854258121158911




### *RAW AGREEMENT*: 0.44

### *COHEN'S KAPPA*: 0.3854258121158911

# Part 2: Model Training and Testing

Now we'll move onto fine-tuning  pretrained language models specifically on your dataset. This part of the homework is meant to be an introduction to the HuggingFace library, and it contains code that will potentially be useful for your final projects. Since we're dealing with large models, the first step is to change to a GPU runtime.

## Adding a hardware accelerator

Please go to the menu and add a GPU as follows:

`Edit > Notebook Settings > Hardware accelerator > (GPU)`

Run the following cell to confirm that the GPU is detected.

In [67]:
import torch
torch.cuda.empty_cache()

# Confirm that the GPU is detected

assert torch.cuda.is_available()

# Get the GPU device name.
device_name = torch.cuda.get_device_name()
n_gpu = torch.cuda.device_count()
print(f"Found device: {device_name}, n_gpu: {n_gpu}")
device = torch.device("cuda")

Found device: NVIDIA GeForce RTX 2060, n_gpu: 1


In [68]:
import random
import numpy as np

def seed_everything(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

seed_everything()

## Installing Hugging Face's Transformers library
We will use Hugging Face's Transformers (https://github.com/huggingface/transformers), an open-source library that provides general-purpose architectures for natural language understanding and generation with a collection of various pretrained models made by the NLP community. This library will allow us to easily use pretrained models like `BERT` and perform experiments on top of them. We can use these models to solve downstream target tasks, such as text classification, question answering, and sequence labeling.

Run the following cell to install Hugging Face's Transformers library and download a sample data file called seed.tsv that contains 250 sentences in English, annotated with their frame.

In [7]:
# !pip install transformers
# !pip install -U -q PyDrive

The cell below imports some helper functions we wrote to demonstrate the task on
the sample seed dataset.

#### *IMPORTANT NOTE*:

The tokenize_and_format function in helpers.py uses bert-base-uncased as the
model for the tokenizer. If you are using a different model for training in this
notebook or for running predictions in a different notebook or python file, you
need to change the model name as well in the tokenizer, otherwise you will get
arbitrarily incorrect results down the line.

If you update the model name for the tokenizer, you would need to reload the
file which can be done simply by re-running the cell below.

In [8]:
from helpers import tokenize_and_format, flat_accuracy

# Part 1: Data Prep and Model Specifications

Upload your data using the file explorer to the left. We have provided a
function below to tokenize and format your data as BERT requires. Make sure that
your tsv file, titled final_data.tsv, has one column "sentence" and another
column "label_ID" containing integers/float. (basically the same format as
seed.tsv should be maintained for the sentence and label columns)

If you run the cell below without modifications, it will run on the seed.tsv
example data we have provided. It imports some helper functions we wrote to
demonstrate the task on the sample dataset. You should first run all of the
following cells with seed.tsv just to see how everything works. Then, once you
understand the whole preprocessing / fine-tuning process, change the tsv in the
below cell to your final_data.tsv file, add any extra preprocessing code you
wish, and then run the cells again on your own data.


#### Important Note :

The code below expects the data to be in a tsv file  with the columns as "sentence"
and "label_ID" (other columns are not that relevant here). But this is different
from the instructions in the report where you are expected to create data with
"text" and "label" columns for all of the annotation steps. 

Modify the code below to suitably handle this.

In [69]:
from helpers import tokenize_and_format, flat_accuracy
import pandas as pd
import numpy as np

seed_everything()

# df = pd.read_csv('./data/final_data_v2.csv') # TODO : Uncomment this line to use the full dataset
# df = pd.read_csv('seed.tsv', delimiter="\t")
# df = pd.read_csv('./data/para_data.csv') # TODO : Uncomment this line to use the full dataset
df = pd.read_csv('./data/augmented_df_v1.csv') # TODO : Uncomment this line to use the full dataset

df = df.sample(frac=1).reset_index(drop=True)

print(len(df))

texts = df.sentence.values # this assumes that the column containing the text is called "sentence"
labels = df.label_ID.values # this assumes that the column containing the labels is called "label_ID"

### tokenize_and_format() is a helper function provided in helpers.py ###
### Male sure you use the correct model name for your tokenizer! ###
input_ids, attention_masks = tokenize_and_format(texts)

# print(texts)
# print(labels)
label_list = []
for l in labels:
  label_array = np.zeros(len(set(labels)))
  label_array[int(l)-1] = 1
  label_list.append(label_array)

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(np.array(label_list))

# Print sentence 0, now as a list of IDs.
print('Original: ', texts[0])
print('Token IDs:', input_ids[0])

34200
Original:  "If someone is gay and he searches for God and has good will, who am I to judge?"
Token IDs: tensor([  101,   107, 11526, 25839, 10127, 16013, 10110, 10191, 17047, 10165,
        10139, 13131, 10110, 10438, 12050, 11229,   117, 10488, 10345,   151,
        10114, 20290,   136,   107,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])


## Create train/test/validation splits

Here we split your dataset into 3 parts: a training set, a validation set, and a testing set. Each item in your dataset will be a 3-tuple containing an input_id tensor, an attention_mask tensor, and a label tensor.



In [71]:
seed_everything()

total = len(df)

num_train = int(total * .8)
num_val = int(total * .1)
num_test = total - num_train - num_val

# make lists of 3-tuples (already shuffled the dataframe in cell above)
train_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_train)]
val_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_train, num_val+num_train)]
test_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_val + num_train, total)]

train_text = [texts[i] for i in range(num_train)]
val_text = [texts[i] for i in range(num_train, num_val+num_train)]
test_text = [texts[i] for i in range(num_val + num_train, total)]


Here we choose the model we want to finetune from https://huggingface.co/transformers/pretrained_models.html. Because the task requires us to label sentences, we wil be using BertForSequenceClassification below. You may see a warning that states that `some weights of the model checkpoint at [model name] were not used when initializing. . .` This warning is expected and means that you should fine-tune your pre-trained model before using it on your downstream task. See [here](https://github.com/huggingface/transformers/issues/5421#issuecomment-652582854) for more info.

In [13]:
from transformers import BertForSequenceClassification, AutoModelForSequenceClassification, BertConfig
from torch.optim import AdamW

# model_name = "bert-base-uncased"
model_name = "bert-base-multilingual-uncased"
# model = BertForSequenceClassification.from_pretrained(
    # Use the 12-layer English BERT model, with an uncased vocab.
    # model_name,
    # num_labels = 15, # The number of output labels.   
    # output_attentions = False, # Whether the model returns attentions weights.
    # output_hidden_states = False, # Whether the model returns all hidden-states.
# )
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels = 15, 
)
# Tell pytorch to run this model on the GPU.
model.cuda()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

# TODO: ACTION REQUIRED #

Define your fine-tuning hyperparameters in the cell below (we have randomly picked some values to start with). We want you to experiment with different configurations to find the one that works best (i.e., highest accuracy) on your validation set. Feel free to also change pretrained models to others available in the HuggingFace library (you'll have to modify the cell above to do this). You might find papers on BERT fine-tuning stability (e.g., [Mosbach et al., ICLR 2021](https://openreview.net/pdf?id=nzpLWnVAyah)) to be of interest.

In [14]:
batch_size = 50
# you can change lr and eps values in the AdamW call if you like
lr = 1e-4
optimizer = AdamW(model.parameters(), lr = lr) #with default values of learning rate and epsilon value
epochs = 10

# Fine-tune your model
Here we provide code for fine-tuning your model, monitoring the loss, and checking your validation accuracy. Rerun both of the below cells when you change your hyperparameters above.

In [74]:
# function to get validation accuracy
def get_validation_performance(val_set):
    # Put the model in evaluation mode
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0

    num_batches = int(len(val_set)/batch_size) + 1

    total_correct = 0

    for i in range(num_batches):

      end_index = min(batch_size * (i+1), len(val_set))

      batch = val_set[i*batch_size:end_index]
      
      if len(batch) == 0: continue

      input_id_tensors = torch.stack([data[0] for data in batch])
      input_mask_tensors = torch.stack([data[1] for data in batch])
      label_tensors = torch.stack([data[2] for data in batch])
      
      # Move tensors to the GPU
      b_input_ids = input_id_tensors.to(device)
      b_input_mask = input_mask_tensors.to(device)
      b_labels = label_tensors.to(device)
        
      # Tell pytorch not to bother with constructing the compute graph during
      # the forward pass, since this is only needed for backprop (training).
      with torch.no_grad():        

        # Forward pass, calculate logit predictions.
        # Note: this line of code might need to change depending on the model
        # the current line will work for bert-base-uncased
        # please refer to huggingface documentation for other models
        outputs = model(b_input_ids, 
                                attention_mask=b_input_mask,
                                labels=b_labels)
                                # token_type_ids=None, 
        loss = outputs.loss
        logits = outputs.logits
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()
        
        # Move logits and labels to CPU
        logits = (logits).detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()


        # Calculate the number of correctly labeled examples in batch
        pred_flat = np.argmax(logits, axis=1).flatten()
        labels_flat = np.argmax(label_ids, axis=1).flatten()

        num_correct = np.sum(pred_flat == labels_flat)
        total_correct += num_correct
        
    # Report the final accuracy for this validation run.
    print("Num of correct predictions =", total_correct)
    avg_val_accuracy = total_correct / len(val_set)
    return avg_val_accuracy, pred_flat, labels_flat



In [16]:
import random
seed_everything()

# training loop

# For each epoch...
for epoch_i in range(0, epochs):
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode.
    model.train()

    # For each batch of training data...
    num_batches = int(len(train_set)/batch_size) + 1

    for i in range(num_batches):
      end_index = min(batch_size * (i+1), len(train_set))

      batch = train_set[i*batch_size:end_index]

      if len(batch) == 0: continue

      input_id_tensors = torch.stack([data[0] for data in batch])
      input_mask_tensors = torch.stack([data[1] for data in batch])
      label_tensors = torch.stack([data[2] for data in batch])

      # Move tensors to the GPU
      b_input_ids = input_id_tensors.to(device)
      b_input_mask = input_mask_tensors.to(device)
      b_labels = label_tensors.to(device) 

      optimizer.zero_grad()

      # Perform a forward pass (evaluate the model on this training batch).
      # this line of code might need to change depending on the model
      # outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
      outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
      
      loss = outputs.loss
      logits = outputs.logits

      total_train_loss += loss.item() 

      # Perform a backward pass to calculate the gradients.
      loss.backward()

      # Update parameters and take a step using the computed gradient.
      optimizer.step()
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set. Implement this function in the cell above.
    print(f"Total loss: {total_train_loss}")
    val_acc, _, _ = get_validation_performance(val_set)
    print(f"Validation accuracy: {val_acc}")
    
print("")
print("Training complete!")

# TODO: SAVE YOUR MODEL HERE... (Refer PyTorch documentation for how to save models)
import datetime 
datetime_str = datetime.datetime.now().strftime('%d%m%y_%H%M')
# torch.save(model.state_dict(), f"./model_chkpts/{model_name}_{lr}-{datetime_str}")
# torch.save(model.state_dict(), f"./model_chkpts/{model_name}_{lr}-{datetime_str}")

model.save_pretrained(f"./model_chkpts/datetime_str")


Training...
Total loss: 74.63601944548239
Num of correct predictions = 2827
Validation accuracy: 0.8266081871345029

Training...
Total loss: 27.559611477300372
Num of correct predictions = 2902
Validation accuracy: 0.8485380116959065

Training...
Total loss: 22.36485901166143
Num of correct predictions = 2861
Validation accuracy: 0.8365497076023392

Training...
Total loss: 20.90613511565004
Num of correct predictions = 2898
Validation accuracy: 0.8473684210526315

Training...
Total loss: 20.13559644352039
Num of correct predictions = 2896
Validation accuracy: 0.8467836257309942

Training...
Total loss: 19.638587995649402
Num of correct predictions = 2894
Validation accuracy: 0.8461988304093567

Training...
Total loss: 19.108792634117407
Num of correct predictions = 2900
Validation accuracy: 0.847953216374269

Training...
Total loss: 18.68482497637267
Num of correct predictions = 2901
Validation accuracy: 0.8482456140350877

Training...
Total loss: 18.448451487382023
Num of correct pre

# Evaluate your model on the test set
After you're satisfied with your hyperparameters (i.e., you're unable to achieve higher validation accuracy by modifying them further), it's time to evaluate your model on the test set! Run the below cell to compute test set accuracy.


In [75]:
seed_everything()

# If your notebook disconnects during training, then here, first load the best
# model you saved (refer PyTorch docs), then check validation performance

test_acc, test_pred, test_labels = get_validation_performance(test_set)
print(len(test_set))

Num of correct predictions = 2867
3420


## Question 8 (10 points):
Finally, perform an *error analysis* on your model. This is good practice for your final project. Write some code in the below code cell to print out the text of up to five test set examples that your model gets **wrong**. If your model gets more than five test examples wrong, randomly choose five of them to analyze. If your model gets fewer than five examples wrong, please design five test examples that fool your model (i.e., *adversarial examples*). Then, in the following text cell, perform a qualitative analysis of these examples. See if you can figure out any reasons for errors that you observe, or if you have any informed guesses (e.g., common linguistic properties of these particular examples). Does this analysis suggest any possible future steps to improve your classifier?

In [18]:
seed_everything()
torch.cuda.empty_cache()


## print out up to 5 test set examples (or adversarial examples) that your model gets wrong
## YOUR ERROR ANALYSIS CODE HERE

# Get index of all wrong labels. 
wrong_labels_mask = (test_pred != test_labels)
wrong_labels_idx = np.argwhere(wrong_labels_mask).flatten()

# Labels of wrong idx.
wrong_pred_array = test_pred[wrong_labels_idx]
wrong_label_array = test_labels[wrong_labels_idx]

# Select 5 random wrong labels
random_wrong_idx = np.random.choice(wrong_labels_idx, 5)

for wrong_idx in random_wrong_idx:
    wrong_pred = test_pred[wrong_idx]
    wrong_lbl = test_labels[wrong_idx]
    print(f"(Pred,Lbl) - ({wrong_pred}, {wrong_lbl}) Text - {test_text[wrong_idx]}")

(Pred,Lbl) - (13, 4) Text - It is clear that the results of the interaction and integration of people from different cultures have brought many benefits and cultural enrichment.
(Pred,Lbl) - (3, 2) Text - ফেডারেল পদক্ষেপের অনুপস্থিতিতে, কিছু রাজ্য অভিবাসীদের জন্য কভারেজের ফাঁক পূরণ করছে।
(Pred,Lbl) - (13, 4) Text - It is clear that the results of the interaction and integration of people from different cultures have brought many benefits and cultural enrichment.
(Pred,Lbl) - (13, 4) Text - It is clear that the results of the interaction and integration of people from different cultures have brought many benefits and cultural enrichment.
(Pred,Lbl) - (3, 2) Text - ফেডারেল পদক্ষেপের অনুপস্থিতিতে, কিছু রাজ্য অভিবাসীদের জন্য কভারেজের ফাঁক পূরণ করছে।


# *Unlabeled Data Section*


In [23]:
# *Working with unlabled data

unlabelled_test_df = pd.read_json("./data/HW2_unlabeled_test_set.json")

# this assumes that the column containing the text is called "sentence"
texts = unlabelled_test_df.sentence.values

### tokenize_and_format() is a helper function provided in helpers.py ###
### Male sure you use the correct model name for your tokenizer! ###
input_ids, attention_masks = tokenize_and_format(texts)

# print(texts)
# print(labels)

# Convert the lists into tensors.
output_input_ids = torch.cat(input_ids, dim=0)
output_attention_masks = torch.cat(attention_masks, dim=0)

# Print sentence 0, now as a list of IDs.
print('Original: ', texts[0])
print('Token IDs:', input_ids[0])

print(len(unlabelled_test_df))
print(unlabelled_test_df.head())

Original:  గత 15 సంవత్సరాలుగా ఆర్లింగ్టన్ కౌంటీ బోర్డ్ సభ్యుడిగా ఉన్న ఫిసెట్ (D) వివాహం చేసుకున్నారు.
Token IDs: tensor([[  101,   848, 12584, 10217,   879, 12573, 47717, 47442, 12688, 12706,
           834, 32385, 16837, 48589, 22668,   868, 52079,   879, 44447, 13354,
         92233, 12706, 19043,   867, 51904,   113,   146,   114,   876, 14800,
         59389, 47012, 48190, 83692,   119,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]])
38100
   id                                           sentence predicted_label  \
0   1  గత 15 సంవత్సరాలుగా ఆర్లింగ్టన్ కౌంటీ బోర్డ్ సభ...            XXXX   
1   2  Soll illegalen Einwanderern der Führerschein v...            XXXX   
2   3  由 D-San Francisco 议员 Mark Leno 赞助。大会司法委员会于 4 月...            XXXX   
3   4  एक वरिष्ठ प्रशासनिक अधिकारीले भने कि रेनो घोषण...        

In [56]:
test_set = [(input_ids[i], attention_masks[i])
                       for i in range(len(output_input_ids))]

print(len(test_set))


38100


In [21]:
model_test = AutoModelForSequenceClassification.from_pretrained(
    r"./model_chkpts", local_files_only=True,
    num_labels=15,
)
# Tell pytorch to run this model on the GPU.
model_test.cuda()


In [22]:
labels_dict = {
    0: 'Economic' ,
    1: 'Capacity and Resources',
    2: 'Morality',
    3: 'Fairness and Equality',
    4: 'Legality, Constitutionality, Jurisdiction',
    5: 'Policy Prescription and Evaluation',
    6: 'Crime and Punishment',
    7: 'Security and Defense',
    8: 'Health and Safety',
    9: 'Quality of Life',
    10: 'Cultural Identity',
    11: 'Public Sentiment',
    12: 'Political',
    13: 'External Regulation and Reputation',
    14: 'Other'
}

print(labels_dict[12.0])


Political


In [52]:
def get_prediction_labels(model, test_set):

  model.eval()
  
  # Tracking variables

  num_batches = int(len(test_set)/batch_size) + 1


  label_preds = []
  for i in range(num_batches):

    end_index = min(batch_size * (i+1), len(test_set))

    batch = test_set[i*batch_size:end_index]

    if len(batch) == 0:
        continue

    input_id_tensors = torch.stack([data[0] for data in batch]).squeeze(1)
    input_mask_tensors = torch.stack([data[1] for data in batch])

    # Move tensors to the GPU
    b_input_ids = input_id_tensors.to(device)
    b_input_mask = input_mask_tensors.to(device)

    # Tell pytorch not to bother with constructing the compute graph during
      # the forward pass, since this is only needed for backprop (training).
    with torch.no_grad():

      # Forward pass, calculate logit predictions.
      # Note: this line of code might need to change depending on the model
      # the current line will work for bert-base-uncased
      # please refer to huggingface documentation for other models
      outputs = model(b_input_ids,
                      attention_mask=b_input_mask)
                      # token_type_ids=None,
      # print(outputs)
      logits = outputs.logits

      # Accumulate the validation loss.
      # total_eval_loss += loss.item()

      # Move logits and labels to CPU
      logits = (logits).detach().cpu().numpy()
      # label_ids = b_labels.to('cpu').numpy()

        # Calculate the number of correctly labeled examples in batch
      pred_flat = np.argmax(logits, axis=1).flatten()
      label_preds.extend(pred_flat)
      # labels_flat = np.argmax(label_ids, axis=1).flatten()
  return label_preds


In [57]:
label_pred = get_prediction_labels(model_test, test_set)

print(label_pred)
print(len(label_pred))

[9, 2, 5, 12, 12, 14, 9, 14, 14, 12, 5, 11, 11, 12, 5, 9, 12, 12, 5, 12, 12, 9, 9, 11, 1, 5, 12, 5, 12, 12, 5, 9, 9, 9, 5, 12, 12, 2, 14, 2, 14, 4, 3, 11, 9, 4, 7, 12, 2, 5, 1, 12, 3, 12, 10, 12, 12, 7, 0, 0, 12, 4, 11, 5, 0, 12, 11, 5, 2, 5, 4, 5, 12, 5, 5, 12, 9, 12, 12, 12, 2, 10, 2, 11, 7, 5, 11, 12, 5, 12, 11, 12, 13, 14, 11, 12, 9, 5, 12, 9, 11, 12, 2, 9, 9, 2, 2, 3, 12, 9, 14, 2, 12, 5, 14, 11, 5, 5, 9, 9, 13, 13, 5, 12, 1, 0, 10, 12, 12, 4, 12, 12, 4, 9, 10, 12, 2, 5, 5, 4, 12, 12, 12, 5, 9, 10, 5, 5, 12, 7, 14, 9, 2, 12, 14, 12, 13, 11, 11, 12, 5, 5, 3, 7, 12, 11, 12, 2, 2, 3, 0, 9, 2, 12, 5, 12, 12, 11, 5, 5, 4, 5, 4, 3, 0, 9, 12, 9, 3, 9, 2, 7, 5, 12, 9, 12, 11, 3, 4, 13, 12, 12, 12, 12, 11, 2, 5, 12, 3, 9, 2, 0, 2, 1, 0, 5, 12, 2, 5, 12, 4, 3, 12, 5, 2, 9, 3, 9, 2, 11, 12, 5, 12, 12, 0, 11, 5, 5, 12, 8, 2, 7, 11, 11, 0, 2, 4, 0, 13, 13, 12, 12, 12, 11, 11, 6, 2, 12, 6, 12, 2, 4, 13, 12, 9, 9, 10, 2, 12, 9, 11, 2, 12, 12, 4, 2, 12, 10, 2, 0, 2, 2, 2, 0, 8, 12, 11, 4, 12, 12,

In [59]:
# Fill in the unlabeled dataset
unlabelled_test_df["label_ID"] = label_pred
unlabelled_test_df["predicted_label"] = unlabelled_test_df["label_ID"].apply(
    lambda x: labels_dict[x])

print(unlabelled_test_df.head())


   id                                           sentence  \
0   1  గత 15 సంవత్సరాలుగా ఆర్లింగ్టన్ కౌంటీ బోర్డ్ సభ...   
1   2  Soll illegalen Einwanderern der Führerschein v...   
2   3  由 D-San Francisco 议员 Mark Leno 赞助。大会司法委员会于 4 月...   
3   4  एक वरिष्ठ प्रशासनिक अधिकारीले भने कि रेनो घोषण...   
4   5  The panel Wednesday approved the resolution in...   

                      predicted_label  label_ID language  
0                     Quality of Life         9       te  
1                            Morality         2       de  
2  Policy Prescription and Evaluation         5    zh-CN  
3                           Political        12       ne  
4                           Political        12       en  


In [61]:
unlabelled_test_df.to_json("class_output.json", orient="records")

### *DESCRIBE YOUR QUALITATIVE ANALYSIS OF THE ABOVE EXAMPLES IN YOUR REPORT*



---

Finished? Remember to upload the PDF file of this notebook, report and your three dataset files (annotator1.tsv, annotator2.tsv, and final_data.tsv) to Gradescope with the filename line formatted as **Firstname_Lastname_HW2**.
