# BERT_fine-tune

This is the code to fine-tune the [**bert-base-uncased**](https://huggingface.co/bert-base-uncased) pre-train language model by [**CLOTH**](https://www.cs.cmu.edu/~glai1/data/cloth/) or [**DGen**](https://github.com/DRSY/DGen) datasets.

* Paper: "CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model"
* Author: AndyChiangSH
* Time: 2022/10/15
* GitHub: https://github.com/AndyChiangSH/CDGP

## Download datasets

### CLOTH

In [None]:
!wget https://github.com/AndyChiangSH/CDGP/raw/main/datasets/CLOTH.zip

In [None]:
!unzip ./CLOTH.zip -d ./CLOTH

### DGen

In [None]:
!wget https://github.com/AndyChiangSH/CDGP/raw/main/datasets/DGen.zip

In [None]:
!unzip ./DGen.zip -d ./DGen

## Data preprocessing

### CLOTH

In [None]:
import json

with open("./CLOTH/CLOTH_train_cleaned.json", "r") as file:
    dataset = json.load(file)

print(len(dataset))
print(dataset[0])

### DGen

In [None]:
import json

with open("./DGen/DGen_train_cleaned.json", "r") as file:
    dataset = json.load(file)

print(len(dataset))
print(dataset[0])

### Data masking

In [None]:
from tqdm.notebook import tqdm
import os

input_list = list()
label_list = list()

for data in tqdm(dataset):
  answer = data["answer"]
  distractors = data["distractors"]
  sentence = data["sentence"]
  mask_sentence = sentence.replace("**blank**", "[MASK]")
  mask_sentence += " [SEP] " + answer
  for distractor in distractors:
    dis_sentence = mask_sentence.replace("[MASK]", distractor)
    input_list.append(mask_sentence)
    label_list.append(dis_sentence)

In [None]:
print("input_list:", len(input_list))
print(input_list[:10])

In [None]:
print("label_list:", len(label_list))
print(label_list[:10])

## Fine-tune BERT

In [None]:
!pip install transformers datasets

In [None]:
PLM = "bert-base-uncased"
BATCH_SIZE = 64
EPOCH = 1
LR = 0.0001
MAX_LENGTH = 64

### Setup the Dataset

In [None]:
data_dic = {"input": input_list, "label": label_list}

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict(data_dic)

In [None]:
print(len(dataset))

### Setup the DataLoader

In [None]:
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

In [None]:
print(len(dataloader))

### Fine-tune the model

In [None]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained(PLM)
model = BertForMaskedLM.from_pretrained(PLM, return_dict=True)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
print(device)

In [None]:
# process bar
num_training_steps = EPOCH * len(dataloader)
progress_bar = tqdm(range(num_training_steps))

# start fine-tune
loss_history = []
for epoch in range(EPOCH):
  for batch in dataloader:
    inputs = tokenizer(batch["input"], truncation=True, padding="max_length", max_length=MAX_LENGTH, return_tensors="pt")
    labels = tokenizer(batch["label"], truncation=True, padding="max_length", max_length=MAX_LENGTH, return_tensors="pt")["input_ids"]
    output = model(**inputs.to(device), labels=labels.to(device))
    optimizer.zero_grad()
    loss = output.loss
    logits = output.logits
    loss_history.append(loss.item())
    loss.backward()
    optimizer.step()
    progress_bar.update(1)
  
  print(f"[epoch {epoch+1}] loss: {loss.item()}")

### Show the loss line chart

In [None]:
print(loss_history)
print(len(loss_history))

In [None]:
# paint training loss graph
import matplotlib.pyplot as plt

plt.plot(loss_history)
plt.title('Training loss')
plt.ylabel('loss')
plt.xlabel('batch')
plt.legend(['loss'], loc='upper right')
plt.show()

### Save the model

In [None]:
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained("./cdgp-csg-bert-dgen")

### Delete the model

In [None]:
del model
del model_to_save
torch.cuda.empty_cache()

## Testing

### Testing data

In [None]:
questions = {
    "q1": {
        "sentence": "To make Jane live a [MASK] life, Mother was very careful about spending money. [SEP] happy",
        "answer": "happy",
        "distractors": ["poor", "busy", "sad"]
    },
    "q2": {
        "sentence": "[MASK] , Jane didn't understand her. [SEP] However",
        "answer": "However",
        "distractors": ["Though", "Although", "Or"]
    },
    "q3": {
        "sentence": "Every day Mother was busy with her [MASK] while Jane was studying at school, so they had little time to enjoy themselves. [SEP] work",
        "answer": "work",
        "distractors": ["writing", "housework", "research"]
    },
    "q4": {
        "sentence": "One day, Mother realized Jane was unhappy and even [MASK] to her. [SEP] unfriendly",
        "answer": "unfriendly",
        "distractors": ["loyal", "kind", "cruel"]
    },
    "q5": {
        "sentence": "The old man was waiting for a ride across the [MASK] . [SEP] river",
        "answer": "river",
        "distractors": ["town", "country", "island"]
    },
    "q6": {
        "sentence": "I felt uncomfortable and out of place as the professor carefully [MASK] what she expected us to learn. [SEP] explained",
        "answer": "explained",
        "distractors": ["showed", "designed", "offered"]
    },
    "q7": {
        "sentence": "As I listened, I couldn't help but [MASK] of my own oldest daughter. [SEP] think",
        "answer": "think",
        "distractors": ["speak", "talk", "hear"]
    },
    "q8": {
        "sentence": "As we were [MASK] on the third floor for old people with Alzheimer, most of them stared off at the walls or floor. [SEP] singing",
        "answer": "singing",
        "distractors": ["meeting", "gathering", "dancing"]
    },
    "q9": {
        "sentence": "As we got [MASK] with each song, she did as well. [SEP] louder",
        "answer": "louder",
        "distractors": ["higher", "nearer", "faster"]
    },
    "q10": {
        "sentence": "Mr. Petri, [MASK] injured in the fire, was rushed to hospital. [SEP] seriously",
        "answer": "seriously",
        "distractors": ["blindly", "hardly", "slightly"]
    },
    "q11": {
        "sentence": "If an object is attracted to a magnet, the object is most likely made of [MASK]. [SEP] metal",
        "answer": "metal",
        "distractors": ["wood", "plastic", "cardboard"]
    },
    "q12": {
        "sentence": "the main organs of the respiratory system are [MASK]. [SEP] lungs",
        "answer": "lungs",
        "distractors": ["ovaries", "intestines", "kidneys"]
    },
    "q13": {
        "sentence": "The products of photosynthesis are glucose and [MASK] else. [SEP] oxygen",
        "answer": "oxygen",
        "distractors": ["carbon", "hydrogen", "nitrogen"]
    },
    "q14": {
        "sentence": "frogs have [MASK] eyelid membranes. [SEP] three",
        "answer": "three",
        "distractors": ["two", "four", "one"]
    },
    "q15": {
        "sentence": "the only known planet with large amounts of water is [MASK]. [SEP] earth",
        "answer": "earth",
        "distractors": ["saturn", "jupiter", "mars"]
    },
    "q16": {
        "sentence": "[MASK] is responsible for erosion by flowing water and glaciers. [SEP] gravity",
        "answer": "gravity",
        "distractors": ["kinetic", "electromagnetic", "weight"],
    },
    "q17": {
        "sentence": "Common among mammals and insects , pheromones are often related to [MASK] type of behavior. [SEP] reproductive",
        "answer": "reproductive",
        "distractors": ["aggressive", "immune", "cardiac"]
    },
    "q18": {
        "sentence": "[MASK] can reproduce by infecting the cell of a living host. [SEP] virus",
        "answer": "virus",
        "distractors": ["bacteria", "mucus", "carcinogens"]
    },
    "q19": {
        "sentence": "proteins are encoded by [MASK]. [SEP] genes",
        "answer": "genes",
        "distractors": ["DNA", "RNA", "codons"]
    },
    "q20": {
        "sentence": "Producers at the base of ecological food webs are also known as [MASK]. [SEP] autotrophic",
        "answer": "autotrophic",
        "distractors": ["endoscopic", "symbiotic", "mutualistic"],
    },
    "q21": {
        "sentence": "Today morning, I saw a [MASK] sitting on the wall. [SEP] cat",
        "answer": "cat",
        "distractors": [],
    },
    "q22": {
        "sentence": "Ukrainian presidential adviser says situation is ' [MASK] control' in suburbs and outskirts of Kyiv. [SEP] under",
        "answer": "under",
        "distractors": [],
    },
    "q23": {
        "sentence": "I don't think that after what is [MASK] now, Ukraine has weak positions. [SEP] happening",
        "answer": "happening",
        "distractors": [],
    },
}

### Load the model

In [None]:
from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained(PLM)
model = BertForMaskedLM.from_pretrained("./cdgp-csg-bert-dgen")
model.eval()

### Generate distractors

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask", tokenizer=tokenizer, model=model, top_k=10)

In [None]:
unmasker(questions["q1"]["sentence"])