<a href="https://colab.research.google.com/github/Aleena24/Large-Language-Model/blob/main/lab1_TransferLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing transfer learning to fine-tune a language model on a Question Answering/Machine Reading Comprehension dataset.

In [3]:
!pip install datasets



In [4]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
from datasets import load_dataset
from transformers import TrainingArguments, Trainer
import torch

In [5]:
dataset = load_dataset("rajpurkar/squad", split={'train': 'train[:10%]', 'validation': 'validation[:10%]'})
print(dataset['train'][0])
print(dataset['validation'][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}
{'id': '56be4db0acb8001400a502ec', 'title': 'Super_Bow

#pre-training based on bert-base-uncased model

In [6]:
model_name = "bert-base-uncased"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#Fine-Tuning

In [7]:
def train_data(examples):
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=128,
        stride=32,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized_examples.sequence_ids(i)

        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        answer_start = answers["answer_start"][0]
        answer_end = answer_start + len(answers["text"][0])

        token_start_index = 0
        while sequence_ids[token_start_index] != 1:
            token_start_index += 1

        token_end_index = len(input_ids) - 1
        while sequence_ids[token_end_index] != 1:
            token_end_index -= 1

        if not (offsets[token_start_index][0] <= answer_start and offsets[token_end_index][1] >= answer_end):
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            while token_start_index < len(offsets) and offsets[token_start_index][0] <= answer_start:
                token_start_index += 1
            tokenized_examples["start_positions"].append(token_start_index - 1)

            while offsets[token_end_index][1] >= answer_end:
                token_end_index -= 1
            tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

train_dataset = dataset["train"].map(train_data, batched=True, remove_columns=dataset["train"].column_names)
validation_dataset = dataset["validation"].map(train_data, batched=True, remove_columns=dataset["validation"].column_names)


Map:   0%|          | 0/1057 [00:00<?, ? examples/s]

# Training the Model

In [8]:
!pip install accelerate -U



In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    save_steps=500,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
)

trainer.train()




Step,Training Loss,Validation Loss
500,1.5222,1.150925
1000,1.0003,0.995923
1500,0.6245,1.037585


#Model Evaluation

In [None]:
def ask_question(question, context, model, tokenizer):
    device = next(model.parameters()).device

    inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    inputs = {k: v.to(device) for k, v in inputs.items()}

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    return answer

context = '''A Large Language Model (LLM) is a type of artificial intelligence designed to
            understand and generate human language. These models are trained on vast amounts of text data,
            enabling them to comprehend context, generate coherent text, translate languages, and perform other
            complex language-related tasks. Prominent examples include OpenAI's GPT-4 and Google's BERT.
            They leverage deep learning techniques, specifically transformer architectures, to process and produce
            text that mimics human language patterns. LLMs have a wide range of applications, from chatbots and content
            creation to aiding in research and improving accessibility for diverse linguistic needs..'''
question = "Examples of LLM?"
model = model.to('cuda')
answer = ask_question(question, context, model, tokenizer)
print(answer)


to protect the chinese states and empires against the raids and invasions of the various nomadic groups of the eurasian steppe.
