<a href="https://colab.research.google.com/github/SproutCoder/text_mining_23/blob/main/project_4_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 4: Huggingface

### Enter names and mat. numbers:
- Group Name PiKa

- Sebastian Pirozhkov, 421892
- Christopher Kaschny, 447930


(Large) Language models constitute a paradigm shift to NLP. For this project, you may explore the world of fine-tuning by finding yourself a project to solve with a fine-tuned language model.

To do so, use the huggingface's transformers library and use the provided pre-trained models.
Suitable projects are, e.g.,
- text classification (for instance based on pre-trained *BERT model* with a text classification head)
- fine-tuning a generative language model to infer *cooking recipes* or *song texts* (for instance GPT-2, LLaMA, etc.)
- fine-tuning a generative language model to infer *prompts* LLMs

Go to [Kaggle](https://www.kaggle.com/datasets) for more datasets and/or check [huggingface](https://huggingface.co/) and their [Task Guides](https://huggingface.co/docs/transformers/tasks/sequence_classification "click here for the text classification tutorial").

With the `from datasets import load_dataset` import, you may use huggingface's datasets.

## Question answering (extractive)

We chose to fine-tune a model for extractive question answering.
We are going to Finetune DistilBERT on the SQuAD dataset and use the model for inference of a test example.

In [1]:
! pip install transformers datasets
! pip install transformers datasets evaluate



### Load SQuAD dataset

We are going to load a subset of the SQuAD dataset from the Huggingface Datasets library to experiment with it.

In [2]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")



split into training and test set

In [3]:
squad = squad.train_test_split(test_size=0.2)

In [4]:
# look at an example
squad["train"][0]

{'id': '56cca3676d243a140015f055',
 'title': 'IPod',
 'context': 'The games are in the form of .ipg files, which are actually .zip archives in disguise[citation needed]. When unzipped, they reveal executable files along with common audio and image files, leading to the possibility of third party games. Apple has not publicly released a software development kit (SDK) for iPod-specific development. Apps produced with the iPhone SDK are compatible only with the iOS on the iPod Touch and iPhone, which cannot run clickwheel-based games.',
 'question': 'What is the only operating system on which iPhone SDK-made games can be played?',
 'answers': {'text': ['iOS'], 'answer_start': [397]}}

### Preprocess

we ar going to load the DistilBERT tokenizer to process the questoin and context field:

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

One should be aware of a few preprocessing steps that are particula rot quesiton answering:

- Some examples may have long ```context```exceedings the maximum input length of the model. To deal with that we truncate the ```context```by setting ```truncation = "only_second```.
- we map the start and end positions of the answer to the original ```context```by setting ```return _offset_mapping=True```.
- now we can find the start and end tokens of the answer utilizing the previous mapping. For We use the ```sequence_ids```metthod to find which part of the offset corresponds to the ```question```an dwhich corresponds to the ```context```.

We create a function handling the preprocessing steps (trauncate and map the start/end tokens of ```answer```to ```context```:

In [6]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]] # creates a list of questions stripped of leading/trailing whitespace
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second", # only context will be truncated (when exceending max_length)
        return_offsets_mapping=True, #  needed to identify the position of the answer in the original text.
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i) #  Sequence IDs indicate which part of the offset mapping corresponds to the question and which part corresponds to the context

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

We apply the preprocessing function of the entire dataset using Huggingface's map function.

In [7]:
tokenized_squad = squad.map(preprocess_function,
                            batched=True, # process multiple elemetns a tonce
                            remove_columns=squad["train"].column_names) #remoce not needed columns

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Create a batch of examples (without any additional preprocessing) We use hugginface's ```DefaultDataCollator```.

In [8]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

### Train

In [9]:
# load DistilBERT
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to

In [10]:
! pip install transformers
! pip install -U accelerate
! pip install -U transformers




In [11]:
from torch.optim import AdamW

#define training parameters
training_args = TrainingArguments(
    output_dir="./qa_model_test",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

optimizer = AdamW(model.parameters(), lr=training_args.learning_rate) # due to future switch to PyTorch's AdamW implementation

# Passing the training arguments to Trainer along with the model, dataset, tokenizer, and data collator.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    optimizers=(optimizer, None)
)

# fine tune the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.092271
2,2.667000,1.596333
3,2.667000,1.547049


TrainOutput(global_step=750, training_loss=2.2371070556640626, metrics={'train_runtime': 116.4738, 'train_samples_per_second': 103.027, 'train_steps_per_second': 6.439, 'total_flos': 1175877900288000.0, 'train_loss': 2.2371070556640626, 'epoch': 3.0})

Now the model is fine tuned to question answering.

### Evaluate

Evaluation for question answering requieres a lot of postprocessing so we skip it for this exploration of fine tuning Huggingface models. A description on how to evaluate properly: https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt#postprocessing

### Inference

Now we can use the finetuned model for inference.

In [12]:
#example from Huggingface tutorial:
question_1 = "How many programming languages does BLOOM support?"
context_1 = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

In [14]:
#our example from NLP text book (Jurafsky; M):
question_2 = "What privacy issues might be there?"
context_2 = "Potential Harms from Language Models Large pretrained neural language models exhibit many of the potential harms discussed in Chapter 4 and Chapter 6. Many of these harms become realized when pretrained language models are fine-tuned to downstream tasks, particularly those involving text generation, such as in assistive technologies like web search query completion, or predictive typing for email (Olteanu et al., 2020). For example, language models can generate toxic language. Gehman et al. (2020) show that many kinds of completely non-toxic prompts can nonetheless lead large language models to output hate speech and abuse. Brown et al. (2020) and Sheng et al. (2019) showed that large language models generate sentences displaying negative attitudes toward minority identities such as being Black or gay. Indeed, language models are biased in a number of ways by the distributions of their training data. Gehman et al. (2020) shows that large language model training datasets include toxic text scraped from banned sites, such as Reddit communities that have been shut down by Reddit but whose data may still exist in dumps. In addition to problems of toxicity, internet data is disproportionately generated by authors from developed countries, and many large language models trained on data from Reddit, whose authors skew male and young. Such biased population samples likely skew the resulting generation away from the perspectives or topics of underrepresented populations. Furthermore, language models can amplify demographic and other biases in training data, just as we saw for embedding models in Chapter 6. Language models can also be a tool for generating text for misinformation, phishing, radicalization, and other socially harmful activities (Brown et al., 2020). McGuffie and Newhouse (2020) show how large language models generate text that emulates online extremists, with the risk of amplifying extremist movements and their attempt to radicalize and recruit. Finally, there are important privacy issues. Language models, like other machine learning models, can leak information about their training data. It is thus possible for an adversary to extract individual training-data phrases from a language model such as an individual person’s name, phone number, and address (Henderson et al. 2017, Carlini et al. 2020). This is a problem if large language models are trained on private datasets such as electronic health records (EHRs). Mitigating all these harms is an important but unsolved research question in NLP. Extra pretraining (Gururangan et al., 2020) on non-toxic subcorpora seems to reduce a language model’s tendency to generate toxic language somewhat (Gehman et al., 2020). And analyzing the data used to pretrain large language models is important to understand toxicity and bias in generation, as well as privacy, making it extremely important that language models include datasheets (page 16) or model cards (page 76) giving full replicable information on the corpora used to train them."

We are going to use a pipeline for passing text into our model and use it for question answering. (One could also do this manually)

In [15]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model="./qa_model_test")

In [None]:
question_answerer(question=question_1, context=context_1)

In [None]:
question_answerer(question=question_2, context=context_2)

So our model seems to be working :)

### Save folder (Colab specific)

Don't forget to save the folder ```qa_model_test```locally/ in drive when running in Colab.


**How to save folder in Google Drive (Version 1)**

To save a folder from Google Colab to Google Drive, you can mount your Google Drive in Colab and then copy the folder to the desired location in your Drive. Here’s how you can do it:

Mount your Google Drive in Colab by running the following code:
```python
from google.colab import drive
drive.mount('/content/drive')
```
Once your drive is mounted, you can copy the folder from Colab to your Google Drive using the !cp command. For example, if you want to copy a folder named my_folder from Colab to a folder named Colab Notebooks in your Google Drive, you can use the following command:
```python
!cp -r /content/my_folder /content/drive/MyDrive/Colab\ Notebooks/
```
Make sure to replace my_folder and Colab Notebooks with the actual names of your folders.

(For Tansperancy: Instruction was created unsing Bing-Chat)

In [16]:
from google.colab import drive
drive.mount('/content/drive')

!cp -r /content/drive/MyDrive/qa_model_test /content/drive/MyDrive/Colab\ Notebooks/

Mounted at /content/drive
cp: cannot stat '/content/drive/MyDrive/qa_model_test': No such file or directory


**How to save model folder in Google Drive (Version 2):**

To save model folder to Google Drive from a Colab notebook, you first need to mount your Google Drive in the notebook. Here's how you can do it:

1. Run the following command in a cell to mount your Google Drive:
```python
from google.colab import drive
drive.mount('/content/drive')
```

2. Follow the instructions and enter the authorization code when prompted.

3. Once your Google Drive is mounted, you can access it from the `/content/drive/MyDrive` directory.

4. To save the model to a directory in your Google Drive, you can change the `output_dir` argument in `TrainingArguments` to a path within your Google Drive. For example:
```python
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/qa_model_test",
    ...
)
```

This will save the model to the `qa_model_test` directory in your Google Drive.

(For transparency: This instruction was generated using Bing-Chat btw.)

**How to save files/folders locally:**


If you want to save the model locally from a Colab notebook, you can download the files from the Colab runtime to your local machine. Here's how you can do it:

1. After training the model, the files will be saved to the `output_dir` specified in `TrainingArguments`. In your case, it is `"./qa_model_test"`.

2. To download the entire directory, you can compress it into a zip file using the following command:
```python
!zip -r /content/qa_model_test.zip /content/qa_model_test
```

3. Once the zip file is created, you can download it to your local machine by clicking on the folder icon on the left sidebar of the Colab notebook, navigating to the `/content` directory, right-clicking on the `qa_model_test.zip` file and selecting "Download".

This will download the `qa_model_test.zip` file to your local machine. You can then extract its contents to access the saved model files.

(For transparency: This instruction was generated using Bing-Chat btw.)

In [17]:
!zip -r /content/qa_model_test.zip /content/qa_model_test

  adding: content/qa_model_test/ (stored 0%)
  adding: content/qa_model_test/checkpoint-500/ (stored 0%)
  adding: content/qa_model_test/checkpoint-500/scheduler.pt (deflated 45%)
  adding: content/qa_model_test/checkpoint-500/optimizer.pt (deflated 27%)
  adding: content/qa_model_test/checkpoint-500/tokenizer_config.json (deflated 43%)
  adding: content/qa_model_test/checkpoint-500/vocab.txt (deflated 53%)
  adding: content/qa_model_test/checkpoint-500/tokenizer.json (deflated 71%)
  adding: content/qa_model_test/checkpoint-500/trainer_state.json (deflated 52%)
  adding: content/qa_model_test/checkpoint-500/pytorch_model.bin (deflated 8%)
  adding: content/qa_model_test/checkpoint-500/rng_state.pth (deflated 28%)
  adding: content/qa_model_test/checkpoint-500/special_tokens_map.json (deflated 42%)
  adding: content/qa_model_test/checkpoint-500/training_args.bin (deflated 48%)
  adding: content/qa_model_test/checkpoint-500/config.json (deflated 44%)
  adding: content/qa_model_test/runs