# Objective
### The purpose of this notebook is to build and fine-tune a question-answering (QA) chatbot model using a transformer-based architecture. Specifically, we use the Hugging Face Transformers library and leverage a pre-trained BERT model variant (RoBERTa) to fine-tune it on a custom dataset of movie reviews. This notebook includes data preparation, model fine-tuning, evaluation, and inference, and demonstrates the use of a user interface with Gradio for interaction with the model.

## Steps
1. **Data Collection and Preparation**: Load and preprocess the movie review dataset.
2. **Model Selection**: Load the RoBERTa model for fine-tuning.
3. **Training and Fine-Tuning**: Customize the model on the movie review QA dataset.
4. **Evaluation**: Measure the model's performance.
5. **User Interface**: Build a simple UI using Gradio for user interaction.

### Installation and Setup

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate

!apt install git-lfs

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

### Hugging Face Login

In [None]:
!git config --global user.email "m.moussa@alustudent.com"
!git config --global user.name "Kakaymi10"

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Model and Tokenizer Setup

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]



In [None]:
tokenizer.is_fast

True

# Data Preparation
Load and preprocess the movie review dataset, including tokenization and alignment of QA pairs.

1. **Load Dataset**: Import the CSV files containing training and test data.
2. **Adjust Columns**: Rename and adjust columns to suit model requirements, including question, context, and answers.
3. **Define Preprocessing Functions**: Create functions to process and align answer spans for model compatibility.

In [None]:
import pandas as pd
df_train=pd.read_csv('/content/train.csv')
df_test=pd.read_csv('/content/test.csv')

In [None]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [None]:
df_train.columns

Index(['item_id', 'domain', 'nn_mod', 'nn_asp', 'query_mod', 'query_asp',
       'q_review_id', 'q_reviews_id', 'question', 'question_subj_level',
       'ques_subj_score', 'is_ques_subjective', 'review_id', 'review',
       'human_ans_spans', 'human_ans_indices', 'answer_subj_level',
       'ans_subj_score', 'is_ans_subjective'],
      dtype='object')

In [None]:
df_train.iloc[0].question

'Who is the author of this series?'

In [None]:
df_train.iloc[0].review

"Whether it be in her portrayal of a nerdy lesbian or a punk rock rebel, Maslany's plural personalities, (though very stereotypical), are entertaining eye-candy. Combined with a complex and unpredictable plot line, this show is surprisingly addictive. ANSWERNOTFOUND"

In [None]:
df_train.iloc[0].human_ans_indices

'(251, 265)'

In [None]:
df_train.iloc[0].review[251:265]

'ANSWERNOTFOUND'

In [None]:
df_train=df_train[['question','human_ans_indices','review','human_ans_spans']]
df_test=df_test[['question','human_ans_indices','review','human_ans_spans']]

In [None]:
import numpy as np
df_train['id']=np.linspace(0,len(df_train)-1,len(df_train))
df_test['id']=np.linspace(0,len(df_test)-1,len(df_test))

df_train['id']=df_train['id'].astype(str)
df_test['id']=df_test['id'].astype(str)

In [None]:
int(df_train.iloc[0].human_ans_indices.split('(')[1].split(',')[0])

251

In [None]:
float(df_train.iloc[0].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])

265.0

In [None]:
df_train['answers']=df_train['human_ans_spans']
df_test['answers']=df_test['human_ans_spans']

In [None]:
for i in range(0,len(df_train)):
  answer1={}
  si=int(df_train.iloc[i].human_ans_indices.split('(')[1].split(',')[0])
  ei=int(df_train.iloc[i].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])
  answer1['text']=[df_train.iloc[i].review[si:ei]]
  answer1['answer_start']=[si]
  df_train.at[i, 'answers']=answer1
  #print(df_train.iloc[i].answers,df_train.iloc[i].human_ans_spans)

In [None]:
for i in range(0,len(df_test)):
  answer1={}
  si=int(df_test.iloc[i].human_ans_indices.split('(')[1].split(',')[0])
  ei=int(df_test.iloc[i].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])
  answer1['text']=[df_test.iloc[i].review[si:ei]]
  answer1['answer_start']=[si]
  df_test.at[i, 'answers']=answer1
  #print(df_train.iloc[i].answers,df_train.iloc[i].human_ans_spans)

In [None]:
df_train.columns

Index(['question', 'human_ans_indices', 'review', 'human_ans_spans', 'id',
       'answers'],
      dtype='object')

In [None]:
df_train.columns=['question', 'human_ans_indices', 'context', 'human_ans_spans', 'id',
       'answers']

df_test.columns=['question', 'human_ans_indices', 'context', 'human_ans_spans','id',
       'answers']

In [None]:
val_dataset2 = datasets.Dataset.from_pandas(df_test)
train_dataset2 = datasets.Dataset.from_pandas(df_train)


# Model Preparation for Fine-Tuning

In [None]:
train_dataset = train_dataset2.map(
    preprocess_training_examples,
    batched=True,
    remove_columns=train_dataset2.column_names,
)
len(train_dataset2), len(train_dataset)

Map:   0%|          | 0/2501 [00:00<?, ? examples/s]

(2501, 4862)

In [None]:
train_dataset2.shape

(2501, 6)

In [None]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [None]:
validation_dataset = val_dataset2.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=val_dataset2.column_names,
)
len(validation_dataset)

Map:   0%|          | 0/582 [00:00<?, ? examples/s]

1104

In [None]:
len(validation_dataset)

1104

In [None]:
validation_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 1104
})

In [None]:
len(val_dataset2)

582

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



# Metrics Evaluation

In [None]:
import torch
from transformers import AutoModelForQuestionAnswering


In [None]:
import collections



In [None]:
import evaluate

metric = evaluate.load("squad")

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

In [None]:
from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

# Fine-Tuning the Model


In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "roberta-finetuned-subjqa-movies_2",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    fp16=True,
    push_to_hub=True,
)



In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [None]:
import numpy as np
n_best=20
max_answer_length = 30

# Model Evaluation

In [None]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, val_dataset2)

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  0%|          | 0/582 [00:00<?, ?it/s]

{'exact_match': 2.7491408934707904, 'f1': 9.852373992728378}

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,1.2742,No log,0.0059
2,1.0163,No log,0.0059
3,0.843,No log,0.0059
4,0.6959,No log,0.0059
5,0.5892,No log,0.0059


TrainOutput(global_step=3040, training_loss=0.8837257786800987, metrics={'train_runtime': 889.7692, 'train_samples_per_second': 27.322, 'train_steps_per_second': 3.417, 'total_flos': 4764093117189120.0, 'train_loss': 0.8837257786800987, 'epoch': 5.0})

In [None]:
trainer.push_to_hub(commit_message="Training complete")

events.out.tfevents.1730377938.66469ff408b5.682.0:   0%|          | 0.00/7.92k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/MoussaMoustapha/roberta-finetuned-subjqa-movies_2/commit/3dfc661dbb0d37fd02c818a12609f9f8df406259', commit_message='Training complete', commit_description='', oid='3dfc661dbb0d37fd02c818a12609f9f8df406259', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, val_dataset2)

  0%|          | 0/582 [00:00<?, ?it/s]

{'exact_match': 63.57388316151203, 'f1': 65.60935129342873}

In [None]:
#Inference!

# Inference and Testing


In [None]:
from transformers import pipeline

In [None]:
# Replace this with your own checkpoint
model_checkpoint2 = "MoussaMoustapha/roberta-finetuned-subjqa-movies_2"
question_answerer = pipeline("question-answering", model=model_checkpoint2)

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
import pandas as pd
df_train1=pd.read_csv('/content/train.csv')
df_test1=pd.read_csv('/content/test.csv')

In [None]:
df_train1.iloc[13].question

'Why is the movie soo confusing?'

In [None]:
context = df_train1.iloc[13].review
question = df_train1.iloc[13].question
question_answerer(question=question, context=context)

{'score': 0.27877935767173767,
 'start': 10,
 'end': 72,
 'answer': "is an interesting movie but might not be everyone's cup of tea"}

In [None]:
# Replace this with your own checkpoint
model_checkpoint_o = "deepset/roberta-base-squad2"
question_answerer_old = pipeline("question-answering", model=model_checkpoint_o)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
context = df_train1.iloc[13].review
question = df_train1.iloc[13].question
question_answerer_old(question=question, context=context)

{'score': 0.4546010494232178,
 'start': 2649,
 'end': 2669,
 'answer': 'they were so orderly'}

In [None]:
df_train.iloc[3].question

'Is this series good and excelent?'

In [None]:
df_train.iloc[3].answers

{'text': ['this show is OUTSTANDING'], 'answer_start': [296]}

In [None]:
df_train[['id','question','context','answers']].head()

Unnamed: 0,id,question,context,answers
0,0.0,Who is the author of this series?,Whether it be in her portrayal of a nerdy lesb...,"{'text': ['ANSWERNOTFOUND'], 'answer_start': [..."
1,1.0,Can we enjoy the movie along with our family ?,"An outstanding romantic comedy, 13 Going on 30...","{'text': ['ANSWERNOTFOUND'], 'answer_start': [..."
2,2.0,Does this one good?,"To let the truth be known, I watched this movi...","{'text': ['ANSWERNOTFOUND'], 'answer_start': [..."
3,3.0,Is this series good and excelent?,"At the time of my review, there had been 910 c...","{'text': ['this show is OUTSTANDING'], 'answer..."
4,4.0,How is the costume design?,"""Fright Night"" is great! This is how the story...",{'text': ['The costume design by Susan Matheso...


In [None]:
len(df_train)

2501

# User Interface

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.4.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.2 (from gradio)
  Downloading gradio_client-1.4.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub>=0.25.1 (from gradio)
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart==0.0.12 (from gradio)
  Downloading python_multipart-0.0.12-py3-none-any.whl.metadata (1.9 kB)
Col

In [8]:
import pandas as pd
import gradio as gr
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# Load the QA model and tokenizer
model_name = "MoussaMoustapha/roberta-finetuned-subjqa-movies_2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)



In [10]:

# Function to ask questions and get answers
def ask_question(question):
    # Initialize the pipeline within the function:
    qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)
    result = qa_pipeline(question=question, context=context)
    return result['answer']

# Create Gradio interface
iface = gr.Interface(
    fn=ask_question,
    inputs=gr.Textbox(lines=2, placeholder="Enter your question about the movie review dataset..."),
    outputs="text",
    title="Movie Review Dataset QA",
    description="Enter a question about the movie review dataset, and the model will try to answer based on the provided context.",
)

In [11]:
iface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://e50c80efa03fcb95fd.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


