# **QuestBot**: AI- Powered Insight Generator

## **Plan of Attack:**


1. ***Data Preprocessing***: Clean and preprocess the dataset to remove any unwanted information and tokenize the text for input to the model.

2. ***Model Selection***: Choose a pre-trained roberta model from the Hugging Face Transformers library as the base model for fine-tuning.

3. ***Fine-Tuning***: Fine-tune the roberta model on the custom dataset using the question-answering objective. Train the model to predict the answer given a question as input.

4. ***Save the Fine-Tuned Model***: Save the fine-tuned roberta model after the training process so you can use it later for inference.

5. ***Deployment***: Deploy the fine-tuned roberta model as a QnAbot using a user-friendly interface (e.g., with Gradio) to allow users to input questions and receive relevant advice or information as responses.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate



### Let's load the Roberta Model

In [None]:
from datasets import load_dataset
import datasets

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
tokenizer.is_fast

True

### Loading Dataset and Creating Preprocessing functions for handling training and validation datasets

In [None]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [None]:
import pandas as pd
df_train=pd.read_csv('/content/train.csv')
df_test=pd.read_csv('/content/test.csv')

In [None]:
df_train.columns

Index(['item_id', 'domain', 'nn_mod', 'nn_asp', 'query_mod', 'query_asp',
       'q_review_id', 'q_reviews_id', 'question', 'question_subj_level',
       'ques_subj_score', 'is_ques_subjective', 'review_id', 'review',
       'human_ans_spans', 'human_ans_indices', 'answer_subj_level',
       'ans_subj_score', 'is_ans_subjective'],
      dtype='object')

In [None]:
df_train.iloc[0].question

'Who is the author of this series?'

In [None]:
df_train.iloc[0].review

"Whether it be in her portrayal of a nerdy lesbian or a punk rock rebel, Maslany's plural personalities, (though very stereotypical), are entertaining eye-candy. Combined with a complex and unpredictable plot line, this show is surprisingly addictive. ANSWERNOTFOUND"

In [None]:
df_train.iloc[0].human_ans_indices

'(251, 265)'

In [None]:
df_train.iloc[0].review[251:265]

'ANSWERNOTFOUND'

In [None]:
df_train=df_train[['question','human_ans_indices','review','human_ans_spans']]
df_test=df_test[['question','human_ans_indices','review','human_ans_spans']]

#### Creating 'id' column in the dataset

In [None]:
import numpy as np
df_train['id']=np.linspace(0,len(df_train)-1,len(df_train))
df_test['id']=np.linspace(0,len(df_test)-1,len(df_test))

df_train['id']=df_train['id'].astype(str)
df_test['id']=df_test['id'].astype(str)

In [None]:
int(df_train.iloc[0].human_ans_indices.split('(')[1].split(',')[0])

251

In [None]:
float(df_train.iloc[0].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])

265.0

In [None]:
df_train['answers']=df_train['human_ans_spans']
df_test['answers']=df_test['human_ans_spans']

### Generating answers from review column 

In [None]:
for i in range(0,len(df_train)):
  answer1={}
  si=int(df_train.iloc[i].human_ans_indices.split('(')[1].split(',')[0])
  ei=int(df_train.iloc[i].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])
  answer1['text']=[df_train.iloc[i].review[si:ei]]
  answer1['answer_start']=[si]
  df_train.at[i, 'answers']=answer1
  #print(df_train.iloc[i].answers,df_train.iloc[i].human_ans_spans)

In [None]:
for i in range(0,len(df_test)):
  answer1={}
  si=int(df_test.iloc[i].human_ans_indices.split('(')[1].split(',')[0])
  ei=int(df_test.iloc[i].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])
  answer1['text']=[df_test.iloc[i].review[si:ei]]
  answer1['answer_start']=[si]
  df_test.at[i, 'answers']=answer1
  #print(df_train.iloc[i].answers,df_train.iloc[i].human_ans_spans)

In [None]:
df_train.columns

Index(['question', 'human_ans_indices', 'review', 'human_ans_spans', 'id',
       'answers'],
      dtype='object')

In [None]:
df_train.columns=['question', 'human_ans_indices', 'context', 'human_ans_spans', 'id',
       'answers']

df_test.columns=['question', 'human_ans_indices', 'context', 'human_ans_spans','id',
       'answers']

In [None]:
val_dataset2 = datasets.Dataset.from_pandas(df_test)
train_dataset2 = datasets.Dataset.from_pandas(df_train)


In [None]:
train_dataset = train_dataset2.map(
    preprocess_training_examples,
    batched=True,
    remove_columns=train_dataset2.column_names,
)
len(train_dataset2), len(train_dataset)

Map:   0%|          | 0/2501 [00:00<?, ? examples/s]

(2501, 4862)

In [None]:
train_dataset2.shape

(2501, 6)

In [None]:
train_dataset.shape

(4862, 4)

### Similarly we preprocess the validation dataset 

In [None]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [None]:
validation_dataset = val_dataset2.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=val_dataset2.column_names,
)
len(validation_dataset)

Map:   0%|          | 0/582 [00:00<?, ? examples/s]

1104

In [None]:
len(validation_dataset)

1104

In [None]:
validation_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 1104
})

In [None]:
len(val_dataset2)

582

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
import torch
from transformers import AutoModelForQuestionAnswering


In [None]:
import collections



In [None]:
import evaluate

metric = evaluate.load("squad")

### Creating metric function for scores

In [None]:
from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

## Fine Tuning the model

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "roberta-finetuned-subjqa-movies_2",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    fp16=True,
    # push_to_hub=True,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)


In [None]:
import numpy as np
n_best=20
max_answer_length = 30

In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,1.2951,No log
2,1.0322,No log
3,0.8542,No log
4,0.713,No log
5,0.6147,No log


TrainOutput(global_step=3040, training_loss=0.9018189380043431, metrics={'train_runtime': 820.6534, 'train_samples_per_second': 29.623, 'train_steps_per_second': 3.704, 'total_flos': 4764093117189120.0, 'train_loss': 0.9018189380043431, 'epoch': 5.0})

### Calculating Validation Score

In [None]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, val_dataset2)

  0%|          | 0/582 [00:00<?, ?it/s]

{'exact_match': 62.7147766323024, 'f1': 64.87899203047395}

## Saving our model in local machine

In [None]:
import os

# Get the current working directory in Colab
current_directory = os.getcwd()

# Define the name of the directory where you want to save the model and tokenizer
desired_directory_name = "my_saved_model"

# Create the full path to the desired directory
desired_directory_path = os.path.join(current_directory, desired_directory_name)

# Print the path of the desired directory
print("Path of desired directory:", desired_directory_path)


Path of desired directory: /content/my_saved_model


In [None]:
# Save the trained model
model_save_path = "/content/my_saved_model"
model.save_pretrained(model_save_path)

# Save the tokenizer
tokenizer_save_path = "/content/my_saved_tokenizer"
tokenizer.save_pretrained(tokenizer_save_path)

# Print the paths for confirmation
print("Trained model saved at:", model_save_path)
print("Tokenizer saved at:", tokenizer_save_path)

Trained model saved at: /content/my_saved_model
Tokenizer saved at: /content/my_saved_tokenizer


### Creating model using saved model for streamlining downstream process

In [None]:
from transformers import pipeline

In [None]:
# Replace this with your own checkpoint
model_checkpoint2 = "/content/roberta-finetuned-subjqa-movies_2/my_checkpoint"
question_answerer = pipeline("question-answering", model=model_checkpoint2)

In [None]:
import pandas as pd
df_train1=pd.read_csv('/content/train.csv')
df_test1=pd.read_csv('/content/test.csv')

In [None]:
df_train1.iloc[13].question

'Why is the movie soo confusing?'

In [None]:
df_train1.iloc[13].review

"Inception is an interesting movie but might not be everyone's cup of tea.  Dom Cobb (Leonardo DiCaprio) is a man who specializes in dream extractions (think corporate espionage) - going into a shared dream state using a military derived technique with a team and a subject to extract a piece of information from that subject's subconscious mind.  This can sometimes involve going into a dream within a dream (or more).Much was made of the complexities of this movie and indeed it is complex but for a generation growing up watching Matrix movies (especially the latter two) this is positively straight forward - and that's not a bad thing.  This is helped by Cobb's new dream architect (played by Ellen Page) who is also new to this world and to whom he explains the rules of the dream world (and thereby us - something not as well done in the Matrix flicks).  In addition to these dream rules there are the complexities of Cobb himself and the guilt he feels over the tragedy of his wife and kids a

In [None]:
context = df_train1.iloc[13].review
question = df_train1.iloc[13].question
question_answerer(question=question, context=context)

{'score': 0.4546007215976715,
 'start': 2649,
 'end': 2669,
 'answer': 'they were so orderly'}