### Importing necessary libraries & data files

In [1]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline, Trainer, TrainingArguments, DataCollatorWithPadding, AutoModelForQuestionAnswering, AutoModel
import torch
from datasets import Dataset

In [12]:
article_info = pd.read_csv("articles_info.csv")  
additional_info = pd.read_csv("additional_info.csv")

article_info['combined_text'] = article_info['title'] + ". " + article_info['content'] + " Tags: " + article_info['tags']

context_df = article_info[['title', 'combined_text']] 

additional_info = additional_info.rename(columns={
    'info_title': 'title',
    'content': 'combined_text'
})

context_df['combined_text'] = context_df['combined_text'].fillna('')

context_df = pd.concat([context_df, additional_info], ignore_index=True)
context_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  context_df['combined_text'] = context_df['combined_text'].fillna('')


Unnamed: 0,title,combined_text
0,Understanding Cholera: A brief look into its c...,Understanding Cholera: A brief look into its c...
1,From HPP Innovation Week – Part 2,From HPP Innovation Week – Part 2. This is the...
2,From HPP Innovation week – Part 1,From HPP Innovation week – Part 1. Hiperbaric ...
3,Coliforms and their role in ensuring the safet...,Coliforms and their role in ensuring the safet...
4,Diverse burden of foodborne disease,Diverse burden of foodborne disease. Foodborne...


### Generating Q&A based on the data files

In [3]:
qg_pipeline = pipeline('text2text-generation', model="valhalla/t5-base-qg-hl")

def generate_multiple_questions(context, num_questions=10):
    input_text = f"generate question: {context}"
    questions = qg_pipeline(input_text, num_return_sequences=num_questions, num_beams=num_questions, max_length=64)
    
    return [q['generated_text'] for q in questions]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [4]:
# Load BioBERT model for Q&A
bio_model_name = "dmis-lab/biobert-base-cased-v1.1"
bio_model = AutoModelForQuestionAnswering.from_pretrained(bio_model_name)
bio_tokenizer = AutoTokenizer.from_pretrained(bio_model_name)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
gen_qa_pipeline = pipeline('text2text-generation', model="google/flan-t5-large")

def truncate_context(context, max_length=512):
    tokenized_text = bio_tokenizer.tokenize(context)
    if len(tokenized_text) > max_length:
        tokenized_text = tokenized_text[:max_length]
    return bio_tokenizer.convert_tokens_to_string(tokenized_text)

def generate_longer_answers(context, question, min_length=50, max_length=200):
    truncated_context = truncate_context(context)
    
    input_text = f"question: {question} context: {truncated_context}"
    answers = gen_qa_pipeline(input_text, num_return_sequences=1, min_length=min_length, max_length=max_length)

    return answers[0]['generated_text']

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### BioBERT model

In [2]:
# Load BioBERT model and tokenizer
model_name = "dmis-lab/biobert-base-cased-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Set up the question-answering pipeline
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [26]:
# Define the function for answering questions using each row in the DataFrame
def answer_question_on_df(question, df, context_column='combined_text'):
    # Define a function to get an answer for a single context
    def get_answer(context):
        if not context:  # Check if the context is empty or NaN
            return "No context provided"
        
        response = qa_pipeline({
            'question': question,
            'context': context
        })
        return response['answer']
    
    # Apply the function to the specified column and store results in a new column
    df['answer'] = df[context_column].apply(get_answer)
    return df

In [None]:
# Example question
question = "What is microbiology?"

# Get answers for each row
df_with_answers = answer_question_on_df(question, context_df)

# Display the updated DataFrame with answers
print(df_with_answers)