<a href="https://colab.research.google.com/github/Tanaya2012/QA-chatbot/blob/main/Generate_train_data_for_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data generation for BERT training

This notebook demonstrates the process of creating a dataframe for fine-tuning BERT using existing text from PDF documents. The notebook outlines the steps to extract the necessary information, including questions, answers, and context, from the PDF documents. The extracted data is then structured and organized into a dataframe, which can be used as input for fine-tuning the BERT model for specific natural language processing tasks.

In [None]:
!pip install --upgrade pip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!python -m nltk.downloader punkt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
!git clone https://github.com/patil-suraj/question_generation.git

Cloning into 'question_generation'...
remote: Enumerating objects: 268, done.[K
remote: Total 268 (delta 0), reused 0 (delta 0), pack-reused 268[K
Receiving objects: 100% (268/268), 299.04 KiB | 9.06 MiB/s, done.
Resolving deltas: 100% (140/140), done.


In [None]:
%cd question_generation

/content/question_generation/question_generation/question_generation/question_generation/question_generation


## Importing necessary data and Libraries

In [None]:
from pathlib import Path
pdf_text = Path('/content/drive/MyDrive/cleaned_sentences.txt').read_text()
cleaned_sentences_with_stopwords = pdf_text.split('\n')
clean_text = pdf_text.replace('/n', ' ')

## Model


The given code snippet utilizes the `pipelines` library to import a pre-trained question generation model. It also imports the `pandas` library for data manipulation. The code creates an empty list called `questions_dataset` to store the generated questions. 

The code then iterates through each sentence in the `cleaned_sentences_with_stopwords` list. If the sentence is not empty, it generates questions for that sentence using the pre-trained question generation model. The generated questions are stored in a DataFrame called `df_sentence`, with an additional column containing the original sentence. 

Each `df_sentence` DataFrame is appended to the `questions_dataset` list. Finally, the code concatenates all the DataFrames in the `questions_dataset` list into a single DataFrame called `df_bert_training`. This DataFrame can be used as a training dataset for fine-tuning a BERT model for question generation tasks.

In [None]:
from pipelines import pipeline
import pandas as pd
nlp = pipeline("question-generation", model="valhalla/t5-small-qg-prepend", qg_format="prepend")

In [None]:
questions_dataset = []
for sentences in cleaned_sentences_with_stopwords:
  if sentences != '':
    df_sentence = pd.DataFrame(nlp(sentences))
    df_sentence['text'] = sentences
    questions_dataset.append(df_sentence)
df_bert_training = pd.concat(questions_dataset)
df_bert_training

Unnamed: 0,answer,question,text
0,<pad> emphysema and chronic bronchitis,What are the symptoms of chronic obstructive p...,chronic obstructive pulmonary disease copd is ...
1,airflow obstruction,What is the cause of chronic obstructive pulmo...,chronic obstructive pulmonary disease copd is ...
0,<pad> copd,What is an important contributor to mortality ...,copd is an important contributor to mortality ...
0,<pad> to reduce activity limitations among adu...,What is the goal of healthy people 2020?,healthy people 2020 has several copdrelated ob...
0,<pad> 2013,In what year did cdc analyze data from the beh...,to assess the statelevel prevalence of copd an...
...,...,...,...
0,<pad> cdc and fda,Who continue to monitor hpvassociated outcomes?,postlicensure monitoring and evaluation by cdc...
0,<pad> ongoing,What are evaluations ongoing?,evaluations are ongoing
0,<pad> 214,How many cervical cancer screenings are there?,to date there is no indication replacement wit...
0,<pad> cervical cancer,What type of cancer does hpv vaccine affect?,evaluation of the impact of hpv vaccination on...


In [None]:
df_bert_training.to_csv('/content/drive/MyDrive/df_bert_train.csv')