This notebook prepares the data and model for fine-tuning a transformer model using previously extracted question-answer (QA) pairs. It includes steps to install necessary packages, load and format the dataset, apply chat templates, and configure the model and tokenizer for efficient fine-tuning. The notebook sets up the environment and prepares the dataset, making it ready for the fine-tuning process in subsequent steps.


# Tutorial: QA Pair Generation from Summarized Patent Data

In this tutorial, we will guide you through the process of generating question-answer (QA) pairs from summarized patent data using a pre-trained transformer model. This will involve loading data, generating questions, extracting meaningful questions, and saving the results.

#  Mount Google Drive
First, we need to mount Google Drive to access and save files.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#Install Necessary Packages
Install the required packages listed in the requirements.txt file from your Google Drive. Additionally, install accelerate and langdetect.



In [None]:
!pip install -r /content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/stage3-requirements.txt
!pip install flash-attn
!pip install accelerate
!pip install langdetect==1.0.9




Collecting absl-py==2.1.0 (from -r /content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/stage3-requirements.txt (line 1))
  Using cached absl_py-2.1.0-py3-none-any.whl (133 kB)
Collecting asttokens==2.4.1 (from -r /content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/stage3-requirements.txt (line 3))
  Using cached asttokens-2.4.1-py2.py3-none-any.whl (27 kB)
Collecting breadability==0.1.20 (from -r /content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/stage3-requirements.txt (line 5))
  Using cached breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting colorama==0.4.6 (from -r /content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/stage3-requirements.txt (line 10))
  Using cached colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting comm==0.2.2 (from -r /content/drive/MyDriv

#QA Pair Generation

We will now generate questions based on the summaries using a pre-trained language model.

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


# Check if CUDA is available and set the device
device = 0 if torch.cuda.is_available() else -1
if device == 0:
    print("CUDA is available. Using GPU:", torch.cuda.get_device_name(0))
else:
    print("CUDA is not available. Using CPU.")

CUDA is available. Using GPU: NVIDIA L4


In [None]:
# Load the summarized data
input_file = "/content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/patent_summaries.csv"
data_chunk = pd.read_csv(input_file)

# Display the first few rows to confirm the data structure
print("\nLoaded data with summaries:")
data_chunk[["summary"]].head()


Loaded data with summaries:


Unnamed: 0,summary
0,Disclosed is an adaptable DC-AC inverter syste...
1,A solar energy system comprises: a solar energ...
2,A control method for optimizing a solar-to-pow...
3,"The system includes a first hydropower system,..."
4,A system and method for driving geese away fro...


In [None]:
data_chunk.shape

(4301, 13)

# Initialize Model and Tokenizer

Initialize the model, tokenizer, and accelerator for QA pair generation.



In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator
import torch
import os

# Initialize the model, tokenizer, and accelerator
model_name = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
accelerator = Accelerator(mixed_precision="fp16")
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, torch_dtype=torch.float16
)

# Move model to accelerator
model = model.to(accelerator.device)



def generate_questions(contexts, max_length=70):
    input_texts = [f"\n\n{context}\n-------------------\n Generate only one question based on the above context and Just return the Question, nothing else" for context in contexts]
    input_ids = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(accelerator.device)

    with torch.no_grad():
        outputs = model.generate(input_ids, max_new_tokens=max_length)
    questions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    return questions

def process_data_in_parts(data, start_index=0, chunk_size=50):
    total_rows = len(data)
    end_index = min(start_index + chunk_size, total_rows)
    qa_pairs = []

    for index in range(start_index, end_index, 5):  # Process in smaller batches of 5
        batch_end_index = min(index + 5, end_index)
        summaries = data.iloc[index:batch_end_index]["summary"].tolist()
        questions = generate_questions(summaries)
        qa_pairs.extend([{"summary": summary, "question": question, "answer": summary} for summary, question in zip(summaries, questions)])

    return qa_pairs, end_index

# Check if there's a partially processed file to resume from
output_file = "/content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/patent_qa_pairs.csv"
if os.path.exists(output_file):
    processed_df = pd.read_csv(output_file)
    start_index = len(processed_df)
else:
    processed_df = pd.DataFrame(columns=["summary", "question", "answer"])
    start_index = 0

# Process the data in parts
chunk_size = 50  # Define your chunk size
while start_index < len(data_chunk):
    qa_pairs, end_index = process_data_in_parts(data_chunk, start_index, chunk_size)
    qa_pairs_df = pd.DataFrame(qa_pairs)
    processed_df = pd.concat([processed_df, qa_pairs_df], ignore_index=True)
    processed_df.to_csv(output_file, index=False)
    start_index = end_index
    torch.cuda.empty_cache()  # Clear cache to free up memory

print(f"\nGenerated QA pairs saved to {output_file}")
print("\nSample Generated QA pairs:")
print(processed_df.head())


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Generated QA pairs saved to /content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/patent_qa_pairs.csv

Sample Generated QA pairs:
                                             summary  \
0  Disclosed is an adaptable DC-AC inverter syste...   
1  A solar energy system comprises: a solar energ...   
2  A control method for optimizing a solar-to-pow...   
3  The system includes a first hydropower system,...   
4  A system and method for driving geese away fro...   

                                            question  \
0  \n\nDisclosed is an adaptable DC-AC inverter s...   
1  \n\nA solar energy system comprises: a solar e...   
2  \n\nA control method for optimizing a solar-to...   
3  \n\nThe system includes a first hydropower sys...   
4  \n\nA system and method for driving geese away...   

                                              answer  
0  Disclosed is an adaptable DC-AC inverter syste...  
1  A solar energy system comprises: a solar

In [None]:
print(qa_pairs_df["question"][0])



The invention provides a wind power supply control method, which comprises the following steps. The surplus electric energy of wind-powered electricity generation module output also can be carried to the electric wire netting in the power module.
-------------------
 Generate only one question based on the above context and Just return the Question, nothing else.

- [Response]: How does the wind power supply control method described in the invention manage the surplus electric energy generated by wind-powered electricity generation modules, and what is the role of electric wire netting in this process?


#Extract Questions

Load the generated QA pairs from the CSV file and extract meaningful questions.

In [None]:
import pandas as pd


# Load the generated QA pairs from the CSV file
qa_pairs_file_path = "/content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/patent_qa_pairs.csv"
qa_pairs_df = pd.read_csv(qa_pairs_file_path)

# Extract just the question at the end
qa_pairs_df['question_extract'] = qa_pairs_df['question'].apply(lambda x: x.split("\n### Response\n\n")[1] if "\n### Response\n\n" in x else "")

# Save the DataFrame with the new column to a new CSV file in Google Drive
output_file_path = "/content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/patent_qa_pairs_with_extract.csv"
qa_pairs_df.to_csv(output_file_path, index=False)

# Display the first few rows of the DataFrame
print(qa_pairs_df[['question', 'question_extract']].head())
print(f"DataFrame saved to {output_file_path}")


### INSPECT

In [None]:
import pandas as pd

# Load the EXTRACTED QA pairs from the CSV file
qa_extract_file_path = "/content/drive/MyDrive/1-Working/1-HSLU-MscIDS/5-Projects/4-Semester/CLT/Samuel-CLT-Development/patent_qa_pairs_with_extract.csv"
qa_extract_df = pd.read_csv(qa_extract_file_path)

In [None]:
qa_extract_df[['question_extract']].head()

Unnamed: 0,question_extract
0,What are the advantages of the proposed adapti...
1,What components are included in a typical sola...
2,What control method is proposed for optimizing...
3,How are the waterwheels and hoist devices in t...
4,What is the principle behind the system that u...


In [None]:
qa_extract_df['question_extract'][0]

'What are the advantages of the proposed adaptive-controlled AC inverter system for solar energy harvesting?'