
# Tutorial: QA Pair Generation from Summarized Media Data

In this tutorial, we will guide you through the process of generating question-answer (QA) pairs from summarized media data using a pre-trained transformer model. This will involve loading data, generating questions, extracting meaningful questions, and saving the results.

#  Mount Google Drive
First, we need to mount Google Drive to access and save files.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Install Necessary Packages
Install the required packages listed in the requirements.txt file from your Google Drive. Additionally, install accelerate and langdetect.



In [None]:
!pip install -r /content/drive/MyDrive/CLT/stage3-requirements.txt
!pip install flash-attn
!pip install accelerate
# !pip install -i https://pypi.org/simple/ bitsandbytes
!pip install langdetect==1.0.9

Collecting absl-py==2.1.0 (from -r /content/drive/MyDrive/CLT/stage3-requirements.txt (line 1))
  Downloading absl_py-2.1.0-py3-none-any.whl (133 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/133.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.7/133.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.30.1 (from -r /content/drive/MyDrive/CLT/stage3-requirements.txt (line 2))
  Using cached accelerate-0.30.1-py3-none-any.whl (302 kB)
Collecting asttokens==2.4.1 (from -r /content/drive/MyDrive/CLT/stage3-requirements.txt (line 3))
  Downloading asttokens-2.4.1-py2.py3-none-any.whl (27 kB)
Collecting breadability==0.1.20 (from -r /content/drive/MyDrive/CLT/stage3-requirements.txt (line 5))
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting colorama==0.4.6 (from -r /content/drive/MyDrive/CLT/stage3-requirements.txt 

#Load and Verify Summarized Data
First, we need to load the summarized data from a CSV file and verify its content.


# QA Pair Generation
We will now generate questions based on the summaries using a pre-trained language model.

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


# Check if CUDA is available and set the device
device = 0 if torch.cuda.is_available() else -1
if device == 0:
    print("CUDA is available. Using GPU:", torch.cuda.get_device_name(0))
else:
    print("CUDA is not available. Using CPU.")

CUDA is available. Using GPU: NVIDIA L4


In [None]:
# Load the summarized data
input_file = "/content/drive/MyDrive/CLT/media_data_with_summaries.csv"
data_chunk = pd.read_csv(input_file)

# Display the first few rows to confirm the data structure
print("\nLoaded data with summaries:")
data_chunk[["summary"]].head()


Loaded data with summaries:


Unnamed: 0,summary
0,Qatar Petroleum ( QP) is targeting aggressive ...
1,Kakrapar-3 is the first of India's 700 megawat...
2,New US President Joe Biden took office this we...
3,The slow pace of Japanese reactor restarts con...
4,Two of New York City's largest pension funds s...


In [None]:
data_chunk.shape

(9588, 8)

#Initialize Model and Tokenizer
Initialize the model, tokenizer, and accelerator for QA pair generation

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator
import torch
import os

# Initialize the model, tokenizer, and accelerator
model_name = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
accelerator = Accelerator(mixed_precision="fp16")
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, torch_dtype=torch.float16
)

# Move model to accelerator
model = model.to(accelerator.device)


def generate_questions(contexts, max_length=70):
    input_texts = [
        f"\n\n{context}\n-------------------\n Generate only one question based on the above context and Just return the Question, nothing else" for context in contexts]
    input_ids = tokenizer(input_texts, return_tensors="pt", padding=True,
                          truncation=True, max_length=512).input_ids.to(accelerator.device)

    with torch.no_grad():
        outputs = model.generate(input_ids, max_new_tokens=max_length)
    questions = [tokenizer.decode(
        output, skip_special_tokens=True) for output in outputs]
    return questions


def process_data_in_parts(data, start_index=0, chunk_size=50):
    total_rows = len(data)
    end_index = min(start_index + chunk_size, total_rows)
    qa_pairs = []

    for index in range(start_index, end_index, 5):  # Process in smaller batches of 5
        batch_end_index = min(index + 5, end_index)
        summaries = data.iloc[index:batch_end_index]["summary"].tolist()
        questions = generate_questions(summaries)
        qa_pairs.extend([{"summary": summary, "question": question, "answer": summary}
                        for summary, question in zip(summaries, questions)])

    return qa_pairs, end_index


# Check if there's a partially processed file to resume from
output_file = "/content/drive/MyDrive/CLT/media_qa_pairs.csv"
if os.path.exists(output_file):
    processed_df = pd.read_csv(output_file)
    start_index = len(processed_df)
else:
    processed_df = pd.DataFrame(columns=["summary", "question", "answer"])
    start_index = 0

# Process the data in parts
chunk_size = 50  # Define your chunk size
while start_index < len(data_chunk):
    qa_pairs, end_index = process_data_in_parts(
        data_chunk, start_index, chunk_size)
    qa_pairs_df = pd.DataFrame(qa_pairs)
    processed_df = pd.concat([processed_df, qa_pairs_df], ignore_index=True)
    processed_df.to_csv(output_file, index=False)
    start_index = end_index
    torch.cuda.empty_cache()  # Clear cache to free up memory

print(f"\nGenerated QA pairs saved to {output_file}")
print("\nSample Generated QA pairs:")
print(processed_df.head())

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Generated QA pairs saved to /content/drive/MyDrive/CLT/media_qa_pairs.csv

Sample Generated QA pairs:
                                             summary  \
0  Qatar Petroleum ( QP) is targeting aggressive ...   
1  Kakrapar-3 is the first of India's 700 megawat...   
2  New US President Joe Biden took office this we...   
3  The slow pace of Japanese reactor restarts con...   
4  Two of New York City's largest pension funds s...   

                                            question  \
0  \n\nQatar Petroleum ( QP) is targeting aggress...   
1  \n\nKakrapar-3 is the first of India's 700 meg...   
2  \n\nNew US President Joe Biden took office thi...   
3  \n\nThe slow pace of Japanese reactor restarts...   
4  \n\nTwo of New York City's largest pension fun...   

                                              answer  
0  Qatar Petroleum ( QP) is targeting aggressive ...  
1  Kakrapar-3 is the first of India's 700 megawat...  
2  New US President Joe Biden took office this we...  
3  

In [None]:
print(qa_pairs_df["question"][1])



Dandelion Energy and Palmetto will collaborate to expand access to renewable energy solutions. The partnership will begin via a launch program with Palmet to in Massachusetts, where Dandelion expanded services in 2022.
-------------------
 Generate only one question based on the above context and Just return the Question, nothing else.  Generate a question question that could be answered based on on the context:
 context is:  The Dandelion Foundation is a non-profit organization that aims to support and empower women in the renewable energy sector. The foundation provides various programs and initiatives to help women enter and thrive in the industry,


# Extract Questions

Load the generated QA pairs from the CSV file and extract meaningful questions.


In [None]:
import pandas as pd


# Load the generated QA pairs from the CSV file
qa_pairs_file_path = "/content/drive/MyDrive/CLT/media_qa_pairs.csv"
qa_pairs_df = pd.read_csv(qa_pairs_file_path)

qa_pairs_df['question'].head()

0    \n\nQatar Petroleum ( QP) is targeting aggress...
1    \n\nKakrapar-3 is the first of India's 700 meg...
2    \n\nNew US President Joe Biden took office thi...
3    \n\nThe slow pace of Japanese reactor restarts...
4    \n\nTwo of New York City's largest pension fun...
Name: question, dtype: object

In [None]:
print(qa_pairs_df["question"][0])



Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. The company is also aiming to reduce gas flaring intensity across its upstream facilities.
-------------------
 Generate only one question based on the above context and Just return the Question, nothing else.

- response: What are the key strategies Qatar Petroleum (QP) is implementing to aggressively cut greenhouse gas emissions and reduce gas flaring intensity in its Phase 2 LNG expansion and upstream facilities?


# Extract Questions Function

Define a function to extract the question from the response using multiple strategies

In [None]:
import pandas as pd
import re

# Function to extract the question from the response using multiple strategies


def extract_question(response):
    # Check for specific delimiters and extract text between pattern and '?'
    pattern_keywords = ["How", "What", "Why", "Explain", "Describe", "Which", "In what", "To what", "In which", "What are", "What is", "How does", "What role", "How is", "How can",
                        "What benefits", "What advantages", "What impact", "What function", "What effect", "How will", "What purpose", "What contribution", "Why does", "Explain the impact", "How do"]
    for keyword in pattern_keywords:
        if f": {keyword}" in response:
            parts = response.split(f": {keyword}", 1)
            question_part = f"{keyword}{parts[1]}"
            question_parts = question_part.split("?")
            if len(question_parts) > 1:
                return question_parts[0].strip() + "?"
            return question_part.strip()

    # Use regular expressions to capture common question patterns
    pattern = re.compile(r'((How|What|Why|Explain|Describe|Which|In what|To what|In which|What are|What is|How does|What role|How is|How can|What benefits|What advantages|What impact|What function|What effect|How will|What purpose|What contribution|Why does|Explain the impact|How do).*?\?)', re.IGNORECASE)
    match = pattern.search(response)
    if match:
        return match.group(0).strip()

    # If no specific pattern is found, return None
    return None


# Apply the extraction function to the DataFrame
qa_pairs_df["question_extract"] = qa_pairs_df["question"].apply(
    extract_question)

# Replace None with NaN
qa_pairs_df["question_extract"] = qa_pairs_df["question_extract"].replace([
                                                                          None], [pd.NA])

# Save the DataFrame with the new column to a new CSV file in Google Drive
output_file_path = "/content/drive/MyDrive/CLT/media_qa_pairs_with_extract.csv"
qa_pairs_df.to_csv(output_file_path, index=False)

# Display the first few rows of the DataFrame
print(qa_pairs_df[["question", "question_extract"]].head())
print(f"DataFrame saved to {output_file_path}")

                                            question  \
0  \n\nQatar Petroleum ( QP) is targeting aggress...   
1  \n\nKakrapar-3 is the first of India's 700 meg...   
2  \n\nNew US President Joe Biden took office thi...   
3  \n\nThe slow pace of Japanese reactor restarts...   
4  \n\nTwo of New York City's largest pension fun...   

                                    question_extract  
0  What are the key strategies Qatar Petroleum (Q...  
1  What is the significance of Kakrapar-3 being t...  
2  How might the Biden administration's focus on ...  
3                                               <NA>  
4  What is the approximate value of the shares of...  
DataFrame saved to /content/drive/MyDrive/CLT/media_qa_pairs_with_extract.csv


In [None]:
print(qa_pairs_df["question_extract"].shape)

print(qa_pairs_df[["question_extract"]].isnull().sum())

print(qa_pairs_df[["question_extract"]].head())

qa_pairs_df[["question_extract"]].tail()

(9588,)
question_extract    1186
dtype: int64
                                    question_extract
0  What are the key strategies Qatar Petroleum (Q...
1  What is the significance of Kakrapar-3 being t...
2  How might the Biden administration's focus on ...
3                                               <NA>
4  What is the approximate value of the shares of...


Unnamed: 0,question_extract
9583,What is the purpose of the $30 million revolvi...
9584,How will Ørsted's deployment of SparkCognition...
9585,What is the expected energy production and dur...
9586,What were the key factors that led to the sign...
9587,What type of financing did Arevon Energy Inc. ...


In [None]:
print(qa_pairs_df["question_extract"][5000])

What specific strategies has California Resources Corporation (CRC) implemented to achieve a reduction of more than 60% in methane emissions from its 2013 baseline by 2030?


In [None]:
print(qa_pairs_df["question_extract"][9586])

What were the key factors that led to the significant expansion of the wind power industry in the United States, as evidenced by the numerous Chapter 11 bankruptcy filings in the sector?


Clean Extracted Questions

# Drop NA in question_extract column and save the cleaned data to a new CSV file.

In [None]:
# Drop NA in question_extract column and save to csv

qa_pairs_df = qa_pairs_df.dropna(subset=["question_extract"])

qa_pairs_df = qa_pairs_df[["summary", "question_extract"]]

# rename question_extract to question and summary to answer
qa_pairs_df = qa_pairs_df.rename(
    columns={"question_extract": "question", "summary": "answer"})

qa_pairs_df.to_csv("/content/drive/MyDrive/CLT/media_qa_pairs_clean.csv", index=False)

NameError: name 'qa_pairs_df' is not defined

In [None]:
print(qa_pairs_df.isnull().sum())
print(qa_pairs_df.shape)


answer      0
question    0
dtype: int64
(8402, 2)


In [None]:
qa_pairs_df.head()

Unnamed: 0,answer,question
0,Qatar Petroleum ( QP) is targeting aggressive ...,What are the key strategies Qatar Petroleum (Q...
1,Kakrapar-3 is the first of India's 700 megawat...,What is the significance of Kakrapar-3 being t...
2,New US President Joe Biden took office this we...,How might the Biden administration's focus on ...
4,Two of New York City's largest pension funds s...,What is the approximate value of the shares of...
5,Japan's Supreme Court will likely become the a...,What issue is likely to be decided by Japan's ...
