## Generating Q&A

In [1]:
import asyncio
import nest_asyncio
from groq import AsyncGroq
import pandas as pd
import os
from aiohttp import ClientResponseError
import time
import re
from tqdm import tqdm 
import json

In [2]:
nest_asyncio.apply()
client = AsyncGroq(api_key='')

questions_prompt = """You are a helpful assistant. You will receive an article and the title related to Embedded Linux, the Linux kernel, U-boot or a vendor-specific documentation. Your objective is to output a valid list of strings with three questions that could be answer EXCLUSIVELY with the content of the article. These questions must be deeply technical and refer to the intricacies of what is refered on the article text

===============================================
Example 1)


title: "Introduction to Device Tree"
article_text: "Device Tree is a data structure for describing hardware. It is used by the Linux kernel to manage the hardware it runs on. The Device Tree source can be compiled into a binary format. Device Tree provides a way to describe the peripherals of a system, such as memory, CPUs, and buses, without hardcoding these descriptions in the operating system code."

output: ['1. What is Device Tree?', '2. How is the Device Tree source compiled?', '3. What does Device Tree describe in a system?']

Example 2)

title: "Understanding U-Boot",
article_text: "U-Boot is a boot loader used in embedded devices to package the instructions to boot the device's operating system kernel. It supports multiple filesystems, network booting, and various hardware platforms. U-Boot is highly configurable and can be customized for different hardware architectures and requirements."

output: ['1. What is U-Boot?', '2. What functionalities does U-Boot support?', '3. How can U-Boot be customized?']

Example 3)

title: "Linux Kernel Overview",
article_text: "The Linux kernel is the core component of the Linux operating system. It manages the system's hardware, runs user programs, and provides services to the software. The kernel handles process management, memory management, hardware device control, and system calls."

output: ['1. What is the Linux kernel?', '2. What is the Linux Kernel for?', '3. How does the Linux kernel manage hardware devices?']
===============================================

The output MUST be on the schema below:

['1. Question 1', '2. Question 2', '3. Question 3']

Do not add any text, explanation or extra element to the output. THE OUTPUT MUST BE SOLELY THE LIST
"""

categories_prompt = """
You are a helpful assistant. You will receive an article and the title related to Embedded Linux, the Linux kernel, U-boot or a vendor-specific documentation. Your objective is to output a valid list of strings with one or two labels/categories that you can infer EXCLUSIVELY with the content of the article and title. These labels must be related to the embedded Linux area, hardware, electronics, Linux, vendors, etc.

===============================================
Example 1)


title: "Introduction to Device Tree"
article_text: "Device Tree is a data structure for describing hardware. It is used by the Linux kernel to manage the hardware it runs on. The Device Tree source can be compiled into a binary format. Device Tree provides a way to describe the peripherals of a system, such as memory, CPUs, and buses, without hardcoding these descriptions in the operating system code."

output: ['Device Tree']

Example 2)

title: "Understanding U-Boot",
article_text: "U-Boot is a boot loader used in embedded devices to package the instructions to boot the device's operating system kernel. It supports multiple filesystems, network booting, and various hardware platforms. U-Boot is highly configurable and can be customized for different hardware architectures and requirements."

output: ['U-Boot', 'Bootloader']

Example 3)

title: "Linux Kernel Overview",
article_text: "The Linux kernel is the core component of the Linux operating system. It manages the system's hardware, runs user programs, and provides services to the software. The kernel handles process management, memory management, hardware device control, and system calls."

output: ['Linux kernel', 'Device Drivers']
===============================================

The output MUST be on the schema below:

['Category 1', 'Category 2']

Do not add any text, explanation or extra element to the output. THE OUTPUT MUST BE SOLELY THE LIST
"""

In [3]:
async def groq_completion(title, article, prompt):
    result = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": prompt,
            },
            {
                "role": "user",
                "content": f"title: {title} | article: {article}",
            }
        ],
        model="llama3-70b-8192",
        temperature=0.1,
        max_tokens=1024,
        top_p=1,
        stop=None,
        stream=False,
        seed=100,
    )
    return eval(result.choices[0].message.content)

async def groq_completion_with_retry(title, article, prompt, retries=8):
    for attempt in range(retries):
        try:
            result = await groq_completion(title, article, prompt)
            return result
        except Exception as e:
            error_message = str(e)
            if "rate_limit_exceeded" in error_message:
                wait_time_match = re.search(r'Please try again in (\d+)m(\d+\.\d+)s', error_message)
                if wait_time_match:
                    minutes = int(wait_time_match.group(1))
                    seconds = float(wait_time_match.group(2))
                    wait_time = minutes * 60 + seconds
                    print(f"Rate limit exceeded. Waiting for {wait_time} seconds before retrying.")
                else:
                    wait_time = 2 ** (attempt + 1)  # exponential backoff as fallback
                    print(f"Rate limit exceeded. Retrying in {wait_time} seconds (attempt {attempt + 1}/{retries})")
            else:
                wait_time = 2 ** (attempt + 1)  # exponential backoff for other errors
                print(f"Retry {attempt + 1}/{retries} after exception: {str(e)}")
            await asyncio.sleep(wait_time)
    raise Exception(f"Failed after {retries} attempts")

In [4]:
dataset_path = os.path.join('..', '..', 'dataset', 'deduplicated_dataset.json')

data = pd.read_json(dataset_path, lines=True)

In [5]:
def save_progress_to_json(results, filename='final_results.json'):
    with open(filename, 'w') as f:
        json.dump(results, f, indent=4)

def load_progress_from_json(filename='final_results.json'):
    if os.path.exists(filename):
        with open(filename, 'r') as f:
            return json.load(f)
    return []

results = []

In [6]:
async def process_row(index, row):
    title = row['title']
    article = row['article_text']

    result = {
        'title': title,
        'article_text': article,
        'questions': [],
        'labels': []
    }

    try:
        result['questions'] = await groq_completion_with_retry(title, article, questions_prompt)
    except Exception as e:
        print(f"An unexpected error occurred (questions): {str(e)}")

    try:
        result['labels'] = await groq_completion_with_retry(title, article, categories_prompt)
    except Exception as e:
        print(f"An unexpected error occurred (labels): {str(e)}")

    return result

async def process_rows_in_batches(data, start_index, batch_size=10):
    total_batches = (len(data) - start_index) // batch_size + 1
    pbar = tqdm(total=total_batches, desc="Generating questions and labels", initial=start_index // batch_size)
    
    for i in range(start_index, len(data), batch_size):
        batch = data.iloc[i:i+batch_size]
        tasks = [process_row(index, row) for index, row in batch.iterrows()]
        results_batch = await asyncio.gather(*tasks, return_exceptions=True)

        # Append results and save progress
        results.extend(results_batch)
        save_progress_to_json(results)

        # Update the progress bar
        pbar.update(1)

    pbar.close()

In [7]:
results = load_progress_from_json()
start_index = len(results)
start_index

1741

In [10]:
asyncio.run(process_rows_in_batches(data, start_index))
save_progress_to_json(results, filename='final_results.json')


[A

Retry 1/8 after exception: Error code: 413 - {'error': {'message': 'Request Entity Too Large', 'type': 'invalid_request_error', 'code': 'request_too_large'}}
Rate limit exceeded. Waiting for 635.124 seconds before retrying.
Rate limit exceeded. Waiting for 170.203 seconds before retrying.
Rate limit exceeded. Waiting for 249.909 seconds before retrying.
Rate limit exceeded. Waiting for 1179.782 seconds before retrying.
Rate limit exceeded. Waiting for 377.356 seconds before retrying.
Retry 2/8 after exception: Error code: 413 - {'error': {'message': 'Request Entity Too Large', 'type': 'invalid_request_error', 'code': 'request_too_large'}}
Retry 3/8 after exception: Error code: 413 - {'error': {'message': 'Request Entity Too Large', 'type': 'invalid_request_error', 'code': 'request_too_large'}}
Retry 4/8 after exception: Error code: 413 - {'error': {'message': 'Request Entity Too Large', 'type': 'invalid_request_error', 'code': 'request_too_large'}}
Retry 5/8 after exception: Error code

Generating questions and labels:  46%|████▌     | 174/377 [17:58<?, ?it/s]
Generating questions and labels:  46%|████▌     | 174/377 [02:22<?, ?it/s]


Rate limit exceeded. Retrying in 2 seconds (attempt 1/8)
Rate limit exceeded. Retrying in 2 seconds (attempt 1/8)
Retry 7/8 after exception: Error code: 413 - {'error': {'message': 'Request Entity Too Large', 'type': 'invalid_request_error', 'code': 'request_too_large'}}
Rate limit exceeded. Waiting for 168.484 seconds before retrying.
Rate limit exceeded. Retrying in 4 seconds (attempt 2/8)
Rate limit exceeded. Waiting for 222.148 seconds before retrying.
Retry 8/8 after exception: Error code: 413 - {'error': {'message': 'Request Entity Too Large', 'type': 'invalid_request_error', 'code': 'request_too_large'}}
Rate limit exceeded. Retrying in 8 seconds (attempt 3/8)
Rate limit exceeded. Retrying in 16 seconds (attempt 4/8)
Rate limit exceeded. Waiting for 119.75999999999999 seconds before retrying.
Rate limit exceeded. Retrying in 32 seconds (attempt 5/8)
Rate limit exceeded. Waiting for 327.63 seconds before retrying.
Rate limit exceeded. Retrying in 64 seconds (attempt 6/8)
Rate lim

KeyboardInterrupt: 