## Take Home Test: Reformat a Public Dataset for LLM Training

### Objective

The goal of this task is to prepare public datasets for more effective use in training and fine-tuning Large Language Models (LLMs). You are required to reformat a specific subset of a public dataset into a structured, consistent format to facilitate its usability.

### Detailed Instructions

#### 1. Dataset Selection and Preparation

- **Dataset:** You are assigned the `Headline` subset of the [AdaptLLM/finance-tasks](https://huggingface.co/datasets/AdaptLLM/finance-tasks) dataset.

- **Task Description:** Each entry in the `input` column contains multiple "Yes" or "No" questions alongside their respective answers. Your task is to:

  - Develop a Python script to parse and separate each question and its answer from the entry.
  - Save each question-answer pair in a structured JSON format as follows:
    ```json
    {
      "id": "<unique_identifier>",
      "Question": "<question_text>",
      "Answer": "<answer_text>"
    }
    ```

  - You are encouraged to introduce additional attributes if needed to preserve the integrity and completeness of the information. Adding relevant tag information is strongly recommended.
- **Automation Requirement:** The task must be completed using Python. Manual editing or data manipulation is strictly prohibited. Your script should efficiently handle variations in data format within the column.

#### 2. Deliverables

- **Reformatted Dataset:** Provide the schema of the final format you adopted for saving the results.
- **Transformation Code:** Submit the complete code used for converting the dataset into the designated format.
- **Statistics:** Report the total number of question-answer pairs extracted from the dataset.
- **Performance Metrics:** Document the time taken to complete the dataset cleanup and transformation process.


In [3]:
import pandas as pd
import json
import uuid
import time

start_time = time.time()


def parse_row(row):
    qas = []
    for col in row.index:
        if "?" in str(row[col]):
            question = str(row[col]).strip()
            answer_col = row.index.get_loc(col) + 1
            if answer_col < len(row.index):
                answer = str(row[answer_col]).strip()
                qas.append((question, answer))
    return qas


df = pd.read_csv("train.csv", delimiter='\t', on_bad_lines='skip', quoting=3)

structured_data = []
for idx, row in df.iterrows():
    qas = parse_row(row)
    for qa in qas:
        structured_data.append({
            "id": str(uuid.uuid4()),
            "Question": qa[0],
            "Answer": qa[1]
        })

output_file_path = 'output.json'
with open(output_file_path, 'w') as outfile:
    json.dump(structured_data, outfile, indent=2)

end_time = time.time()
execution_time = end_time - start_time

output_file_path, len(structured_data), execution_time

  answer = str(row[answer_col]).strip()


('output.json', 83330, 2.707728862762451)