## Reformat a Public Dataset for LLM Training

### Objective

The goal of this task is to prepare public datasets for more effective use in training and fine-tuning Large Language Models (LLMs). You are required to reformat a specific subset of a public dataset into a structured, consistent format to facilitate its usability.

### Detailed Instructions

#### 1. Dataset Selection and Preparation

- **Dataset:** You are assigned the `Headline` subset of the [AdaptLLM/finance-tasks](https://huggingface.co/datasets/AdaptLLM/finance-tasks) dataset.

- **Task Description:** Each entry in the `input` column contains multiple "Yes" or "No" questions alongside their respective answers. Your task is to:

  - Develop a Python script to parse and separate each question and its answer from the entry.
  - Save each question-answer pair in a structured JSON format as follows:
    ```json
    {
      "id": "<unique_identifier>",
      "Question": "<question_text>",
      "Answer": "<answer_text>"
    }
    ```

  - You are encouraged to introduce additional attributes if needed to preserve the integrity and completeness of the information. Adding relevant tag information is strongly recommended.
- **Automation Requirement:** The task must be completed using Python. Manual editing or data manipulation is strictly prohibited. Your script should efficiently handle variations in data format within the column.

#### 2. Deliverables

- **Reformatted Dataset:** Provide the schema of the final format you adopted for saving the results.
- **Transformation Code:** Submit the complete code used for converting the dataset into the designated format.
- **Statistics:** Report the total number of question-answer pairs extracted from the dataset.
- **Performance Metrics:** Document the time taken to complete the dataset cleanup and transformation process.


In [210]:
from tqdm import tqdm
import pandas as pd
import random
import json 
import re

In [211]:
dataset_fina = json.load(open('test.json'))

In [212]:
for key, value in dataset_fina[0].items():
    # Print the key and the type of its corresponding value
    print(f"Key: {key} -> Type: {type(value)}")

Key: id -> Type: <class 'int'>
Key: input -> Type: <class 'str'>
Key: options -> Type: <class 'list'>
Key: gold_index -> Type: <class 'int'>
Key: class_id -> Type: <class 'int'>


### 1. Parse the data into reformatted file

##### There are some typo error for sentences like: "gold tumbles to 9-month low at rs 8,520 \n\n"
- the ids of which are: [2480, 14500, 17352, 17990]

In [213]:
reformatted_json = []
typo_error_list = [2480, 14500, 17352, 17990]
typo_error_str = "gold tumbles to 9-month low at rs 8,520 \n\n"

def extract_last_yes_or_no(sentence: str):
    """
    This regular expression looks for the words "Yes" or "No" in the sentence.
    It returns the last word and the rest of the sentence.
    """
    pattern = r'(.*\b)(Yes|No)(\b.*)$'
    match = re.search(pattern, sentence, flags=re.IGNORECASE)

    if match:
        # Extract the word "Yes" or "No" that was matched
        word_removed = match.group(2)
        # Remove the word from the sentence
        modified_sentence = re.sub(pattern, r'\1\3', sentence, flags=re.IGNORECASE)
        return word_removed, modified_sentence
    else:
        return sentence, None 


def fill_into_template(reformatted_json: list[dict], item: dict, typo_error_list: list[int] = None) -> list:
    """
    Fill the template with the item's data.
    The function will split question-answer in the input 
    by ("\n\n"), and extract the answer using last "Yes" or "No" 
    in the question-answer sentence. 
    """
    if item['id'] in typo_error_list:
        item['input'] = item['input'].replace("gold tumbles to 9-month low at rs 8,520 \n\n", "gold tumbles to 9-month low at rs 8,520 \n")
    input = item['input'] + ' ' + item['options'][item['gold_index']]
    question_answer_list = input.split("\n\n")
    for sentence in question_answer_list:
        answer, question = extract_last_yes_or_no(sentence)
        if question and answer:
            item_template = {
                'id': len(reformatted_json), 
                'Question': question.replace("\"", ""),
                'Answer': answer,
                'class_id': item['class_id'],
            }
            reformatted_json.append(item_template)
        else:
            print(f"Error: {sentence}")
            print(item['id'])
            
    pass


In [214]:
for item in tqdm(dataset_fina, desc='Procesing dataset'):
    fill_into_template(reformatted_json, item, typo_error_list)

Procesing dataset: 100%|██████████| 20547/20547 [00:07<00:00, 2869.19it/s]


### 2. Let's see the reformatted output

In [215]:
print(f"Number of questions in the reformatted dataset: {len(reformatted_json)}")

Number of questions in the reformatted dataset: 123282


In [216]:
reformatted_json[random.randint(0, len(reformatted_json))]

{'id': 89959,
 'Question': 'SEBI allows gold ETFs to invest in Gold Deposit Schemes Does the news headline talk about price in the past? ',
 'Answer': 'No',
 'class_id': 0}

### 3. Let's look in Dataframe

In [217]:
df = pd.DataFrame(reformatted_json)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123282 entries, 0 to 123281
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        123282 non-null  int64 
 1   Question  123282 non-null  object
 2   Answer    123282 non-null  object
 3   class_id  123282 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 3.8+ MB


In [218]:
# output reformatted json to reformatted.json
with open('reformatted.json', 'w') as f:
    json.dump(reformatted_json, f, indent=4)