## Take Home Test: Reformat a Public Dataset for LLM Training

### Objective

The goal of this task is to prepare public datasets for more effective use in training and fine-tuning Large Language Models (LLMs). You are required to reformat a specific subset of a public dataset into a structured, consistent format to facilitate its usability.

### Detailed Instructions

#### 1. Dataset Selection and Preparation

- **Dataset:** You are assigned the `Headline` subset of the [AdaptLLM/finance-tasks](https://huggingface.co/datasets/AdaptLLM/finance-tasks) dataset.

- **Task Description:** Each entry in the `input` column contains multiple "Yes" or "No" questions alongside their respective answers. Your task is to:

  - Develop a Python script to parse and separate each question and its answer from the entry.
  - Save each question-answer pair in a structured JSON format as follows:
    ```json
    {
      "id": "<unique_identifier>",
      "Question": "<question_text>",
      "Answer": "<answer_text>"
    }
    ```

  - You are encouraged to introduce additional attributes if needed to preserve the integrity and completeness of the information. Adding relevant tag information is strongly recommended.
- **Automation Requirement:** The task must be completed using Python. Manual editing or data manipulation is strictly prohibited. Your script should efficiently handle variations in data format within the column.

#### 2. Deliverables

- **Reformatted Dataset:** Provide the schema of the final format you adopted for saving the results.
- **Transformation Code:** Submit the complete code used for converting the dataset into the designated format.
- **Statistics:** Report the total number of question-answer pairs extracted from the dataset.
- **Performance Metrics:** Document the time taken to complete the dataset cleanup and transformation process.


In [None]:
# ========================
# Load
# ========================
import pandas as pd
text_df = pd.read_json("/content/sample_data/test.json")

In [None]:
# ========================
# Help Func
# ========================
import re

def remove_options(text):
    pattern = re.compile(r'\nOptions:\n- No\n- Yes')
    cleaned_text = pattern.sub('', text)
    pattern = re.compile(r'\nOptions:\n- Yes\n- No')
    cleaned_text = pattern.sub('', cleaned_text)
    return cleaned_text


def split_A(text):
    # 使用 '\n\n' 分割文本成多个片段
    # segments = text.split('\n\n')
    pattern = re.compile(r'(yes|no)\s*\n\n', re.IGNORECASE)
    segments = pattern.split(text)

    valid_segments = []
    buff = []
    for segment in segments:
      buff.append(segment)
      if ( len(buff)==2 and
        segment.lower() in {'yes', 'no'}):
        valid_segments.append(buff )
        buff = [ ]
      if ( len(buff)==2 and
        segment.lower() not in {'yes', 'no'}):
        buff = [ ]



    # for segment in segments:
    #     # 使用空白字符分割片段，分离出前面的部分和最后一个单词
    #     parts = segment.rsplit(maxsplit=1)

    #     if len(parts) == 2:
    #         body, last_word = parts
    #         # 判断最后一个单词是否为小写的 'yes' 或 'no'
    #         if last_word.lower() in {'yes', 'no'}:
    #             valid_segments.append(parts)

    return valid_segments
def find_puncidx_before_does(text):
    pattern = re.compile(r'([\n:"])(?=\s*Does\s)')#, re.IGNORECASE
    match = pattern.search(text)
    if match:
        return match.start()
    else:
        return None
def process_text(text):

  L_HQ_A = split_A(
      remove_options(text)
  )
  res = []
  for HQ,A in L_HQ_A:
    idx = find_puncidx_before_does(HQ)
    if idx is None:
      res.append( ("[HQA]",HQ,A) )
    else:
      res.append( (HQ[:idx+1],HQ[idx+1:].replace("\n",""),A) )
  return res
  # idx = find_puncidx_before_does(HQ)
  # H,Q = HQ[:idx+1],HQ[idx+1:]
  # return "[HEAD]"+H,'[QUE]'+Q,'[ASW]'+A
def findHQA(vsl):
 return any([True if v[0]=='[HQA]' else False for v in  vsl])

In [None]:
%%time
# ========================
# Process
# ========================

text_df['resList2'] = text_df['input'].map(process_text )
text_df['processERR2'] = text_df['resList2'].map( findHQA )

print(
    "解析失败句子对个数",
    sum([len(x) for x in text_df[text_df['processERR2']==True]['resList2'].values])
  )

print(
    '解析成功句子对个数:',
    sum([len(x) for x in text_df[text_df['processERR2']==False]['resList2'].values])
)

# 解析失败句子对个数 0

# 解析成功句子对个数: 102735

# CPU times: user 1.55 s, sys: 0 ns, total: 1.55 s

# Wall time: 1.55 s

In [None]:
# ========================
# Saving
# ========================

jsondata = []
for id1,vls in zip(text_df['id'].values,text_df['resList2'].values):
  for i,vl in enumerate(vls):
    idv = f'{id1}-{i}'
    jsondata.append(
        {"id": idv,
        "head":vl[0],
        "Question": vl[1],
        "Answer": vl[2]
  })

import json
with open('/content/sample_data/AdaptLLM-finance-tasks-Headline.json', 'w') as f:
    json.dump(jsondata, f)