## Take Home Test: Reformat a Public Dataset for LLM Training

### Objective

The goal of this task is to prepare public datasets for more effective use in training and fine-tuning Large Language Models (LLMs). You are required to reformat a specific subset of a public dataset into a structured, consistent format to facilitate its usability.

### Detailed Instructions

#### 1. Dataset Selection and Preparation

- **Dataset:** You are assigned the `Headline` subset of the [AdaptLLM/finance-tasks](https://huggingface.co/datasets/AdaptLLM/finance-tasks) dataset.

- **Task Description:** Each entry in the `input` column contains multiple "Yes" or "No" questions alongside their respective answers. Your task is to:

  - Develop a Python script to parse and separate each question and its answer from the entry.
  - Save each question-answer pair in a structured JSON format as follows:
    ```json
    {
      "id": "<unique_identifier>",
      "Question": "<question_text>",
      "Answer": "<answer_text>"
    }
    ```

  - You are encouraged to introduce additional attributes if needed to preserve the integrity and completeness of the information. Adding relevant tag information is strongly recommended.
- **Automation Requirement:** The task must be completed using Python. Manual editing or data manipulation is strictly prohibited. Your script should efficiently handle variations in data format within the column.

#### 2. Deliverables

- **Reformatted Dataset:** Provide the schema of the final format you adopted for saving the results.
- **Transformation Code:** Submit the complete code used for converting the dataset into the designated format.
- **Statistics:** Report the total number of question-answer pairs extracted from the dataset.
- **Performance Metrics:** Document the time taken to complete the dataset cleanup and transformation process.


# Load

In [1]:
import pandas as pd
text_df = pd.read_json("/content/sample_data/test.json")
text_df.head()

Unnamed: 0,id,input,options,gold_index,class_id
0,0,"Headline: ""Gold falls to Rs 30,800; silver dow...","[No, Yes]",1,0
1,1,Headline: february gold rallies to intraday hi...,"[No, Yes]",0,7
2,2,Please answer a question about the following h...,"[No, Yes]",0,5
3,3,"Read this headline: ""gold closes lower as doll...","[No, Yes]",1,3
4,4,"gold adds $42, or 2.4%, to trade at $1,833.30/...","[No, Yes]",0,1


# Processing

In [36]:
%%time
import re
def remove_options(text):
    pattern = re.compile(r'\nOptions:\n- No\n- Yes')
    cleaned_text = pattern.sub('', text)
    pattern = re.compile(r'\nOptions:\n- Yes\n- No')
    cleaned_text = pattern.sub('', cleaned_text)
    return cleaned_text


def split_A(text):
    # 使用 '\n\n' 分割文本成多个片段
    # segments = text.split('\n\n')
    pattern = re.compile(r'(yes|no)\s*\n\n', re.IGNORECASE)
    segments = pattern.split(text)

    valid_segments = []
    buff = []
    for segment in segments:
      buff.append(segment)
      if ( len(buff)==2 and
        segment.lower() in {'yes', 'no'}):
        valid_segments.append(buff )
        buff = [ ]
      if ( len(buff)==2 and
        segment.lower() not in {'yes', 'no'}):
        buff = [ ]



    # for segment in segments:
    #     # 使用空白字符分割片段，分离出前面的部分和最后一个单词
    #     parts = segment.rsplit(maxsplit=1)

    #     if len(parts) == 2:
    #         body, last_word = parts
    #         # 判断最后一个单词是否为小写的 'yes' 或 'no'
    #         if last_word.lower() in {'yes', 'no'}:
    #             valid_segments.append(parts)

    return valid_segments
def find_puncidx_before_does(text):
    pattern = re.compile(r'([\n:"])(?=\s*Does\s)')#, re.IGNORECASE
    match = pattern.search(text)
    if match:
        return match.start()
    else:
        return None
def process_text(text):

  L_HQ_A = split_A(
      remove_options(text)
  )
  res = []
  for HQ,A in L_HQ_A:
    idx = find_puncidx_before_does(HQ)
    if idx is None:
      res.append( ("[HQA]",HQ,A) )
    else:
      res.append( (HQ[:idx+1],HQ[idx+1:].replace("\n",""),A) )
  return res
  # idx = find_puncidx_before_does(HQ)
  # H,Q = HQ[:idx+1],HQ[idx+1:]
  # return "[HEAD]"+H,'[QUE]'+Q,'[ASW]'+A

text_df['resList2'] = text_df['input'].map(process_text )
def findHQA(vsl):
 return any([True if v[0]=='[HQA]' else False for v in  vsl])
text_df['processERR2'] = text_df['resList2'].map( findHQA )

print(
    "解析失败句子对个数",
    sum([len(x) for x in text_df[text_df['processERR2']==True]['resList2'].values])
  )

print(
    '解析成功句子对个数:',
    sum([len(x) for x in text_df[text_df['processERR2']==False]['resList2'].values])
)

# 解析失败句子对个数 0

# 解析成功句子对个数: 102735

# CPU times: user 1.55 s, sys: 0 ns, total: 1.55 s

# Wall time: 1.55 s

解析失败句子对个数 0
解析成功句子对个数: 102735
CPU times: user 1.37 s, sys: 20.4 ms, total: 1.39 s
Wall time: 1.4 s


In [63]:
%%time
jsondata = []
for id1,vls in zip(text_df['id'].values,text_df['resList2'].values):
  for i,vl in enumerate(vls):
    idv = f'{id1}-{i}'
    jsondata.append(
        {"id": idv,
        "head":vl[0],
        "Question": vl[1],
        "Answer": vl[2]
  })

import json
with open('/content/sample_data/AdaptLLM-finance-tasks-Headline.json', 'w') as f:
    json.dump(jsondata, f)


CPU times: user 978 ms, sys: 46.8 ms, total: 1.02 s
Wall time: 1.03 s


# 待办 清洗head

In [22]:
def find_first_two_quotes(text):
    pattern = re.compile(r'"')
    matches = pattern.finditer(text)
    
    indices = [match.start() for match in matches]
    if len(indices) < 2:
        return (indices[0], None) if indices else (None, None)
    
    return indices[0], indices[1]
quotes_indices = find_first_two_quotes(tex)
if quotes_indices is not None:
    headline = tex[quotes_indices[0] : quotes_indices[1]]
# if 
# res = []
print(tex)
headline       

Headline: "Gold falls to Rs 30,800; silver down at Rs 41,200 per kg" Now answer this question:


'"Gold falls to Rs 30,800; silver down at Rs 41,200 per kg'

In [38]:
ress = []
for data in jsondata:
    if 'does'  in data['Question']:
        ress.append(data)

In [39]:
ress

[]

In [61]:
for data in jsondata:
    if ('"' not  in data['head'] and ':' not  in data['head'] ):
        tex = data['head']
        print(data)
        break


{'id': '8-0', 'head': "jewellers' body opposes government's move to increase import duty on gold\n", 'Question': 'Does the news headline talk about price going down? Yes or No? ', 'Answer': 'No'}


In [53]:
data

{'id': '20546-4',
 'head': '"Gold futures rise to Rs 29,889 per 10 gm" Answer this question:',
 'Question': ' Does the news headline talk about price in the past? ',
 'Answer': 'Yes'}

In [44]:
data['head']

'"Gold futures rise to Rs 29,889 per 10 gm" Answer this question:'

In [17]:

res2 = []
for data in res:
    if ':' not in data['head']:
        res2.append(data)
        

In [62]:
# res2