회의록 생성 서비스를 이용한 사용자의 데이터를 기반으로 추가 학습 데이터셋을 만드는 코드입니다. 

기본 회의록 양식에 포함되는 질문 및 답변 외에 사용자가 추가한 질문과 그 답변은 저희 모델에 customized된 데이터셋이라 생각할 수 있으므로 기존 MRC 모델에 train dataset으로 넣어 추가적으로 fine-tuning하여 정확도를 향상시키고자 합니다.

개발 과정에서는 임의로 수집한 구어체 텍스트(회의 내용이라고 가정)에 대해 다양한 질문에 대한 답변을 직접 입력하는 방법으로 데이터셋을 구축하였습니다.

데이터셋의 형태는 MRC 모델에 사용된 SQuAD 2.0 데이터와 동일한 형태를 가지도록 하였습니다. 
아래의 DataFrame은 회의 내용 script text 일부(context), 질문(question), 답변(text), 회의명(title), context에서 text 첫 단어의 시작 위치(answer_start), id 의 순서로 이루어져 있습니다. 
이를 바탕으로 json 형태의 파일을 생성하며, Finetuning_with_Custom_Data.ipynb 에서와 같이 모델 학습을 진행합니다. 



In [None]:
!pip install transformers==3.3.0
import pandas as pd
import json


In [3]:
!git clone https://github.com/Soyeon-ErinLee/Dobby-AI

Cloning into 'Dobby-AI'...
remote: Enumerating objects: 130, done.[K
remote: Counting objects: 100% (130/130), done.[K
remote: Compressing objects: 100% (87/87), done.[K
remote: Total 703 (delta 72), reused 86 (delta 38), pack-reused 573[K
Receiving objects: 100% (703/703), 27.16 MiB | 24.53 MiB/s, done.
Resolving deltas: 100% (308/308), done.


In [4]:
def find_word_idx(context, answer):
    import re
    if pd.isnull(answer):
        return 0
    a = re.search(answer, context)
    if a==None:
        return 0
    else:
        return a.start()

def create_paragraphs(context,question,text,answer_start, id):
  prg_dt={}

  prg_dt['context']=context
  qas={}
  answers_list=[{'answer_start':answer_start, 'text':text}]*3
  qas['answers']=answers_list
  qas['id']=id
  qas['is_impossible']=False
  qas['question']=question
  prg_dt['qas']=[qas]

  return prg_dt 

def create_data(df):
  data=[]
  for title in df.title.unique():
    data_dt = {'title':title} # meeting title
    df_temp=df[df.title==title]
    paragraphs=df_temp.apply(lambda x: create_paragraphs(x['context'],x['question'],x['text'],x['answer_start'], x['id']), axis=1).tolist()
    data_dt['paragraphs']=paragraphs
    data.append(data_dt)
  dataset={'version':'v2.0' , 'data':data} 
  return dataset



In [5]:
df=pd.read_csv('/content/Dobby-AI/Data/dataset_통합.csv').iloc[:,:4]
df.loc[:,'answer_start']=df.apply(lambda x: find_word_idx(x['context'], x['text']), axis=1)
df.loc[:,'id']=df.index
df = df.replace('\n',' ', regex=True)
df.text = df.text.fillna('')


In [6]:
df.head(10)

Unnamed: 0,context,question,text,title,answer_start,id
0,"Honey, I’m home. How was your day? Alright. He...",why does he want to take the tent back?,too heavy to carry around,meeting_1,470,0
1,I must apologize for dragging you all here at ...,who wrote the report?,Peter Sullivan.,meeting_2,701,1
2,I must apologize for dragging you all here at ...,what department Mr. Sullivan works in?,in the Risk Assessment and Management Office a...,meeting_2,1050,2
3,I must apologize for dragging you all here at ...,when did the meeting start?,,meeting_2,0,3
4,I must apologize for dragging you all here at ...,what is the main issue?,,meeting_2,0,4
5,"As you probably know, over the last 36 to 41 m...",what is the main issue?,pushing the risk profile without raising any r...,meeting_2,873,5
6,"As you probably know, over the last 36 to 41 m...",when did the problem start?,the last two weeks.,meeting_2,1383,6
7,"So, you're saying this has already happened? S...",what does the model predict about the company?,if those assets decrease by just 25% and remai...,meeting_2,229,7
8,"Nothing more. And standing here tonight, I'm a...",What is the conclusion that the mayor made?,Sell it all. Today.,meeting_2,613,8
9,"Nothing more. And standing here tonight, I'm a...",why should we sell all the trades before noon?,,meeting_2,0,9


In [None]:
data=create_data(df)
with open("/content/drive/MyDrive/kpmg/data/additional_qa.json", "w") as outfile:  
    json.dump(data, outfile) 
dicto['data'][1]['paragraphs']

[{'context': "I must apologize for dragging you all here at such an uncommon hour. But from what I've been told, this matter needs to be dealt with urgently. So urgently, in fact, it probably should have been addressed weeks ago. But that is spilt milk under the bridge. So, why doesn't somebody tell me what they think is going on here? Mr. Tuld, as I mentioned earlier, if you compare the figure at the top of page 13... Jared, it's a little early for all that. Just speak to me in plain English. Okay. In fact, I'd like to speak to the guy who put this together. Mr. Sullivan, is it? Does he speak English? I'd like to speak with the analyst who seems to have stumbled across this mess. Certainly. That would be Peter Sullivan. Right here. Oh, Mr. Sullivan, you're here. Good morning. Maybe you could tell me what you think is going on here. And please, speak as you might to a young child or a golden retriever. It wasn't brains that got me here. I can assure you of that. Well, um... Sir, as you