## 이상치 데이터 EDA

- `preprocess.py` 모듈을 통해 3 가지 전처리를 자동으로 가능

    - **감탄사 ["음~","어~","아~","그~"] 제거**

    - **빈 utterance 제거**

    - **틀린 output(train 401번째, 402번째 샘플) 수정**

    - 단순히 `python preprocess.py`을 터미널에 입력하면 전처리되어 default로 `resource/data` 경로에 저장됨

In [7]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [2]:
def make_dataframe(path: str) -> pd.DataFrame:
    """
    Read a json file and return a pandas DataFrame.

    Parameters:
    path (str): Path to the json file.

    Returns:
    pd.DataFrame: DataFrame of the json file.
    """
    # Read the json file
    with open(path, 'r') as file:
        data = json.load(file)

    # Create a DataFrame
    # columns = ['id', 'conversation', 'subject_keyword', 'output']
    df = pd.DataFrame(data)
    df['conversation'] = df['input'].apply(lambda x: x['conversation'])
    df['subject_keyword'] = df['input'].apply(lambda x: x['subject_keyword'])

    # Drop the 'input' column
    df.drop('input', axis=1, inplace=True)

    # Speakers in the conversation
    df['speakers'] = df['conversation'].apply(lambda turns: list(set(turn['speaker'] for turn in turns)))

    # Reorder the columns
    df = df[['id', 'conversation', 'subject_keyword', 'speakers', 'output']]

    return df

In [3]:
train_df = make_dataframe('../resource/filtered_data/일상대화요약_train.json')
dev_df = make_dataframe('../resource/filtered_data/일상대화요약_dev.json')
test_df = make_dataframe('../resource/filtered_data/일상대화요약_test.json')

In [98]:
train_df = make_dataframe('../resource/data/일상대화요약_train.json')
dev_df = make_dataframe('../resource/data/일상대화요약_dev.json')
test_df = make_dataframe('../resource/data/일상대화요약_test.json')

<br/>

<br/>

## spekers와 output 내의 speakers가 동일한지 확인

- speaker의 구조가 'SD'+'7자리숫자' 인지 확인

In [90]:
def check_speaker_structure(df: pd.DataFrame) -> None:
    """
    Check the structure of the speakers in the conversation.

    Parameters:
    df (pd.DataFrame): DataFrame of the json file.
    """
    # Check the structure of the speakers in the conversation
    cnt = 0
    for i, speakers in enumerate(df['speakers']):
        for speaker in speakers:
            if not re.match(r'SD\d{7}', speaker):
                print(f'Row {i}: {speaker}')
                cnt += 1

    if cnt == 0:
        print('All speakers are in the correct format.')

In [21]:
check_speaker_structure(train_df)
check_speaker_structure(dev_df)
check_speaker_structure(test_df)

All speakers are in the correct format.
All speakers are in the correct format.
All speakers are in the correct format.


- output 내에 등장하는 모든 speaker들 중에서
    - 실제 대화에 등장하는 speaker가 아닌 샘플을 찾기

In [91]:
def check_invalid_output(df: pd.DataFrame) -> pd.DataFrame:
    """
    Check if the output is invalid.

    Parameters:
    df (pd.DataFrame): DataFrame to check.

    Returns:
    pd.DataFrame: DataFrame with valid output.
    """
    def is_not_valid_output(row):
        # extract speakers in the output
        speakers = re.findall(r'SD\d{7}', row['output'])

        # real speakers
        real_speakers = row['speakers']

        # check the validity
        if set(speakers) != set(real_speakers):
            print("real_speakers: ", set(real_speakers), "output_speakers: ", set(speakers))
            
        return set(speakers) != set(real_speakers)

    # find the rows with invalid output
    is_not_valid = df.apply(lambda row: is_not_valid_output(row), axis=1)

    return df[is_not_valid]

In [17]:
check_invalid_output(train_df)

real_speakers:  {'SD2100503', 'SD2110504'} output_speakers:  {'SD2100503', 'SD2110504', 'SD2100504'}
real_speakers:  {'SD2100503', 'SD2110504'} output_speakers:  {'SD2110504', 'SD2110503'}


Unnamed: 0,id,conversation,subject_keyword,speakers,output
400,nikluge-2024-일상 대화 요약-train-000401,"[{'speaker': 'SD2100503', 'utterance': '언니 결혼 ...",[결혼],"[SD2100503, SD2110504]",대화에서 SD2100503과 SD2100504는 결혼식에 대해 이야기를 나눴습니다....
401,nikluge-2024-일상 대화 요약-train-000402,"[{'speaker': 'SD2110504', 'utterance': '너는 누구랑...",[결혼],"[SD2100503, SD2110504]",이 대화에서 SD2110503과 SD2110504는 결혼에 대해 이야기를 나눴습니다...


In [18]:
check_invalid_output(dev_df)

Unnamed: 0,id,conversation,subject_keyword,speakers,output


- 이상 데이터 직접 수정

In [55]:
# 수정 후, 다시 불러오기
train_df = make_dataframe('../resource/filtered_data/일상대화요약_train.json')
dev_df = make_dataframe('../resource/filtered_data/일상대화요약_dev.json')

check_invalid_output(train_df)

Unnamed: 0,id,conversation,subject_keyword,speakers,output


<br/>

<br/>

## utterance가 비어있는 샘플 확인

- `output`과, `speaker`의 경우 모두 채워져있는 것을 확인

### 결과
- train : 30개, dev : 1개, test : 4개의 샘플에서 비어있는 utterance 발견

    - 지금 default의 경우 `chat.append(f"화자{speaker}: {utterance}")` 로 인해 비어있는 샘플이 입력으로 추가되게 됨
    - 이는 의미없는 연산 비용을 발생시키므로 제거시켜주자

In [80]:
train_df = make_dataframe('../resource/filtered_data/일상대화요약_train.json')
dev_df = make_dataframe('../resource/filtered_data/일상대화요약_dev.json')
test_df = make_dataframe('../resource/filtered_data/일상대화요약_test.json')

In [81]:
# Fine the samples that have empty utterances

def find_empty_utterances(df: pd.DataFrame) -> pd.DataFrame:
    """
    Find the samples that have empty utterances.

    Parameters:
    df (pd.DataFrame): DataFrame to check.

    returns:
    pd.Series: DataFrame with empty outputs.
    """
    # Find the samples that have empty utterances
    def has_empty_utterances(turns):
        return any(not turn['utterance'] for turn in turns)

    empty_utterances = df['conversation'].apply(lambda turns: has_empty_utterances(turns))
    print(f'Number of samples that have empty utterances: {empty_utterances.sum()}')
    
    return empty_utterances

In [82]:
empty_train = find_empty_utterances(train_df)
empty_dev = find_empty_utterances(dev_df)
empty_test = find_empty_utterances(test_df)

Number of samples that have empty utterances: 30
Number of samples that have empty utterances: 1
Number of samples that have empty utterances: 4


- 비어있는 utterances 제거

In [95]:
# Remove the samples that have empty utterances

def remove_empty_utterances(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove the samples that have empty utterances.

    Parameters:
    df (pd.DataFrame): DataFrame to check.

    Returns:
    pd.DataFrame: DataFrame with no empty utterances.
    """
    # Remove the samples that have empty utterances
    def remove_empty_utterance_turn(turns):
        # Remove the turns that have empty utterances
        return [turn for turn in turns if turn['utterance']]
    
    df['conversation'] = df['conversation'].apply(lambda turns: remove_empty_utterance_turn(turns))
    print('Empty utterances removed.')
    return df

In [96]:
fine_train_df = remove_empty_utterances(train_df)
fine_dev_df = remove_empty_utterances(dev_df)
fine_test_df = remove_empty_utterances(test_df)

Empty utterances removed.
Empty utterances removed.
Empty utterances removed.


In [99]:
_ = find_empty_utterances(fine_train_df)
_ = find_empty_utterances(fine_dev_df)
_ = find_empty_utterances(fine_test_df)

Number of samples that have empty utterances: 0
Number of samples that have empty utterances: 0
Number of samples that have empty utterances: 0


In [87]:
# save the fine data to json files

def save_to_json(df: pd.DataFrame, path: str) -> None:
    """
    Save the DataFrame to a json file.

    Parameters:
    df (pd.DataFrame): DataFrame to save.
    path (str): Path to save the json file.
    """
    def make_input_column(row):
        input_col = row[['conversation', 'subject_keyword']].to_dict()
        return input_col

    df['input'] = df.apply(lambda row: make_input_column(row), axis=1)

    # Drop the 'conversation', 'speakers' and 'subject_keyword'columns
    df.drop(['conversation', 'speakers', 'subject_keyword'], axis=1, inplace=True)

    # Reorder the columns
    df = df[['id', 'input', 'output']]

    # Save the DataFrame to a json file
    data = df.to_dict(orient='records')
    
    with open(path, 'w') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)

In [88]:
save_to_json(fine_train_df, './sample.json')