# GPT API Chinese Semantic Checker

This notebook is used to check the semantic of Chinese language unit. It is based on GPT-3.5 and GPT-4 's API. 

Author: Zexin Xu, Zilu Zhang

## GPT API Query

This snippet invloves the Baidu API. `API_key` is deleted for security reasons. Please use your own API key. For more details of the usage of GPT models, please refer to [GPT API](https://platform.openai.com/docs/introduction)

In [None]:
import pandas as pd
import openai
from datetime import datetime

API_KEY = ""

In [14]:
openai.api_key = API_KEY
#NOTE If you don't havve access to GPT-4, feel free to use GPT-3.5-turbo or other suitiable models
model_id = "gpt-4"

def ChatGPT_conversation(conversation):
    response = openai.ChatCompletion.create(
        model=model_id,
        messages=conversation
    )
    conversation.append({'role': response.choices[0].message.role, 'content': response.choices[0].message.content})
    return conversation

In [None]:
#NOTE We are asking in the context of Chinese
question_type = ['语病']
question_suffix = '，这句话是否有'

#NOTE Put your own dataframe here, and change column names accordingly
chatgpt_tunit_df = pd.DataFrame()
data_df = chatgpt_tunit_df

In [None]:
%%time

"""
    This setup is to solve GPT hour limit issue. Generating multiple files and combine them later.
    If you received a bad gateway error or rate limit error, the code will stop and print out the index 
    of the last row. So you can start from that index and run the code again. Also remember to change the
    file output path name. The second try-except function is to prevent overwriting existing file accidentally.
"""

start_index = 0
stop_index = 0
conversation = []
try:
    for i, row in data_df.loc[start_index:, :].iterrows():
        for q_type in question_type:
            question = '“' + row['sentence'] + "”" + question_suffix + q_type + "？"
            # Append is to include previous interaction
            conversation.append({'role': 'user', 'content': question})
            conversation = ChatGPT_conversation(conversation)
            data_df.loc[i, q_type] = conversation[-1]['content'].strip()
            if row['sen'] == 1:
                conversation = []
        stop_index = i
        if i % 20 == 0:
            print(f"{i}th iteration done...")
except Exception as e:
    print(type(e).__name__, "/", str(e))
    print("stop at index: ", stop_index - 1)
    now = datetime.now()
    print("current time = ", now.strftime("%H:%M:%S"))

#NOTE Prevent overwriting existing file
try:
    data_df.to_csv('chatgpt/GPT4/yubing/tunit_df_result_context.csv', mode='x', index=False, encoding='utf-8-sig')
except FileExistsError:
    print('File already exists! Change it to another name.')
    

## Result processsing

This snippet includes the result processing of the GPT API. We use keyword filtering to determine the label of returned result.

In [8]:
import pandas as pd 

result_tunit_df = pd.read_csv('chatgpt/GPT4/yubing/tunit_df_result_context.csv', encoding='utf-8-sig')
result_sen_df = pd.read_csv('chatgpt/GPT4/yubing/sen_df_result.csv', encoding='utf-8-sig')

def modify_yufa_result(df):
    for i, row in df.iterrows():
        if ('没有语病' in row['语病'] or
            '没有语病错误' in row['语病'] or
            '不算有语病' in row['语病'] or
            '没有问题' in row['语病'] or
            '语病正确' in row['语病'] or
            '语病上是正确的' in row['语病'] or
            '语病上没有错误' in row['语病'] or
            '语病没有错误' in row['语病'] or
            '没有语病问题' in row['语病'] or
            '没有明显错误' in row['语病'] or
            '不算是语病' in row['语病'] or 
            '语法上可以说是正确的' in row['语病'] or
            '语法上没有明显错误' in row['语病'] or 
            '没有明显的' in row['语病'] or 
            '没有显著的' in row['语病'] or
            '没有错误' in row['语病'] or
            '完全正确' in row['语病'] or
            '基本正确' in row['语病'] or    
            '基本上正确' in row['语病'] or 
            '是正确的语法' in row['语病'] or
            '标准的英语表达' in row['语病'] or 
            '语法是正确的' in row['语病'] or
            '语法正确' in row['语病'] or
            '没有语法错误' in row['语病'] or
            '没有。' in row['语病']):
            df.loc[i, 'yubing_label'] = 1
        else:
            df.loc[i, 'yubing_label'] = 0

modify_yufa_result(result_tunit_df)
modify_yufa_result(result_sen_df)
result_tunit_df.to_csv('chatgpt/GPT4/tunit_df_result_mod.csv', index=False, encoding='utf-8-sig')
result_sen_df.to_csv('chatgpt/GPT4/sen_df_result_mod.csv', index=False, encoding='utf-8-sig')