# How to format inputs to ChatGPT models

ChatGPT is powered by `gpt-3.5-turbo` and `gpt-4`, OpenAI's most advanced models.

You can build your own applications with `gpt-3.5-turbo` or `gpt-4` using the OpenAI API.

Chat models take a series of messages as input, and return an AI-written message as output.

This guide illustrates the chat format with a few example API calls.

## 1. Import the openai library

In [1]:
# # if needed, install and/or upgrade to the latest version of the OpenAI Python library

# %pip install --upgrade openai


In [1]:
# import the OpenAI Python library for calling the OpenAI API
import openai


# 2. An example chat API call

A chat API call has two required inputs:
- `model`: the name of the model you want to use (e.g., `gpt-3.5-turbo`, `gpt-4`, `gpt-4-0314`)
- `messages`: a list of message objects, where each object has two required fields:
    - `role`: the role of the messenger (either `system`, `user`, or `assistant`)
    - `content`: the content of the message (e.g., `Write me a beautiful poem`)

Messages can also contain an optional `name` field, which give the messenger a name. E.g., `example-user`, `Alice`, `BlackbeardBot`. Names may not contain spaces.

Typically, a conversation will start with a system message that tells the assistant how to behave, followed by alternating user and assistant messages, but you are not required to follow this format.

Let's look at an example chat API calls to see how the chat format works in practice.

In [2]:
path = '../data/'

In [9]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn import metrics
from tqdm import tqdm

import time

In [10]:

df_labeled = pd.read_csv(f'{path}insurance_dataset.csv')
df_labeled = df_labeled[df_labeled['div']=='test'].reset_index(drop=True)
df_labeled['label']=df_labeled['label'].apply(lambda x:x.replace('stok','stock'))
print('Test data labels:',Counter(df_labeled.label))

Test data labels: Counter({'tamin': 31, 'insurance': 29, 'health': 20, 'person': 17})


  df_labeled = pd.read_csv(f'{path}insurance_dataset.csv')


In [11]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn import metrics
from gensim.models import CoherenceModel

def q_metrics(y_true, y_pred,my_model=None):
    contigency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    purity = np.sum(np.amax(contigency_matrix, axis=0)) / np.sum(contigency_matrix)
  
    print('purity_score:',purity)
    print('NMI:',metrics.normalized_mutual_info_score(y_true, y_pred))
    
    if my_model!=None:
        cm = CoherenceModel(model=my_model, corpus=bow_corpus, dictionary=dictionary, coherence='u_mass')
        print('Coherence:',cm.get_coherence())

def print_result(resres,my_model=None):
    pred = []
    for i in range(len(resres)):
        pred.append(np.argmax(resres[i]))
  
    y_true = df[df['div']=='test']['topic']
    y_pred = pred
    q_metrics(y_true, y_pred,my_model)
  

In [6]:
len(df_labeled['label'])

97

In [20]:
# %%capture
# !pip install googletrans==3.1.0a0

In [12]:
from googletrans import Translator
translator = Translator()
import re

In [7]:
q = '''this text is about insurance in  Iran.
The text is about which of these topics?

-<Third-party>: Third-party insurance is a type of insurance It is often mandatory for certain types of insurance, like car insurance, and is generally less expensive than comprehensive coverage. In Iran we are forced to get Third-party insurance for cars, and if we say insurance for cars without specification means Third-party insurance.
-<Health>: Health insurance taken out to cover the cost of medical care, doctor costs, medicine, hospital and ….
- <Social-Security>: Social-Security that is about Social Security organisation and also Life insurance, Unemployment Insurance and retirement insurance.
- <Other>

Now, which of <Third-party>,<Health>,<Social-Security>,<Other> topics best fit the following tweet? Answer with only the previous options that is most accurate and nothing else. Just name one of them with no more explanation.


'''

output = 'output is only one word <stok> or <currency> or <good> or <other>'


In [None]:
# Example OpenAI Python library request
# "firs translate this tweet from fa to en then say it's topic is economics or sport or art or health or social or politic."+"\ntweet:\n"+tweet+output_format
# q+toppics+"\ntweet:\n"+tweet+output_format

responses=[] 
for tweet in tqdm(df_labeled['text']):
    
    openai.api_key = 'YOUR_API_KEY'
    MODEL = "gpt-3.5-turbo"
    tweet = re.sub(r"http\S+", '', tweet.replace('\n',' '))
#     tweet = translator.translate(tweet, dest='en').text
#     prompt_text=q+toppics+"\ntweet:\n"+tweet+output_format
#     prompt_text = "firs translate this tweet from fa to en then say it's topic is economics or sport or art or health or social or politic."+"\ntweet:\n"+tweet+output_format
    prompt_text = q+"\nText:\n"+tweet
    response = openai.ChatCompletion.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": "You are a classifier that tag persian texts your output is only one word <stok> or <currency> or <good> or <other>"},
            {"role": "user", "content":prompt_text},

        ],
        temperature=0,
    )

    responses.append(response.choices[0].message.content)
    time.sleep(25)


In [30]:
norm_label = lambda x:x.lower().replace(' ','').replace('<','').replace('>','')
responses = list(map(norm_label,responses))

In [35]:
from sklearn.metrics import classification_report

# res = [r[1:-1] for r in responses]
res = responses
df2 = pd.DataFrame(res)
df2.columns=['predict_ChatGPT']
df2['label']=df_labeled['label'].to_list()[:100]
df2['text']=df_labeled['text'].to_list()[:100]                    
# df2 = pd.concat([df2,df_labeled[['label']].reset_index(drop=True)],axis=1)
df2
q_metrics(df2['label'],df2['predict_ChatGPT'])
print(classification_report(df2['label'],df2['predict_ChatGPT']))

purity_score: 0.6597938144329897
NMI: 0.44193190397109255
                 precision    recall  f1-score   support

         health       0.60      0.60      0.60        20
      insurance       0.00      0.00      0.00        29
          other       0.00      0.00      0.00         0
         person       0.00      0.00      0.00        17
social-security       0.00      0.00      0.00         0
          tamin       0.00      0.00      0.00        31
    third-party       0.00      0.00      0.00         0

       accuracy                           0.12        97
      macro avg       0.09      0.09      0.09        97
   weighted avg       0.12      0.12      0.12        97



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [105]:
df2

Unnamed: 0,predict_ChatGPT,label,text
0,other,other,رامین پرچمی: بخاطر ۱۳۰ میلیون تومان بدهی ۳ سال...
1,other,other,استعفا در سکوت/ کناره‌گیری #محمدرضا_عارف از ری...
2,good,stock,بورس کالا دلیل افزایش ۲۰ درصدی قیمت آهن؟\nبه گ...
3,currency,currency,جالب است که دلار در مسیر کانال 30 هزار تومانی ...
4,currency,currency,ادعای فارین پالیسی درباره استفاده ایران از ارز...
...,...,...,...
95,good,good,فرمانده دریابانی #بوشهر از کشف بیش از 15 هزار ...
96,good,good,عضو کمیسیون اقتصادی: مجلس با مسببین گرانی خودر...
97,other,other,‼️اظهارات تحقیر آمیز ترامپ خطاب به کشورهای عرب...
98,other,other,نخست وزیر اسراییل همزمان با مذاکرات وین: \n\nط...


In [106]:
df2.to_csv(f'{path}/predict_ChatGPT-economics-gtrans.csv')

In [50]:
import json
import ast

res2=[]
for i,r in enumerate(responses):
    try:
        res2.append((ast.literal_eval("{"+re.search('{(.*)}', r).group(1)+"}")['Topic']).lower().replace('\"','').replace('}',''))
        
    except:
        try:
            
            res2.append((r.replace('Topic": ','Topic: ').split('Topic: ')[-1].split()[0]).lower().replace('\"','').replace('}',''))
#             res2.append(ast.literal_eval(r.replace('topic','Topic').replace("': '",'":"').replace("{'",'{"').replace("'}",'"}').replace("\',\n \'",'\",\n \"').split('}')[0]+'}'))
        except:
            print(i)
    if res2[-1]=='politics':
        res2[-1]='social'
        


In [51]:
df2 = pd.DataFrame(res2).rename(columns={'Topic':'predict_ChatGPT'})
df2.columns=['predict_ChatGPT']
df2 = pd.concat([df2,df_labeled[['label']]],axis=1)
df2



Unnamed: 0,predict_ChatGPT,label
0,social,social
1,sport,sport
2,poem,poem
3,health,social
4,social,social
...,...,...
156,politics,social
157,health,health
158,sport,health
159,health,health


In [61]:
Counter(df2['predict_ChatGPT'])

Counter({'social': 28,
         'sport': 31,
         'poem': 23,
         'health': 44,
         'economics': 22,
         'art': 13})

In [62]:
df2.to_csv(f'{path}/predict_ChatGPT-people-fa.csv')

In [64]:
q_metrics(df2['label'],df2['predict_ChatGPT'])

purity_score: 0.9254658385093167
NMI: 0.8417225303570115


In [3]:
df_labeled = pd.read_csv(f'{path}art_dataset-stm_pred.csv')

  df_labeled = pd.read_csv(f'{path}art_dataset-stm_pred.csv')


In [16]:


df_labeled['stm_pred'] = df_labeled[['pred.1','pred.2','pred.3']]. idxmax(axis=1)
df_labeled['stm_pred'] = df_labeled['stm_pred'].apply(lambda x:int(x[-1])-1)

In [17]:
q_metrics(df_labeled[df_labeled['div']=='test']['label'],df_labeled[df_labeled['div']=='test']['stm_pred'])

purity_score: 0.8461538461538461
NMI: 0.6265048896128346


In [27]:
from sklearn.metrics import classification_report



topic_name={1:'film',0:'art',2:'poet'}
y_pred = [*map(topic_name.get, list(df_labeled[df_labeled['div']=='test']['stm_pred']))]
print(classification_report(df_labeled[df_labeled['div']=='test']['label'].to_list(),y_pred))


              precision    recall  f1-score   support

         art       0.22      0.14      0.17        14
        film       0.78      0.83      0.80        42
        poet       0.95      1.00      0.97        35

    accuracy                           0.79        91
   macro avg       0.65      0.66      0.65        91
weighted avg       0.76      0.79      0.77        91

