ASL, v.060323

# Introduction

In this notebook, we use ChatGPT via the OpenAI API to classify the MedWeb tweets studied in the previous notebooks (`DAT255-NLP-2.0-MedTweets-fastai-ULMFiT.ipynb` and `DAT255-NLP-3.0-MedTweets-transformers.ipynb`).

# Setup

In [1]:
%matplotlib inline
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from pathlib import Path

In [2]:
import openai

Load API key (get yours here: https://platform.openai.com/). The (secret) key is stored in an environment file (OBS: make sure to gitignore it!). We load it using `python-dotenv`: https://pypi.org/project/python-dotenv/

In [3]:
NB_DIR = Path.cwd()

In [4]:
dotenv_file = NB_DIR/'.env'

In [5]:
import dotenv
dotenv.load_dotenv(dotenv_file)

True

# Load and prepare the data

In [6]:
df = pd.read_csv('https://github.com/HVL-ML/DAT255/raw/main/3-NLP/data/medwebdata.csv')
df.head()

Unnamed: 0,ID,Tweet,Influenza,Diarrhea,Hayfever,Cough,Headache,Fever,Runnynose,Cold,labels,is_test
0,1en,The cold makes my whole body weak.,0,0,0,0,0,0,0,1,Cold,False
1,2en,It's been a while since I've had allergy sympt...,0,0,1,0,0,0,1,0,Hayfever;Runnynose,False
2,3en,I'm so feverish and out of it because of my al...,0,0,1,0,0,1,1,0,Hayfever;Fever;Runnynose,False
3,4en,"I took some medicine for my runny nose, but it...",0,0,0,0,0,0,1,0,Runnynose,False
4,5en,I had a bad case of diarrhea when I traveled t...,0,0,0,0,0,0,0,0,sober,False


In [7]:
df.drop(['labels'], axis=1, inplace=True)

In [8]:
df['labels'] = df.apply(lambda x: [x[c] for c in df.columns[2:-1]], axis=1)

In [9]:
df.head()

Unnamed: 0,ID,Tweet,Influenza,Diarrhea,Hayfever,Cough,Headache,Fever,Runnynose,Cold,is_test,labels
0,1en,The cold makes my whole body weak.,0,0,0,0,0,0,0,1,False,"[0, 0, 0, 0, 0, 0, 0, 1]"
1,2en,It's been a while since I've had allergy sympt...,0,0,1,0,0,0,1,0,False,"[0, 0, 1, 0, 0, 0, 1, 0]"
2,3en,I'm so feverish and out of it because of my al...,0,0,1,0,0,1,1,0,False,"[0, 0, 1, 0, 0, 1, 1, 0]"
3,4en,"I took some medicine for my runny nose, but it...",0,0,0,0,0,0,1,0,False,"[0, 0, 0, 0, 0, 0, 1, 0]"
4,5en,I had a bad case of diarrhea when I traveled t...,0,0,0,0,0,0,0,0,False,"[0, 0, 0, 0, 0, 0, 0, 0]"


# Set up the ChatGPT prompt

In [10]:
prompt = """
When I write a short text, I want you to return whether or not the text deals with 
one or more of the terms, and if so, which ones. 
['Influenza', 'Diarrhea', 'Hayfever', 'Cough', 'Headache', 'Fever', 'Runnynose', 'Cold']. 
Respond only with the terms that the text I provide deals with as a comma-separated inside brackets. 

For example, when I write, "I'm feeling really bad. My head hurts. My nose is runny. 
I've felt like this for days." you should respond with [Headache, Runnynose].

Here is my text: 
"""

# Create API request

## Test

In [11]:
tweet = "I'm so feverish and out of it because of my allergies. I'm so sleepy."

In [12]:
messages=[
        {"role": "user", "content": prompt},
        {"role": "user", "content": tweet}]

In [13]:
messages

[{'role': 'user',
  'content': '\nWhen I write a short text, I want you to return whether or not the text deals with \none or more of the terms, and if so, which ones. \n[\'Influenza\', \'Diarrhea\', \'Hayfever\', \'Cough\', \'Headache\', \'Fever\', \'Runnynose\', \'Cold\']. \nRespond only with the terms that the text I provide deals with as a comma-separated inside brackets. \n\nFor example, when I write, "I\'m feeling really bad. My head hurts. My nose is runny. \nI\'ve felt like this for days." you should respond with [Headache, Runnynose].\n\nHere is my text: \n'},
 {'role': 'user',
  'content': "I'm so feverish and out of it because of my allergies. I'm so sleepy."}]

In [14]:
import os
openai.api_key = os.getenv("OPENAI_API_KEY")

model = "gpt-3.5-turbo"

In [15]:
response = openai.ChatCompletion.create(
  model=model,
  messages=messages,
  temperature=0.9,
  max_tokens=150,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.6,
  stop=[" Human:", " AI:"]
)

In [16]:
response

<OpenAIObject chat.completion id=chatcmpl-6r1TJhhsqbkmcUzIycLLlAo2VraBn at 0x7fb8c5cd6d40> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "[Fever, Hayfever]",
        "role": "assistant"
      }
    }
  ],
  "created": 1678095061,
  "id": "chatcmpl-6r1TJhhsqbkmcUzIycLLlAo2VraBn",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 9,
    "prompt_tokens": 177,
    "total_tokens": 186
  }
}

## Function to get predictions

In [17]:
def get_prediction(tweet):
    messages=[
        {"role": "user", "content": prompt},
        {"role": "user", "content": tweet}]
    
    response = openai.ChatCompletion.create(
              model=model,
              messages=messages,
              temperature=0.9,
              max_tokens=150,
              top_p=1,
              frequency_penalty=0.0,
              presence_penalty=0.6,
              stop=[" Human:", " AI:"]
            )
    
    prediction = response.get("choices")[0].get("message").get("content")
    
    prediction = prediction.strip('][').split(', ')
    
    return prediction

In [18]:
df["predictions"] = ""

In [19]:
all_labels = df.columns[2:10]
all_labels

Index(['Influenza', 'Diarrhea', 'Hayfever', 'Cough', 'Headache', 'Fever',
       'Runnynose', 'Cold'],
      dtype='object')

In [20]:
def get_pred_encoded(tweet):
    pred = get_prediction(tweet)
    pred_encoded = []
    for l in all_labels:
        if l in pred:
            pred_encoded.append(1)
        else:
            pred_encoded.append(0)
            
    return pred_encoded

In [21]:
get_pred_encoded(df.iloc[0]["Tweet"])

[0, 0, 0, 0, 0, 0, 0, 1]

# Get predictions for test data

In [22]:
test_df = df.loc[df.is_test == True]

In [23]:
test_df.head()

Unnamed: 0,ID,Tweet,Influenza,Diarrhea,Hayfever,Cough,Headache,Fever,Runnynose,Cold,is_test,labels,predictions
1920,1921en,I went on a trip and got the flu as a souvenir.,1,0,0,0,0,1,0,0,True,"[1, 0, 0, 0, 0, 1, 0, 0]",
1921,1922en,Difficult bosses are one kind of headache,0,0,0,0,0,0,0,0,True,"[0, 0, 0, 0, 0, 0, 0, 0]",
1922,1923en,I'm dying and need someone to translate for me...,0,0,0,0,0,0,0,0,True,"[0, 0, 0, 0, 0, 0, 0, 0]",
1923,1924en,Flu crisis.,1,0,0,0,0,1,0,0,True,"[1, 0, 0, 0, 0, 1, 0, 0]",
1924,1925en,"I have a horribly stuffy nose, there's no way ...",0,0,0,0,0,0,1,0,True,"[0, 0, 0, 0, 0, 0, 1, 0]",


In [24]:
len(test_df)

640

> **Note:** If we feed all these tweets one by one to ChatGPT via the API, we will likely hit the rate limit. Therefore, we'll rather try a few. 

In [26]:
for tweet in df["Tweet"].sample(2):
    pred = get_pred_encoded(tweet)
    print(tweet)
    print(pred)

My cough is so terrible, my stomach muscles hurts bad.
[0, 0, 0, 1, 0, 0, 0, 0]
They say allergies will be horrid in the spring, but don't they say the same thing every year anyway?
[0, 0, 1, 0, 0, 0, 0, 0]


In [None]:
#test_df["predictions"] = test_df["Tweet"].apply(get_pred_encoded)