# GPT 3.5 Classification
We will use GPT 3.5 Turbo (ChatGPT) to find cancelled studies based on the `description` field.

<small>**NOTE:** We did not find any additional cancelled studies, which weren't already found by the GPT 3.5 Classification.</small>

<small>**NOTE:** Not used for final analysis. We instead used a manual classification approach.</small>

First import the packages:

In [2]:
from openai import OpenAI
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

import json

# NOTE: configure the environment variable OPENAI_API_KEY="your_api_key_here" first
client = OpenAI()

This variable has to be `True` to run the model. 
<br>**This will cost credits!**

In [3]:
RUN_AI_TASK = True

First we will import the data. We want to patch. We use these helper functions to get correct names and $\texttt{NA}$ Values.

In [None]:
na_values = [
    "", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", 
    "1.#IND", "1.#QNAN", "<NA>", "NULL", "NaN", "None", "nan", "null"
    # "N/A",
    # "NA",
    # "n/a",
]

def python_name_converter(x):
        return '_'.join([word.lower() for word in x.split(' ')]) if x[0] != '$' else x

We can now read the data and assign the new meta-field.

In [None]:
regex_df = pd.read_excel(
    '../../../data/ema_rwd/ema_rwd_patched_regex_v2.xlsx',
    keep_default_na=False,
    na_values=na_values,
    na_filter=True
).iloc[:, 1:].rename(
    columns=python_name_converter
).assign(
    **{'$CANCELLED_GPT': pd.NA}
)

We will classify the study based on the `description` field. We return `pd.NA` when this field is not present or when we get a $\texttt{JSON}$ error. 

We won't override generated values by default. 

In [66]:
system_prompt = "You will be presented with the description of a Post-authorisation study (PAS or PASS or PAES) and your job is to detect if the study was cancelled, stopped, withdrawn, etc. Provide your answer as either True or False in JSON-Format (key: answer, type: boolean). Choose ONLY from these two values but NOT both. Answer True even if ONLY a single sentence indicates that the study was cancelled.\n\nNote that a study can also be cancelled if the Market-authorisation holder (MAH) or the researched medical product was withdrawn."

def classify(description, current, override=False):
    if pd.notna(current) and not override:
        return current

    if pd.isna(description):
        return pd.NA

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        response_format={ "type": "json_object" },
        messages=[
            {
              "role": "system",
              "content": system_prompt
            },
            {
              "role": "user",
              "content": json.dumps(description)
            }
        ],
        temperature=1,
        max_tokens=10,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
        
    try:
        answer = json.loads(response.choices[0].message.content)['answer']
        return answer
    except Exception as e:
        print('Exception: ', e)
        return pd.NA

We will partition the data first and run the code seperatly on every chunk to lose minimal data and credits if an error occurs. 

**NOTE: This will override `gpt_chunks` and therefore the generated data. Rerun the cell below first if you have empty cells, which you want to fill up.**

In [None]:
def chunker(seq, size):
    return {i: seq[pos:pos + size] for i, pos in enumerate(range(0, len(seq), size))}

chunks = chunker(regex_df, 100)
gpt_chunks = dict()

This will run the main task if `RUN_AI_TASK` is set to `True`. 

**This will use your CREDITS!**

In [67]:
def find_gpt_cancelled_by_index(df, index):
    return df.at[index, '$CANCELLED_GPT'] if isinstance(df, pd.DataFrame) else pd.NA

if RUN_AI_TASK:
    for k, chunk_df in chunks.items():
        tqdm.pandas(desc=f'GPT Classification {k}')
        gpt_chunks[k] = chunk_df.assign(
            **{'$CANCELLED_GPT': chunk_df.progress_apply(lambda x : classify(x['description'], find_gpt_cancelled_by_index(gpt_chunks.get(k, None), x.name)), axis='columns')}
        )

GPT Classification 0:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 1:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 2:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 3:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 4:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 5:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 6:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 7:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 8:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 9:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 10:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 11:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 12:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 13:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 14:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 15:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 16:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 17:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 18:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 19:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 20:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 21:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 22:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 23:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 24:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 25:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 26:   0%|          | 0/100 [00:00<?, ?it/s]

GPT Classification 27:   0%|          | 0/60 [00:00<?, ?it/s]

We will now concatenate the chunks and revert some changes in the `pd.Dataframe`.

In [107]:
def excel_name_converter(x):
        return ' '.join([word.capitalize() for word in x.split('_')]) if x[0] != '$' else x

gpt_df = pd.concat(gpt_chunks.values())
gpt_df.loc[:, ['$CANCELLED_GPT', '$CANCELLED_MANUAL']] = gpt_df.loc[:, ['$CANCELLED_GPT', '$CANCELLED_MANUAL']].astype(bool)
gpt_df = gpt_df.assign(
    **{
        '$CANCELLED_GPT': np.where(gpt_df['description'].isna(), pd.NA, gpt_df['$CANCELLED_GPT']),
        '$CANCELLED_MANUAL' : np.where(gpt_df['description'].isna(), pd.NA, gpt_df['$CANCELLED_MANUAL'])
    }
).rename(columns=excel_name_converter)

We can now view and export the data.

In [108]:
gpt_df

Unnamed: 0,Eu Pas Register Number,Title,Url,Registration Date,Update Date,Puri,Countries,Description,State,Lead Institution Encepp,...,Check Stability,Check Logical Consistency,Conducted Data Characterisation,$CANCELLED,$CANCELLED_MANUAL,$UPDATED_state,$UPDATED_state_eq_state,$CANCELLED_extracted_word,$CANCELLED_extracted_sentence,$CANCELLED_GPT
0,1578,Long-term outcomes and adverse events of thera...,https://catalogues.ema.europa.eu/study/9133,2010-10-27,2015-03-30,https://redirect.ema.europa.eu/resource/9133,Italy,Objectives: to measure long-term outcomes and ...,Finalised,Department of Epidemiology of the Regional Hea...,...,Unknown,Unknown,No,False,False,Finalised,True,,,False
1,1587,DEVELOPMENT AND VALIDATION OF A SHORTENED VERS...,https://catalogues.ema.europa.eu/study/1588,2010-10-25,2010-10-25,https://redirect.ema.europa.eu/resource/1588,Spain,Questionnaires for measuring quality of life i...,Finalised,University Hospital Vall d’Hebron (HUVH),...,Unknown,Unknown,No,False,False,Finalised,True,,,False
2,1591,Cardiac profile of patients using rosiglitazon...,https://catalogues.ema.europa.eu/study/15148,2010-10-06,2016-09-09,https://redirect.ema.europa.eu/resource/15148,United Kingdom,The study is a retrospective analysis of a coh...,Finalised,European Medicines Agency (EMA),...,Unknown,Unknown,No,False,False,Finalised,True,,,False
3,1594,Validation of statistical signal detection pro...,https://catalogues.ema.europa.eu/study/14160,2010-10-20,2016-07-19,https://redirect.ema.europa.eu/resource/14160,United Kingdom,OBJECTIVE: To evaluate whether statistical sig...,Finalised,European Medicines Agency (EMA),...,Unknown,Unknown,No,False,False,Finalised,True,,,False
4,1597,International Active Surveillance study - Fola...,https://catalogues.ema.europa.eu/study/36862,2010-10-26,2020-08-24,https://redirect.ema.europa.eu/resource/36862,Canada; Russian Federation; Ukraine; United St...,Oral Contraceptives containing DRSP have been ...,Finalised,Berlin Center for Epidemiology & Health Resear...,...,Unknown,Unknown,No,True,True,Finalised,True,discontinue,"Because of the rarity of CRC, the recruitments...",True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2755,108481,Post Marketing Surveillance of Effectiveness (...,https://catalogues.ema.europa.eu/study/199010,2024-01-18,2024-01-18,https://redirect.ema.europa.eu/resource/199010,China,This is a multicenter observational study invo...,Ongoing,Merck & Co.,...,Unknown,Unknown,No,False,False,Ongoing,True,,,False
2756,108728,A Non-Interventional Multi-Database Post-Autho...,https://catalogues.ema.europa.eu/study/199011,2024-01-16,2024-01-16,https://redirect.ema.europa.eu/resource/199011,Denmark; Finland; France; Germany; United States,The aim of this study is to describe congenita...,Planned,IQVIA,...,Unknown,Unknown,No,False,False,Planned,True,,,False
2757,108847,Post-Authorisation Safety Study of Comirnaty O...,https://catalogues.ema.europa.eu/study/199012,2024-01-22,2024-01-22,https://redirect.ema.europa.eu/resource/199012,Italy; Netherlands; Norway; Spain; United Kingdom,This study aims to answer the research questio...,Planned,Pfizer,...,Unknown,Unknown,No,False,False,Planned,True,,,False
2758,108904,Prospective Observational Study to Monitor and...,https://catalogues.ema.europa.eu/study/199013,2024-01-22,2024-02-17,https://redirect.ema.europa.eu/resource/199013,Brazil; Bulgaria; Denmark; France; Germany; Is...,"This is a multinational, non-interventional, o...",Planned,IQVIA,...,Unknown,Unknown,No,False,False,Planned,True,,,False


In [109]:
gpt_df.to_excel('ema_rwd_patched_gpt.xlsx', sheet_name='PAS')