In [1]:
!pip install openai --upgrade


Collecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m71.7/77.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.1


In [3]:
import pandas as pd
import os
import openai
from dateutil.parser import parse
# SYSTEM PROMPT
SYSTEM_PROMPT = """

Task:
Imagine you are a data analyst working in a hospital tasked with evaluating the sequential validity of time-related fields in a healthcare dataset. Your role involves assessing the quality of medical data across various hospitals.
When you receive medical tabular data, it's crucial to distinguish and understand the chronological order of the column names.

Procedure:
Analyze various CSV files from different researchers. The variable names might be explicit, but sometimes they might be abbreviated. Consider this in the context of medical tabular data. For example:
EDREGTIME (Emergency Department registration time)
ADMITIME (Admission time)
DISCHTIME (Discharge time)
DEATHTIME (Death time)
Infer the full meaning of variable names like Admitime.
From the input columns, identify and exclude columns that aren't related to medical events.
Determine the sequence and relationship of the relevant columns without including additional information. Express the order of medical data using logical operators.
Use symbols like "<" for "less than", ">" for "greater than", ">=" for "greater than or equal to" to define relationships, and "&&" mark the end of the sentence.

Note:
Make sure to output the exact column name, becareful of lowercase and uppercase. When interpreting medical data columns like discharge_date, admission_date, diagnosis_date, medication_start_date, and medication_end_date, avoid making specific or strict assumptions about the order of events that might not be universally valid. Instead, only establish general sequences that have broad applicability. For instance, it's universally valid that medication has a start and an end date, so medication_start_date should always be on or before medication_end_date. Similarly, a patient's date of admission should always be before their date of discharge. However, do not make assumptions like diagnosis_date is always before admission_date as this is not universally true. The goal is to capture the broadest and most general sequence of events without being overly prescriptive.

Output format:
1. Columns to exclude: List the columns that aren't directly related to the medical sequence.
2. Reason for exclusion: Provide reasons for why certain columns were excluded from the medical sequence.
3. Relevant columns: Describe the medical significance of each column in a dictionary format
ADMITTIME: The time a patient is admitted to the hospital.
DISCHTIME: The time a patient is discharged from the hospital.
DEATHTIME: The time a patient dies.
EDREGTIME: The time a patient registers at the emergency department.
4. Sequential: Define the logical sequence of the columns. Make sure to use the exact variable name including lower and uppercase.
5. Sequence rationale: Describe the rationale or reasoning behind the defined sequence.

Example:
MIMIC data

Input: ADMITTIME, DISCHTIME, DEATHTIME, EDREGTIME, EDOUTTIME, LABTEST_DATE, MEDICATION_DATE, SURGERY_DATE, FOLLOWUP_DATE, HEIGHT, WEIGHT, STORETIME, VALUE

Output:
1. Columns to exclude: HEIGHT, WEIGHT, STORETIME, VALUE

2. Reason for exclusion: These columns do not directly represent chronological medical events or timestamps related to patient care in the MIMIC dataset.

3. Relevant columns:
ADMITTIME: The time a patient is admitted to the hospital.
DISCHTIME: The time a patient is discharged from the hospital.
DEATHTIME: The time a patient dies.
EDREGTIME: The time a patient registers at the emergency department.
EDOUTTIME: The time a patient leaves the emergency department.
LABTEST_DATE: The time a lab test was conducted.
MEDICATION_DATE: The time medication was administered.
SURGERY_DATE: The time surgery was conducted.
FOLLOWUP_DATE: The time of a patient's follow-up visit.
RADIOLOGY_DATE: The time a radiological test was conducted.

4. Sequential:
EDREGTIME < EDOUTTIME &&
ADMITTIME < DISCHTIME &&

5. Sequence rationale:
The patient first registers at the emergency department, then leaves it.
Once admitted, various medical events like lab tests, medications, surgeries, radiological tests, and follow-ups can occur in different order so we omit them in our sequence. Instead, we show the more logical and general sequence which is the patient admits in the hospital and then discharges.


Input: discharge_date, admission_date, diagnosis_date, medication_start_date, medication_end_date, age_at_recruitment, ethnicitiy, bmi, smoking_status, blood_glucose

Output:

1. Columns to exclude: age_at_recruitment, gender, ethnicitiy, bmi, smoking_status, alcohol_frequency

2. Reason for exclusion: These columns do not directly represent chronological medical events or timestamps related to patient care. Age_at_recruitment, gender, and ethnicity are demographic information, bmi, smoking_status, alcohol_frequency are health status or lifestyle information but not time-related.

3. Relevant columns:

DISCHARGE_DATE: The date a patient was discharged from the hospital.
ADMISSION_DATE: The date when a patient was admitted.
DIAGNOSIS_DATE: The date when a patient was diagnosed.
MEDICATION_START_DATE: The date when medication was started.
MEDICATION_END_DATE: The date when medication was ended.

4. Sequential:
MEDICATION_START_DATE <= MEDICATION_END_DATE &&
ADMISSION_DATE < DISCHARGE_DATE &&

5. Sequence rationale:
Medication administration has a clear start and end, so the start date should always be on or before the end date. Similarly, the date of admission to a hospital should always precede the discharge date. These sequences capture general, universally valid relationships in medical data without making assumptions about the specific order of other events.

"""

# Helper function to check if a string is a date
def is_date(string):
    try:
        parse(string)
        print(f"Checked date: {string} is valid.")
        return True
    except ValueError:
        print(f"Checked date: {string} is invalid.")
        return False


# Validate a single comparison, returns False if the comparison fails
def validate_comparison(row, left_col, right_col, operator):
    # Convert to datetime if both values are dates
    left_col = left_col.lower()
    right_col = right_col.lower()

    if is_date(str(row[left_col])) and is_date(str(row[right_col])):
        left_val = pd.to_datetime(row[left_col])
        right_val = pd.to_datetime(row[right_col])

        if operator == "<" and not left_val < right_val:
            return False
        elif operator == "<=" and not left_val <= right_val:
            return False
        elif operator == ">" and not left_val > right_val:
            return False
        elif operator == ">=" and not left_val >= right_val:
            return False
    return True

# 1. Importing data
def import_data(file_path):
    try:
        df = pd.read_csv(file_path)
        print(f"Data imported successfully with columns: {df.columns.tolist()}")
        df.columns = df.columns.str.lower()  # Convert column names to lowercase
        return df
    except Exception as e:
        print(f"Error importing data: {e}")


def get_gpt_output(df, temperature=0):
    # Filtering columns based on data type and column name patterns
    timestamp_columns = df.columns.tolist()

    openai.api_key = os.getenv("OPENAI_API_KEY")
    openai.api_key = "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"

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": ", ".join(timestamp_columns)}
        ]
    )
    return response['choices'][0]['message']['content']

# 3. Check column order
def extract_sequence_order(gpt_output):
    lines = gpt_output.split('\n')
    sequence = ""
    capturing = False

    for line in lines:
        line = line.strip()
        if line.startswith("4. Sequential:"):
            capturing = True
            # Start capturing the sequence, remove the "4. Sequential:" part
            sequence += line.replace("4. Sequential:", "").strip() + " "
        elif capturing:
            print(line)
            if line.startswith("5. Sequence rationale:"):
                # Stop capturing if the next section starts
                break
            if "&&" not in line:
                # If there's no "&&", it's part of the sequence
                sequence += line + " "
            else:
                # If "&&" is found, append the part of the line before "&&"
                sequence += line + " "

    return sequence.strip()

# 4. Validation
def validate_rows(df, sequence):
    valid_rows = 0
    invalid_rows = []
    conditions = [cond.strip().replace("<=", "≤").replace(">=", "≥") for cond in sequence.split("&&") if cond.strip()]

    for index, row in df.iterrows():
        valid = True
        row_errors = []
        for condition in conditions:
            parts = ''
            operator = ''
            if "≤" in condition or "≥" in condition:
                parts = condition.split("≤") if "≤" in condition else condition.split("≥")
                operator = "<=" if "≤" in condition else ">="
            elif "<" in condition or ">" in condition:
                parts = condition.split("<") if "<" in condition else condition.split(">")
                operator = "<" if "<" in condition else ">"

            for i in range(len(parts) - 1):
                left_col = parts[i].replace("≤", "").replace("≥", "").strip()
                right_col = parts[i + 1].replace("≤", "").replace("≥", "").strip()

                # if "≤" in parts[i + 1]:
                #     operator = "<="
                # elif "≥" in parts[i + 1]:
                #     operator = ">="

                if not validate_comparison(row, left_col, right_col, operator):
                    valid = False
                    #change operator
                    comp_symbol = change_operator(operator)
                    row_errors.append(f"{left_col} {comp_symbol} {right_col}")

        # Print the result of each comparison
        if valid:
            print(f"Row {index} is valid.")
        else:
            print(f"Row {index} is invalid due to: {', '.join(row_errors)}")

        if valid:
            valid_rows += 1
        else:
            invalid_rows.append((index, row_errors))

    # Print the summary of validation
    print(f"Total valid rows: {valid_rows}, Invalid rows: {len(invalid_rows)}")
    return valid_rows


# 5. Validity Calculation
def calculate_validity_percentage(df, valid_rows):
    total_rows = len(df)
    return (valid_rows / total_rows) * 100

#6. Change operator
def change_operator(operator):
    oper = ''
    if operator == '<':
        oper = '>='
    elif operator == '>':
        oper = '<='
    elif operator == '<=':
        oper = '>'
    elif operator == '>=':
        oper = '<'
    else:
        oper = operator

    return oper

# Main execution block
if __name__ == "__main__":
    file_path = "/content/split_1.csv"  # Update this path as needed
    df = import_data(file_path)

    if df is not None:
        gpt_output = get_gpt_output(df)
        print("GPT Output:")
        print(gpt_output)

        sequence = extract_sequence_order(gpt_output)
        print(f"Extracted sequence: {sequence}")

        valid_rows = validate_rows(df, sequence)
        if valid_rows is not None:
            percentage = calculate_validity_percentage(df, valid_rows)
            print(f"Percentage of valid rows: {percentage:.2f}%")



Data imported successfully with columns: ['STARTDATE', 'ENDDATE', 'SUBJECT_ID', 'ADMITTIME', 'DISCHTIME']
GPT Output:
Output:

1. Columns to exclude: subject_id

2. Reason for exclusion: The column 'subject_id' is an identifier and does not directly represent a medical event or timestamp.

3. Relevant columns:

STARTDATE: The day an activity, such as a medication regimen, begins.

ENDDATE: The day an activity, such as a medication regimen, ends.

ADMITTIME: The time a patient is admitted to the hospital.

DISCHTIME: The time a patient is discharged from the hospital.

4. Sequential:

STARTDATE <= ENDDATE &&
ADMITTIME < DISCHTIME

5. Sequence rationale:
The 'STARTDATE' and 'ENDDATE' likely refers to a period of a certain activity, such as medication administration, therefore the start should always be on or before the end. The 'ADMITTIME' and 'DISCHTIME' represent the period a patient is in the hospital, hence, the admission should always take place before the discharge.

STARTDATE <= E