# Statistical Signal Detection: Reporting Odds Ratio (ROR)

## Objective
While the Power BI dashboard visualizes the *counts* of adverse events, this script quantifies the *strength of association* between the drug and the adverse event.

## Methodology
We calculate the **Reporting Odds Ratio (ROR)** with a 95% Confidence Interval to compare the reporting frequency of **Nausea** between two cohorts:
1.  **Ozempic (Diabetes)**
2.  **Wegovy (Obesity)**

## Hypothesis
* **Null Hypothesis (H0):** There is no difference in the reporting odds of Nausea between Diabetes and Obesity patients (ROR ≈ 1.0).
* **Alternative Hypothesis (H1):** There is a statistically significant difference in reporting odds (ROR ≠ 1.0).

In [5]:
import pandas as pd
import numpy as np
import scipy.stats as stats

# 1. Load the Data
df = pd.read_csv(r"C:\Users\findo\OneDrive\Desktop\GLP1-Safety-Project\data\processed\faers_glp1_final.csv")

# 2. Define the Cohorts (Case Insensitive)
# Cohort A: Ozempic (Diabetes)
# Cohort B: Wegovy (Obesity)
# We exclude generic "Semaglutide" for this specific stats test to be cleaner
def define_cohort(product_name):
    product_name = str(product_name).upper()
    if 'OZEMPIC' in product_name:
        return 'Ozempic (Diabetes)'
    elif 'WEGOVY' in product_name:
        return 'Wegovy (Obesity)'
    else:
        return 'Other'

df['cohort'] = df['drugname_clean'].apply(define_cohort)

# Filter dataset to only keep Ozempic and Wegovy rows
stats_df = df[df['cohort'] != 'Other'].copy()

# 3. Define the Target Event
target_event = "Nausea"  # You can change this to "Vomiting" or "Pancreatitis"

# Create the 2x2 Contingency Table
# -----------------------------------------------
#                     | Event (Yes) | Event (No)
# Ozempic (Diabetes)  |     a       |     b     
# Wegovy (Obesity)    |     c       |     d     
# -----------------------------------------------

# Calculate a, b, c, d
a = len(stats_df[(stats_df['cohort'] == 'Ozempic (Diabetes)') & (stats_df['pt'] == target_event)])
b = len(stats_df[(stats_df['cohort'] == 'Ozempic (Diabetes)') & (stats_df['pt'] != target_event)])
c = len(stats_df[(stats_df['cohort'] == 'Wegovy (Obesity)') & (stats_df['pt'] == target_event)])
d = len(stats_df[(stats_df['cohort'] == 'Wegovy (Obesity)') & (stats_df['pt'] != target_event)])

print(f"--- 2x2 Table for {target_event} ---")
print(f"Ozempic (Diabetes): {a} cases / {b} controls")
print(f"Wegovy  (Obesity) : {c} cases / {d} controls")

# Check for Zero Division (Safety Check)
if c == 0 or d == 0:
    print("\nERROR: Still finding 0 cases for Wegovy. Check your spelling or data!")
else:
    # 4. Calculate ROR (Reporting Odds Ratio)
    # Formula: (a/b) / (c/d)
    ror = (a / b) / (c / d)

    # 5. Calculate 95% Confidence Interval
    ln_ror = np.log(ror)
    se_ln_ror = np.sqrt((1/a) + (1/b) + (1/c) + (1/d))
    lower_ci = np.exp(ln_ror - 1.96 * se_ln_ror)
    upper_ci = np.exp(ln_ror + 1.96 * se_ln_ror)

    print("\n" + "="*40)
    print(f"STATISTICAL RESULT: {target_event}")
    print("="*40)
    print(f"Reporting Odds Ratio (ROR): {ror:.2f}")
    print(f"95% Confidence Interval:  [{lower_ci:.2f}, {upper_ci:.2f}]")
    
    # Interpretation
    if lower_ci > 1.0:
        print(f"\nCONCLUSION: SIGNAL DETECTED.")
        print(f"Ozempic patients are {ror:.2f}x more likely to report {target_event} than Wegovy patients.")
    elif upper_ci < 1.0:
        print(f"\nCONCLUSION: INVERSE SIGNAL.")
        print(f"Ozempic patients are LESS likely to report {target_event} than Wegovy patients.")
    else:
        print(f"\nCONCLUSION: NO SIGNIFICANT DIFFERENCE.")
        print("The risk is statistically similar between groups.")

--- 2x2 Table for Nausea ---
Ozempic (Diabetes): 2708 cases / 57830 controls
Wegovy  (Obesity) : 861 cases / 16979 controls

STATISTICAL RESULT: Nausea
Reporting Odds Ratio (ROR): 0.92
95% Confidence Interval:  [0.85, 1.00]

CONCLUSION: INVERSE SIGNAL.
Ozempic patients are LESS likely to report Nausea than Wegovy patients.
