![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# PRR, ROR, and EBGM Calculation from DRUG and ADE relations

**Introduction**

This notebook is focused on Signal Processing of Drug Events using advanced Natural Language Processing (NLP) techniques, specifically to identify and analyze relationships between drugs and Adverse Drug Events (ADEs) from clinical text data.

**Purpose & Importance:**

🩺 Signal Detection in Pharmacovigilance:
The science and activities related to the detection, assessment, understanding, and prevention of adverse effects or any other drug-related problems.

Adverse Drug Reactions (ADRs):
Any unexpected, harmful reaction experienced after the administration of a drug under normal conditions of use.

Signal Detection:
The process of identifying potential safety issues or new adverse reactions related to a drug.

Detecting unexpected ADRs early is crucial for public health safety.

**📊 Statistical Methods for Signal Detection:**

**PRR (Proportional Reporting Ratio):**
A statistical measure that compares the frequency of a specific adverse event for a particular drug against the frequency of that event for all other drugs.

High PRR values may indicate a potential safety signal.

**ROR (Reporting Odds Ratio):**
Another measure that compares the odds of an adverse event occurring with a specific drug versus the odds with all other drugs.

High ROR values suggest a possible drug-event association.

**EBGM (Empirical Bayes Geometric Mean):**
A Bayesian-adjusted statistical measure that smoothens raw data by incorporating prior knowledge.

Useful when dealing with small sample sizes or sparse data to reduce false positives.

**🤖 Combining NLP + Signal Detection:**
Natural Language Processing (NLP):
A branch of AI focused on enabling computers to understand and process human language.

The notebook leverages Spark NLP for Healthcare, an advanced NLP library, to:

1. Extract Drug-ADE relations from unstructured clinical notes (free-text format in medical records).

2. Apply pre-trained models specialized in recognizing medical entities and their relationships.

3. Once relations are extracted, it computes the PRR, ROR, and EBGM scores to evaluate potential adverse drug events systematically.

**🌐 Normalization with Medical Ontologies:**
To ensure consistency and interoperability, drug and ADE mentions are normalized to standardized medical terminologies:

**RxNorm:**
A standardized nomenclature for medications, maintained by the U.S. National Library of Medicine, providing unique identifiers for drugs.

**ICD10 (International Classification of Diseases - 10th Revision):**
A globally used system for coding diseases, symptoms, and health conditions.

**MedDRA (Medical Dictionary for Regulatory Activities):**
A clinically validated international medical terminology used by regulatory authorities and the pharmaceutical industry for adverse event reporting.

**Why is normalization important?**

It allows for more accurate aggregation, comparison, and regulatory reporting by mapping various drug and ADE mentions to standardized codes.



# JSL Setup

let's setup the JSL dependenies and start the spark session

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.0 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.7/53.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m498.7/498.7 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-spark-connect 0.5.5 requires pyspark[connect]>=3.5, but you have pyspark 3.4.0 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m545.9/545.9 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m635.7/635.7 kB[0m [31m24.1

In [None]:
import os
import json


import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline,PipelineModel

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'], params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.5.3
Spark NLP_JSL Version : 5.5.3


# Helper function to get relations between DRUGs and ADEs

This part will helps in converting the output of the relation extraction model into a Pandas DataFrame format that is easier to analyze.

In [None]:
# get relations in a pandas dataframe
import pandas as pd

def get_relations_df (results, rel_col='relations', chunk_col='ner_chunks'):
    rel_pairs=[]
    chunks = []

    for rel in results[rel_col]:
        rel_pairs.append((
            rel.metadata['entity1_begin'],
            rel.metadata['entity1_end'],
            rel.metadata['chunk1'],
            rel.metadata['entity1'],
            rel.metadata['entity2_begin'],
            rel.metadata['entity2_end'],
            rel.metadata['chunk2'],
            rel.metadata['entity2'],
            rel.result,
            rel.metadata['confidence'],
        ))

    for chunk in results[chunk_col]:
        chunks.append((
            chunk.metadata["sentence"],
            chunk.begin,
            chunk.end,
            chunk.result,
        ))

    rel_df = pd.DataFrame(rel_pairs, columns=['entity1_begin', 'entity1_end', 'chunk1', 'entity1', 'entity2_begin', 'entity2_end', 'chunk2', 'entity2', 'relation', 'confidence'])

    chunks_df = pd.DataFrame(chunks, columns = ["sentence", "begin", "end", "chunk"])
    chunks_df.begin = chunks_df.begin.astype(str)
    chunks_df.end = chunks_df.end.astype(str)

    result_df = pd.merge(rel_df,chunks_df, left_on=["entity1_begin", "entity1_end", "chunk1"], right_on=["begin", "end", "chunk"])[["sentence"] + list(rel_df.columns)]


    return result_df

# DRUG-ADE and relations extraction pipeline

let's build a Spark NLP Pipeline to process clinical text and extract Drug-ADE relations:



In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

ner_tagger = MedicalNerModel()\
    .pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "token"])\
    .setOutputCol("dependencies")

reModel = RelationExtractionModel()\
    .pretrained("re_ade_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(20)\
    .setRelationPairs(["drug-ade, ade-drug"])\
    .setRelationPairsCaseSensitive(False)\
    .setCustomLabels({"1": "is_related", "0": "not_related"})



pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    reModel
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
ade_model = pipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
ner_ade_clinical download started this may take some time.
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
re_ade_clinical download started this may take some time.
[OK!]


# Test-Inference on clinical notes

Let's apply the pipeline on sample clinical text with various drug-event scenarios and getting results:



In [None]:
text = ["""
Hypersensitivity to aspirin can be manifested as acute asthma, urticaria and/or angioedema, or a systemic anaphylactoid reaction.
A patient had undergone a renal transplantation as a result of malignant hypertension, and immunosuppressive therapy consisting of cyclosporin and prednisone ,  developed  sweating  and  thrombosis alone 5 years following the transplantation but there were not stomach pain.
A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for   rheumatoid arthritis  presented  with  tense bullae and cutaneous fragility on the face and the back of the hands.""",

"""We describe the side effects of 5-FU in a colon cancer patient who suffered severe mucositis,  prolonged myelosuppression, and neurologic toxicity that required admission to the intensive care unit who  has a healthy appetite.
The reported cases of in utero exposure to cyclosposphamide shared the following manifestations with our patient who suffered  growth deficiency, developmental delay, craniosynostosis, blepharophimosis, flat nasal bridge and abnormal ears.
I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication.
I experienced fatigue, muscle cramps, anxiety, agression and sadness after taking Lipitor but no more adverse after passing Zocor.
A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands.""" ]

In [None]:
light_results = LightPipeline(ade_model).fullAnnotate(text)

## Get all results

Converts model output into structured tabular format using get_relations_df helper function:

In [None]:
df = get_relations_df(light_results[1], 'relations')

df

Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,0,32,35,5-FU,DRUG,76,91,severe mucositis,ADE,is_related,1.0
1,0,32,35,5-FU,DRUG,95,120,prolonged myelosuppression,ADE,is_related,1.0
2,0,32,35,5-FU,DRUG,127,145,neurologic toxicity,ADE,is_related,1.0
3,1,270,285,cyclosposphamide,DRUG,354,370,growth deficiency,ADE,is_related,0.9993292
4,1,270,285,cyclosposphamide,DRUG,373,391,developmental delay,ADE,is_related,0.9999902
5,1,270,285,cyclosposphamide,DRUG,394,409,craniosynostosis,ADE,is_related,0.99999976
6,1,270,285,cyclosposphamide,DRUG,412,427,blepharophimosis,ADE,is_related,0.9987691
7,1,270,285,cyclosposphamide,DRUG,430,446,flat nasal bridge,ADE,not_related,0.98880184
8,1,270,285,cyclosposphamide,DRUG,452,464,abnormal ears,ADE,is_related,0.8681236
9,2,477,493,allergic reaction,ADE,498,507,vancomycin,DRUG,is_related,1.0


## Get only DRUG and ADE pairs required for the terms calculation

We need just the DRUG and ADE pairs (in relation) for calculating PRR, ROR and EBGM later:

In [None]:
import pandas as pd
import numpy as np
from scipy.special import digamma


# Optional: Filter only rows where relation is "is_related"
df = df[df['relation'] == "is_related"].copy()

# Define a function to extract the drug and ADE based on entity types.
def extract_drug_ade(row):
    # Standardize the entity type text (in case of case differences)
    ent1 = row['entity1'].upper()
    ent2 = row['entity2'].upper()

    if ent1 == 'DRUG' and ent2 == 'ADE':
        return pd.Series({'drug': row['chunk1'], 'ade': row['chunk2']})
    elif ent1 == 'ADE' and ent2 == 'DRUG':
        return pd.Series({'drug': row['chunk2'], 'ade': row['chunk1']})
    else:
        # If neither condition is met, return NaN values.
        return pd.Series({'drug': np.nan, 'ade': np.nan})

# Apply the extraction function row-wise.
df_extracted = df.apply(extract_drug_ade, axis=1)

# Remove any rows where extraction failed.
df_extracted = df_extracted.dropna()
df_extracted


Unnamed: 0,drug,ade
0,5-FU,severe mucositis
1,5-FU,prolonged myelosuppression
2,5-FU,neurologic toxicity
3,cyclosposphamide,growth deficiency
4,cyclosposphamide,developmental delay
5,cyclosposphamide,craniosynostosis
6,cyclosposphamide,blepharophimosis
8,cyclosposphamide,abnormal ears
9,vancomycin,allergic reaction
10,vancomycin,itchy skin


# PRR-ROR-EBGM terms calculation

Here we calculates statistical measures for signal detection: PRR (Proportional Reporting Ratio), ROR (Reporting Odds Ratio) and EBGM (Empirical Bayes Geometric Mean).

Key Steps:

1. Builds contingency tables (a, b, c, d counts) for each Drug-ADE pair.

2. Computes PRR, ROR formulas.

3. Uses Bayesian smoothing (digamma function) for EBGM calculation.

In [None]:
import pandas as pd
import numpy as np
from scipy.special import digamma

def compute_drug_ade_statistics(
    df_extracted,
    drug_col='drug',
    ade_col='ade',
    alpha_prior=1,
    beta_prior=1
):
    """
    Computes PRR, ROR, and EBGM statistics for drug-ADE pairs from an extracted DataFrame.

    Parameters:
    - df_extracted (pd.DataFrame): DataFrame containing drug and ADE columns.
    - drug_col (str, optional): Column name for drugs. Default is 'drug'.
    - ade_col (str, optional): Column name for ADEs. Default is 'ade'.
    - alpha_prior (int, optional): Prior alpha parameter for EBGM calculation. Default is 1.
    - beta_prior (int, optional): Prior beta parameter for EBGM calculation. Default is 1.

    Returns:
    - pd.DataFrame: DataFrame with columns [drug_col, ade_col, 'a', 'b', 'c', 'd', 'PRR', 'ROR', 'EBGM'].
    """

    # Aggregate counts for each drug-ADE pair (a)
    pair_counts = df_extracted.groupby([drug_col, ade_col]).size().reset_index(name='a')

    # Calculate overall counts
    drug_counts = df_extracted.groupby(drug_col).size().reset_index(name='drug_count')
    ade_counts = df_extracted.groupby(ade_col).size().reset_index(name='ade_count')

    # Total number of reports
    total_reports = len(df_extracted)

    # Merge counts
    df_stats = pair_counts.merge(drug_counts, on=drug_col).merge(ade_counts, on=ade_col)

    # Build contingency table
    df_stats['b'] = df_stats['drug_count'] - df_stats['a']
    df_stats['c'] = df_stats['ade_count'] - df_stats['a']
    df_stats['d'] = total_reports - (df_stats['a'] + df_stats['b'] + df_stats['c'])

    # Calculate PRR and ROR
    df_stats['PRR'] = (df_stats['a'] / (df_stats['a'] + df_stats['b'])) / (df_stats['c'] / (df_stats['c'] + df_stats['d']))
    df_stats['ROR'] = (df_stats['a'] / df_stats['b']) / (df_stats['c'] / df_stats['d'])

    # Calculate EBGM
    def compute_ebgm(a, b, c, d):
        alpha_post = alpha_prior + a
        beta_post = beta_prior + (b + c + d)
        return np.exp(digamma(alpha_post) - np.log(beta_post))

    df_stats['EBGM'] = df_stats.apply(lambda row: compute_ebgm(row['a'], row['b'], row['c'], row['d']), axis=1)

    # Return selected columns
    return df_stats[[drug_col, ade_col, 'a', 'b', 'c', 'd', 'PRR', 'ROR', 'EBGM']]



result_df = compute_drug_ade_statistics(df_extracted, drug_col='drug', ade_col='ade')
result_df

Unnamed: 0,drug,ade,a,b,c,d,PRR,ROR,EBGM
0,5-FU,neurologic toxicity,1,2,0,17,inf,inf,0.07631
1,5-FU,prolonged myelosuppression,1,2,0,17,inf,inf,0.07631
2,5-FU,severe mucositis,1,2,0,17,inf,inf,0.07631
3,Lipitor,agression,1,3,0,16,inf,inf,0.07631
4,Lipitor,fatigue,1,3,0,16,inf,inf,0.07631
5,Lipitor,muscle cramps,1,3,0,16,inf,inf,0.07631
6,Lipitor,sadness,1,3,0,16,inf,inf,0.07631
7,cyclosposphamide,abnormal ears,1,4,0,15,inf,inf,0.07631
8,cyclosposphamide,blepharophimosis,1,4,0,15,inf,inf,0.07631
9,cyclosposphamide,craniosynostosis,1,4,0,15,inf,inf,0.07631


# DRUG and ADE normalization: example of using RxNorm, ICD10 and MedDRA codes

Normalization enhances consistency and interoperability by mapping entities to standardized codes.
In the following part we will use RxNorm then ICD10 for DRUG normalization, and MedDRA for ADE normalization then calculate the same terms:

## RxNorm pipeline for DRUG normalization

In [None]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline

document_assembler_rx = DocumentAssembler()\
    .setInputCol("drug")\
    .setOutputCol("document")


sbert_embedder_rx = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)


rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_v2", "en", "clinical/models")\
    .setInputCols(["sbert_embeddings"])\
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

# Assemble the RxNorm pipeline
rxnorm_pipeline = Pipeline(stages=[
    document_assembler_rx,
    sbert_embedder_rx,
    rxnorm_resolver
])


sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm_augmented_v2 download started this may take some time.
[OK!]


## ICD10 pipelines for ADE normalization

In [None]:
document_assembler_ade = DocumentAssembler()\
    .setInputCol("ade")\
    .setOutputCol("document")


sbert_embedder_med = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)


icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
    .setInputCols(["sbert_embeddings"])\
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")


icd_pipeline = Pipeline(stages=[
    document_assembler_ade,
    sbert_embedder_med,
    icd_resolver
])


sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icd10cm_augmented_billable_hcc download started this may take some time.
[OK!]


## MedDRA pipeline for ADE normalization

In [None]:
sbert_embedder_med = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)

meddra_resolver = SentenceEntityResolverModel.load("sbiobertresolve_meddra_preferred_term") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("meddra_pt_code")\
    .setDistanceFunction("EUCLIDEAN")

medra_pipeline = Pipeline(stages=[
    document_assembler_ade,
    sbert_embedder_med,
    meddra_resolver
])


sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


# PRR-ROR-EBGM terms calculation normalized

Calulating PRR, ROR and EBGM based on RxNorm and ICD10 normalization:

## RxNorm (Drug) <> ICD10 (ADE)

In [None]:
# Convert the pandas DataFrame to a Spark DataFrame and add a unique identifier
spark_df = spark.createDataFrame(df_extracted)
spark_df = spark_df.withColumn("uid", F.monotonically_increasing_id())

# -------------------------------
# Apply the RxNorm Pipeline to the 'drug' column
# -------------------------------
drug_df = spark_df.select("uid", "drug")
rxnorm_model = rxnorm_pipeline.fit(drug_df)
rxnorm_result = rxnorm_model.transform(drug_df)
# Extract the first (and only) code from the array and rename the column
rxnorm_df = rxnorm_result.select("uid", F.col("rxnorm_code.result").getItem(0).alias("drug_rxnorm_code"))

# -------------------------------
# Apply the MedDRA Pipeline to the 'ade' column
# -------------------------------
ade_df = spark_df.select("uid", "ade")
icd_model = icd_pipeline.fit(ade_df)
icd_result = icd_model.transform(ade_df)
# Extract the first (and only) code from the array and rename the column
icd_df = icd_result.select("uid", F.col("icd10cm_code.result").getItem(0).alias("ade_icd10_code"))

# -------------------------------
# Combine the results with the original DataFrame, remove uid, and show them
# -------------------------------
final_df = spark_df.join(rxnorm_df, on="uid") \
                   .join(icd_df, on="uid") \
                   .drop("uid").toPandas()


result_df = compute_drug_ade_statistics(final_df, drug_col='drug_rxnorm_code', ade_col='ade_icd10_code')
result_df

Unnamed: 0,drug_rxnorm_code,ade_icd10_code,a,b,c,d,PRR,ROR,EBGM
0,11124,L28.2,1,3,0,16,inf,inf,0.07631
1,11124,R20.0,1,3,0,16,inf,inf,0.07631
2,11124,R20.8,1,3,0,16,inf,inf,0.07631
3,11124,T78.40,1,3,0,16,inf,inf,0.07631
4,153165,E31.0,1,3,0,16,inf,inf,0.07631
5,153165,R45.2,1,3,0,16,inf,inf,0.07631
6,153165,R53,1,3,0,16,inf,inf,0.07631
7,153165,T75.1,1,3,0,16,inf,inf,0.07631
8,215018,D75.89,1,2,0,17,inf,inf,0.07631
9,215018,R60.0,1,2,0,17,inf,inf,0.07631


Calulating PRR, ROR and EBGM based on RxNorm and MedDRA normalization:

## RxNorm (Drug) <> Meddra (ADE)

In [None]:
# Convert the pandas DataFrame to a Spark DataFrame and add a unique identifier
spark_df = spark.createDataFrame(df_extracted)
spark_df = spark_df.withColumn("uid", F.monotonically_increasing_id())

# -------------------------------
# Apply the RxNorm Pipeline to the 'drug' column
# -------------------------------
drug_df = spark_df.select("uid", "drug")
rxnorm_model = rxnorm_pipeline.fit(drug_df)
rxnorm_result = rxnorm_model.transform(drug_df)
# Extract the first (and only) code from the array and rename the column
rxnorm_df = rxnorm_result.select("uid", F.col("rxnorm_code.result").getItem(0).alias("drug_rxnorm_code"))

# -------------------------------
# Apply the MedDRA Pipeline to the 'ade' column
# -------------------------------
ade_df = spark_df.select("uid", "ade")
medra_model = medra_pipeline.fit(ade_df)
medra_result = medra_model.transform(ade_df)
# Extract the first (and only) code from the array and rename the column
medra_df = medra_result.select("uid", F.col("meddra_pt_code.result").getItem(0).alias("ade_meddra_pt_code"))

# -------------------------------
# Combine the results with the original DataFrame, remove uid, and show them
# -------------------------------
final_df = spark_df.join(rxnorm_df, on="uid") \
                   .join(medra_df, on="uid") \
                   .drop("uid").toPandas()


result_df = compute_drug_ade_statistics(final_df, drug_col='drug_rxnorm_code', ade_col='ade_meddra_pt_code')
result_df



Unnamed: 0,drug_rxnorm_code,ade_meddra_pt_code,a,b,c,d,PRR,ROR,EBGM
0,11124,10002198,1,3,0,16,inf,inf,0.07631
1,11124,10051788,1,3,0,16,inf,inf,0.07631
2,11124,10054786,1,3,0,16,inf,inf,0.07631
3,11124,10077855,1,3,0,16,inf,inf,0.07631
4,153165,10016256,1,3,0,16,inf,inf,0.07631
5,153165,10028334,1,3,0,16,inf,inf,0.07631
6,153165,10031071,1,3,0,16,inf,inf,0.07631
7,153165,10039367,1,3,0,16,inf,inf,0.07631
8,215018,10028584,1,2,0,17,inf,inf,0.07631
9,215018,10030111,1,2,0,17,inf,inf,0.07631
