<a href="https://colab.research.google.com/github/FritscheLab/EPID731/blob/main/Day2/EPID731_BonusWhiteRabbit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Before you start: Create a copy of this notebook
The original notebook is read-only, so please follow these steps to get a copy you can modify and interact with:

Go to the File menu in the top left corner and select Save a copy in Drive.
(If you can't see the File menu, click the ^ button in the top right corner.)
Close the tab with the original file `EPID731_BonusWhiteRabbit.ipynb`.
Open and follow the exercise in your new copy, named
`Copy of EPID731_BonusWhiteRabbit.ipynb`, which is now saved in your Google Drive.

# Exercise Objective:

In this exercise, you will use the WhiteRabbit tool to perform data profiling on the MIMIC-IV demo OMOP dataset.
This will help you understand the dataset's structure and prepare it for further analysis tasks.




## Introduction to WhiteRabbit:

WhiteRabbit is a data profiling and ETL (Extract, Transform, Load) tool developed by OHDSI. It helps explore
and understand healthcare databases, identify data issues, and create ETL specifications for mapping to the OMOP Common Data Model.
Key features include database scanning, generating scan reports, and assisting with ETL specifications.

Reference: [WhiteRabbit GitHub](https://github.com/OHDSI/WhiteRabbit/releases/tag/v1.0.0-RC2)


## MIMIC-IV Demo OMOP Dataset

This dataset contains de-identified health records from the MIMIC-IV database, converted into the OMOP Common Data Model format. It is designed for demonstrating the use of the OMOP CDM with real-world healthcare data. The dataset includes various medical records suitable for research and educational purposes.

Reference: PhysioNet: https://physionet.org/content/mimic-iv-demo-omop/0.9/



## Step 1. Downloading and Setting Up WhiteRabbit:

First, download and decompress the "WhiteRabbit version 1.0.0 release."  
Here we use the MacOS, Linux binary `./WhiteRabbit_v1.0.0/bin/whiteRabbit`  
(The WindowsOS command is `./WhiteRabbit_v1.0.0/bin/whiteRabbit.bat`)

In [None]:
install.packages("RCurl")
library(RCurl)

download.file("https://github.com/OHDSI/WhiteRabbit/releases/download/v1.0.0/WhiteRabbit_v1.0.0.zip", destfile = "WhiteRabbit.zip")
system("unzip WhiteRabbit.zip")

if (file.exists("./WhiteRabbit_v1.0.0")) {
  cat("WhiteRabbit downloaded and unzipped successfully!\n")
} else {
  cat("Failed to download or unzip WhiteRabbit. Please check the URL and file paths.\n")
}

## Step 2. Downloading the MIMIC-IV Demo OMOP Dataset:

We will download the demo dataset (10.8 MB) and unzip the file using the following commands:

In [None]:
download.file("https://physionet.org/static/published-projects/mimic-iv-demo-omop/mimic-iv-demo-data-in-the-omop-common-data-model-0.9.zip", destfile = "MIMIC-IV-Demo.zip")
unzip("MIMIC-IV-Demo.zip")

if (file.exists("./mimic-iv-demo-data-in-the-omop-common-data-model-0.9")) {
  cat("Dataset downloaded and unzipped successfully!\n")
} else {
  cat("Failed to download or unzip the dataset. Please check the URL and file paths.\n")
}

## Step 3. Writing an INI File for WhiteRabbit

To use the WhiteRabbit command line tools, we need an INI file. Below is the original template for reference:

```ini
# Usage: dist/bin/whiteRabbit -ini <ini_file_path>
WORKING_FOLDER = /users/joe                   # Path to the folder where all output will be written
DATA_TYPE = PostgreSQL                        # "Delimited text files", "MySQL", "Oracle", "SQL Server", "PostgreSQL", "MS Access", "Redshift", "BigQuery", "Azure", "Teradata", "SAS7bdat"
SERVER_LOCATION = 127.0.0.1/data_base_name    # Name or address of the server. For Postgres, add the database name
USER_NAME = joe                               # User name for the database
PASSWORD = supersecret                        # Password for the database
DATABASE_NAME = schema_name                   # Name of the data schema used
DELIMITER = ,                                 # The delimiter that separates values "," or "tab"
TABLES_TO_SCAN = *                            # Comma-delimited list of table names to scan. Use "*" (asterix) to include all tables in the database
SCAN_FIELD_VALUES = yes                       # Include the frequency of field values in the scan report? "yes" or "no"
MIN_CELL_COUNT = 5                            # Minimum frequency for a field value to be included in the report
MAX_DISTINCT_VALUES = 1000                    # Maximum number of distinct values per field to be reported
ROWS_PER_TABLE = 100000                       # Maximum number of rows per table to be scanned for field values
CALCULATE_NUMERIC_STATS = no                  # Include average, standard deviation and quartiles in the scan report? "yes" or "no"
NUMERIC_STATS_SAMPLER_SIZE = 500              # Maximum number of rows used to calculate numeric statistics
```

For our scan, we need the following options:

- `WORKING_FOLDER`: Path where output will be written.
- `DATA_TYPE`: Set to "Delimited text files".
- `DELIMITER`: The delimiter separating values, e.g., a comma.
- `TABLES_TO_SCAN`: Set to "person.csv,measurement.csv" to scan only the person and measurement tables (use "*" to scan all).
- `SCAN_FIELD_VALUES`: Set to "yes" to include the frequency of field values.
- `MIN_CELL_COUNT`: Set the minimum frequency for a field value to be included in the report.
- `MAX_DISTINCT_VALUES`: Set the maximum number of distinct values per field.
- `ROWS_PER_TABLE`: Set the maximum number of rows per table to scan.
- `CALCULATE_NUMERIC_STATS`: Set to "yes" to include numeric statistics.
- `NUMERIC_STATS_SAMPLER_SIZE`: Set the number of rows used for calculating numeric statistics.

We do not need to specify `SERVER_LOCATION`, `USER_NAME`, `PASSWORD`, or `DATABASE_NAME` since we are using delimited text files.

In [None]:
ini_content <- "
WORKING_FOLDER = /content/mimic-iv-demo-data-in-the-omop-common-data-model-0.9/1_omop_data_csv
DATA_TYPE = Delimited text files
DELIMITER = ,
TABLES_TO_SCAN = person.csv,measurement.csv
SCAN_FIELD_VALUES = yes
MIN_CELL_COUNT = 5
MAX_DISTINCT_VALUES = 1000
ROWS_PER_TABLE = 100000
CALCULATE_NUMERIC_STATS = yes
NUMERIC_STATS_SAMPLER_SIZE = 500
"
writeLines(ini_content, "/content/mimic-iv-demo.ini")

cat("INI file created successfully. Path: '/content/mimic-iv-demo.ini'\n")


## Step 4. Running WhiteRabbit to Generate Scan Report

Now, use the WhiteRabbit tool to scan the dataset as specified in your INI file and generate a report.


In [None]:
system(paste("WhiteRabbit_v1.0.0/bin/whiteRabbit -ini ./mimic-iv-demo.ini"), intern = TRUE)
file.rename("./mimic-iv-demo-data-in-the-omop-common-data-model-0.9/1_omop_data_csv/ScanReport.xlsx", "./mimic-iv-demo_ScanReport.xlsx")

cat("Scan report generated and moved successfully. Available at './mimic-iv-demo_ScanReport.xlsx'\n")

**Next Steps:**

The scan report is now available in your current directory as `./mimic-iv-demo_ScanReport.xlsx`. Download this file to your local machine to explore it further:

**Access Online Version:**

If you prefer to view a precompiled scan report online, you explore this Google Sheets Document: Here's a [link to a mimic-iv-demo_ScanReport.xlsx](https://docs.google.com/spreadsheets/d/1BD7RVmHaWFJXKL52JFbfCYCF902_X43CPmDpxRJnaAo/edit?usp=sharing).

## Step 5. Alternative using `whiteRRabbit` (R-based tool)

As an alternative to the Java-based WhiteRabbit, you can use `whiteRRabbit`, an R-based tool for data profiling. It is optimized for performance with `data.table` and offers additional features like date shifting and column exclusion.

For more details, you can visit the [whiteRRabbit GitHub repository](https://github.com/FritscheLab/whiteRRabbit.git).

First, let's clone the repository and install the required R packages.

In [None]:
system("git clone https://github.com/FritscheLab/whiteRRabbit.git")
install.packages(c("data.table", "optparse", "openxlsx", "lubridate"))

Now, we will run the `whiteRRabbit.R` script with parameters equivalent to the INI file configuration used for the Java-based WhiteRabbit. This command will process the `person.csv` and `measurement.csv` files.

In [None]:
# Create a subdirectory and move the relevant csv files there for whiteRRabbit to scan
dir.create("whiterrabbit_input")
file.copy(
  from=c("./mimic-iv-demo-data-in-the-omop-common-data-model-0.9/1_omop_data_csv/person.csv", "./mimic-iv-demo-data-in-the-omop-common-data-model-0.9/1_omop_data_csv/measurement.csv"),
  to="whiterrabbit_input"
)

# Run whiteRRabbit
system(paste(
  "Rscript ./whiteRRabbit/whiteRRabbit.R",
  "--working_folder ./whiterrabbit_input",
  "--delimiter ','",
  "--output_dir ./",
  "--output_format xlsx",
  "--prefix whiteRRabbit_ScanReport",
  "--maxRows 100000",
  "--maxDistinctValues 1000",
  "--scan_field_values",
  "--min_cell_count 5"
))

cat("whiteRRabbit scan report generated successfully. Available at './whiteRRabbit_ScanReport.xlsx'\n")