# Notebook #3: Exploratory Data Analyses, Statistical Testing, and Time Series Analysis
The `rhino_health` Python library is a robust tool designed to support a wide array of statistical and epidemiological analyses over federated data sets, crucial in the realm of healthcare and medical research. In this notebook, we demonstrate the ability to perform analyses over multiple distributed datasets. More specifically, we'll perform an analysis of data related to pneumonia outcomes and perform both exploratory data analyses to support our machine learning project as well as hypothesis testing for a traditional biostatistics analysis.

#### Import Rhino's Metrics Module

The `rhino_health.lib.metrics` module in the Rhino Health library is a comprehensive suite designed for diverse statistical and epidemiological analyses of healthcare data. This module is divided into several submodules, each targeting specific types of metrics and analyses.

In the following code block, we'll import some basic functions like Mean() and Count() and authenticate to the Rhino cloud using your specific credentials.

### Install the Rhino Health Python SDK, Load All Necessary Libraries and Login to the Rhino FCP

In [None]:
import getpass
from pprint import pprint
import rhino_health as rh

from rhino_health.lib.metrics.basic import Sum, Count, Mean,StandardDeviation
from rhino_health.lib.metrics.aggregate_metrics.aggregation_service import get_cloud_aggregated_metric_data

In [None]:
my_username = "FCP_LOGIN_EMAIL" # Replace this with the email you use to log into Rhino Health
session = rh.login(username=my_username, password=getpass.getpass())

### Load the Cohorts
We'll use our SDK to identify the relevant cohorts that we'd like to perform exploratory analyses on. It is **critical to understand that the cohorts must have the same data schema in order to generate statistics on multiple cohorts simultaneously.**

<Include screenshot of the data schema>

In [None]:
#Replace with your project and cohort names. Raw data and harmonized data
project = session.project.get_project_by_name("YOUR_PROJECT_NAME")

cxr_cohorts = [
    project.get_cohort_by_name("mimic_cxr_dev"), # Replace Me
    project.get_cohort_by_name("mimic_cxr_hco"), # Replace Me
] 

### An Introduction to Federated Metrics

The Rhino Federated Computing Platform allows you to quickly and securely calculate metrics using federated computing across multiple sites. Each metric on the Rhino platform has two components:

#### The Metric Object

The metric configuration object is a crucial component, serving as a blueprint for metric retrieval. It allows you to specify the metric variables, grouping preferences, and data filters. For example, let's define two metrics:

1. Count of total cases across both cohorts
2. Count of positive pneumonia cases across both cohorts

#### The Response Object

When retrieving a metric, *all results are returned in a MetricResponse object*. The MetricResponse object is a Python class that includes the specific outcome values in the 'output' attribute, such as statistical measures, and details about the metric configuration ('metric_configuration_dict').

The metric results will always be under the output attribute, under the metric name key (in this case, "chi_square"). The metric response values are then stored under the value name (e.g., "p_value" in the example above). The initial metric configuration used to generate this output can be found under the "metric_configuration_dict" attribute.


### Exploratory Data Analysis

In the rest of this notebook, we'll analyze the federated data that we've prepared in the preceding notebooks. The data will be aggregated across two sites.

### Defining a simple metric without a filter:

We'll define the simplest metric possible -  a simple count of the number of rows across both of our cohorts: 

In [None]:
# Count the number of entries in the dataset
pneumonia_count_response = session.project.aggregate_cohort_metric( 
    cohort_uids=[str(cohort.uid) for cohort in cxr_cohorts], # list containing relevant cohorts
    metric_configuration=Count(variable="Pneumonia") # Metric configuration
) 

pneumonia_count = pneumonia_count_response.output
print(f"Entries in Dataset: {pneumonia_count}")

### Adding a filter to our metric
The `data_filters` parameter enables you to refine your analysis by setting conditions and filter the output by certain criteria. We'll now filter our `Count()` variable by a value; in this case, pneumona cases are identified by the `pneumonia` value of 1, and thus we'll add `pneumonia` as a `filter_column` and `1` as a `filter_value`.

In [None]:
# Count the number of people with pneumonia and the number without pneumonia
pneumonia_count_configuration = Count(variable={"data_column": "Pneumonia", 
                                         "filter_column": "Pneumonia",  
                                         "filter_value": 1})

pneumonia_count_response = session.project.aggregate_cohort_metric(
    cohort_uids=[str(cohort.uid) for cohort in cxr_cohorts],
    metric_configuration=pneumonia_count_configuration)
                                         
pneumonia_count = pneumonia_count_response.output

print(f"Pneumonia Cases in Dataset: {pneumonia_count}")