In [None]:
import os
import dtale
home_dir=os.path.expanduser('~')
os.chdir(f"{home_dir}/nexus_correlation_discovery/")
from demo import nexus_demo
from nexus.utils.time_point import TEMPORAL_GRANU
from nexus.utils.coordinate import SPATIAL_GRANU
from nexus.nexus_api import API
from nexus.utils.data_model import Variable
from demo.cluster_utils import CorrCommunity
from demo.demo_ui import show_communities

# Nexus Introduction

Correlation analysis is a vital initial step for investigating causation, essential for understanding complex phenomena and making informed choices. While it is hard to establish causality from vast observational data without assumptions and expert knowledge, identifying correlations remains a key strategy to “cast a wide net” and detect potential causal links. Our system Nexus identifies correlations over collections of spatio-temporal tabular data, aiming to identify interesting hypotheses and provide a good starting point for further causal analysis. Nexus focuses on two personas.

**Persona 1: Exploring an Existing Hypothesis.** A researcher at a medical school, Bob, has a dataset with asthma attack incidences in hospitals across various zip codes in Chicago. Bob's research goal is to explore what factors could potentially affect asthma attacks. Thus, he wants to start by finding variables that are correlated with asthma attacks. Persona 1 is someone who has an initial dataset and seeks to enrich such a dataset with additional variables relevant to the analysis.

<img src="persona1.png" alt="persona 1" width="500"/>

**Persona 2: Data-Driven Hypothesis Generation.** Amy, a researcher in Chicago, finds [Chicago Open Data](https://data.cityofchicago.org) has many interesting datasets. She want to know whether she could form new hypotheses from BIG DATA. That is to find all correlations in Chicago Open Data and see if there is any interesting ones that can lead to new hypotheses or insights.

<img src="persona2.png" alt="persona 2" width="400"/>

In this demonstration, we will illustrate how Nexus assists Persona 1 and 2 with the analysis of real-world datasets.

## Install Nexus

Let's first install Nexus!

In [None]:
nexus_demo.install_nexus()

## Create Nexus API

Nexus indexes Chicago Open Data offline and stores the data in `demo.db`.

In [None]:
conn_str = f'data/demo.db'
nexus_api = API(conn_str)

# Persona 1: Enrich the asthma dataset with additional variables

Bob, a researcher from a medical school, has a dataset with asthma attack incidences in hospitals across various zip codes in Chicago.

| Zip5\*   | enc_asthma\*\* | encAsthmaExac\*\*\* | AttackPer\*\*\*\*  |
|--------|------------|---------------|-----------|
| 60604.0| 10.0       | 1.0           | 0.1       |
| 60605.0| 47.0       | 7.0           | 0.15      |
| 60606.0| 33.0       | 13.0          | 0.39      |
| 60607.0| 12.0       | 3.0           | 0.25      |
| ...| ...       | ...          | ...      |

\* zipcode

\*\* Count of asthma visits 2009-2019, denominator.

\*\*\* Count of visits for asthma attacks (a.k.a., exacerbations) 2009-2019, numerator.

\*\*\*\* Asthma attacks as a percentage of all asthma visits.

Bob wants to find variables correlated with asthma attacks from Chicago Open Data. 

<!-- He finds that [Chicago Open Data](https://data.cityofchicago.org/) has a wealth of datasets on diverse societal aspects such as education, business, and crime in Chicago. He believes there are some variables in Chicago Open Data that are useful for his research. Thus, he adds Chicago Open Data as a data source in Nexus. -->

## Browse Data Assets

In [None]:
catalog = nexus_api.get_catalog()
dtale.show(catalog)

You can use Nexus to look at a dataset in the catalog given the dataset id.

In [None]:
dataset_id = 'ijzp-q8t2_location_6'
df = nexus_api.get_agg_dataset(dataset_id)
dtale.show(df)

## Find correlations from an input table

In [None]:
dataset = 'asthma'
temporal_granularity, spatial_granularity = None, SPATIAL_GRANU.ZIPCODE
overlap_threshold = 5
correlation_threshold = 0.5
correlations = nexus_api.find_correlations_from(dataset, temporal_granularity, spatial_granularity, 
                                      overlap_threshold, correlation_threshold, 
                                      correlation_type="pearson")
dtale.show(correlations)

## Display the detailed profile of a correlation

In [None]:
correlation_idx = 9
nexus_api.show_correlation_profile(correlations, correlation_idx)

## Control for variables

In [None]:
control_variables = [Variable('chicago_income_by_zipcode_zipcode_6', 'avg_income_household_median')]
df_control = nexus_api.find_correlations_from(dataset, temporal_granularity, spatial_granularity, 
                                              overlap_threshold, correlation_threshold, 
                                              correlation_type="pearson", control_variables=control_variables)
dtale.show(df_control)

## Assemble a dataset from multiple variables

In [None]:
row_idx = 10
aligned, prov = nexus_api.get_joined_data_from_row(df_control.loc[row_idx])
dtale.show(aligned)

Nexus also offers `join_and_project` API that can assemble a dataset from any set of given variables.

In [None]:
variables = [Variable('divg-mhqk_location_6', 'count'), Variable('4u6w-irs9_location_6', 'avg_square_feet')]
df, prov = nexus_api.join_and_project(variables)
dtale.show(df)

Nexus provides the data provenance information for all data assembly APIs.

In [None]:
print(prov)

# Persona 2: Data-Driven Hypothesis Generation.

In [None]:
chicago_correlations = nexus_demo.find_all_correlations(TEMPORAL_GRANU.MONTH, SPATIAL_GRANU.TRACT)
print(f"Nexus found {len(chicago_correlations)} correlations in total")

## Correlation Distillation Using Nexus Variable Clusters

In [None]:
variable_clusters = nexus_demo.get_correlation_communities(chicago_correlations)
print(f"Nexus extracts {len(variable_clusters.comps)} variable clusters out of {len(chicago_correlations)} correlations")

### Examine Correlation Communities

In [None]:
show_communities(variable_clusters, show_corr_in_same_tbl=False)