In [2]:
import os
import dtale
home_dir=os.path.expanduser('~')
os.chdir(f"{home_dir}/nexus_correlation_discovery/")
from demo import nexus_demo

# Nexus Introduction

Correlation analysis is a vital initial step for investigating causation, essential for understanding complex phenomena and making informed choices. While it is hard to establish causality from vast observational data without assumptions and expert knowledge, identifying correlations remains a key strategy to “cast a wide net” and detect potential causal links. Our system Nexus identifies correlations over collections of spatio-temporal tabular data, aiming to identify interesting hypotheses and provide a good starting point for further causal analysis. Nexus focuses on two personas.

**Persona 1: Enrich an Existing Dataset.** A researcher at a medical school, Bob, has a dataset with asthma attack incidences in hospitals across various zip codes in Chicago. Bob's research goal is to explore what factors could potentially affect asthma attacks. Thus, he wants to start by finding variables that are correlated with asthma attacks. Persona 1 is someone who has an initial dataset and seeks to enrich such a dataset with additional variables relevant to the analysis.

**Persona 2: Data-Driven Hypothesis Generation.** Amy, a social scientist in Chicago, is seeking to discover intriguing phenomena within the city for her research. To avoid limiting her analysis to existing knowledge, she employs a data-driven strategy. Recognizing that Chicago Open Data has a wealth of datasets on diverse societal aspects such as education, business, and crime, Amy wants to identify interesting correlations automatically to generate new hypotheses. Persona 2 has a large repository of tabular data and wants to automatically identify interesting correlations to formulate new hypotheses for further causal analysis.

In this demonstration, we will illustrate how Nexus assists Persona 1 and 2 with the analysis of real-world datasets.

## Install Nexus

Let's first install Nexus!

In [None]:
nexus_demo.install_nexus()

## Create Nexus API

In [None]:
from nexus.nexus_api import API
conn_str = f'data/quickstart.db'
nexus_api = API(conn_str)

# Persona 1: Enrich the asthma dataset with additional variables

Bob, a researcher from a medical school, has a dataset with asthma attack incidences in hospitals across various zip codes in Chicago.

| Zip5\*   | enc_asthma\*\* | encAsthmaExac\*\*\* | AttackPer\*\*\*\*  |
|--------|------------|---------------|-----------|
| 60604.0| 10.0       | 1.0           | 0.1       |
| 60605.0| 47.0       | 7.0           | 0.15      |
| 60606.0| 33.0       | 13.0          | 0.39      |
| 60607.0| 12.0       | 3.0           | 0.25      |
| ...| ...       | ...          | ...      |

\* zipcode

\*\* Count of asthma visits 2009-2019, denominator.

\*\*\* Count of visits for asthma attacks (a.k.a., exacerbations) 2009-2019, numerator.

\*\*\*\* Asthma attacks as a percentage of all asthma visits.

Bob is searching for variables correlated with asthma attacks from external data sources. He finds that [Chicago Open Data](https://data.cityofchicago.org/) has a wealth of datasets on diverse societal aspects such as education, business, and crime in Chicago. He believes there are some variables in Chicago Open Data that are useful for his research. Thus, he adds Chicago Open Data as a data source in Nexus.

## Browse Data Assets

Now, Chicago Open Data has been added to Nexus and Bob can use Nexus to browse the data catalog. Note this data catalog contains both the original dataset and their aggregated version.

For example, table `ijzp-q8t2` is Crimes - 2001 to Present. This table originally has geo-coordinate granularity. To combine it with the asthma dataset having zipcode granularity, Nexus automatically resolves the granularity inconsistency and creates table `ijzp-q8t2_location_6` that aggregates ijzp-q8t2 to the zipcode granularity using the `location` attribute.

In [None]:
catalog = nexus_api.get_catalog()
dtale.show(catalog)

You can use Nexus to look at a dataset in the catalog given the dataset id.

In [None]:
dataset_id = '4u6w-irs9_location_6'
df = nexus_api.get_agg_dataset(dataset_id)
dtale.show(df)

## Find correlations from an input table

Bob's goal is to explore what factors could potentially affect asthma attacks. Thus, he starts by finding variables that are correlated with asthma attacks. He can achieve this easily by using the `find_correlations_from` API in Nexus.

In this API, Nexus aligns the asthma dataset with tables from Chicago Open Data and computes correlations. Tables from Chicago Open Data originally have the spatial granularity of geo-coordinate. We aggregate them to the zip code level and apply aggregate functions "avg" and "count". For example, if you see an attribute named `avg_basketball_courts`, it means the original attribute is `basketball_courts` and function `average` is applied. The attribute after aggregation is named `avg_basketball_courts`. 

In [None]:
from nexus.utils.time_point import TEMPORAL_GRANU
from nexus.utils.coordinate import SPATIAL_GRANU

dataset = 'asthma'
# asthma data only has spatial attribute, thus the temporal granularity is set to ALL.
temporal_granularity, spatial_granularity = TEMPORAL_GRANU.ALL, SPATIAL_GRANU.ZIPCODE
overlap_threshold = 5
correlation_threshold = 0.5
# you can change correlation_type to 'spearman' or 'kendall'
correlations = nexus_api.find_correlations_from(dataset, temporal_granularity, spatial_granularity, 
                                      overlap_threshold, correlation_threshold, 
                                      correlation_type="pearson")
dtale.show(correlations)

In [None]:
from nexus.utils.time_point import TEMPORAL_GRANU
from nexus.utils.coordinate import SPATIAL_GRANU

dataset = 'ijzp-q8t2'
temporal_granularity, spatial_granularity = TEMPORAL_GRANU.DAY, SPATIAL_GRANU.TRACT
overlap_threshold = 5
correlation_threshold = 0.5
# you can change correlation_type to 'spearman' or 'kendall'
correlations = nexus_api.find_correlations_from(dataset, temporal_granularity, spatial_granularity, 
                                      overlap_threshold, correlation_threshold, 
                                      correlation_type="pearson")
dtale.show(correlations)

## Display the detailed profile of a correlation

todo

## Control for variables

Bob got 234 correlations for the asthma dataset. After browsing several correlations, he realizes that "poverty" might be driving these correlations. Thus, we want to control for the income level of each zipcode when calculating correlations. To achieve that, users can specify variables that they want to control in the `control_variables` field. After controlling for the median household income in a zipcode, only 63 correlations are left.

In [None]:
from nexus.utils.data_model import Variable

dataset = 'asthma'
temporal_granularity, spatial_granularity = TEMPORAL_GRANU.ALL, SPATIAL_GRANU.ZIPCODE
overlap_threshold = 5
correlation_threshold = 0.5
control_variables = [Variable('chicago_income_by_zipcode_zipcode_6', 'avg_income_household_median')]
df_control = nexus_api.find_correlations_from(dataset, temporal_granularity, spatial_granularity, 
                                              overlap_threshold, correlation_threshold, 
                                              correlation_type="pearson", control_variables=control_variables)
dtale.show(df_control)

## Assemble a dataset from multiple variables

Bob identifies a few interesting correlations and wants to combine variables involved in these correlations to assemble a new dataset. Nexus provides data assembly APIs to make it easy for Bob.

Suppose Bob finds the first correlation intriguing and wishes to explore the data used to calculate it. In such a scenario, he can simply input the correlation's ID into Nexus to obtain the integrated dataset.

In [None]:
row_idx = 0
aligned, prov = nexus_api.get_joined_data_from_row(df_control.loc[0])
dtale.show(aligned)

Nexus also offers `join_and_project` API that can assemble a dataset from any set of given variables.

In [None]:
variables = [Variable('divg-mhqk_location_6', 'count'), Variable('4u6w-irs9_location_6', 'avg_square_feet')]
df, prov = nexus_api.join_and_project(variables)
dtale.show(df)

Nexus provides the data provenance information for all data assembly APIs.

In [None]:
print(prov)

## Regression Analysis

When you find multiple intriguing correlations and wish to conduct further regression analysis on variables of interest, you can begin by utilizing Nexus's `join_and_project` function to compile the necessary dataset. Subsequently, you may employ any data analysis library for regression analysis. In this instance, we will illustrate the process using `scikit-learn`.

In [None]:
from sklearn import linear_model

dependent_variable = Variable('asthma_Zip5_6', 'avg_enc_asthma')
independent_variables = [Variable('ijzp-q8t2_location_6', 'count'), Variable('n26f-ihde_pickup_centroid_location_6', 'avg_tip')]

data_to_analyze, provenance = nexus_api.join_and_project([dependent_variable] + independent_variables)
# apply any data anlysis method
regression_model = linear_model.LinearRegression() # OLS regression

x = data_to_analyze[[variable.attr_name for variable in independent_variables]]
y = data_to_analyze[dependent_variable.attr_name]
model = regression_model.fit(x, y)
r_squared = model.score(x, y)

print("coefficients of each independent variables:", model.coef_)
print("r square score:", r_squared)

# Persona 2: Data-Driven Hypothesis Generation.
Amy, a social scientist in Chicago, is seeking to discover intriguing phenomena within the city for her research. To avoid limiting her analysis to existing knowledge, she employs a data-driven strategy. Recognizing that Chicago Open Data has a wealth of datasets on diverse societal aspects such as education, business, and crime, Amy wants to identify interesting correlations automatically to generate new hypotheses. 

She points Nexus to Chicago Open Data and uses the `find_all_correlations` API to identify all correlations within Chicago Open Data at the census tract and month granularity.

In [None]:
from nexus.utils.time_point import TEMPORAL_GRANU
from nexus.utils.coordinate import SPATIAL_GRANU
chicago_correlations = nexus_demo.find_all_correlations(TEMPORAL_GRANU.MONTH, SPATIAL_GRANU.TRACT)
print(f"Nexus found {len(chicago_correlations)} correlations in total")

## Use Nexus Variable Clusters

Nexus found 40,538 correlations in total, which is an overwhelming number for users to discern interesting correlations manually.

Luckily, Nexus can distill the structure of correlations and extract a small number of variable clusters from the vast array of correlations. These variable clusters can help users identify causal links and confounders.

Nexus searches for an optimal set of signals that, when applied as filters, yield a correlation graph with the highest modularity score. The signals that we consider for chicago open data include:

- Missing value ratio in the aggregated column
- Missing value ratio in the original column
- Zero value ratio in the aggregated column
- Zero value ratio in the original column
- The absolute value of correlation coefficient
- Overlap: number of samples used to calculate the correlation

In chicago open data, the best set of thresholds for the above signals are [1.0, 1.0, 1.0, 0.8, 0.6, 70], which means we include correlations whose missing_ratio <= 1.0, missing_ratio_original<=1.0, zero_ratio <=1.0, zero_ratio_original <= 0.8, |r| >= 0.6, |samples| >= 70.`

You can play with different sets of thresholds as well!

In [None]:
from demo.cluster_utils import CorrCommunity
from demo.demo_ui import show_communities
import random
import networkx as nx

def filter_on_signals(corr, signals, ts):
    return corr[
        (corr["missing_ratio1"].values <= ts[0])
        & (corr["zero_ratio1"].values <= ts[1])
        & (corr["missing_ratio2"].values <= ts[0])
        & (corr["zero_ratio2"].values <= ts[1])
        & (corr["missing_ratio_o1"].values <= ts[2])
        & (corr["zero_ratio_o1"].values <= ts[3])
        & (corr["missing_ratio_o2"].values <= ts[2])
        & (corr["zero_ratio_o2"].values <= ts[3])
        & (abs(corr["r_val"]).values >= ts[4])
        & (corr["samples"].values >= ts[5])
    ]



def build_graph_on_vars(corrs, threshold=0, weighted=False):
    G = nx.Graph()
    labels = {}
    from collections import defaultdict
    tbl_attrs = defaultdict(set)
    for _, row in corrs.iterrows():
        tbl_id1, tbl_id2, tbl_name1, tbl_name2, agg_attr1, agg_attr2 = (
            row["table_id1"],
            row["table_id2"],
            row["table_name1"],
            row["table_name2"],
            row["agg_attr1"],
            row["agg_attr2"],
        )
        G.add_edge(f"{tbl_id1}--{agg_attr1}", f"{tbl_id2}--{agg_attr2}")
        tbl_attrs[tbl_id1].add(agg_attr1)
        tbl_attrs[tbl_id2].add(agg_attr2)
        labels[f"{tbl_id1}--{agg_attr1}"] = f"{tbl_name1}--{agg_attr1}"
        labels[f"{tbl_id2}--{agg_attr2}"] = f"{tbl_name2}--{agg_attr2}"

    nx.set_node_attributes(G, labels, "label")
    return G

from collections import defaultdict

def get_communities(G):
    random.seed(9)
    # sort components by the number of variables in the cluster
    comps = nx.community.louvain_communities(G, resolution=1)
    print(len(comps))
    all_communities = {}
    for i, comp in enumerate(comps):
        community = defaultdict(list)
        for tbl_var in comp:
            tbl_var = G.nodes[tbl_var]["label"]
            x = tbl_var.split("--")
            tbl, var = x[0], x[1]
            community[tbl].append(var)
        all_communities[f"Cluster {i}"] = community
    return all_communities, comps

signal_thresholds = [1.0, 1.0, 1.0, 0.8, 0.6, 70]
filtered_corr = filter_on_signals(chicago_correlations, None, signal_thresholds)
G = build_graph_on_vars(filtered_corr, 0, False)
communities, _ = get_communities(G)
corr_community = CorrCommunity(chicago_correlations, 'chicago')
corr_community.all_communities = communities
# corr_community.get_correlation_communities_chicago(signal_thresholds)

### Examine Correlation Communities

We implement a simple interface for you to explore our correlation communities. Each community is composed of a group of variables. By default, the display is set to only show the tables where these variables are found. To view the specific variables within a community, simply click the "Show Variables" button.

Clicking the "Show Correlations" button will reveal all the correlations within a community. Once displayed, you have the flexibility to apply any filters to the resulting dataframe.

FAQ:

Why do some communities display the exact same set of tables?

The reason is that while the tables might be the same, the variables within these communities differ. We construct the correlation graph based on variables, and then present it in a table-view for clarity.

In [None]:
show_communities(corr_community, show_corr_in_same_tbl=False)

## Use Factor Analysis

Factor analysis aims to extract common factors from observed variables and represent existing variables using fewer factors. 

Factor analysis can take as input a correlation matrix. It derives factors that are essentially linear combinations of the observed variables. These factors are crafted to closely approximate the original correlation matrix when observed variables are projected onto them. 

In [None]:
# need to remove correlations that have values of 1 or -1 to avoid singular matrix
corrs, corr_map = load_corrs_from_dir(corr_path, remove_perfect_corrs=True) 
signals = [1.0, 1.0, 1.0, 0.8, 0.6, 70] # we use the same signal thresholds as in the previous example
corrs_filtered = filter_on_signals(corrs, None, signals)

n_factors = 10 # set the number of factors to 10

"""
the following code fits a factor analysis model on the correlation matrix online
It takes 10 minutes to run; save_path indicates the path to save the factor analysis model (fa)
"""
# fa, clusters = nexus_api.factor_analysis(corrs_filtered, corr_map, n_factors, save_path="chicago_open_data_factor_analysis.pkl")

"""
For the purpose of this demo, we load the factor analysis model from the file "chicago_open_data_factor_analysis.pkl"
"""
fa = pickle.load(open("chicago_open_data_factor_analysis.pkl", "rb"))
clusters, covered_vars = nexus_api.build_factor_clusters(fa, corrs_filtered, corr_map, n_factors, threshold=0.5)
corr_community = CorrCommunity(corrs_filtered, 'chicago', clusters)
show_communities(corr_community, show_corr_in_same_tbl=False, use_qgrid=use_qgrid)