In [1]:
import os
import dtale
home_dir=os.path.expanduser('~')
os.chdir(f"{home_dir}/nexus_correlation_discovery/")
from demo import nexus_demo

# Nexus Introduction

Correlation analysis is a vital initial step for investigating causation, essential for understanding complex phenomena and making informed choices. While it is hard to establish causality from vast observational data without assumptions and expert knowledge, identifying correlations remains a key strategy to “cast a wide net” and detect potential causal links. Our system Nexus identifies correlations over collections of spatio-temporal tabular data, aiming to identify interesting hypotheses and provide a good starting point for further causal analysis. Nexus focuses on two personas.

**Persona 1: Exploring an Existing Hypothesis.** A researcher at a medical school, Bob, has a dataset with asthma attack incidences in hospitals across various zip codes in Chicago. Bob's research goal is to explore what factors could potentially affect asthma attacks. Thus, he wants to start by finding variables that are correlated with asthma attacks. Persona 1 is someone who has an initial dataset and seeks to enrich such a dataset with additional variables relevant to the analysis.

<img src="persona1.png" alt="persona 1" width="500"/>

**Persona 2: Data-Driven Hypothesis Generation.** Amy, a researcher in Chicago, finds [Chicago Open Data](https://data.cityofchicago.org) has many interesting datasets. She want to know whether she could form new hypotheses from BIG DATA. That is to find all correlations in Chicago Open Data and see if there is any interesting ones that can lead to new hypotheses or insights.

<img src="persona2.png" alt="persona 2" width="400"/>

In this demonstration, we will illustrate how Nexus assists Persona 1 and 2 with the analysis of real-world datasets.

## Install Nexus

Let's first install Nexus!

In [17]:
nexus_demo.install_nexus()

Installation Nexus successful!


## Create Nexus API

Nexus indexes Chicago Open Data offline and stores the data in `quickstart.db`.

In [27]:
from nexus.nexus_api import API
# conn_str = f'data/quickstart.db'
conn_str = f'data/demo.db'
nexus_api = API(conn_str)

# Persona 1: Enrich the asthma dataset with additional variables

Bob, a researcher from a medical school, has a dataset with asthma attack incidences in hospitals across various zip codes in Chicago.

| Zip5\*   | enc_asthma\*\* | encAsthmaExac\*\*\* | AttackPer\*\*\*\*  |
|--------|------------|---------------|-----------|
| 60604.0| 10.0       | 1.0           | 0.1       |
| 60605.0| 47.0       | 7.0           | 0.15      |
| 60606.0| 33.0       | 13.0          | 0.39      |
| 60607.0| 12.0       | 3.0           | 0.25      |
| ...| ...       | ...          | ...      |

\* zipcode

\*\* Count of asthma visits 2009-2019, denominator.

\*\*\* Count of visits for asthma attacks (a.k.a., exacerbations) 2009-2019, numerator.

\*\*\*\* Asthma attacks as a percentage of all asthma visits.

Bob wants to find variables correlated with asthma attacks from Chicago Open Data. 

<!-- He finds that [Chicago Open Data](https://data.cityofchicago.org/) has a wealth of datasets on diverse societal aspects such as education, business, and crime in Chicago. He believes there are some variables in Chicago Open Data that are useful for his research. Thus, he adds Chicago Open Data as a data source in Nexus. -->

## Browse Data Assets

Chicago Open Data has been added to Nexus and Bob can use Nexus to browse the data catalog. 

Note this data catalog contains both the original dataset and their aggregated version. For example, table `ijzp-q8t2` is Crimes - 2001 to Present. This table originally has geo-coordinate granularity. To combine it with the asthma dataset having zipcode granularity, Nexus automatically resolves the granularity inconsistency and creates table `ijzp-q8t2_location_6` that aggregates ijzp-q8t2 to the zipcode granularity using the `location` attribute.

In [28]:
catalog = nexus_api.get_catalog()
dtale.show(catalog)



You can use Nexus to look at a dataset in the catalog given the dataset id.

In [18]:
dataset_id = 'ijzp-q8t2_location_6'
df = nexus_api.get_agg_dataset(dataset_id)
dtale.show(df)



## Find correlations from an input table

Bob's goal is to explore what factors could potentially affect asthma attacks. Thus, he starts by finding variables that are correlated with asthma attacks. He can achieve this easily by using the `find_correlations_from` API in Nexus.

In this API, Nexus aligns the asthma dataset with tables from Chicago Open Data and computes correlations. Tables from Chicago Open Data originally have the spatial granularity of geo-coordinate. We aggregate them to the zip code level and apply aggregate functions "avg" and "count". For example, if you see an attribute named `avg_basketball_courts`, it means the original attribute is `basketball_courts` and function `average` is applied. The attribute after aggregation is named `avg_basketball_courts`. 

In [19]:
from nexus.utils.time_point import TEMPORAL_GRANU
from nexus.utils.coordinate import SPATIAL_GRANU

dataset = 'asthma'
# dataset = 'ijzp-q8t2'
# asthma data only has spatial attribute, thus the temporal granularity is set to ALL.
temporal_granularity, spatial_granularity = None, SPATIAL_GRANU.ZIPCODE
# temporal_granularity, spatial_granularity = TEMPORAL_GRANU.MONTH, SPATIAL_GRANU.ZIPCODE
overlap_threshold = 5
correlation_threshold = 0.5
# you can change correlation_type to 'spearman' or 'kendall'
correlations = nexus_api.find_correlations_from(dataset, temporal_granularity, spatial_granularity, 
                                      overlap_threshold, correlation_threshold, 
                                      correlation_type="pearson")
dtale.show(correlations)

total number of correlations: 219




## Display the detailed profile of a correlation

In [6]:
correlation_idx = 9
nexus_api.show_correlation_profile(correlations, correlation_idx)

Variable 1 - table id: asthma, aggregated table: asthma_Zip5_6, aggregated attribute: avg_encAsthmaExac
	 Missing value ratio: 0.0
	 zero value ratio: 0.02
Variable 2 - table id: 9xs2-f89t, aggregated table: 9xs2-f89t_location_6, aggregated attribute: avg_general_services_route_
	 Missing value ratio: 0.0
	 zero value ratio: 0.0
Correlation Profile
	Correlation coefficient: 0.603
	p value: 0.0
	Number of samples: 46
	Spatio-temporal key type: spatial


## Control for variables

Bob got more than 200 correlations for the asthma dataset. After browsing several correlations, he realizes that "poverty" might be driving these correlations. Thus, he wants to control for the income level of each zipcode when calculating correlations. To achieve that, users can specify variables that they want to control in the `control_variables` parameter. After controlling for the median household income in a zipcode, only 60 correlations are left.

In [26]:
from nexus.utils.data_model import Variable

dataset = 'asthma'
temporal_granularity, spatial_granularity = None, SPATIAL_GRANU.ZIPCODE
overlap_threshold = 5
correlation_threshold = 0.5
control_variables = [Variable('chicago_income_by_zipcode_zipcode_6', 'avg_income_household_median')]
df_control = nexus_api.find_correlations_from(dataset, temporal_granularity, spatial_granularity, 
                                              overlap_threshold, correlation_threshold, 
                                              correlation_type="spearman", control_variables=control_variables)
dtale.show(df_control)

total number of correlations: 21




## Assemble a dataset from multiple variables

Bob identifies a few interesting correlations and wants to combine variables involved in these correlations to assemble a new dataset. Nexus provides data assembly APIs to make it easy for Bob.

Suppose Bob finds the first correlation intriguing and wishes to explore the data used to calculate it. In such a scenario, he can simply input the correlation's ID into Nexus to obtain the integrated dataset.

In [21]:
row_idx = 40
aligned, prov = nexus_api.get_joined_data_from_row(df_control.loc[row_idx])
dtale.show(aligned)



Nexus also offers `join_and_project` API that can assemble a dataset from any set of given variables.

In [22]:
variables = [Variable('divg-mhqk_location_6', 'count'), Variable('4u6w-irs9_location_6', 'avg_square_feet')]
df, prov = nexus_api.join_and_project(variables)
dtale.show(df)



Nexus provides the data provenance information for all data assembly APIs.

In [10]:
print(prov)

SELECT "divg-mhqk_location_6".count,"4u6w-irs9_location_6".avg_square_feet FROM "divg-mhqk_location_6" INNER JOIN "4u6w-irs9_location_6" ON "divg-mhqk_location_6".val = "4u6w-irs9_location_6".val


## Regression Analysis

When you find multiple intriguing correlations and wish to conduct further regression analysis on variables of interest, you can begin by utilizing Nexus's `join_and_project` function to compile the necessary dataset. Subsequently, you may employ any data analysis library for regression analysis. In this instance, we will illustrate the process using `scikit-learn`.

In [11]:
from sklearn import linear_model

dependent_variable = Variable('asthma_Zip5_6', 'avg_enc_asthma')
independent_variables = [Variable('ijzp-q8t2_location_6', 'count'), Variable('n26f-ihde_pickup_centroid_location_6', 'avg_tip')]

data_to_analyze, provenance = nexus_api.join_and_project([dependent_variable] + independent_variables)
# apply any data anlysis method
regression_model = linear_model.LinearRegression() # OLS regression

x = data_to_analyze[[variable.attr_name for variable in independent_variables]]
y = data_to_analyze[dependent_variable.attr_name]
model = regression_model.fit(x, y)
r_squared = model.score(x, y)

print("coefficients of each independent variables:", model.coef_)
print("r square score:", r_squared)

coefficients of each independent variables: [ 3.17139472e-02 -5.13593106e+02]
r square score: 0.3408732623177211


# Persona 2: Data-Driven Hypothesis Generation.
Amy, a researcher in Chicago, finds [Chicago Open Data](https://data.cityofchicago.org) has many interesting datasets. She want to know whether she could form new hypotheses from BIG DATA. That is to find all correlations in Chicago Open Data and see if there is any interesting ones that can lead to new hypotheses or insights.

She can use the `find_all_correlations` API to identify all correlations within Chicago Open Data at the census tract and month granularity.

In [23]:
from nexus.utils.time_point import TEMPORAL_GRANU
from nexus.utils.coordinate import SPATIAL_GRANU
chicago_correlations = nexus_demo.find_all_correlations(TEMPORAL_GRANU.MONTH, SPATIAL_GRANU.TRACT)
print(f"Nexus found {len(chicago_correlations)} correlations in total")

Nexus found 40538 correlations in total


## Correlation Distillation Using Nexus Variable Clusters

Nexus found 40,538 correlations in total, which is an overwhelming number for users to discern interesting correlations manually.

Luckily, Nexus can distill the structure of correlations and extract a small number of variable clusters from the vast array of correlations. These variable clusters can help users identify causal links and confounders.

In [24]:
from demo.cluster_utils import CorrCommunity
from demo.demo_ui import show_communities

variable_clusters = nexus_demo.get_correlation_communities(chicago_correlations)
print(f"Nexus extracts {len(variable_clusters.comps)} variable clusters out of {len(chicago_correlations)} correlations")

Nexus extracts 23 variable clusters out of 40538 correlations


### Examine Correlation Communities

Nexus helps Amy reduce the burden of examining correlations by extracting 23 clusters from the vast correlations. 

There is a cluster (Cluster 14) with tables related to divvy bike stations, taxi trips, and Chicago covid-19 community vulnerability index (CCVI). CCVI score measures a community’s susceptibility to the negative impacts from COVID-19 based on various social and economic factors. A lower CCVI score means less vulnerability, indicating an area has a more advanced socio-economic status.

These significant negative correlations between CCVI score and divvy bike docks inspire Amy to form a hypothesis that Divvy bike locations are biased towards richer areas. Notably, this hypothesis has been verified in existing studies [1]. 

[1] Elizabeth Flanagan and et al. 2016. Riding tandem: Does cycling infrastructure investment mirror gentrification and privilege in Portland, OR and Chicago, IL? Research in Transportation Economics 60 (2016), 14–24.

In [25]:
show_communities(variable_clusters, show_corr_in_same_tbl=False)

Dropdown(description='Show:', layout=Layout(width='200px'), options=('Cluster 0', 'Cluster 1', 'Cluster 2', 'C…

Output()

Output()

Output()

### Notes on using factor analysis

Factor analysis aims to extract common factors from observed variables and represent existing variables using fewer factors. 

It takes as input a correlation matrix. It derives factors that are essentially linear combinations of the observed variables. These factors are crafted to closely approximate the original correlation matrix when observed variables are projected onto them. 

We also implement factor analysis in Nexus, but it has several limitations when applied on a large correlation matrix:

1. Assumption. Factor analysis assumes these correlations among observed variables are computed on the same set of samples. However, in our scenario, variables are from different datasets and aligned on different samples, which breaks this assumption.

2. Scalability Issue. Factor analysis does matrix decomposition and its runtime grows quadratically. It runs for 10 minutes on 556 variables.

3. Hard to determine the number of factors. Although there are some methods to choose the factors, they do not work well on large correlation matrices. For example, the most used method is to look at the eigenvalues of the correlation matrix and select the number of eigenvalues greater than 1 as the number of factors. I tried this method and found we needed 253 factors! A human can hardly interpret that many factors.

4. Hard to determine the threshold for assigning variables to factors. There is no golden rule to determine this threshold. As a rule of thumb, 0.7 or higher factor loading represents that the factor extracts sufficient variance from that variable.

