___

<h1 align="center" style=font-size:52px>Demonstration of MATE and COCOA for Data Discovery</h1>
<h1 align="center" style=font-size:22px></h1>

___

<p align="center" style=padding:50px>
  <img src="datalake_indexes_qr.png" width=160px/>
</p>

___

## Google Colab Setup

Run this cell to setup the Google Colab environment for the demo.

In [1]:
#%cd /content
#! git clone https://github.com/LUH-DBS/datalake_indexes
#%cd datalake_indexes
#! git pull
#%pip install .

## General Setup
Now, we select one of the following datalakes for the demonstration:
- GitTables
- DWTC
- German Open Data

By removing the comment, we initialize a demo instance with the given data lake.

In [2]:
from maco.demo.datalake_indexes_demo import DatalakeIndexesDemo
import pandas as pd
from IPython.display import display, HTML

demo = DatalakeIndexesDemo("gittables")
# demo = DatalakeIndexesDemo("webtable")
# demo = DatalakeIndexesDemo("open_data")

___

# 1) Input Preparation

___

## Reading the input dataset

___

In [3]:
demo.load_dataset("who")

Unnamed: 0,Country,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2007.5,58.19375,269.0625,78.25,0.014375,34.96011,64.5625,2362.25,15.51875,107.5625,48.375,8.2525,52.3125,0.1,340.015425,9972259.8125,16.58125,15.58125,0.415375,8.2125
1,Albania,2007.5,75.15625,45.0625,0.6875,4.84875,193.259091,98.0,53.375,49.06875,0.9375,98.125,5.945625,98.0625,0.1,2119.726679,696911.625,1.61875,1.7,0.709875,12.1375
2,Algeria,2007.5,73.61875,108.1875,20.3125,0.668929,236.185241,78.735115,1943.875,48.74375,23.5,91.75,4.687387,91.875,0.1,2847.853392,21649827.4375,6.09375,5.975,0.694875,12.7125
3,Angola,2007.5,49.01875,328.5625,83.75,5.669554,102.100268,74.911452,3561.3125,18.01875,132.625,46.125,4.045512,47.6875,2.36875,1975.143045,10147099.1875,6.19375,6.66875,0.458375,8.04375
4,Antigua and Barbuda,2007.5,75.05625,127.5,0.0,7.740179,1001.585226,97.183779,0.0,38.425,0.0,96.9375,4.863012,98.3125,0.125,9759.305728,12753375.120052,3.425,3.375,0.488625,8.84375


___

# 2) Joinability Discovery

___

## Finding the top-20 joinable tables using the Super Key Index and MATE

___

In [None]:
demo.joinability_discovery(
    # number of tables to return
    k=20,   
    
    # number of candidates to evaluate
    k_c=200,  
    
    # minumum number of joinable rows per table
    min_join_ratio=0,     
    
    # use the Super Key to filter irrelevant candidates
    use_hash_optimization=True,  
    
    # use Bloom Filter to filter irrelevant candidates
    use_bloom_filter=False,
    
    # calculate hash online instead of fetching Super Key from DB
    online_hash_calculation=False
)

Preparing input dataset...
Done.
Fetching joinable tables based on first query column...
Done.
Running hash-based row filtering...


  0%|          | 0/200 [00:00<?, ?it/s]

Done.
Generating join maps...


___

## Inspecting the joinability scores for the retrieved joinable tables

___

In [None]:
#demo.plot_joinability_scores()

___

## Inspecting the joinable table with rank #1

___

In [None]:
demo.display_joinable_table(3)

___

# 3) Duplicate Detection using XASH

___

## Discovering duplicate tables and their relationship within the joinable tables
___

In [None]:
#demo.duplicate_detection().show("./maco/demo/nb.html")

___

## Removing duplicates within the top joinable tables

___

In [None]:
#demo.remove_duplicates()

___

# 4) Correlation Calculation

___

## Obtaining the top-10 correlating features using the Order Index and COCOA

___

In [None]:
demo.correlation_calculation(
    # number of features to return
    k_c=10,
    
    # calculate order index online instead of fetching it from the DB
    online_index_generation=False
)

___

## Inspecting the correlation coefficients for the retrieved features

___

In [None]:
#demo.plot_correlation_coefficients()

___

## Materializing join for the top 3 correlating features

___

In [None]:
demo.add_external_features([1, 2, 3])

___

## Inspect differences between Spearman and Pearson correlation coefficients

___

In [None]:
#demo.plot_spearman_pearson()

___

## Inspect correlation between input and top-correlating features

___

In [None]:
#demo.plot_correlation_heatmap()

___

## Compare RMSE of the model trained on the input and enriched dataset

___

In [None]:
demo.fit_and_evaluate_model()