___

<h1 align="center" style=font-size:52px>Demonstration of MATE and COCOA for Data Discovery</h1>
<h1 align="center" style=font-size:22px></h1>

___

<p align="center" style=padding:50px>
  <img src="datalake_indexes_qr.png" width=160px/>
</p>

___

## Google Colab Setup

Run this cell to setup the Google Colab environment for the demo.

In [1]:
#%cd /content
#! git clone https://github.com/LUH-DBS/datalake_indexes
#%cd datalake_indexes
#! git pull
#%pip install .

## General Setup
Now, we select one of the following datalakes for the demonstration:
- GitTables
- DWTC
- German Open Data

By removing the comment, we initialize a demo instance with the given data lake.

In [1]:
from maco.demo.datalake_indexes_demo import DatalakeIndexesDemo
import pandas as pd
from IPython.display import display, HTML

demo = DatalakeIndexesDemo("gittables")
# demo = DatalakeIndexesDemo("webtable")
# demo = DatalakeIndexesDemo("open_data")

___

# 1) Input Preparation

___

## Reading the input dataset

___

In [2]:
demo.load_dataset("movie")

Unnamed: 0,Movie Title,Director Name,IMDB Score
0,Unleashed,Louis Leterrier,7.0
1,Vaalu,Vijay Chandar,5.1
2,The Da Vinci Code,Ron Howard,6.6
3,Midnight in Paris,Woody Allen,7.7
4,Why Did I Get Married Too?,Tyler Perry,4.4


___

# 2) Joinability Discovery

___

## Finding the top-20 joinable tables using the Super Key Index and MATE

___

In [None]:
demo.joinability_discovery(
    # number of tables to return
    k=20,   
    
    # number of candidates to evaluate
    k_c=200,  
    
    # minumum number of joinable rows per table
    min_join_ratio=0,     
    
    # use the Super Key to filter irrelevant candidates
    use_hash_optimization=True,  
    
    # use Bloom Filter to filter irrelevant candidates
    use_bloom_filter=False,
    
    # calculate hash online instead of fetching Super Key from DB
    online_hash_calculation=False
)

Preparing input dataset...
Done.
Fetching joinable tables based on first query column...


___

## Inspecting the joinability scores for the retrieved joinable tables

___

In [None]:
demo.plot_joinability_scores()

___

## Inspecting the joinable table with rank #1

___

In [None]:
demo.display_joinable_table(3)

___

# 3) Duplicate Detection using XASH

___

## Discovering duplicate tables and their relationship within the joinable tables
___

In [None]:
#demo.duplicate_detection().show("./maco/demo/nb.html")

___

## Removing duplicates within the top joinable tables

___

In [None]:
#demo.remove_duplicates()

___

# 4) Correlation Calculation

___

## Obtaining the top-10 correlating features using the Order Index and COCOA

___

In [7]:
demo.correlation_calculation(
    # number of features to return
    k_c=10,
    
    # calculate order index online instead of fetching it from the DB
    online_index_generation=False
)

  0%|          | 0/20 [00:00<?, ?it/s]

--------------------------------------------
Runtime:
--------------------------------------------
Total runtime: 1.39s
Preparation runtime: 1.26s
Correlation calculation runtime: 0.13s

--------------------------------------------
Statistics:
--------------------------------------------
Evaluated features: 92
Max. correlation coefficient: 0.0270


___

## Inspecting the correlation coefficients for the retrieved features

___

In [None]:
#demo.plot_correlation_coefficients()

___

## Materializing join for the top 3 correlating features

___

In [None]:
demo.add_external_features([1, 2, 3])

___

## Inspect differences between Spearman and Pearson correlation coefficients

___

In [None]:
#demo.plot_spearman_pearson()

___

## Inspect correlation between input and top-correlating features

___

In [None]:
#demo.plot_correlation_heatmap()

___

## Compare RMSE of the model trained on the input and enriched dataset

___

In [None]:
demo.fit_and_evaluate_model()