# Dataset Discovery and Exploration: State-of-the-art, Challenges and Opportunities
## Part 1: Dataset Search
### Framework Overview -- D3L
#### Input Dataset
The input dataset consists of structured data with various attributes. Below is a glimpse of the top 10 rows, showcasing the structure and type of data we are dealing with:

| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1  | Value2  | Value3  |
| ...     | ...     | ...     |

D3L utilizes a comprehensive approach based on:

1. **Attribute Header Similarity**
2. **Value Similarity**
3. **Format Similarity**
4. **Value Distribution**
#### Output Datasets: Top k searched dataset results

In [None]:
import sys
print(sys.executable)

##### Generate LSH indexes for all evidence in D3L

In [ ]:
# import and initialize D3L
from d3l.d3l.indexing.similarity_indexes import NameIndex, FormatIndex, ValueIndex, EmbeddingIndex, DistributionIndex
from d3l.d3l.input_output.dataloaders import CSVDataLoader
from d3l.d3l.querying.query_engine import QueryEngine
from d3l.d3l.utils.functions import pickle_python_object, unpickle_python_object
import os
# import pandas as pd

data_path = "Datasets"
threshold = 0.5
#  collection of tables

dataloader = CSVDataLoader(
        root_path=data_path,
        encoding='utf-8'
)


##### Generating/loading NameIndex of tables

In [None]:
name_lsh = os.path.join(data_path, f'./Name.lsh')
if os.path.isfile(name_lsh):
    name_index = unpickle_python_object(name_lsh)
    print("Name LSH index: LOADED!")
else:
    name_index = NameIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(name_index, name_lsh)
    print("Name LSH index: SAVED!")




##### Generating/loading FormatIndex of tables

In [ ]:
format_lsh = os.path.join(data_path, './format.lsh')
if os.path.isfile(format_lsh):
    format_index = unpickle_python_object(format_lsh)
    print("Format LSH index: LOADED!")
else:
    format_index = FormatIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(format_index, format_lsh)
    print("Format LSH index: SAVED!")

##### Generating/loading ValueIndex of tables

In [ ]:
value_lsh = os.path.join(data_path, './value.lsh')
if os.path.isfile(value_lsh):
    value_index = unpickle_python_object(value_lsh)
    print("Value LSH index: LOADED!")
else:
    value_index = ValueIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(value_index, value_lsh)
    print("Value LSH index: SAVED!")

##### Generating/loading ValueIndex of tables

In [ ]:
   # DistributionIndex
distribution_lsh = os.path.join(data_path, './distribution.lsh')
if os.path.isfile(distribution_lsh):
    distribution_index = unpickle_python_object(distribution_lsh)
    print("Distribution LSH index: LOADED!")
else:
    distribution_index = DistributionIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(distribution_index, distribution_lsh)
    print("Distribution LSH index: SAVED!")

##### Generating/loading EmbeddingIndex of tables

In [ ]:
embed_name = './embedding_.lsh'
embedding_lsh = os.path.join(data_path, embed_name)

if os.path.isfile(embedding_lsh):
    embedding_index = unpickle_python_object(embedding_lsh)
    print("Embedding LSH index: LOADED!")
else:
    embedding_index = EmbeddingIndex(dataloader=dataloader,
                                     index_similarity_threshold=threshold)
    pickle_python_object(embedding_index, embedding_lsh)
    print("Embedding LSH index: SAVED!")


##### show the input table

In [None]:
searched_table = 'T2DV2_122'
table_df = dataloader.read_table(searched_table)
print(table_df.head(3))

In [ ]:
#Searched results, K =10
qe = QueryEngine(name_index, format_index, value_index, embedding_index, distribution_index)
results, extended_results = qe.table_query(table=dataloader.read_table(table_name=searched_table),
                                           aggregator=None, k=10, verbose=True)
results

In [None]:
#Individual search results
# Name index query
name_results = name_index.query(query="<string>", k="<integer>") # The query arg should be a column name. Tokenization will be performed automatically.

# Format index query
format_results = format_index.query(query="<list/set>", k="<integer>") # The query arg should be a collection of string values. The corresponding format descriptors will be extracted automatically.

# Value index query
value_results = value_index.query(query="<list/set>", k="<integer>") # The query arg should be a collection of string values. Value pre-processing will be performed automatically.

# Embeddings index query
embedding_results = embedding_index.query(query="<list/set>", k="<integer>") # The query arg should be a collection of string values. The corresponding embeddings will be extracted automatically.

# Distribution index query
distribution_results = distribution_index.query(query="<list/set>", k="<integer>") # The query arg should be a collection of numerical values. The corresponding distribution will be extracted automatically.

## Part 2: Dataset Navigation
### Framework Overview -- Aurum
TBC Still under work

## Part 3: Dataset Annotation
### Framework Overview -- TableMiner+
#### Input dataset: 26 tables from 13 domain, while each domain has 2 tables (TBC)
The 13 domains include:
1. **Airport**
2. **City**
3. **CollegeOrUniversity**
4. **Company**
5. **Continent**
6. **Country**
7. **Hospital**
8. **LandmarksOrHistoricalBuildings**
9. **Monarch**
10. **Movie**
11. **Museum**
12. **Scientist**
13. **VideoGame**

TableMiner+ has 4 steps:
1. **Subject Column Detection: Including column (data) type detection** 
2. **NE-Column interpretation - the LEARNING phase:**
***2.1 preliminary cell annotation***
***2.2 column semantic type annotation***
***2.3 property annotation***
3. **NE-Column interpretation - the UPDATE phase: revise annotation until all annotation is stabilized**
4. **Relation enumeration and annotating literal-columns(not included yet)**

##### show the example annotation table

In [ ]:
import pandas as pd
from TableMiner.LearningPhase.Update import TableLearning,  updatePhase
Table = pd.read_csv("E:\Project\EDBTDemo\Datasets\T2DV2_122.csv") #125
print(Table, "\n")

##### Perform NE-Column interpretation (Table Learning includes the process of subject column detection of a table)

In [ ]:
tableLearning = TableLearning(Table)
tableLearning.table_learning()

##### Perform NE-Column interpretation - the UPDATE phase

In [ ]:
updatePhase(tableLearning)

##### check the annotation of column (this needs re-factor)

In [ ]:
annotation_class = tableLearning.get_annotation_class()
for column_index, learning_class in annotation_class.items():
    column = learning_class.get_column()
    print(f"column is {column}")
    cellAnnotation  = learning_class.get_cell_annotation()
    ColumnSemantics = learning_class.get_winning_concepts()
    print(f"Cell annotation of the column: {cellAnnotation}")
    print(f"Column semantic type of the column: {ColumnSemantics}")

## Part 4: Schema Inference
### Framework Overview -- Starmie
#### Input dataset: all 200 tables covering 13 specific domains
Starmie perform column clustering on the embedding of each column.
Embedding generation: Description ...  