# Dataset Discovery and Exploration: State-of-the-art, Challenges and Opportunities
## Part 1: Dataset Search
### Framework Overview -- D3L
#### Input Dataset
The input dataset consists of structured data with various attributes. Below is a glimpse of the top 10 rows, showcasing the structure and type of data we are dealing with:

| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1  | Value2  | Value3  |
| ...     | ...     | ...     |

D3L utilizes a comprehensive approach based on:

1. **Attribute Header Similarity**
2. **Value Similarity**
3. **Format Similarity**
4. **Value Distribution**
5. **Attribute value embeddings**
#### Output Datasets: Top k searched dataset results

##### Generate LSH indexes for all evidence in D3L

In [31]:
# import and initialize D3L
from d3l.indexing.similarity_indexes import NameIndex, FormatIndex, ValueIndex, EmbeddingIndex, DistributionIndex
from d3l.input_output.dataloaders import CSVDataLoader
from d3l.querying.query_engine import QueryEngine
from d3l.utils.functions import pickle_python_object, unpickle_python_object
import os
#

data_path = "Datasets"
result_path = "Result"
threshold = 0.5
#  collection of tables

dataloader = CSVDataLoader(
        root_path=data_path,
        encoding='utf-8'
)


##### Generating/loading NameIndex of tables

In [28]:
name_lsh = os.path.join(result_path, 'Name.lsh')
print(name_lsh)
if os.path.isfile(name_lsh):
    name_index = unpickle_python_object(name_lsh)
    print("Name LSH index: LOADED!")
else:
    name_index = NameIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(name_index, name_lsh)
    print("Name LSH index: SAVED!")

Result\Name.lsh
Name LSH index: LOADED!


##### Generating/loading FormatIndex of tables

In [None]:
format_lsh = os.path.join(result_path, './format.lsh')
if os.path.isfile(format_lsh):
    format_index = unpickle_python_object(format_lsh)
    print("Format LSH index: LOADED!")
else:
    format_index = FormatIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(format_index, format_lsh)
    print("Format LSH index: SAVED!")

##### Generating/loading ValueIndex of tables

In [29]:
value_lsh = os.path.join(result_path, './value.lsh')
if os.path.isfile(value_lsh):
    value_index = unpickle_python_object(value_lsh)
    print("Value LSH index: LOADED!")
else:
    value_index = ValueIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(value_index, value_lsh)
    print("Value LSH index: SAVED!")

Value LSH index: LOADED!


##### Generating/loading DistributionIndex of tables

In [None]:
   # DistributionIndex
distribution_lsh = os.path.join(result_path, './distribution.lsh')
if os.path.isfile(distribution_lsh):
    distribution_index = unpickle_python_object(distribution_lsh)
    print("Distribution LSH index: LOADED!")
else:
    distribution_index = DistributionIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(distribution_index, distribution_lsh)
    print("Distribution LSH index: SAVED!")

##### Generating/loading EmbeddingIndex of tables

In [None]:
embed_name = './embedding_.lsh'
embedding_lsh = os.path.join(result_path, embed_name)

if os.path.isfile(embedding_lsh):
    embedding_index = unpickle_python_object(embedding_lsh)
    print("Embedding LSH index: LOADED!")
else:
    embedding_index = EmbeddingIndex(dataloader=dataloader,
                                     index_similarity_threshold=threshold)
    pickle_python_object(embedding_index, embedding_lsh)
    print("Embedding LSH index: SAVED!")


##### show the input table

In [None]:
searched_table = 'T2DV2_122'
table_df = dataloader.read_table(searched_table)
print(table_df.head(3))

Query table in the framework using all the above indexes

In [None]:
#Searched results, K =10
qe = QueryEngine(name_index, format_index, value_index, embedding_index, distribution_index)
results, extended_results = qe.table_query(table=dataloader.read_table(table_name=searched_table),
                                           aggregator=None, k=10, verbose=True)
print(results)
table_df = dataloader.read_table(results[1][0])

In [None]:
print(table_df)

##### Individual search using different methods

In [None]:
#Individual search results
# Name index query
name_results = name_index.query(query="<string>", k="<integer>") # The query arg should be a column name. Tokenization will be performed automatically.

# Format index query
format_results = format_index.query(query="<list/set>", k="<integer>") # The query arg should be a collection of string values. The corresponding format descriptors will be extracted automatically.

# Value index query
value_results = value_index.query(query="<list/set>", k="<integer>") # The query arg should be a collection of string values. Value pre-processing will be performed automatically.

# Embeddings index query
embedding_results = embedding_index.query(query="<list/set>", k="<integer>") # The query arg should be a collection of string values. The corresponding embeddings will be extracted automatically.

# Distribution index query
distribution_results = distribution_index.query(query="<list/set>", k="<integer>") # The query arg should be a collection of numerical values. The corresponding distribution will be extracted automatically.

## Part 2: Dataset Navigation
### Framework Overview -- Aurum

#### Prerequisites: detect subject columns and type of columns

In [None]:
import pickle
from TableMiner.SCDection.TableAnnotation import TableColumnAnnotation as TA
import pandas as pd
import os
table_names = os.listdir("Datasets")
table_dict = {}
for table_name in table_names:
    table_dict[table_name] = []
    table = pd.read_csv(f"Datasets/{table_name}")
    try:
        annotation = TA(table)
        annotation.subcol_Tjs()
        table_dict[table_name].append(annotation.annotation)
        table_dict[table_name].append(annotation.column_score)
    except:
        print(f"table {table_name} fails")
with open("Result/dict.pkl", "wb") as save_file:
    pickle.dump(table_dict, save_file)

("T2DV2_1.csv.Fans' Rank", 0     442
1     267
2     505
3     175
4     486
     ... 
95     44
96     78
97     74
98    495
99    281
Name: Fans' Rank, Length: 100, dtype: int64)
T2DV2_162.Fans' Rank [1.0, 0.0]
T2DV2_137.Fans' Rank [1.0, 0.0]
T2DV2_100.Fans' Rank [1.0, 0.0]
T2DV2_75.Fans' Rank [1.0, 0.0]
T2DV2_235.Fans' Rank [1.0, 0.0]
T2DV2_133.Fans' Rank [1.0, 0.0]
T2DV2_1.Fans' Rank [1.0, 0.0]
T2DV2_77.Rank [0.524, 0.0]
T2DV2_187.Rank [0.524, 0.0]
T2DV2_55.Rank [0.524, 0.0]
T2DV2_205.Rank [0.524, 0.0]
T2DV2_81.Rank [0.524, 0.0]
T2DV2_19.Rank [0.524, 0.0]
T2DV2_22.Rank [0.524, 0.0]
T2DV2_221.Rank [0.524, 0.0]
T2DV2_146.Rank [0.524, 0.0]
T2DV2_203.Rank [0.524, 0.0]
T2DV2_214.Rank [0.524, 0.0]
T2DV2_193.Rank [0.524, 0.0]
T2DV2_122.Rank [0.524, 0.0]
T2DV2_164.Rank [0.524, 0.0]
T2DV2_145.Rank [0.524, 0.0]


## Part 3: Dataset Annotation
### Framework Overview -- TableMiner+
#### Input dataset: 26 tables from 13 domain, while each domain has 2 tables (TBC)
The 13 domains include:
1. **Airport**
2. **City**
3. **CollegeOrUniversity**
4. **Company**
5. **Continent**
6. **Country**
7. **Hospital**
8. **LandmarksOrHistoricalBuildings**
9. **Monarch**
10. **Movie**
11. **Museum**
12. **Scientist**
13. **VideoGame**

TableMiner+ has 4 steps:
1. **Subject Column Detection: Including column (data) type detection** 
2. **NE-Column interpretation - the LEARNING phase:**
***2.1 preliminary cell annotation***
***2.2 column semantic type annotation***
***2.3 property annotation***
3. **NE-Column interpretation - the UPDATE phase: revise annotation until all annotation is stabilized**
4. **Relation enumeration and annotating literal-columns(not included yet)**

##### show the example annotation table

In [ ]:
import pandas as pd
from TableMiner.LearningPhase.Update import TableLearning,  updatePhase
Table = pd.read_csv("E:\Project\EDBTDemo\Datasets\T2DV2_122.csv") #125
print(Table, "\n")

##### Perform NE-Column interpretation (Table Learning includes the process of subject column detection of a table)

In [ ]:
tableLearning = TableLearning(Table)
tableLearning.table_learning()

##### Perform NE-Column interpretation - the UPDATE phase

In [ ]:
updatePhase(tableLearning)

##### check the annotation of column (this needs re-factor)

In [ ]:
annotation_class = tableLearning.get_annotation_class()
for column_index, learning_class in annotation_class.items():
    column = learning_class.get_column()
    print(f"column is {column}")
    cellAnnotation  = learning_class.get_cell_annotation()
    ColumnSemantics = learning_class.get_winning_concepts()
    print(f"Cell annotation of the column: {cellAnnotation}")
    print(f"Column semantic type of the column: {ColumnSemantics}")

## Part 4: Schema Inference
### Framework Overview -- Starmie
#### Input dataset: all 200 tables covering 13 specific domains
Starmie perform column clustering on the embedding of each column.
Embedding generation: Description ...  

##### Use clustering on table's embeddings to detect the types/domains' of tables

In [None]:
from clustering import typeInference,result_precision,conceptualAttri,current_dir_path
clustering_result = typeInference("Result\\tableEmbeddings.pkl", "Agglomerative", numEstimate=13)
print(clustering_result)
result_precision = result_precision(clustering_result)

##### Use clustering on column embeddings to detect column's type 

In [ ]:
column_clustering = conceptualAttri(os.path.join(current_dir_path, "Datasets"),
                os.path.join(current_dir_path, "Result/tableEmbeddings.pkl"),
                clustering_method="Agglomerative",
                numEstimate=12)
for column_index, column_clusters in column_clustering.items():
    print(column_index)