# Dataset Discovery and Exploration: State-of-the-art, Challenges and Opportunities
## Part 1: Dataset Search
### Framework Overview -- D3L


Our demo utilizes structured data derived from the Web Data Commons project, focusing on:
- **T2Dv2 Gold Standard for Matching Web Tables to DBpedia**: 108 tables from 9 entity classes. [Access here](https://webdatacommons.org/webtables/goldstandardV2.html).
- **Schema.org Table Corpus 2023**: 92 tables from 8 entity classes. [Access here](https://webdatacommons.org/structureddata/schemaorgtables/2023/index.html#toc3).
#### Input Dataset
The input dataset consists of structured data with various attributes. Below is a glimpse of the top 5 rows, showcasing the structure and type of data we are dealing with:

| Rank | Title                                | Category         | Publisher |
|------|--------------------------------------|------------------|-----------|
| 1    | Super Smash Bros. Melee              | Fighting         | Nintendo  |
| 2    | Pikmin 2                             | Strategy/Sim     | Nintendo  |
| 3    | Legend of Zelda: Collector's Edition | RPG              | Nintendo  |
| 4    | Legend of Zelda: The Wind Waker      | Action Adventure | Nintendo  |
| 5    | Metal Gear Solid: Twin Snakes        | Action Adventure | Konami    |



D3L utilizes a comprehensive approach based on:

1. **Attribute Header Similarity**
2. **Value Similarity**
3. **Format Similarity**
4. **Value Distribution**
5. **Attribute value embeddings**
#### Output Datasets: Top k searched dataset results



In [26]:
## Download required words
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Autoload all modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\raulm\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\raulm\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\raulm\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


##### Generate LSH indexes for all evidence in D3L

In [27]:
from Utils import mkdir

# Import and initialize modules
from d3l.indexing.similarity_indexes import NameIndex, FormatIndex, ValueIndex, EmbeddingIndex, DistributionIndex
from d3l.input_output.dataloaders import CSVDataLoader
from d3l.querying.query_engine import QueryEngine
from d3l.utils.functions import pickle_python_object, unpickle_python_object
import os
import pandas as pd

data_path = "Datasets"
result_path = "Result/"
threshold = 0.5
mkdir(result_path)

dataloader = CSVDataLoader(
        root_path=data_path,
        encoding='utf-8'
)

# Metrics
dataloader.print_table_statistics()


Number of columns: 1413
Number of rows: 47825
Total number of cells: 364049



##### Generating/loading NameIndex of tables
Name index: Use q-gram analysis of attribute names to calculate the Jaccard distance between their qsets.

In [28]:
name_lsh = os.path.join(result_path, 'Name.lsh')
print(name_lsh)
if os.path.isfile(name_lsh):
    name_index = unpickle_python_object(name_lsh)
    print("Name LSH index: LOADED!")
else:
    name_index = NameIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(name_index, name_lsh)
    print("Name LSH index: SAVED!")

Result/Name.lsh
Name LSH index: LOADED!


##### Generating/loading FormatIndex of tables
 Format Index: Identifies data formats through regular expressions

In [29]:
format_lsh = os.path.join(result_path, './format.lsh')
if os.path.isfile(format_lsh):
    format_index = unpickle_python_object(format_lsh)
    print("Format LSH index: LOADED!")
else:
    format_index = FormatIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(format_index, format_lsh)
    print("Format LSH index: SAVED!")

Format LSH index: LOADED!


##### Generating/loading ValueIndex of tables
Value Index: Employs TFIDF tokens to represent values, with Jaccard distance between their t-sets assessing similarity.

In [30]:
value_lsh = os.path.join(result_path, './value.lsh')
if os.path.isfile(value_lsh):
    value_index = unpickle_python_object(value_lsh)
    print("Value LSH index: LOADED!")
else:
    value_index = ValueIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(value_index, value_lsh)
    print("Value LSH index: SAVED!")

Value LSH index: LOADED!


##### Generating/loading DistributionIndex of tables
Distribution Index: Assesses numeric attribute value relatedness via the Kolmogorov-Smirnov statistic, offering insights into domain-originating samples.

In [31]:
distribution_lsh = os.path.join(result_path, './distribution.lsh')
if os.path.isfile(distribution_lsh):
    distribution_index = unpickle_python_object(distribution_lsh)
    print("Distribution LSH index: LOADED!")
else:
    distribution_index = DistributionIndex(dataloader=dataloader, index_similarity_threshold=threshold)
    pickle_python_object(distribution_index, distribution_lsh)
    print("Distribution LSH index: SAVED!")

Distribution LSH index: LOADED!


##### Generating/loading EmbeddingIndex of tables
Embedding index: Determines textual content relatedness through cosine distance of their vector representations.

In [32]:
embedding_lsh = os.path.join(result_path, './embedding.lsh')
if os.path.isfile(embedding_lsh):
    embedding_index = unpickle_python_object(embedding_lsh)
    print("Embedding LSH index: LOADED!")
else:
    embedding_index = EmbeddingIndex(dataloader=dataloader,
                                     index_similarity_threshold=threshold)
    pickle_python_object(embedding_index, embedding_lsh)
    print("Embedding LSH index: SAVED!")


File exists. Use --overwrite to download anyway.
Loading embeddings. This may take a few minutes ...
Embedding LSH index: LOADED!


##### show the input table

In [33]:
searched_table = os.listdir(data_path)[0][:-4]
print(searched_table)
table_df = dataloader.read_table(searched_table)
print(table_df.head(5))

T2DV2_1
   Fans' Rank        Title  Year                       Director(s)  \
0         442     The Game  1997                     David Fincher   
1         267       Gandhi  1982              Richard Attenborough   
2         505      Gattaca  1997                     Andrew Niccol   
3         175  The General  1926  Clyde Bruckman and Buster Keaton   
4         486        Ghost  1990                      Jerry Zucker   

   Overall Rank  
0          1491  
1           251  
2           950  
3            45  
4           416  


Query table in the framework using all the above indexes

In [34]:
# Searched results, K =10
qe = QueryEngine(name_index, value_index, embedding_index, format_index, distribution_index)
results, extended_results = qe.table_query(table=dataloader.read_table(table_name=searched_table),
                                           aggregator=None, k=10, verbose=True)
print(results)


Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['eramos', 'estabamos', 'estais', 'estan', 'estara', 'estaran', 'estaras', 'estare', 'estareis', 'estaria', 'estariais', 'estariamos', 'estarian', 'estarias', 'esteis', 'esten', 'estes', 'estuvieramos', 'estuviesemos', 'fueramos', 'fuesemos', 'habeis', 'habia', 'habiais', 'habiamos', 'habian', 'habias', 'habra', 'habran', 'habras', 'habre', 'habreis', 'habria', 'habriais', 'habriamos', 'habrian', 'habrias', 'hayais', 'hubieramos', 'hubiesemos', 'mas', 'mia', 'mias', 'mio', 'mios', 'seais', 'sera', 'seran', 'seras', 'sere', 'sereis', 'seria', 'seriais', 'seriamos', 'serian', 'serias', 'tambien', 'tendra', 'tendran', 'tendras', 'tendre', 'tendreis', 'tendria', 'tendriais', 'tendriamos', 'tendrian', 'tendrias', 'teneis', 'tengais', 'tenia', 'teniais', 'teniamos', 'tenian', 'tenias', 'tuvieramos', 'tuviesemos'] not in stop_words.



[('T2DV2_1', [1.0, 0.4, 0.4, 0.4, 0.6]), ('T2DV2_100', [1.0, 0.0, 0.3730386169712881, 0.0, 0.3080637030641807]), ('T2DV2_235', [1.0, 0.0, 0.37270460820719714, 0.0, 0.32245873854598917]), ('T2DV2_75', [1.0, 0.0, 0.3724816521097934, 0.0, 0.35692228998979797]), ('T2DV2_137', [1.0, 0.0, 0.3676786243929949, 0.0, 0.24352459667395487]), ('T2DV2_162', [1.0, 0.0, 0.3660358903393094, 0.0, 0.18288983195023673]), ('T2DV2_133', [1.0, 0.0, 0.36353461697171324, 0.0, 0.3195188827010667]), ('T2DV2_19', [0.9069044055544515, 0.0, 0.36697226314091913, 0.11261610944659746, 0.19399263370781453]), ('T2DV2_164', [0.9069044055544515, 0.0, 0.3661840406714171, 0.0, 0.21583410007146123]), ('T2DV2_187', [0.9069044055544515, 0.0, 0.36580861465015974, 0.0, 0.20309494508660556])]


##### Output the results and check if the output tables have the same type as the input query table. This is a validation step that checks against the groundTruth, to see if the classes found match.

In [24]:
# Summarize searched results in a table

# class_input_table = df[df['fileName'] == searched_table+".csv"]['class'].tolist()[0]

# data = []
# exceptions = []
# average = []
# for table, score in results:
#         print(table)
#         print(score)
#         class_table = df[df['fileName'] == table+".csv"]['class'].tolist()
#         if len(class_table)==0:
#             class_table = "No Class found"
#         else:
#             class_table = class_table[0]
#         data.append((table, score,class_table))
#         average.append(sum(score)/len(score))
#         if class_table!=class_input_table:
#             exceptions.append(table)

# Creating the DataFrame

# result_summarization = pd.DataFrame(data, columns=["Table Name", "Scores", "Ground Truth Class"])
# result_summarization = pd.concat([result_summarization.drop(["Scores"], axis=1), result_summarization["Scores"].apply(pd.Series)], axis=1).round(3)
# result_summarization.columns = ["Table Name", "Class", "Header Score", "Value Score", "Embedding Score","Format Score",  "Distribution Score"]
# result_summarization["average score"] = average
# print(result_summarization)
# print(result_summarization["Table Name"])

##### For tables that does not belong to the same class of input table, show the specific table.

In [25]:
# for table_name in exceptions:
#     table_except = dataloader.read_table(table_name)
#     table_except_part = table_except.head(5)
#     print(table_except_part)
#     break

##### Individual search using different methods

In [35]:
# Individual search results
# Name index query
topk = 10
def remove_search_col(listA, check_col):
    return [i for i, score in listA if i!=check_col]
        
def check_column(Dataloader:CSVDataLoader, combined_column_name):
    table_name, column_name = combined_column_name.split(".")
    table = Dataloader.read_table(table_name)
    return table[column_name]

def table_results(list_result):
    return pd.DataFrame(list_result, columns=["Column Name", "Scores"])

name_results = name_index.query(query="Tipo organismo", k=topk)
print(f"Name results are \n {table_results(name_results)} \n")

Name results are 
 Empty DataFrame
Columns: [Column Name, Scores]
Index: [] 



In [36]:
### Value 
## Currently not working. Commented to be able to run "all above cells" without interruptions.
# Value index query
value_results = value_index.query(query=table_df["Nombre Organismo"], k=topk)
print(f"Value results are \n {value_results}\n")
columns = [check_column(dataloader, column) for column,score in value_results if column !="file_de2e4073-570f-4b79-bb7a-7d3dfec1c238.Nombre Organismo" ]
print(f"Value indexes results are \n {table_results(value_results)}\n")
if(columns):
    print(f"example results searching Attribute value indexes:\n {columns[0]} \n")

KeyError: 'Nombre Organismo'

In [37]:
# Embeddings index
print(table_df.iloc[:,9])
embedding_results = embedding_index.query(query=table_df.iloc[:,9], k=topk)
print(f"Embedding indexes results are \n{table_results(embedding_results)} \n")
embedding_column  = [check_column(dataloader, column) for column,score in embedding_results if column !="file_de2e4073-570f-4b79-bb7a-7d3dfec1c238.CORREO INSTITUCIONAL" ]
if(embedding_column):
    print(f"example results searching embedding value indexes:\n {embedding_column[0]} \n")

IndexError: single positional indexer is out-of-bounds

## Part 2: Dataset Navigation
### Framework Overview -- Aurum
This is a simplified version of Aurum. It includes two phases: signature building stage and relationship building stage.
Signatures: LSH indexes from D3L: name index and value index
Relationship Building Stage: Search similar columns based on similarity in name and value LSH indexes.


#### Prerequisites: detect subject columns and type of columns

In [38]:
import pickle
from TableMiner.SCDection.TableAnnotation import TableColumnAnnotation as TA
"""
Find the column type and Named entity scores in each table,
 store the table and related column type/NE-scores info as dict in pickle file
"""
def subjectColDetection(DATA_PATH, RESULT_PATH):
    table_dict = {}
    # Try to load the dict from pickle file
    if "dict.pkl" in os.listdir(RESULT_PATH):
        with open(os.path.join(RESULT_PATH,"dict.pkl"), "rb") as f:
            table_dict = pickle.load(f)
    else:
        table_names = os.listdir(DATA_PATH)
        for tableName in table_names:
            table_dict[tableName] = []
            table = pd.read_csv(f"Datasets/{tableName}")
            try:
                annotation_table = TA(table, SearchingWeb = False)
                annotation_table.subcol_Tjs()
                table_dict[tableName].append(annotation_table.annotation)
                table_dict[tableName].append(annotation_table.column_score)
            except Exception as e:
                print(f"Error in {tableName} : {e}")
                continue
        # Save the dict as pickle file
        with open(os.path.join(RESULT_PATH, "dict.pkl"), "wb") as save_file:
            pickle.dump(table_dict, save_file)
    return table_dict

# Perform the call to the method
SubjectCol_dict = subjectColDetection(data_path, "Result")

##### Find the subject columns of result tables from Part I Dataset search.

In [39]:
result_tables = os.listdir(data_path)
subject_columns=[]
all_columns = []
tables_without_ne = []

"""
Use iteration and the above column info dict to find the subject columns (and all columns)
 in each table.
"""
for table in result_tables:
    df_table = dataloader.read_table(table[:-4])
    annotation, NE_column_score = SubjectCol_dict[table]
    if NE_column_score.values():
        max_score = max(NE_column_score.values()) 
    else:
        tables_without_ne.append(table)
        continue
    all_columns.extend([f"{table[:-4]}.{df_table.columns[i]}" for i in NE_column_score.keys()])
    subcol_index = [key for key, value in NE_column_score.items() if value == max_score]
    for index in subcol_index:
        subject_columns.append(f"{table[:-4]}.{df_table.columns[index]}")
print(subject_columns)
print("Amount of tables that don't have NE columns: ", len(tables_without_ne))
print("Tables without NE columns: ", tables_without_ne)

['T2DV2_1.Title', 'T2DV2_100.Title', 'T2DV2_101.City', 'T2DV2_106.Country', 'T2DV2_109.Name', 'T2DV2_111.Name', 'T2DV2_112.City', 'T2DV2_114.Country', 'T2DV2_117.Unnamed: 0', 'T2DV2_12.Name', 'T2DV2_121.Game', 'T2DV2_122.Title', 'T2DV2_125.Name', 'T2DV2_129.Title', 'T2DV2_130.Titel', 'T2DV2_131.Unnamed: 1', 'T2DV2_133.Title', 'T2DV2_134.Company', 'T2DV2_136.Company', 'T2DV2_137.Title', 'T2DV2_138.Name', 'T2DV2_142.COUNTRY', 'T2DV2_145.Title', 'T2DV2_146.Company', 'T2DV2_147.Title', 'T2DV2_148.Country', 'T2DV2_150.Title', 'T2DV2_151.A', 'T2DV2_152.Game', 'T2DV2_154.Unnamed: 1', 'T2DV2_155.Country Name (Click For Destination Guides)', 'T2DV2_156.Mathematician', 'T2DV2_159.Name', 'T2DV2_162.Title', 'T2DV2_163.Title', 'T2DV2_164.Title', 'T2DV2_165.City', 'T2DV2_168.City', 'T2DV2_169.Prize', 'T2DV2_17.Company', 'T2DV2_171.Country Name:', 'T2DV2_172.Hospital Name', 'T2DV2_178.Country', 'T2DV2_180.Middle East', 'T2DV2_183.City', 'T2DV2_184.Title', 'T2DV2_186.Emperor', 'T2DV2_187.Title', 'T2DV

In [41]:
from Aurum.graph import buildGraph,draw_interactive_network,save_graph
# Use Aurum to build the graph
aurum_graph = buildGraph(dataloader, data_path, [name_index, value_index], target_path="Result", table_dict=SubjectCol_dict)
import networkx as nx

"""
Find the subgraph in the Aurum that contains the provided nodes and all the nodes that
have routine to these nodes
"""
def subgraph(given_nodes, graph: nx.Graph()):
    # Find the connected components containing the given node
    subgraphs = list(nx.connected_components(graph))
    relevant_nodes = set()
    for node in given_nodes:
        for sg in subgraphs:
            if node in sg:
                relevant_nodes.update(sg)
    new_graph = aurum_graph.subgraph(relevant_nodes).copy()
    return new_graph
subject_columns_graph = subgraph(subject_columns, aurum_graph)
result_SC_graph = subgraph(subject_columns, aurum_graph)
save_graph(result_SC_graph,"Result")
draw_interactive_network(result_SC_graph)


Grafo guardado en 'Result\grafo.pkl'.


In [42]:
# See all columns in the graph
result_graph = subgraph(all_columns, aurum_graph)
draw_interactive_network(result_graph)

In [46]:
from Aurum.graph import draw_interactive_network_with_filters
draw_interactive_network_with_filters(result_graph)

## Part 3: Dataset Annotation
### Framework Overview -- TableMiner+
#### Input dataset: 13 tables from 13 domain, while each domain has 1 table
The 13 domains include:
1. **Airport**
2. **City**
3. **CollegeOrUniversity**
4. **Company**
5. **Continent**
6. **Country**
7. **Hospital**
8. **LandmarksOrHistoricalBuildings**
9. **Monarch**
10. **Movie**
11. **Museum**
12. **Scientist**
13. **VideoGame**

TableMiner+ has 4 steps:
1. **Subject Column Detection: Including column (data) type detection** 
2. **NE-Column interpretation - the LEARNING phase:**
***2.1 preliminary cell annotation***
***2.2 column semantic type annotation***
***2.3 property annotation***
3. **NE-Column interpretation - the UPDATE phase: revise annotation until all annotation is stabilized**
4. **Relation enumeration and annotating literal-columns(not included yet)**

##### show the example annotation table

In [None]:
import pandas as pd
from TableMiner.LearningPhase.Update import TableLearning,  updatePhase
from TableMiner.SearchOntology import SearchDBPedia

# The map removes .csv from the table names
table_domains = os.listdir(data_path)
for table in table_domains:
    table_domains[table_domains.index(table)] = table[:-4]
print(table_domains)

##### Perform NE-Column interpretation (Table Learning includes the process of subject column detection of a table)

In [35]:
def table_annotation(tableName, subcol_dict):
    tableD = dataloader.read_table(tableName)
    print(tableD)
    annotation_table, NE_Score = subcol_dict[tableName + ".csv"]
    print(annotation_table)
    ### Learning phase of TableMiner+
    tableLearning = TableLearning(tableD, NE_column=NE_Score)
    ### Perform NE-Column interpretation - the UPDATE phase
    print("starting learning phase")
    tableLearning.table_learning()
    print("starting update phase")
    updatePhase(tableLearning)
    return tableLearning

In [None]:
import json

# Codigo duplicado. Es para imprimir lindo el json de las requests.
def pretty_print_json(loaded_json):
    print(json.dumps(loaded_json, indent=2, ensure_ascii=False))

# Mergea dos diccionarios.
# Los valores de dict1 sobreescriben los valores de dict2 en caso de colision
def merge_dicts(dict1, dict2):
    return {**dict2, **dict1}

# Agrega las requests guardadas en el archivo pickle al diccionario de requests de SearchDBPedia
# No se remueven los valores en memoria dinamica.
# Los valores predominantes son los de SearchDBPedia.
def load_ontology_requests(dict_path, dict_name):
    target_file = os.path.join(dict_path, dict_name)
    if not os.path.isfile(target_file):
        return {}
    
    request_cache = unpickle_python_object(target_file)
    
    SearchDBPedia.searches_dictionary = merge_dicts(request_cache['searches'], SearchDBPedia.searches_dictionary)
    SearchDBPedia.retrieve_entity_triples_dictionary = merge_dicts(request_cache['retrieve_entity_triples'], SearchDBPedia.retrieve_entity_triples_dictionary)
    SearchDBPedia.retrieve_concepts_dictionary = merge_dicts(request_cache['retrieve_concepts'], SearchDBPedia.retrieve_concepts_dictionary)
    SearchDBPedia.retrieve_concept_uri_dictionary = merge_dicts(request_cache['get_concept_uri'], SearchDBPedia.retrieve_concept_uri_dictionary)
    SearchDBPedia.retrieve_definitional_sentence_dictionary = merge_dicts(request_cache['get_definitional_sentence'], SearchDBPedia.retrieve_definitional_sentence_dictionary)
    return request_cache


request_cache = load_ontology_requests("Result", "ontologyRequests.pkl")    
pretty_print_json(request_cache.get('searches', {}))

In [None]:
# Learning phase for the selected table.

# searched_table = table_domains[4]
# learning = table_annotation(searched_table, SubjectCol_dict)

# Learning phase for all tables
# Start with the first 10 tables
learning = {}
for table in table_domains:
    print(f"\n ---- \n Starting learning for {table} \n ---- \n")
    learning[table] = table_annotation(table, SubjectCol_dict)

In [None]:
# print(learning["table_name"].get_annotation_class()[0].get_cell_annotation())
# print(learning["table_name"].get_annotation_class()[0].get_winning_concepts())

In [None]:
print("Network Calls")
print("Amount of searches", SearchDBPedia.amount_of_search)
print("Amount of unique searches", SearchDBPedia.unique_searches.__len__(), "\n", SearchDBPedia.unique_searches, "\n")

print("Amount of retrieve entity triples", SearchDBPedia.amount_of_retrieve_entity_triples)
print("Amount of unique entity triples", SearchDBPedia.unique_retrieve_entity_triples.__len__(), "\n", SearchDBPedia.unique_retrieve_entity_triples, "\n")

print("Amount of retrieve concepts", SearchDBPedia.amount_of_retrieve_concepts)
print("Amount of unique concepts", SearchDBPedia.unique_retrieve_concepts.__len__(), "\n", SearchDBPedia.unique_retrieve_concepts, "\n")

print("Amount of concept uri", SearchDBPedia.amount_of_get_concept_uri)
print("Amount of unique concept uri", SearchDBPedia.unique_get_concept_uri.__len__(), "\n", SearchDBPedia.unique_get_concept_uri, "\n")

print("Amount of definitional sentences", SearchDBPedia.amount_of_get_definitional_sentence)
print("Amount of unique definitional sentences", SearchDBPedia.unique_get_definitional_sentence.__len__(), "\n", SearchDBPedia.unique_get_definitional_sentence, "\n")

In [None]:
def store_learning(table, learning, dict_path, dict_name):
    target_file = os.path.join(dict_path, dict_name)
    if os.path.isfile(target_file):
        with open(target_file, 'rb') as file:
            dict_annotation = pickle.load(file)
    else:
        dict_annotation = {}
    dict_annotation[table] = learning[table]
    with open(target_file, 'wb') as file:
        pickle.dump(dict_annotation, file)

store_learning(searched_table, learning, "Result", "annotationDict.pkl")

In [88]:
# Guarda las requests cacheadas de la Ontologia en un archivo pickle
# Obtiene las que estan guardadas hasta el momento y le suma las nuevas
def store_ontology_requests(dict_path, dict_name):
    target_file = os.path.join(dict_path, dict_name)
    if not os.path.exists(target_file):
        saved_requests = {
            'searches': {},
            'retrieve_entity_triples': {},
            'retrieve_concepts': {},
            'get_concept_uri': {},
            'get_definitional_sentence': {}
        }
    else:
        saved_requests = load_ontology_requests(dict_path, dict_name)

    request_caching = {}
    request_caching['searches'] = merge_dicts(SearchDBPedia.searches_dictionary, saved_requests['searches'])
    request_caching['retrieve_entity_triples'] = merge_dicts(SearchDBPedia.retrieve_entity_triples_dictionary, saved_requests['retrieve_entity_triples'])
    request_caching['retrieve_concepts'] = merge_dicts(SearchDBPedia.retrieve_concepts_dictionary, saved_requests['retrieve_concepts'])
    request_caching['get_concept_uri'] = merge_dicts(SearchDBPedia.retrieve_concept_uri_dictionary, saved_requests['get_concept_uri'])
    request_caching['get_definitional_sentence'] = merge_dicts(SearchDBPedia.retrieve_definitional_sentence_dictionary, saved_requests['get_definitional_sentence'])
    pickle_python_object(request_caching, target_file)

store_ontology_requests("Result", "ontologyRequests.pkl")

In [None]:
def findAnnotation(dict_of_annotation,tableN):
    learningT = dict_of_annotation[tableN]
    annotation_class = learningT.get_annotation_class()
    for columnIndex, learning_class in annotation_class.items():
        tableDataframe = dataloader.read_table(tableN)
        column = tableDataframe.iloc[:,columnIndex]
        cellAnnotation  = learning_class.get_cell_annotation()[:5]
        ColumnSemantics = learning_class.get_winning_concepts()
        df_t = pd.concat([column[:5], cellAnnotation], axis=1)
        print(f"column and Cell annotation of the column:\n{df_t}\n")
        print(f"Column {column.name} semantic type: {ColumnSemantics}")


with open("Result/annotationDict.pkl", 'rb') as file:
    dict_annotation = pickle.load(file)
    
findAnnotation(dict_annotation, searched_table)

##### Check the tables' subject column annotations

## Part 4: Schema Inference
### Framework Overview -- Starmie
#### Input dataset: all 200 tables covering 13 specific domains
Embedding methods: Starmie
Table class inference method: Hierarchical clustering
Similarity metric: Average column embeddings of each table
Type of result table clusters: the most frequently appeared class in the ground truth in each cluster


##### Use clustering on table's embeddings to detect the types/domains' of tables

In [None]:
import clustering  as c
"""
Generation of Embeddings: Please run the ./starmie/cmd.sh script to generate the tables' embeddings 
We uploaded the sample embeddings generated by this script to the Result path.
"""
clustering_result = c.typeInference("Result/tableEmbeddings.pkl", "Agglomerative", numEstimate=13)
index_cluster = [i for i, tables in clustering_result.items() if "T2DV2_122.csv" in tables]
# The overall result of clustering in a dataframe, includes cluster id, GT label, size, precision,
# Ranked by precision
result_precision = c.result_precision(clustering_result)
print(result_precision)

In [None]:
# Check the last second clusters' score
print("\n",result_precision.iloc[-2])
# Check inside the cluster, what kind of table does it contain
checked_cluster = clustering_result[result_precision.iloc[-2]["cluster id"]]
innerInfo = c.inner_cluster(checked_cluster)
print(innerInfo)
# select random two tables inside the sample cluster; check and compare their column headers
# to find the difference/similarities between then
rows = c.sample_tables_cluster(innerInfo)
first_table = dataloader.read_table(rows.iloc[0, 0][:-4])
second_table = dataloader.read_table(rows.iloc[1, 0][:-4])
print(first_table.columns,"\n",second_table.columns)

##### Use clustering on column embeddings to detect column's type
column type inference: using hierarchical clustering on the column embeddings


In [None]:
import os
### Cluster all column embeddings under specified domain
column_clustering = c.conceptualAttri(os.path.join(c.current_dir_path, "Datasets"),
                os.path.join(c.current_dir_path, "Result/tableEmbeddings.pkl"),
                clustering_method="Agglomerative",
                domain="VideoGame",
                numEstimate=13)

check_table = "T2DV2_122"

for column_index, column_clusters in column_clustering.items():
    for i in  column_clusters:
        if check_table in i:
            column = [i for i in column_clusters if check_table in i][0]
            print(f"Cluster has {column} \n",column_clusters, "\n")
