# WordNet Dataset experiment #
What do I wish to achieve with this experiment(s)?

### Correct vs Incorrect gradient behaviour expectation
The correct gradient adds an additional factor to the attr./rep. forces. I expect this to speed up convergence. 

The attractive term (1st term) is a sparse term (most values in summation are 0), and thus I expect less of an effect on attraction (unless perpelxity is increased). 

The repulsive term (2nd term) will be more strongly affected. As points go apart $q^{H}_{ij}$ goes down (to 0) but, $d^{H}_{ij}$, the extra term will increase (exponentially?)

So the expectation is that points will repel more strongly, points will be repelled faster and more strongly.


### Early exaggeration
At the beginning we scale up the high dim. prob.'s so the attractive term dominates more. This is so similar points already start clustering together before being separated by the repulsive force. 



### WordNet specific experiment animo
1. WordNet is supposedly a hierarchical (hyperbolic) dataset
    - Is this a property that we can "obviously" observe from the embeddings?
    - We expect the WordNet visualizations to display obivous hyperbolic/hierarchical structure
2. Is there a noteworthy difference in speed/embedding quality for the incorrect vs correct gradient?

3. Enable acceleration


### WordNet dataset experiment notes
1. My current setup can't handle > 0.1 numpoints well. It takes a long time (So test stuff with less data)

In [None]:
import os
import traceback

from hyperbolicTSNE import Datasets
from hyperbolicTSNE.visualization import plot_poincare, animate
from hyperbolicTSNE import load_data, Datasets, SequentialOptimizer, initialization, HyperbolicTSNE
from hyperbolicTSNE.cost_functions_ import CoSNE, HyperbolicKL, GlobalHSNE
from hyperbolicTSNE.util import find_last_embedding, opt_config, initialize_logger, write_data, store_visuals

In [None]:
data_home = "datasets"
log_path = "temp/poincare/"  # path for saving embedding snapshots

In [None]:
seed = 42

# Different experiment values
# num_points = [0.25, 0.5, -1]                     # -1 means all the data
# perplexities = [10, 30, 50, 80, 100, 150]

# Parameters specific to experiments
# MNIST -> (70000, 784)
# C_ELEGANS -> (89701, 20222)
datasets = [Datasets.C_ELEGANS, Datasets.PLANARIA] 
num_points = [0.5]
perplexities = [50]
correct_gradients = [False, True]                # NOTE: Recompile with correct flag (GRAD_FIX flag)
exact = False                                    # Exact computation or BH estimation of gradient
pca_components = 50                              # Whether to use pca initialization of high dim. data or not
grad_scale_fix = True                            # Whether we multiply the gradient by the inverse metric tensor of hyperbolic space or not
                                                 # Note that the correct hyperoblic gradient has an inverse metric tensor factor
learning_rates_factors = [1000, 10000]

# General parameters to be set for all experiments
exaggeration_factor = 12
ex_iterations = 250
main_iterations = 1000

# Parameters to be set regarding saving of data
# TODO: Automate cost identification for saving purposes. Right now its done manually
data_header = ['dataset', 'data_size', 
                'data_dim', 'pca_init', 
                'perplexity', 'pca_components',
                'cost_function_value', 'cost_function', 
                'runtime', 'total_iterations', 
                'exact', 'correct_gradient', 'grad_scale_fix', 'lr_factor']

In [None]:
# Setting up the cost function
# cf = CoSNE
# lambda_1 = 20
# lambda_2 = 0.1

# def CoSNE_params(x_norm, l1, l2, n_samples):
#     return {
#         "x_norm" : x_norm,
#         "lambda_1" : l1,
#         "lambda_2" : l2,
#         "n_samples" : n_samples
#     }

# print(cf.class_str())
# cf = GlobalHSNE
# lbda = 0.1
# def GlobalHSNE_params(P_hat, lbda, n_samples):
#     return {
#         "P_hat": P_hat,
#         "lbda": lbda,
#         "n_samples": n_samples,
#     }
cf = HyperbolicKL

In [None]:
""" 
Code for WordNet Dataset experiments
"""
for dataset in datasets:
    for correct_gradient in correct_gradients:
        for lr in learning_rates_factors:
            for np in num_points:
                for perplexity in perplexities:
                    ##################
                    # INITIALIZATION #
                    ##################
                    print("\nExperiment: ", dataset, "num_points: ", np, "perp: ", perplexity, "correct gradient: ", correct_gradient)

                    # Load the data
                    # NOTE: if we don't use -1 data, we need another argument _ to catch the sample indices
                    dataX, dataLabels, D, V, *rest = load_data(
                        dataset, 
                        data_home=data_home, 
                        pca_components=pca_components,
                        random_state=seed, 
                        to_return="X_labels_D_V",
                        hd_params={"perplexity": perplexity}, 
                        sample=np, 
                        knn_method="hnswlib"  # we use an approximation of high-dimensional neighbors to speed up computations
                    )
                    
                    # Compute initial embedding in Poincare disk (PCA embedding)
                    X_embedded = initialization(
                        n_samples=dataX.shape[0], 
                        n_components=2,
                        X=dataX,
                        random_state=seed,
                        method="pca"
                    ) 

                    # Initialize config and parameters
                    learning_rate = (dataX.shape[0] * 1) / (exaggeration_factor * lr)

                    opt_conf = opt_config(cf, learning_rate, exaggeration_factor, ex_iterations, main_iterations, exact, correct_gradient)
                    opt_params = SequentialOptimizer.sequence_poincare(**opt_conf) 
                    opt_params, opt_conf = initialize_logger(opt_params, opt_conf, log_path=log_path, grad_path=None)

                    # TODO: In the case of CoSNE, we must pass additional parameters to the cost function
                    # x_norm = (X_embedded * X_embedded).sum(axis=1)
                    # opt_params["cf_params"].update(CoSNE_params(x_norm, lambda_1, lambda_2, dataX.shape[0]))  # Add CoSNE params to cf_params dict
                    # print(x_norm.shape, dataX.shape[0], X_embedded.shape)
                    opt_params["cf_params"].update({"grad_fix" : correct_gradient})     # So the cost function knows which gradient to use
                    print(opt_params["cf_params"])

                    #########################
                    # EMBEDDING COMPUTATION #
                    #########################
                    # Set up H-TSNE object 
                    htsne = HyperbolicTSNE(
                        init=X_embedded, 
                        n_components=2, 
                        metric="precomputed", 
                        verbose=True, 
                        opt_method=SequentialOptimizer,         # the optimizater we use
                        opt_params=opt_params              # the parameters for the optimizers
                    )

                    # Compute embedding:
                    try:
                        hyperbolicEmbedding = htsne.fit_transform((D, V))
                        
                    except ValueError:
                        hyperbolicEmbedding = find_last_embedding(log_path)
                        traceback.print_exc()


                    ###################
                    # STORING RESULTS #
                    ###################
                    # Write results to csv
                    data_row = [
                        dataset, np, dataX.shape[1], pca_components,
                        perplexity, pca_components, htsne.cf, cf.class_str(), htsne.runtime, htsne.its,
                        opt_conf['exact'], correct_gradient, grad_scale_fix, lr
                    ]
                
                    results_path = f"results/csvs/{dataset.name}.csv"       # Path of csv file to store results
                    save_folder = f"results/{dataset.name}"                 # Path of folder to store visualizations
                    
                    # Name of visualization file (.png and .gif)
                    file_name = f"{save_folder}/exact:{exact}_correctgrad:{correct_gradient}_scalefix:{grad_scale_fix}_points:{np}_pca:{pca_components}_perp:{perplexity}_cf:{cf.class_str()}_lrfactor:{lr}"

                    # Store results
                    write_data(data_header, data_row, file_path=results_path)
                    store_visuals(hyperbolicEmbedding, dataLabels, save_folder, file_name, opt_params)
                    

""" 
Dimensions for wordnet data
x_norm: (41058,) dataX.shape[0]: 41058 X_embedded: (41058, 2)
grad1.shape: (82116,) Y.shape: (82116,) x_norm: (41058,)
"""