# Greenlight presentation notes

## Current progress

tldr: HyperbolicKL has some issues, we switch to GaussianKL to overcome them (hopefully)

### HyperbolicKL notes
I'm trying to figure out what exactly is going wrong when certain flags are set:
1. Baseline experiment: (MNIST, wrong_grad, no scale_fix, HyperbolicKL, accel. on/off) \
   Works perfectly good. We get the results matching the paper (baseline)

2. Experiment 2: (MNIST, wrong_grad, scale_fix, HyperbolicKL, acce. on/off) \
   Works fairly well. We get results similar to baseline. But it requires (much) more iterations to converge nicely.
   This is due to the scaling of the inverse metric tensor term. Since it basically scales the gradients down consistently.

3. Experiment 3: (MNIST, correct_gad, no scale_fix, HyperbolicKL, accel. on/off) \
   Takes much longer to converge properly. The extra d^H_ij term makes gradients (near) 0 for a long time.
   Gradients get very large too at some point (embeddings overshoot the Disk)

4. Experiment 4: (MNIST, correct_grad, scale_fix, HyperbolicKL, accel. on/off) \
   Takes even longer to converge properly. Now we have both d^H_ij (initially) and scale_fix scaling the gradients down. \
   However now gradients stay within reasonable amounts and don't explode off to large values. Presumably due to the scale_fix \
   These settings reproduce the result where points are pushed towards the boundary (conceptual sensible one)

In points (3, 4) we get other issues. So the fact that gradients are very small means no progress is made for a long time, means the algorithm just stops.
\
If we use both fixes (gradient, scale_fix) then we get the "issue" that points are pushed towards the boundary. 

To avoid early stopping increase **n_iter_check** flag in **opt_config** and manually change **n_iter_without_progress** in **solver.py**

#### HyperbolicKL on Tree data
Does not work very well. Often points don't even move from the center or other weirdness happens. I haven't tested this specific case much yet though.

#### HyperbolicKL on Real data
Pushes points towards the boundaries much more strongly than GaussianKL
\
See experiment (73, 74)


### HyperbolicKL to GaussianKL Motivation

Gaussian has faster decreasing tails, resulting in forces being propegated out less, so embeddings are less likely to be pushed outwards towards the edge.

### GaussianKL notes

Some general notes:

- size_tol parameter and the Hyperbolic variance really affect the embedding. This requires some hand-tuning
- Early exaggeration 
- BH accel. numerical problems
- Large tree depth problems
- Crowding problem? Clusters being squished to points?
- GaussianKL gradient: (2/var term and learning rate)

#### GaussianKL on Tree dataset
On the tree dataset, GaussianKL seems to produce much better results than HyperbolicKL. Although there are some issues.

#### GaussianKL on Real datasets (MNIST, C_ELEGANS for now)
On the real datasets, GaussianKL does not push embeddings towards the boundaries as much as HyperbolicKL. 
\
See experiment (71, 72)

# Tree dataset

To properly test the notion that tree-like data can be nicely embedded in Hyperbolic Space, we turn to an artificially created tree-like dataset. 
\
I basically wrote some code that generates a Distance **D** and affinity **V** Matrix such that its entries reflect a tree-like ordering of the data. 

# Experiments

Some notable experiments and analysis:

#### Experiment 38 (also 68)
This experiment nicely showcases a tree embedding using GaussianKL. However, this experiment had an extra grad \*= 4 term in the cost function (leftover when I coped over the HyperbolicKL code) that caused the embedding to converge to this specific visualization.
\
\
If this term is removed, the clusters are more dense and converge to points. (experiment 66)
\
\
However, we can regain clusters by adjusting the learning rate (this basically mimics the grad \*=4 mistake).
It seems that adjusting the learning rate will give us "better" (subjective and context-dependent) embeddings.


#### Experiment 73. 74
Showcases HyperbolicKL on MNIST. Points are pushed along the boundary very strongly

#### Experiment 71, 72
Showcases GaussianKL on MNIST. Points are much less pushed toward the center and the embedding actually takes shape inside the visible part of the disk.

#### Experiment 83, 84
Attempt to reproduce 38, 68 without the grad \*= 4 term, but by adjusting the learning rate since it achieves basically the same result. This attempt was succesful. Therefore learning rate really affects the embeddings.

#### Experiment 85, 86
Experiment that showcases how adjusting learning rate also improves embeddings for larger trees. Retains tree-structure and still displays inter-cluster nodes nicely.
\
\
Exp 86: Experiment to see whether BH accel. improves performance. It indeed does improve performance and the embeddings still look nice.

#### Experiment 88, 89
C_ELEGANS dataset using GaussianKL, BH accel. on/off, and adjusted learning rates

#### Experiment 91
HyperbolicKL on a tree-like dataset. Embeddings are strongly pushed towards the boundary

# Future work

- Proper comparison of GaussianKL vs HyperbolicKL losses (on different data sets)
- Extra investigation into BH accel. for GaussianKL
- Possible "crowding" issues again with GaussianKL (although HyperbolicKL seems to have it too).
  Clusters basically become point-like (for tree datasets), which is strange since it is not the case in experiment 38
- Produce regular t-sne visualizations of Tree-like data
- Quantitative comparison of gradients 
- Comparing cost function between gradients
- Motivate why we should use the correct gradient; Compare "wrong" gradient with "right gradient objectives, what is being minimized in either cases? Do they both minimize the same objective?
- Writing; TU Delft writing center
- Structure all the goals, findings. Connect the contributions and progress, experiments that backup the contributions. Have a logical ordering from hunter's thesis, wrong gradient, to gaussian gradient and backup/connect the steps together. Identify my contributions in this chain and note them down for the thesis.
- Write down claims/goals explicitly, where is the story/thesis heading to. What are my claims?
- Add quantitative measurements for NE methods to substantiate my claims

- Time plan for this structure
- Aggregate important information into a file for teams
- Share overleaf link in teams

# Thesis structure

1. Introduction
    - Explain my contributions, where the thesis is heading
    - Why would people care
    - Challenges
2. Related work
    - data visualization (tsne)
    - Hyperbolic space visualizations
    - Hunters' work, other hyperbolic visualization
    - BH acceleration
3. Background
    - Technical details about what is required to understand my thesis
    - Hyperbolic space, t-sne, gradient descent, math etc..
4. Methods
    - What did I do on a conceptual level. 
    - What are my proposed solutions, adjustments, methods
    - The conceptual/theoretical part, motivations, derivations
5. Experiments
    - Experiments that motivate this thesis, connect them to my contributions/goals
    - Experiment starts with a question targetting a contribution/goal, answer question using the experiment, relate it to the next experiment
6. Discussion
    - Discuss experiments and connect them together to shape a strong motivation for my goal
    - Focussed on content of thesis (methods/experiments)
7. Conclusion
    - Link experiments, discussion back to the original questions, goals, theme of the thesis. 
    - Present recommendations, risks, guidelines based on experiments/discussion
    - Put things in context globally

# Thesis storyline

1. Start with Hunters' work. Hyperbolic Embeddings in general. Try to obtain some results that bring doubt to claims in those original papers?
    - Wrong gradient vs Right gradient?
    - What is being optimized wrt. wrong gradient?
    - Is the actual cost function being minimized when we use the wrong gradient?
    - What is happening to the cost function when we use the wrong gradient?
    - How does this relate to the resulting embedding?
Talk about related works (Poincare map, co-sne, h-sne) that also use the wrong gradient. 


    - What goes on when we use the correct gradient?
    - Can we compare gradients? (think about this a bit more)
    - Why do we care about a correct gradient? (Hyperbolic space, capturing hierarchical structure)
    - Does the wrong gradient capture hierarchical structure?
    - Why do we care about the correct gradient?
    - TODO: Experiment with wrong gradient on various datasets

Also go into core motivations behind Hyperbolic embeddings. Are we capable of capturing hierarchy (visually) using Hunters' work? If not, why not? How do we know we're not? 
\
\
What is the end goal of such work? What kind of visualizations would we like to obtain? Do we have baseline experiments that can affirm this? (HyperbolicKL variants on tree-like data)

2. Lead insights, problems, "mistakes" from .1 to the focus of my thesis. 
    - What does Hunters' work etc.. fail to capture? What would we expect vs. what is being captured?
    - 

3. My proposals to "resolving" issues/mistakes/problems arising from .1 and .2
    - GaussianKL (and why?)