Feature/statistical metric distance #21

NeonNeon · 2018-02-09T07:40:56Z

The statistical metric distance from Lu et. al 2013

Needs some more work, but fixes #20

Need to fix the method that checks if a vlmc is a null model.

Kalior

Can probably clean this up a bit, but better to get it working first

Kalior · 2018-02-09T13:57:51Z

src/distance/statisticalmetric.pyx

+        return False
+    return True
+
+  cdef dict remove_larger_probabilities(self, tree, p_value):


This can probably use some comments

Kalior · 2018-02-09T14:07:26Z

src/distance/statisticalmetric.pyx

+
+    return tree
+
+  cdef str get_next_context(self, context_before, next_char, vlmc_to_approximate):


Parts of this function should probably be moved into the vlmc class, we seem to be want to do this all over the place.

Kalior · 2018-02-09T14:09:24Z

src/distance/statisticalmetric.pyx

+        return True
+    return False
+
+  cdef bint check_from_context(self, start_context, sequence, right_vlmc):


check what? We should probably rename this function into something more descriptive

Kalior · 2018-02-09T14:12:26Z

src/distance/statisticalmetric.pyx

+    if 1 - p_value < self.significance_level:
+      return True
+
+  cdef int nbr_of_non_zero_probabilities(self, vlmc):


Doesn't seem to be used

Kalior · 2018-02-09T14:15:15Z

src/distance/statisticalmetric.pyx

+      times_visited_node = context_count[context]
+      if times_visited_node > 0:
+        alphabet = ["A", "C", "G", "T"]
+        probabilites_original = list(zip(alphabet, list(map(lambda x: original_vlmc.tree[context][x], alphabet ))))


Can we extract the map onto a separate line maybe

Kalior · 2018-02-09T15:14:52Z

src/test_distance_function.py


 def test_distance_function(d):
  tree_dir = "../trees"
  parse_trees_to_json.parse_trees(tree_dir)
  vlmcs = VLMC.from_json_dir(tree_dir)
  metadata = get_metadata_for([vlmc.name for vlmc in vlmcs])

-  for vlmc in vlmcs:
+  for vlmc in vlmcs[0:2]:


This probably shouldn't be left in

Added a few TODOs

Kalior · 2018-02-12T07:52:31Z

src/vlmc/vlmc.pyx

+    values, vectors = scipy.sparse.linalg.eigs(matrix, k=1, sigma=1)
+    np.set_printoptions(threshold=np.NaN)
+    print(values[0])
+    print(self.tree.keys())


These prints should probably be removed

Kalior · 2018-02-12T07:53:01Z

src/vlmc/vlmc.pyx

+            if truncated_context in self.tree:
+              contexts_we_can_get_to.append((truncated_context, probability_of_char))
+              break
+      print(contexts_we_can_get_to)


print remove

Kalior · 2018-02-12T07:55:52Z

src/vlmc/vlmc.pyx

+    nbr_of_states = len(self.tree.keys())
+    alphabet = ["A", "C", "G", "T"]
+    rows = []
+    for left in self.tree.keys():


Can probably rename left to keys or contexts or something

Kalior · 2018-02-12T08:02:32Z

src/vlmc/vlmc.pyx

+
+    one = vectors[:, 0]
+    sum_ = np.sum(one)
+    scaled_vector = np.real(np.around(one.real / sum_, decimals=4))


Probably don't need to round this off?

Also add the estimaded distribution whenever the eigen vector computation does not converge.

This is done to avoid crashes when the root is missing

Kalior

Some things we should go over. Also, we should remember the _ before "private" methods in statisticalmetric.pyx

Kalior · 2018-02-13T19:44:45Z

src/distance/statisticalmetric.pyx

+  7. Else exit with no equivalence.
+  """
+  cpdef double distance(self, left_vlmc, right_vlmc):
+    p_values = np.arange(0, 0.001, 0.00005) # should come from a function


We should probably look into this before merging

Kalior · 2018-02-13T19:56:35Z

src/distance/statisticalmetric.pyx

+    return True
+
+
+  cdef object remove_unlikely_events(self, vlmc, threshhold_probability):


I feel like maybe we could create three functions here, one for removing contexts, one for normalisation, and one for detecting if a context should be removed. May make this function a bit more clear.

Kalior · 2018-02-13T20:02:30Z

src/distance/statisticalmetric.pyx

+    for context in right_vlmc.tree.keys():
+      context_counters[context] = 0
+      transition_counters[context] = {}
+      for character in ["A", "C", "G", "T"]:


right_vlmc.alphabet?

Kalior · 2018-02-13T20:06:02Z

src/distance/statisticalmetric.pyx

+    for character in sequence:
+      context_counters[current_context] += 1
+      transition_counters[current_context][character] += 1
+      current_context = self.get_relvant_context(current_context, character, right_vlmc)


Can we use relevant context function from vlmc here?

Kalior · 2018-02-13T20:10:59Z

src/distance/statisticalmetric.pyx

+        # find probabilites that are greater than zero
+        probs_without_zeros = [item for item in transition_probabilitites if item[1] > 0]
+        # loop through all of these exept one (last)
+        for character_probability_tuple in probs_without_zeros[:-1]:


can we deconstruct the tuples from the array here?

Kalior · 2018-02-13T20:20:51Z

src/vlmc/vlmc.pyx


  cpdef str generate_sequence(self, sequence_length, pre_sample_length):
    total_length = sequence_length + pre_sample_length
    cdef str generated_sequence = ""
+
+    if not "" in self.tree:


I kind of want to say this is a malformed vlmc, but this is needed since we prune away states, right? Have we considered protecting nodes with full transitions from being pruned? Maybe we can look into how Dalevi et al. prune as well?

We should definitely look at this more

Kalior · 2018-02-13T20:25:03Z

src/vlmc/vlmc.pyx

+            to_context = possible_context[-(self.order - i):]
+            if to_context in self.tree:
+              reachable_contexts.append((to_context, probability_of_char))
+              break


Can use _get_context() here?

Kalior · 2018-02-13T20:26:11Z

src/vlmc/vlmc.pyx

+
+    """
+    nbr_of_states = len(self.tree.keys())
+    rows = []


maybe we can preallocate a matrix here, for speedups.

Kalior · 2018-02-13T20:27:38Z

src/vlmc/vlmc.pyx

+    stationary_distibution = {}
+    for i, context in enumerate(self.tree.keys()):
+      stationary_distibution[context] = normalized_eigen_vector[i]
+    return stationary_distibution


Maybe we can create a function for this last normalisation step

Kalior · 2018-02-13T20:27:51Z

src/vlmc/vlmc.pyx

+      print("Calculation of eigenvector did not converge, using estimated stationary distribution.")
+      return self._estimated_context_distribution(50000)
+
+    eigen_vector = vectors[:, 0]


This may need a comment

Occurs for example, when generating a sequence and the vlmc end up in a context for which there is no outgoing transitions with non-zero probability.

When removing the low-probability events of a model, only removes transitions that belong to leaf-contexts. For a full Markov chain of order 3, this means the outgoing transitions from all contexts with 3 letters, e.g. ACT.

If this is not done, we change the actual objects which makes the coming distance calculations completely wrong.

If a context, e.g. "T", have several transtions that goes to the empty context, because context "A" and "C" does not exist e.g., we need to add these probabilites to get the total probability of moving from "T" to the empty context

Or when it converges but the vector is full of nan-values.

The event probability of a transition _t_ is the probability that _t_ happens at any random index in a generated sequence

NeonNeon added 3 commits February 8, 2018 09:10

First working version of statistical-metric, only checks equivalence

43e9aee

Some refactoring and renaming

24ad7b5

Iterating over small P-values

c9f6fcb

Need to fix the method that checks if a vlmc is a null model.

Kalior reviewed Feb 9, 2018

View reviewed changes

NeonNeon added 3 commits February 11, 2018 22:42

Remove slice that was used to test locally

2807968

Add method to get the stationary distribution of the vlmc

55a986f

Some clean up, better names for functions etc

7d3e685

Added a few TODOs

Kalior reviewed Feb 12, 2018

View reviewed changes

NeonNeon and others added 15 commits February 12, 2018 10:21

Change some names and added some whitespace

3ac1b15

Extract the alphabet to an instance variabel

b95b7e7

Make the vlmc able to generate a sequence even it has no root

6c006ff

Tidy some code

88c6ad6

Further improve the stationary distribution method

4af9279

Also add the estimaded distribution whenever the eigen vector computation does not converge.

Lots of cleaned up code in the statistical metric distance class

409c7f3

Refer to newly created alphabet variable

b47373c

Distance based on the stationary distribution of output from the model

fcb03f2

Distance based only on the acgt content in the training sequence

37f1f0f

Extract method to get the context given a sequence

2d8639c

Divide neg-log-likelihood by length of sequence

86ecf8e

Specify python version for tree parsing

8ce8560

bug-fix stationary distribution

9f4d948

First working version of statistical-metric, only checks equivalence

05c3d40

Look at longer sequences when estimating the stationary distribution

af25df8

This is done to avoid crashes when the root is missing

Kalior reviewed Feb 13, 2018

View reviewed changes

NeonNeon added 3 commits February 14, 2018 15:10

Return zero in likelihood if any probability is zero

632eb4f

Add the usage of AbsorbingStateException

4f850d2

Occurs for example, when generating a sequence and the vlmc end up in a context for which there is no outgoing transitions with non-zero probability.

Handle the AbsorbingStateException in the distance function

2c2355e

NeonNeon added 15 commits February 14, 2018 15:16

Use the get_context method from the vlmc instead of another method

9b4c410

Refactor remove_unlikely_events to smaller methods

231d627

Allocate transition matrix instead of appending values

735d61a

Use alphabet variable instead of redefining the array

537290b

Some more clean up

ffcc05f

Refactor get_transition_matrix method, hopefully clearer

42d4de1

Refactor stationary_context method

1667f92

Change prints during testing of statistical metric distance

768b1a4

Only consider leaf context transitions when pruning the VLMCs

66d1682

When removing the low-probability events of a model, only removes transitions that belong to leaf-contexts. For a full Markov chain of order 3, this means the outgoing transitions from all contexts with 3 letters, e.g. ACT.

Make sure to copy the vlmcs before calculating the distance

4b7855d

If this is not done, we change the actual objects which makes the coming distance calculations completely wrong.

Estimating context distribution when eigs not coverging

9dc56b0

Or when it converges but the vector is full of nan-values.

Using event probabilites to know which transitions to prune

5323a4d

The event probability of a transition _t_ is the probability that _t_ happens at any random index in a generated sequence

Merge branch 'master' into feature/statistical-metric-distance

20822fb

Check if _other_ is None in __eq__ method of vlmc

3599a2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/statistical metric distance #21

Feature/statistical metric distance #21

NeonNeon commented Feb 9, 2018

Kalior left a comment

Kalior Feb 9, 2018

Kalior Feb 9, 2018

Kalior Feb 9, 2018

Kalior Feb 9, 2018

Kalior Feb 9, 2018

Kalior Feb 9, 2018

Kalior Feb 12, 2018

Kalior Feb 12, 2018

Kalior Feb 12, 2018

Kalior Feb 12, 2018

Kalior left a comment

Kalior Feb 13, 2018

Kalior Feb 13, 2018

NeonNeon Feb 14, 2018

Kalior Feb 13, 2018

Kalior Feb 13, 2018

Kalior Feb 13, 2018

Kalior Feb 13, 2018

NeonNeon Feb 14, 2018

Kalior Feb 13, 2018

Kalior Feb 13, 2018

Kalior Feb 13, 2018

Kalior Feb 13, 2018


		return tree

		cdef str get_next_context(self, context_before, next_char, vlmc_to_approximate):

		return True


		cdef object remove_unlikely_events(self, vlmc, threshhold_probability):

Feature/statistical metric distance #21

Are you sure you want to change the base?

Feature/statistical metric distance #21

Conversation

NeonNeon commented Feb 9, 2018

Kalior left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kalior left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment