New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/statistical metric distance #21
base: master
Are you sure you want to change the base?
Conversation
Need to fix the method that checks if a vlmc is a null model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably clean this up a bit, but better to get it working first
src/distance/statisticalmetric.pyx
Outdated
return False | ||
return True | ||
|
||
cdef dict remove_larger_probabilities(self, tree, p_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably use some comments
src/distance/statisticalmetric.pyx
Outdated
|
||
return tree | ||
|
||
cdef str get_next_context(self, context_before, next_char, vlmc_to_approximate): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parts of this function should probably be moved into the vlmc class, we seem to be want to do this all over the place.
src/distance/statisticalmetric.pyx
Outdated
return True | ||
return False | ||
|
||
cdef bint check_from_context(self, start_context, sequence, right_vlmc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check
what? We should probably rename this function into something more descriptive
src/distance/statisticalmetric.pyx
Outdated
if 1 - p_value < self.significance_level: | ||
return True | ||
|
||
cdef int nbr_of_non_zero_probabilities(self, vlmc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to be used
src/distance/statisticalmetric.pyx
Outdated
times_visited_node = context_count[context] | ||
if times_visited_node > 0: | ||
alphabet = ["A", "C", "G", "T"] | ||
probabilites_original = list(zip(alphabet, list(map(lambda x: original_vlmc.tree[context][x], alphabet )))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract the map
onto a separate line maybe
src/test_distance_function.py
Outdated
|
||
def test_distance_function(d): | ||
tree_dir = "../trees" | ||
parse_trees_to_json.parse_trees(tree_dir) | ||
vlmcs = VLMC.from_json_dir(tree_dir) | ||
metadata = get_metadata_for([vlmc.name for vlmc in vlmcs]) | ||
|
||
for vlmc in vlmcs: | ||
for vlmc in vlmcs[0:2]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably shouldn't be left in
src/vlmc/vlmc.pyx
Outdated
values, vectors = scipy.sparse.linalg.eigs(matrix, k=1, sigma=1) | ||
np.set_printoptions(threshold=np.NaN) | ||
print(values[0]) | ||
print(self.tree.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These print
s should probably be removed
src/vlmc/vlmc.pyx
Outdated
if truncated_context in self.tree: | ||
contexts_we_can_get_to.append((truncated_context, probability_of_char)) | ||
break | ||
print(contexts_we_can_get_to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print
remove
src/vlmc/vlmc.pyx
Outdated
nbr_of_states = len(self.tree.keys()) | ||
alphabet = ["A", "C", "G", "T"] | ||
rows = [] | ||
for left in self.tree.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably rename left
to keys
or contexts
or something
src/vlmc/vlmc.pyx
Outdated
|
||
one = vectors[:, 0] | ||
sum_ = np.sum(one) | ||
scaled_vector = np.real(np.around(one.real / sum_, decimals=4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't need to round this off?
Also add the estimaded distribution whenever the eigen vector computation does not converge.
This is done to avoid crashes when the root is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things we should go over. Also, we should remember the _
before "private" methods in statisticalmetric.pyx
src/distance/statisticalmetric.pyx
Outdated
7. Else exit with no equivalence. | ||
""" | ||
cpdef double distance(self, left_vlmc, right_vlmc): | ||
p_values = np.arange(0, 0.001, 0.00005) # should come from a function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably look into this before merging
src/distance/statisticalmetric.pyx
Outdated
return True | ||
|
||
|
||
cdef object remove_unlikely_events(self, vlmc, threshhold_probability): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like maybe we could create three functions here, one for removing contexts, one for normalisation, and one for detecting if a context should be removed. May make this function a bit more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/distance/statisticalmetric.pyx
Outdated
for context in right_vlmc.tree.keys(): | ||
context_counters[context] = 0 | ||
transition_counters[context] = {} | ||
for character in ["A", "C", "G", "T"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right_vlmc.alphabet?
src/distance/statisticalmetric.pyx
Outdated
for character in sequence: | ||
context_counters[current_context] += 1 | ||
transition_counters[current_context][character] += 1 | ||
current_context = self.get_relvant_context(current_context, character, right_vlmc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use relevant context function from vlmc here?
src/distance/statisticalmetric.pyx
Outdated
# find probabilites that are greater than zero | ||
probs_without_zeros = [item for item in transition_probabilitites if item[1] > 0] | ||
# loop through all of these exept one (last) | ||
for character_probability_tuple in probs_without_zeros[:-1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we deconstruct the tuples from the array here?
|
||
cpdef str generate_sequence(self, sequence_length, pre_sample_length): | ||
total_length = sequence_length + pre_sample_length | ||
cdef str generated_sequence = "" | ||
|
||
if not "" in self.tree: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of want to say this is a malformed vlmc, but this is needed since we prune away states, right? Have we considered protecting nodes with full transitions from being pruned? Maybe we can look into how Dalevi et al. prune as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely look at this more
src/vlmc/vlmc.pyx
Outdated
to_context = possible_context[-(self.order - i):] | ||
if to_context in self.tree: | ||
reachable_contexts.append((to_context, probability_of_char)) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use _get_context() here?
src/vlmc/vlmc.pyx
Outdated
|
||
""" | ||
nbr_of_states = len(self.tree.keys()) | ||
rows = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can preallocate a matrix here, for speedups.
stationary_distibution = {} | ||
for i, context in enumerate(self.tree.keys()): | ||
stationary_distibution[context] = normalized_eigen_vector[i] | ||
return stationary_distibution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can create a function for this last normalisation step
src/vlmc/vlmc.pyx
Outdated
print("Calculation of eigenvector did not converge, using estimated stationary distribution.") | ||
return self._estimated_context_distribution(50000) | ||
|
||
eigen_vector = vectors[:, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may need a comment
Occurs for example, when generating a sequence and the vlmc end up in a context for which there is no outgoing transitions with non-zero probability.
When removing the low-probability events of a model, only removes transitions that belong to leaf-contexts. For a full Markov chain of order 3, this means the outgoing transitions from all contexts with 3 letters, e.g. ACT.
If this is not done, we change the actual objects which makes the coming distance calculations completely wrong.
If a context, e.g. "T", have several transtions that goes to the empty context, because context "A" and "C" does not exist e.g., we need to add these probabilites to get the total probability of moving from "T" to the empty context
Or when it converges but the vector is full of nan-values.
The event probability of a transition _t_ is the probability that _t_ happens at any random index in a generated sequence
The statistical metric distance from Lu et. al 2013
Needs some more work, but fixes #20