Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/statistical metric distance #21

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

NeonNeon
Copy link
Collaborator

@NeonNeon NeonNeon commented Feb 9, 2018

The statistical metric distance from Lu et. al 2013

Needs some more work, but fixes #20

Copy link
Owner

@Kalior Kalior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably clean this up a bit, but better to get it working first

return False
return True

cdef dict remove_larger_probabilities(self, tree, p_value):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably use some comments


return tree

cdef str get_next_context(self, context_before, next_char, vlmc_to_approximate):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parts of this function should probably be moved into the vlmc class, we seem to be want to do this all over the place.

return True
return False

cdef bint check_from_context(self, start_context, sequence, right_vlmc):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check what? We should probably rename this function into something more descriptive

if 1 - p_value < self.significance_level:
return True

cdef int nbr_of_non_zero_probabilities(self, vlmc):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to be used

times_visited_node = context_count[context]
if times_visited_node > 0:
alphabet = ["A", "C", "G", "T"]
probabilites_original = list(zip(alphabet, list(map(lambda x: original_vlmc.tree[context][x], alphabet ))))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract the map onto a separate line maybe


def test_distance_function(d):
tree_dir = "../trees"
parse_trees_to_json.parse_trees(tree_dir)
vlmcs = VLMC.from_json_dir(tree_dir)
metadata = get_metadata_for([vlmc.name for vlmc in vlmcs])

for vlmc in vlmcs:
for vlmc in vlmcs[0:2]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be left in

values, vectors = scipy.sparse.linalg.eigs(matrix, k=1, sigma=1)
np.set_printoptions(threshold=np.NaN)
print(values[0])
print(self.tree.keys())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These prints should probably be removed

if truncated_context in self.tree:
contexts_we_can_get_to.append((truncated_context, probability_of_char))
break
print(contexts_we_can_get_to)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print remove

nbr_of_states = len(self.tree.keys())
alphabet = ["A", "C", "G", "T"]
rows = []
for left in self.tree.keys():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably rename left to keys or contexts or something


one = vectors[:, 0]
sum_ = np.sum(one)
scaled_vector = np.real(np.around(one.real / sum_, decimals=4))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need to round this off?

Copy link
Owner

@Kalior Kalior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things we should go over. Also, we should remember the _ before "private" methods in statisticalmetric.pyx

7. Else exit with no equivalence.
"""
cpdef double distance(self, left_vlmc, right_vlmc):
p_values = np.arange(0, 0.001, 0.00005) # should come from a function
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably look into this before merging

return True


cdef object remove_unlikely_events(self, vlmc, threshhold_probability):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like maybe we could create three functions here, one for removing contexts, one for normalisation, and one for detecting if a context should be removed. May make this function a bit more clear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bitmoji

for context in right_vlmc.tree.keys():
context_counters[context] = 0
transition_counters[context] = {}
for character in ["A", "C", "G", "T"]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right_vlmc.alphabet?

for character in sequence:
context_counters[current_context] += 1
transition_counters[current_context][character] += 1
current_context = self.get_relvant_context(current_context, character, right_vlmc)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use relevant context function from vlmc here?

# find probabilites that are greater than zero
probs_without_zeros = [item for item in transition_probabilitites if item[1] > 0]
# loop through all of these exept one (last)
for character_probability_tuple in probs_without_zeros[:-1]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we deconstruct the tuples from the array here?


cpdef str generate_sequence(self, sequence_length, pre_sample_length):
total_length = sequence_length + pre_sample_length
cdef str generated_sequence = ""

if not "" in self.tree:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of want to say this is a malformed vlmc, but this is needed since we prune away states, right? Have we considered protecting nodes with full transitions from being pruned? Maybe we can look into how Dalevi et al. prune as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely look at this more

to_context = possible_context[-(self.order - i):]
if to_context in self.tree:
reachable_contexts.append((to_context, probability_of_char))
break
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use _get_context() here?


"""
nbr_of_states = len(self.tree.keys())
rows = []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can preallocate a matrix here, for speedups.

stationary_distibution = {}
for i, context in enumerate(self.tree.keys()):
stationary_distibution[context] = normalized_eigen_vector[i]
return stationary_distibution
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create a function for this last normalisation step

print("Calculation of eigenvector did not converge, using estimated stationary distribution.")
return self._estimated_context_distribution(50000)

eigen_vector = vectors[:, 0]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need a comment

Occurs for example, when generating a sequence and the vlmc end up in
a context for which there is no outgoing transitions with non-zero
probability.
When removing the low-probability events of a model, only removes
transitions that belong to leaf-contexts.  For a full Markov chain of
order 3, this means the outgoing transitions from all contexts with 3
letters, e.g. ACT.
If this is not done, we change the actual objects which makes the
coming distance calculations completely wrong.
If a context, e.g. "T", have several transtions that goes to the empty
context, because context "A" and "C" does not exist e.g., we need to
add these probabilites to get the total probability of moving from "T"
to the empty context
Or when it converges but the vector is full of nan-values.
The event probability of a transition _t_ is the probability that _t_
happens at any random index in a generated sequence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement HMM distance function from Lu et. al 2013
2 participants