Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Shared Weights + Tests #33

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Implement Shared Weights + Tests #33

wants to merge 12 commits into from

Conversation

morgsmss7
Copy link

@morgsmss7 morgsmss7 commented Mar 14, 2020

I'm initiating this PR because for some reason, the test_tree.py file will not run on @jheiko1's machine. She will make comments for her tests and assign reviewers below.

Fixes #28

@jheiko1 and I have been working on the same branch. I have updated the criteria to save the outputs used in impurity calculations. @jheiko1 wrote tests for the new split criteria.

As of right now, the shared weights code is not ready for review, but the tests are.

@@ -65,14 +65,11 @@ cdef class Criterion:
cdef double impurity_improvement(self, double impurity) nogil
cdef double proxy_impurity_improvement(self) nogil

<<<<<<< HEAD
cdef double node_impurity2(self, double* pred_weights)
cdef void children_impurity2(self, double* impurity_left,
double* impurity_right, double* pred_weights)
cdef double proxy_impurity_improvement2(self, double* pred_weights) nogil
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@morgsmss7 is this used?

Copy link
Author

@morgsmss7 morgsmss7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted a few things about the tests. @jheiko1 Mostly they were good. My main comments would be to split everything into different tests, recalculate MSE for the tests with new y values, and look over some of the logic for the random state tests. Looks good otherwise :). I also added some comments for myself. Don't worry about those.

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved
Comment on lines 1814 to 1832
-----------------------

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jheiko1 make sure you have charts for each of the tests like the ones above and include your impurity calculations in the comments as well.

Likewise for Right node:
Right node Mean1 = Mean2 = 4
Total error = ((4 - 4)^2 * 1.0)
= 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jheiko1 add a separate test with a descriptive name for each thing you are testing

Copy link
Author

@morgsmss7 morgsmss7 Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for the tree fit at line 1841, I think this is incorrect because you are going to need a try-except block. Since there is only 1 split, the algorithm is going to randomly choose either the first output (3,3,4,7,8) OR the second output (2,4,3,6,7) to split on, so the impurity can either be the same as line 1852 if the first output is chosen or it could be what you calculated here. Just make sure you write out your calculation. (I edited this after looking at it more)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove line 1845

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for some reason it won't let me comment on some of the lines, but at line 1855, I might suggest increasing to range 30 or something just because it might be stronger.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for lines 1870 and 1875, I think you need to set a variable to np.random.randint(1,100,(5,7)) and then use the same y for both dt_axis_3 and dt_axis_4 because otherwise it will definitely be different because they are using different data. If it doesn't pass, I would increase the number or outputs from 7 to like 20.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after line 1878, I think you also need to break the for loop because just because the assert runs doesn't mean it won't continue trying until i = 100. The way it is right now, whether the test passes is contingent on what happens at random state 100.

Likewise for Right node:
Right node Mean1 = Mean2 = 4
Total error = ((4 - 4)^2 * 1.0)
= 0

Right Impurity = Total error / total weight
= 0 / 1.0
= 0.0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jheiko1 line 1987 needs to be different. You'll need to calculate what happens when both y1 and y2 are chosen here. You also need to add another nested try-except block or two because a few things could happen: just y1 can be chosen with weight 1 or -1, just y2 can be chosen with weight 1 or -1, both can be chosen with weights 1, both can be chosen with weights -1, or both can be chosen y1 with weight 1 and y2 with weight -1 or y2 with weight 1 and y1 with weight -1. You are going to need a try except for each case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 2007, same as previous comment about axis_proj. use the same y for both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also line 1992, increase to a larger number to make the argument stronger

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 2009: same comment as axis. I think you need a break statement.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in line 2014. should be oblique not axis (that was probably me)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in 2025. should be oblique not MAE (that was probably also me)

@@ -1657,3 +1440,4 @@ def __init__(self,
min_impurity_split=min_impurity_split,
random_state=random_state,
ccp_alpha=ccp_alpha)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of unnecessary changes throughout and in this file. I'll fix these. @jheiko1

@morgsmss7
Copy link
Author

The shared weights portion of the PR should be ready for review now!

@sampan501 sampan501 self-requested a review April 6, 2020 18:57
Copy link
Member

@sampan501 sampan501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started with a few areas where the algorithm speed could be improved. Since you are using cython, I highly recommend using MemoryView syntax for numpy arrays whenever are doing basic math operations


cdef DOUBLE_t w = 1.0
for p in range(start, pos):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next 2 loops can be merged into one


cdef DOUBLE_t w = 1.0

for p in range(start, end):
i = samples[p]
if sample_weight != NULL:
w = sample_weight[i]
y_ik = self.y[i, k]
sq_sum_total += w * y_ik * y_ik
for k in range(self.n_outputs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be faster to use matrix manipulation instead of loop. Something like np.sum(w * self.y * self.y * pred_weights)


impurity = sq_sum_total / self.weighted_n_node_samples
impurity -= (sum_total[k] / self.weighted_n_node_samples)**2.0
for k in range(self.n_outputs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, faster with list comprehension do something like impurity -= np.sum([sum_total[k] * pred_weights[k] / self.weighted_n_node_samples for k in range(self.n_outputs)

proxy_impurity_right += sum_right[k] * sum_right[k]


for k in range(self.n_outputs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list comprehension or broadcasting

impurity_left[0] = sq_sum_left / self.weighted_n_left
impurity_right[0] = sq_sum_right / self.weighted_n_right

impurity_left[0] -= (sum_left[k] / self.weighted_n_left) ** 2.0
impurity_right[0] -= (sum_right[k] / self.weighted_n_right) ** 2.0
for k in range(self.n_outputs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list comprehension thing here again

cdef SIZE_t i
cdef SIZE_t p
cdef SIZE_t k
cdef UINT32_t rand_r_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this code removed?

cdef double* sum_left = self.sum_left
cdef double* sum_right = self.sum_right

cdef SIZE_t k
cdef double proxy_impurity_left = 0.0
cdef double proxy_impurity_right = 0.0

cdef UINT32_t rand_r_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same about this code block

@@ -1642,10 +1553,11 @@ cdef class ObliqueProjection(RegressionCriterion):

if sample_weight != NULL:
w = sample_weight[i]

for k in range(self.n_outputs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, same list comment

@jheiko1 jheiko1 requested review from bdpedigo and j1c May 14, 2020 01:45
@morgsmss7
Copy link
Author

@j1c @bdpedigo I've looked over the checks that aren't passing, and I'm not sure what the problem is really. All of the tests that fail in the checks pass on my machine (which I guess is understandable, but I can't really fix it if I can't recreate it) and a lot of them seem pretty unrelated to the changes we made. I know this was a reach goal for the first sprint this semester, so it doesn't really matter in terms of grades, but I'm wondering how you guys think I should proceed.

@bdpedigo bdpedigo removed their request for review November 16, 2020 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests for oblique and axis projection splitters.
3 participants