Tree - grammar, typos and some re-wording #39

lucyleeow · 2020-08-13T12:08:11Z

Nice notebook.

I would have liked to see the decision tree plot (with arrows and boxes) earlier and to show it with the plots of decision boundary - it is a nice way to visualise the splits. It also makes it obvious that the splits are done per box/partition (i.e., for the 2nd level you need 2 splits, one for each partition).

I will put specific comments at the relevant sections.

lucyleeow · 2020-08-13T12:17:07Z

python_scripts/trees.py

 # In a previous notebook, we learnt that a linear classifier will define a
 # linear separation to split classes using a linear combination of the input
 # features. In our 2-dimensional space, it means that a linear classifier will


This is actually more explanation than is in the linear models notebook. I think I already mentioned it in the other PR but it would be nice explain in linear models that the linear separation line formula is a linear combination of the features.

Yes indeed, I think is okay to repeats things.

it would be nice explain in linear models that the linear separation line formula is a linear combination of the features.

+1

Sorry i meant that in linear.py to include this nice explanation of linear separation.

lucyleeow · 2020-08-13T12:22:30Z

python_scripts/trees.py

-# defined some oblique lines that best separate our classes. We define a
-# function below that given a set of data point and a classifier will plot the
-# decision boundaries learnt by the classifier.
+# define some oblique lines that best separate our classes. We define a


We only demonstrated binary classification in linear model. Maybe we could give more details about how these oblique lines work/look like for >2 classes. e.g., is there one line for each class?

Also maybe explain how these oblique separating lines relate to the decision boundaries..

python_scripts/trees.py

lucyleeow · 2020-08-13T12:37:23Z

python_scripts/trees.py

-# For a binary problem, the entropy function for one of the class can be
-# depicted as follows:
+# For a binary problem (e.g., only 2 classes of penguins), the entropy function
+# for one of the class can be depicted as follows:
 #
 # ![title](https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg)
 #


(Below) Nitpick - I would explain it in terms of one class (as this is what is shown in the plot above) e.g., entropy max when proportion from the class is 50% (as other class is also 50%), min when only class x is present, or only the other class is present (prob=0)

Also maybe a quick explanation of >2 classes, e.g., (I assume) for 3 classes entropy is max when all 3 classes are 33% ?

In the code below I can read :

and minimum when only samples for a single class is present.

Do you think we should add more details here ?
With 3 classes, it is gonna be very verbose I think.

Made a suggestion in new commit.

lucyleeow · 2020-08-13T12:40:56Z

python_scripts/trees.py

+# times (until there is no classification error on the training set,
+# i.e., all final partitions consist of only one class). In
+# the above example, it corresponds to setting the `max_depth` parameter to
+# `None`. This allows the algorithm to keep making splits until the final
+# partitions are pure.


Tricky but should we be talking about splitting until no error ... (as this would be un-advisable due to overfitting) ? Is this the 'default' action of the tree algorithm?

For the moment I'm fine with that. It seems to me it is clearly explain here.

It is clearly explained. My concern is that doing this is often not advisable as it leads to overfitting.

Is it too verbose to say something like:
'until there is no classification error on the training set or one of the limiting parameters such as max_depth or min_samples_leaf has been reached.' ?

lucyleeow · 2020-08-13T12:44:28Z

python_scripts/trees.py

@@ -626,9 +632,10 @@ def plot_regression_model(X, y, model, extrapolate=False, ax=None):
 _ = plot_regression_model(X_train, y_train, tree)

 # %% [markdown]
-# We see that the decision tree model does not have a priori and do not end-up
+# We see that the decision tree model does not have a priori and we do not


I assume priori = no assumption on data distribution ?

Maybe make the connection why straight line = no assumption about the data distribution... ?

For me also it is not clear what "a priori" means here

lucyleeow · 2020-08-13T12:46:38Z

python_scripts/trees.py

 # with a straight line to regress flipper length and body mass. The prediction
-# of a new sample, which was already present in the training set, will give the
+# of a sample from the training set will give the


I'm not sure exactly how to interpret the regression tree plot. How can one visualise the splits? Maybe a tree plot alongside would be helpful?

Also it's not obvious to me that predicting using a training sample will give the target of the training sample .. ?

Below 'In the case of regression, the predicted value corresponds to the mean of the target in the node.' - I don't know how to tell what is a node in the graph.

Below - 'extrapolate to unseen data' - do you mean extrapolate to unseen data outside of the range seen in the training dataset?

TwsThomas · 2020-08-17T15:01:19Z

I would have liked to see the decision tree plot (with arrows and boxes) earlier

There is some slides on trees, with that kind of figures, that should be presented before this notebook.

TwsThomas · 2020-08-19T08:01:44Z

python_scripts/trees.py

@@ -340,8 +340,11 @@ def plot_decision_function(X, y, clf, ax=None):
 #
 # Therefore, the entropy will be maximum when the proportion of samples from
 # each class is equal (i.e. $p(X_k)$ is 50%) and minimum when only samples for
-# a single class is present (i.e., $p(X_k)$ is 100%, definitely class `X`,
-# or 0%, definitely the other class).
+# a single class is present (i.e., $p(X_k)$ is 100%, only class `X`,


I am not sure to what refer the notation $p(X_k)$.
Is X the data ? I guess 'k' is the class.
Maybe we could use p_k to represent the frequency of the class k in the partition.

Whoops, I was trying to use the same notation as the graph: https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg
I have amended it to be the same notation as the graph

We should be careful having the same notation with the entropy formula above

Happy to make it the same as the formula above. The notation of the graph won't be the same though

Wait in my first version i did use the same notation as above?

The entropy is defined as: $H(X) = - \sum_{k=1}^{K} p(X_k) \log p(X_k)$

Did you change the notation?

Did you change the notation?

Yes ^^
I replace p(X_k) by p_k for the probability of class k.

Sure, I've amended but now it doesn't match notation in the graph

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Tree - review from PR INRIA#39

lucyleeow · 2020-08-31T13:36:02Z

python_scripts/trees.py

+# We see that the decision tree model does not have a priori distribution
+# for the data and we do not end-up
+# with a straight line to regress flipper length and body mass.
+# Having different body masses
 # for a same flipper length, the tree will be predicting the mean of the
 # targets.
 #
 # So in classification setting, we saw that the predicted value was the most
 # probable value in the node of the tree. In the case of regression, the


Should 'node' be 'leaf' here as well then @TwsThomas ?

first review

da83889

lucyleeow commented Aug 13, 2020

View reviewed changes

typos

f7c7f18

TwsThomas mentioned this pull request Aug 18, 2020

Tree - apply suggestions from PR #39 #46

Closed

review from lucy

9ca1364

lucyleeow mentioned this pull request Aug 18, 2020

Consistent British or American spelling #48

Closed

lucyleeow added 2 commits August 18, 2020 18:21

entropy

f10839b

entropy

feedd4f

TwsThomas reviewed Aug 19, 2020

View reviewed changes

lucyleeow and others added 12 commits August 19, 2020 10:39

use graph notation

bcea5d7

Update python_scripts/trees.py

712e314

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Update python_scripts/trees.py

08d9cc6

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Update python_scripts/trees.py

a890e83

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Update python_scripts/trees.py

0a153ba

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Update python_scripts/trees.py

3651af9

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Update python_scripts/trees.py

5098906

Co-authored-by: Lucy Liu <jliu176@gmail.com>

partitions, entropy

18b2aea

merge

335ea00

minor

7000eb0

notation

076305e

Merge pull request #3 from TwsThomas/review_tree

bcb80fb

Tree - review from PR INRIA#39

lucyleeow commented Aug 31, 2020

View reviewed changes

lesteve merged commit 67f967f into INRIA:master Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree - grammar, typos and some re-wording #39

Tree - grammar, typos and some re-wording #39

lucyleeow commented Aug 13, 2020 •

edited

lucyleeow Aug 13, 2020

TwsThomas Aug 17, 2020

lucyleeow Aug 18, 2020

lucyleeow Aug 13, 2020

lucyleeow Aug 13, 2020

lucyleeow Aug 13, 2020

TwsThomas Aug 18, 2020

lucyleeow Aug 18, 2020

lucyleeow Aug 13, 2020

TwsThomas Aug 18, 2020

lucyleeow Aug 18, 2020

lucyleeow Aug 13, 2020

TwsThomas Aug 17, 2020

lucyleeow Aug 13, 2020

lucyleeow Aug 13, 2020

lucyleeow Aug 13, 2020

TwsThomas commented Aug 17, 2020 •

edited

TwsThomas Aug 19, 2020

lucyleeow Aug 19, 2020

TwsThomas Aug 19, 2020

lucyleeow Aug 19, 2020

lucyleeow Aug 19, 2020 •

edited

lucyleeow Aug 19, 2020

TwsThomas Aug 19, 2020

lucyleeow Aug 20, 2020

lucyleeow Aug 31, 2020

Tree - grammar, typos and some re-wording #39

Tree - grammar, typos and some re-wording #39

Conversation

lucyleeow commented Aug 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TwsThomas commented Aug 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow Aug 19, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow commented Aug 13, 2020 •

edited

TwsThomas commented Aug 17, 2020 •

edited

lucyleeow Aug 19, 2020 •

edited