Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree - grammar, typos and some re-wording #39

Merged
merged 17 commits into from Oct 13, 2020
Merged

Conversation

lucyleeow
Copy link
Collaborator

@lucyleeow lucyleeow commented Aug 13, 2020

Nice notebook.

I would have liked to see the decision tree plot (with arrows and boxes) earlier and to show it with the plots of decision boundary - it is a nice way to visualise the splits. It also makes it obvious that the splits are done per box/partition (i.e., for the 2nd level you need 2 splits, one for each partition).

I will put specific comments at the relevant sections.

Comment on lines 91 to 93
# In a previous notebook, we learnt that a linear classifier will define a
# linear separation to split classes using a linear combination of the input
# features. In our 2-dimensional space, it means that a linear classifier will
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually more explanation than is in the linear models notebook. I think I already mentioned it in the other PR but it would be nice explain in linear models that the linear separation line formula is a linear combination of the features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, I think is okay to repeats things.

it would be nice explain in linear models that the linear separation line formula is a linear combination of the features.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry i meant that in linear.py to include this nice explanation of linear separation.

# defined some oblique lines that best separate our classes. We define a
# function below that given a set of data point and a classifier will plot the
# decision boundaries learnt by the classifier.
# define some oblique lines that best separate our classes. We define a
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only demonstrated binary classification in linear model. Maybe we could give more details about how these oblique lines work/look like for >2 classes. e.g., is there one line for each class?

Also maybe explain how these oblique separating lines relate to the decision boundaries..

python_scripts/trees.py Show resolved Hide resolved
python_scripts/trees.py Outdated Show resolved Hide resolved
python_scripts/trees.py Outdated Show resolved Hide resolved
python_scripts/trees.py Outdated Show resolved Hide resolved
# For a binary problem, the entropy function for one of the class can be
# depicted as follows:
# For a binary problem (e.g., only 2 classes of penguins), the entropy function
# for one of the class can be depicted as follows:
#
# ![title](https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg)
#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Below) Nitpick - I would explain it in terms of one class (as this is what is shown in the plot above) e.g., entropy max when proportion from the class is 50% (as other class is also 50%), min when only class x is present, or only the other class is present (prob=0)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe a quick explanation of >2 classes, e.g., (I assume) for 3 classes entropy is max when all 3 classes are 33% ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code below I can read :

and minimum when only samples for a single class is present.

Do you think we should add more details here ?
With 3 classes, it is gonna be very verbose I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a suggestion in new commit.

Comment on lines +479 to +483
# times (until there is no classification error on the training set,
# i.e., all final partitions consist of only one class). In
# the above example, it corresponds to setting the `max_depth` parameter to
# `None`. This allows the algorithm to keep making splits until the final
# partitions are pure.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky but should we be talking about splitting until no error ... (as this would be un-advisable due to overfitting) ? Is this the 'default' action of the tree algorithm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment I'm fine with that. It seems to me it is clearly explain here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is clearly explained. My concern is that doing this is often not advisable as it leads to overfitting.

Is it too verbose to say something like:
'until there is no classification error on the training set or one of the limiting parameters such as max_depth or min_samples_leaf has been reached.' ?

@@ -626,9 +632,10 @@ def plot_regression_model(X, y, model, extrapolate=False, ax=None):
_ = plot_regression_model(X_train, y_train, tree)

# %% [markdown]
# We see that the decision tree model does not have a priori and do not end-up
# We see that the decision tree model does not have a priori and we do not
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume priori = no assumption on data distribution ?

Maybe make the connection why straight line = no assumption about the data distribution... ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me also it is not clear what "a priori" means here

# with a straight line to regress flipper length and body mass. The prediction
# of a new sample, which was already present in the training set, will give the
# of a sample from the training set will give the
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure exactly how to interpret the regression tree plot. How can one visualise the splits? Maybe a tree plot alongside would be helpful?

Also it's not obvious to me that predicting using a training sample will give the target of the training sample .. ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below 'In the case of regression, the predicted value corresponds to the mean of the target in the node.' - I don't know how to tell what is a node in the graph.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below - 'extrapolate to unseen data' - do you mean extrapolate to unseen data outside of the range seen in the training dataset?

@TwsThomas
Copy link
Contributor

TwsThomas commented Aug 17, 2020

I would have liked to see the decision tree plot (with arrows and boxes) earlier

There is some slides on trees, with that kind of figures, that should be presented before this notebook.

@@ -340,8 +340,11 @@ def plot_decision_function(X, y, clf, ax=None):
#
# Therefore, the entropy will be maximum when the proportion of samples from
# each class is equal (i.e. $p(X_k)$ is 50%) and minimum when only samples for
# a single class is present (i.e., $p(X_k)$ is 100%, definitely class `X`,
# or 0%, definitely the other class).
# a single class is present (i.e., $p(X_k)$ is 100%, only class `X`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to what refer the notation $p(X_k)$.
Is X the data ? I guess 'k' is the class.
Maybe we could use p_k to represent the frequency of the class k in the partition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I was trying to use the same notation as the graph: https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg
I have amended it to be the same notation as the graph

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful having the same notation with the entropy formula above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to make it the same as the formula above. The notation of the graph won't be the same though

Copy link
Collaborator Author

@lucyleeow lucyleeow Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait in my first version i did use the same notation as above?

The entropy is defined as: $H(X) = - \sum_{k=1}^{K} p(X_k) \log p(X_k)$

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you change the notation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you change the notation?

Yes ^^
I replace p(X_k) by p_k for the probability of class k.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've amended but now it doesn't match notation in the graph

lucyleeow and others added 12 commits August 19, 2020 10:39
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
# We see that the decision tree model does not have a priori distribution
# for the data and we do not end-up
# with a straight line to regress flipper length and body mass.
# Having different body masses
# for a same flipper length, the tree will be predicting the mean of the
# targets.
#
# So in classification setting, we saw that the predicted value was the most
# probable value in the node of the tree. In the case of regression, the
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 'node' be 'leaf' here as well then @TwsThomas ?

@lesteve lesteve merged commit 67f967f into INRIA:master Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants