New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tree - grammar, typos and some re-wording #39
Conversation
# In a previous notebook, we learnt that a linear classifier will define a | ||
# linear separation to split classes using a linear combination of the input | ||
# features. In our 2-dimensional space, it means that a linear classifier will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually more explanation than is in the linear models notebook. I think I already mentioned it in the other PR but it would be nice explain in linear models that the linear separation line formula is a linear combination of the features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, I think is okay to repeats things.
it would be nice explain in linear models that the linear separation line formula is a linear combination of the features.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry i meant that in linear.py
to include this nice explanation of linear separation.
# defined some oblique lines that best separate our classes. We define a | ||
# function below that given a set of data point and a classifier will plot the | ||
# decision boundaries learnt by the classifier. | ||
# define some oblique lines that best separate our classes. We define a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only demonstrated binary classification in linear model. Maybe we could give more details about how these oblique lines work/look like for >2 classes. e.g., is there one line for each class?
Also maybe explain how these oblique separating lines relate to the decision boundaries..
# For a binary problem, the entropy function for one of the class can be | ||
# depicted as follows: | ||
# For a binary problem (e.g., only 2 classes of penguins), the entropy function | ||
# for one of the class can be depicted as follows: | ||
# | ||
# ![title](https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg) | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Below) Nitpick - I would explain it in terms of one class (as this is what is shown in the plot above) e.g., entropy max when proportion from the class is 50% (as other class is also 50%), min when only class x is present, or only the other class is present (prob=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe a quick explanation of >2 classes, e.g., (I assume) for 3 classes entropy is max when all 3 classes are 33% ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code below I can read :
and minimum when only samples for a single class is present.
Do you think we should add more details here ?
With 3 classes, it is gonna be very verbose I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a suggestion in new commit.
# times (until there is no classification error on the training set, | ||
# i.e., all final partitions consist of only one class). In | ||
# the above example, it corresponds to setting the `max_depth` parameter to | ||
# `None`. This allows the algorithm to keep making splits until the final | ||
# partitions are pure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky but should we be talking about splitting until no error ... (as this would be un-advisable due to overfitting) ? Is this the 'default' action of the tree algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the moment I'm fine with that. It seems to me it is clearly explain here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is clearly explained. My concern is that doing this is often not advisable as it leads to overfitting.
Is it too verbose to say something like:
'until there is no classification error on the training set or one of the limiting parameters such as max_depth
or min_samples_leaf
has been reached.' ?
python_scripts/trees.py
Outdated
@@ -626,9 +632,10 @@ def plot_regression_model(X, y, model, extrapolate=False, ax=None): | |||
_ = plot_regression_model(X_train, y_train, tree) | |||
|
|||
# %% [markdown] | |||
# We see that the decision tree model does not have a priori and do not end-up | |||
# We see that the decision tree model does not have a priori and we do not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume priori = no assumption on data distribution ?
Maybe make the connection why straight line = no assumption about the data distribution... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me also it is not clear what "a priori" means here
python_scripts/trees.py
Outdated
# with a straight line to regress flipper length and body mass. The prediction | ||
# of a new sample, which was already present in the training set, will give the | ||
# of a sample from the training set will give the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure exactly how to interpret the regression tree plot. How can one visualise the splits? Maybe a tree plot alongside would be helpful?
Also it's not obvious to me that predicting using a training sample will give the target of the training sample .. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below 'In the case of regression, the predicted value corresponds to the mean of the target in the node.' - I don't know how to tell what is a node in the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below - 'extrapolate to unseen data' - do you mean extrapolate to unseen data outside of the range seen in the training dataset?
There is some slides on trees, with that kind of figures, that should be presented before this notebook. |
python_scripts/trees.py
Outdated
@@ -340,8 +340,11 @@ def plot_decision_function(X, y, clf, ax=None): | |||
# | |||
# Therefore, the entropy will be maximum when the proportion of samples from | |||
# each class is equal (i.e. $p(X_k)$ is 50%) and minimum when only samples for | |||
# a single class is present (i.e., $p(X_k)$ is 100%, definitely class `X`, | |||
# or 0%, definitely the other class). | |||
# a single class is present (i.e., $p(X_k)$ is 100%, only class `X`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to what refer the notation
Is X the data ? I guess 'k' is the class.
Maybe we could use p_k to represent the frequency of the class k in the partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I was trying to use the same notation as the graph: https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg
I have amended it to be the same notation as the graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be careful having the same notation with the entropy formula above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to make it the same as the formula above. The notation of the graph won't be the same though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait in my first version i did use the same notation as above?
The entropy is defined as: $H(X) = - \sum_{k=1}^{K} p(X_k) \log p(X_k)$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you change the notation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you change the notation?
Yes ^^
I replace p(X_k) by p_k for the probability of class k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've amended but now it doesn't match notation in the graph
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Tree - review from PR INRIA#39
# We see that the decision tree model does not have a priori distribution | ||
# for the data and we do not end-up | ||
# with a straight line to regress flipper length and body mass. | ||
# Having different body masses | ||
# for a same flipper length, the tree will be predicting the mean of the | ||
# targets. | ||
# | ||
# So in classification setting, we saw that the predicted value was the most | ||
# probable value in the node of the tree. In the case of regression, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should 'node' be 'leaf' here as well then @TwsThomas ?
Nice notebook.
I would have liked to see the decision tree plot (with arrows and boxes) earlier and to show it with the plots of decision boundary - it is a nice way to visualise the splits. It also makes it obvious that the splits are done per box/partition (i.e., for the 2nd level you need 2 splits, one for each partition).
I will put specific comments at the relevant sections.