# Mud card questions

## SVM

- **How does kernel density relate to the SVR?**
    - SVMs with radial basis functions use a similar technique to 'smooth' the points
    - but SVMs can have other kernels too

- **How can I select kernel for SVM?**
    - check the manual

- **For SVC: Is there a typical range of gamma values that we'd be able to predict for an unseen dataset or would we use trial and error?**
    - trial and error
    - we don't know in advance which technique and what parameters will give the best performance.

- **It seems like gamma = 1 always gives the best result for the examples. Under what circumstances would a different gamma be better?**
    - gamma = 1 might work best on that one specific dataset but there is no guarantee it will be best on a different dataset
    - always tune your hyperparameters to optimize performance!

## RF

- **I was surprised to learn that random forests are not ideal for large datasets. Are linear and logistic regression the only tools used for large datasets?**
    - it's not that RF must not be used on large datasets
    - you just need to anticipate longer training times as a con for this technique
    - RFs can also be parallelized, so you can use multiple cores to speedup the calcuations (`n_jobs` parameter)
    - Jut FYI, SVMs are much slower on large datasets than RFs and SVMs can't be run parallel, so that's the technique I would not use on large datasets
    

- **For decision tree in regression: I understood the explanation in Quiz 2 for why a tree can't be arbitrarily deep since we can't split a node if it doesn't have at least two points - but I want to clarify, does that mean that even for a huge dataset with thousands of features there will eventually be some max_depth parameter that is too big for the tree? Because the quiz says that the max_depth parameter can be arbitrarily large, but how can that be if the actual depth of the tree is limited? Shouldn't the max_depth reach a maximum achievable point for every tree?**
    - by default, the tree will be split as long as there is one point on each leaf, let's call that depth `d_tree`
    - if the max_depth value is smaller than `d_tree`, the tree will not reach a depth of `d_tree`, it will be truncated before and there will be more than 1 points in each leaf.
    - if max_depth is equal or larger than `d_tree`, that has no impact on the tree and no error message will be given.


- **Finally, for random forests, why do we assign equal weights to all decision trees, regardless of their accuracy? Would we achieve a better overall prediction if we weighted decision trees by performance within a random forest?**
- **on the tree question I understand that in that case all of the trees have equal weight but could you build a tree that has differently weighted subtrees that works as an effective classifier?**
    - yes, you might but sklearn's random forest is not implemented that way.
    - you can go to sklearn's repository, check if someone else requested this feature in the past

- **I'm wondering how random forest makes decisions about which variables to place at what level on the tree and how continuous features are split(for example: >50 years old left branch, <=50 years old right branch etc.)?¬† Is it actually at random and sklearn picks the trees that perform best or is it more involved?**
    - not random, it's more involved :)
    - Classification and Regression Trees, Breiman et al., 1984


- **Is the decision tree always binary?**
    - all tree implementations that I am aware of are binary

- **could you please go over a bit of how to tune the other parameters of RF like min_split_weights, and to what extent are they actually impactful on the results on both classification or regression.**
    - same as any other parameter, select a couple of values, calculate train and validation scores, check which value optimizes the validation score
    - no clue how impactful that parameter is. it's different for every dataset.

- **Is there a way to control the random forest parameters to bias it toward certain features or is it the randomness the strength of having multiple trees up to a point?**
    - no way to manually bias the trees and you don't want to do that anyway
    - the optimization algorithms will automatically find the model that's best on the training set
    - the hyperparameter tuning will find the model that's best on the validation set

## Other

- **Can you explain the y_pred part and what is the function of train, test, and validation so I know what to fit or what to run which command on?**
    - I would really prefer if you understood what you do and why rather than just apply the commands blindly
    - the commands you need to apply are not always the same
    - generally you fit your model on the training set, and you apply .predict or .predict_proba on the validation and test sets
    - you apply .predict and .predict_proba only once on the test set, at the very end of the ML pipeline after you tuned the hyperparameters and you found the best parameter values

- **How do we know whether to choose the class 0 or class 1 probability when working with y_new in quiz 1**
    - usually the condition is that if pred_proba > p_crit, the point belongs to class 1, otherwise to class 0.
    - with that condition, pred_proba should be the class 1 probabilities
    - if you change the condition, you might need to change which class probabilities are used

- **what does smooth prediction means?**
    - let's check the figures in the lecture notes

- **When finding the right fit using either SVM or random forests how do you precisely determine the parameters that will minimize bias and variance?**
    - select the parameters that give the best validation score

- **What does the big data structure Z from the linear and logistic regression represent?**
    - I don't know what you are refering to

- **Throughout the lecture & notebook we've only see examples with two features. I would like to see examples with more features.**
    - if the dataset had more than two features, the code would be exactly the same except for  the preprocessing step
    - we use the adult and house price datasets which have more than two features

- **In quiz 3, we write the tree by our common sense or by looking a little bit into our dataset, however, if we later apply random forest classifier by Sklearn package in real-world data, how can we understand the generation process of those trees?**
    - the trees are generated to maximize performance on the training data
    - you can print out and visualize each individual tree, see [here](https://scikit-learn.org/stable/modules/tree.html#tree)

- **In Quiz 5, I used a different random state (1) and got 10 as the gamma that maximizes accuracy, instead of 1. The same thing happened when I ran Andras' code. Why is that? I thought the random state didn't matter, apart from reproducibility**
    - different random states give slightly different results. 
    - this is why the random state needs to be fixed and you need to try different states.
    - if you try different states, you'll see the uncertainty due to the inherent randomness in splitting and some ML algorithms

- **Quiz 3, Quiz 4, Quiz 5**