# Mud card and piazza questions

## Mix

- **How can I compare the different evaluation metrics if they result in different suggestions? Like a model may have high R^2, low MSE or MAE**
    - you do not compare different metrics!
    - you choose one metric, calculate that metric for different models, and compare the metric values of different models

- **It would be great if you could give rough regions of what is considered good and bad values of each metric, like when you mentioned that 0.35 was pretty good for log-loss.**
    - keep in mind that you use the evaluation metric to compare different models
    - you need to know two things:
        - is the model better if my score is larger?
            - for accuracy, recall, precision, f score, yes
            - logloss is better if it is smaller
            - MSE, RMSE, MAE are better if they are smaller
            - R2 is better the closer to 1 it is
        - what's the evaluation metric if my model has no predictive power (chance level)?
           - for accuracy: it is the balance of the most popolous class
           - for MSE: it's the variance of the target variable
           - for RSME: it is the standard deviation of the target variable

- **Is there any kind of way to determine which metric to choose? (I know it said consult with a team and that it is based on the question we are asking), but once we have a question formulated is there some kind of guide to help decide or do we try all metrics?**
- **I still don't fully understand how to choose an evaluation metric. If there isn't stakeholder input and there aren't ethical considerations, how are evaluation metrics selected? How can we compare them?**
    - if the problem is imbalanced classification: accuracy won't work so try the f1 score or maybe logloss
    - if you want to avoid false positives, try precision or f_beta with beta < 1.
    - if you want to avoid false negatives, try recall or f_beta with beta > 1.
    - if you want to make sure your predicted probabilities are as accurate as possible because you'll need to rank your points based on the predicted probabilities for an intervetion, use logloss.
    - in regression, RMSE or R2 are used most often.

- **How can we illustrate confidence levels for classification and regression?**
    - ML algorithms do not return confidence levels for one datapoint. 
        - regression models return one predicted value
        - classification models return one predicted probability or one most likely class
    - you could train several ML models using different random states for splitting, model training, etc.
        - then you'd have multiple predictions for the same point in your test set and you can use that to calculate e.g., mean and std predicted values
    - [here](https://stanfordmlgroup.github.io/projects/ngboost/) is a brand new ML technique which returns probabilistic predictions
        - I haven't read the paper yet

- **How should we choose an evaluation metric if our data is not iid?**
    - tough question
    - iid vs. non-iid usually doesn't impact which metric to choose

## Regression

- **What are the benefits of using MSE vs. MAE/RMSE and how do you decide which metrics to focus on?**
- **in what situations would MAE be more useful than MSE?**
- **For regression models I'm wondering how one might decide between MSE, RMSE, MAE and R^2 for evaluating a model?**
    - MSE is often used but its unit is not the same as the target variable's unit
        - for example, if the target variable is in dollars, the unit of MSE is in dollar**2
    - this is why RMSE is sometimes preferred over MSE
    - MAE is not used very often in my experience but it is a good metric

- **Could you please explain more in detail about regression metrics? Which metrics should I use in a specific circumstance?**
    - regression metrics are much easier than clasisfication metrics, that's why we spent less time on them
    - you normally just decide if you want to use one of MSE, RMSE, MAE or R2 instead
        - R2 is nice when you want a normalized metric
            - a negative R2 indicates a bad model
            - 0 shows baseline performance
            - R2 is 1 if the model is perfect
        - MSE, RMSE, MAE are all fine if you want a metric with lower values indicating a better model
            - I prefer RMSE because it has the same unit as the target variable, but the same is true for MAE

## Classification

- **I am confused mostly about the C matrix and how it works and why it is important. As showing in ROC plot, what does that means of its x-axis and y-axis.**

- **Considering you get an n x n confusion matrix when choosing an evaluation metric for classification, does this mean it's a better idea to reduce the number of unique classifiers? For instance, you have ~50 unique classifiers (e.g. US states) ¬†but can reduce that number to 4 (regions in the US). Is that good practice, in light of choosing evaluation metrics?**
    - terminology mix up!
    - a classifier is a trained classification model, I think you mean class or label
    - you could also talk about the number of categories in a categorical feature but that's not a target variable
    - sometimes you can reduce the number of classes, I wouldn't generally recommend to do so. 

- **I am interested in when precision-recall curve would be more useful than the ROC curve.**
    - when the dataset is imbalanced
    - if the dataset is imbalanced, you want to choose an evaluation metric that does not use TN from the confusion matrix

- **How do we interpret the F-score in a confusion matrix? And how do we interpret the value of the logloss?**
    - the f score is calculated based on the recall and precision, the closer it is to 1, the better.
    - logloss is better the closer it is to 0. 

- **What's the mechanic behind the f_beta score? Is it just a defined indicator? Would you please give some more examples about beta choosing?**
    - I don't know what a defined indicator is
    - the f score is the weighted harmonic mean of precision and recall and the beta parameter tells you how much weight you give to precision vs. recall (see question above for choosing beta)


- **Is the critical probability that we use as a cutoff for predicting the class considered a hyper-parameter of the model? Is it something that we tune along with other model hyper-parameters in order to maximize the value of our metric?**
- **Also for the predict part, in the examples we specified p_crit but in looking at some of the documentation it doesn't seem like you get to specify any p_crit value, is there any way you can use a different value if you'd want to for some reason (would you ever want to/should you ever)?**
    - that's a tough question because the critical probability is not a hyper-parameter of any ML algorithm
    - you can tune it but it requires additional coding
    - I usually tune the critical probability out of desperation when my model is not predictive enough with the nominal 50% critical probability :) 

- **The .predict_proba and .predict methods come from the specific sklearn classifier but are methods that all the classifiers have. I am guessing that the predict methods all do the same thing using the probabilities but the underlying process for predict_proba would be different for each classifier?**
    - that is exactly right!
    - we will cover ML algorithms during the next two weeks so you'll see how different algorithms calculate probabilities.

- **The algorithm that provides every wrong answer is actually good right? You can just make conclusion opposite to what it predicts.**
    - lol

- **I am stilled confused about quiz4, would you mind to go over it during class?**

- **Would love to go over quiz 3! Confused what we classified as true positive, when it wasn't precisely binary (0,1,2 as options).**