# DATS Daily Quiz Session One: ML-Basics

[Quiz Source](https://www.1point3acres.com/bbs/thread-713903-1-1.html)

## Day 1, 03-25-2021

**Question:** Explain the concepts of *Overfitting* and *Underfitting*

**Ans from Web:** 
* *Overfitting:* Overfitting refers to a model that models the training data too well. Overfitting happens when a model **learns the detail and noise in the training data** to the extent that it negatively impacts the performance of the model on new data (test set). This means that the **noise or random fluctuations in the training data is picked up and learned as concepts by the model.** The problem is that **these concepts DO NOT apply to new data** and negatively impact the models **ability to generalize.** Overfitting is more likely with nonparametric and nonlinear models that **have more flexibility** when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to **limit and constrain how much detail the model learns.** For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by **pruning a tree** after it has learned in order to remove some of the detail it has picked up.
* *Underfitting:* Underfitting refers to a model that can neither model the training data nor generalize to new data.
An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

**In a Nutshell:**
* *Overfitting:* The resulting model has much higher performance on the training set than on the test set, as it learns ungeneralizable details/noises as "concepts" that belong to the training set only. Flexible models, e.g. non-parametric models like decision trees, are usually prone to this problem.
* *Underfitting:* Underfitting refers to a model that can neither model the training data nor generalize to new data.


## Day 2, 03-26-2021

**Question:** Explain the notion of *Variance-Bias Trade-off*

**Ans from Web:** 

Please refer to the [wiki](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).

**In a Nutshell:** 

The *Variance-Bias Trade-off* often refers to the *Model Complexity-Generalizability Trade-off*. Intuitively, complex models or **High Variance** models, such as neural network families, tend to overfit the training data while simple model or **High Bias** models like linear regression tend to underfit the training data, and conceptually we have: 

$$E[Loss] = Bias[\hat{f}] + Var[\hat{f}] + \epsilon\ (random\ noise)$$
    
So usually bias and variance are not compatible with each other. Note that the "complexity" here does not necessarily means the number of parameters or the complexity of model architecture. Rather, we'd better think of (1) the amount of assumptions we impose on our algorithm and (2) how well our assumptions fit the atual situations. Therefore, during the modeling process, we seek "just the right amount of complexity", trying to minimize bias and variance at the same time as a dual objective problem.


## Day 3, 03-27-2021

**Question:** How to prevent *Overfitting* problems?

**Ans from Web:** 

Can be found [here](https://www.zhihu.com/question/59201590/answer/167392763).
                  Regularization can be found [here](https://blog.csdn.net/jinping_shi/article/details/52433975).

**In a Nutshell:** 

Usually we can prevent overfitting through 3 kinds of ideas: 

* Increase the amount of data. As long as data is given to allow the model to "see" as many "exceptions" as possible, it will continue to modify itself to get better results. 

* Regularization. Regularization is a function of adding parameters to the loss function, the more parameters or larger, the larger the loss function, that is, to punish the behavior of increasing the model parameters, so that the model will not recklessly increase the parameters, it can reduce overfitting. l2 regularization mathematical principle is that it adds a regularization factor to the penalty function, which leads to each iteration, the parameters are multiplied by a factor less than 1, so the overall parameters are getting smaller. Therefore the overall parameters are getting smaller.

* Model Ensemble. This is because the expected error of randomly selecting one of the N models as the output will be larger than the average output error of all models. Therefore, we can consider methods such as bagging and boosting to prevent overfitting.

## Day 4, 03-28-2021

**N/A**

## Day 5, 03-29-2021

**Question:** What are the differences between *Generative* and *Discriminative* models?

**Ans from Web:** 

This is actually a very in-depth topic, which has a lot to do with Learning Theory. For further reading, please refer to a blog [here](https://medium.com/@mlengineer/generative-and-discriminative-models-af5637a66a3) and discussions [here](https://stats.stackexchange.com/questions/12421/generative-vs-discriminative#:~:text=3%3A%20Generative%20models%20often%20outperform,your%20model%20that%20prevent%20overfitting.)

**In a Nutshell:** 

There are two modeling paradigms for solving classification problems, namely *Generative* and *Discriminative*. Conceptually, discriminative models try to find the decision boundary that seperates one class from another. On the other hand, generative models try to capture the "characteristic" of a given class, *i.e.* the joint distribution of *target* and *features*. Mathematically, discriminative ones model $p(y\ |\ x)$ directly while their generative counterparts model $p(x\ |\ y)\times p(y)$, or $p(x,\ y)$. So one can see that **the generative models are modeling a more general problem** from which we can derive the corresponding discriminative models by applying bayesian trick:

$$p(y\ |\ x) = \frac{p(x,\ y)}{p(x)}$$

One typical discriminative model for binary classification problem is the **Logistic Regression** and its counterpart would be **(Naive) Bayesian Classifier**. One may see the connections between these two models by taking the logrithm (the derivation not shown here). A good summary of the pros and cons of these two models are shown as follows: 

![pros-and-cons](fig/day-5-gen-dis-diff.png)



## Day 6, 03-30-2021

**Question:** Give a set of ground truths and two models, how do you be confident that *one model is better than another?*

**Ans from Web:** 

A collection of performance comparison test can be found [here](https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html).

**In a Nutshell:** 

When comparing the performances of two models, e.g. in terms of classification accuracy, it is not sufficient to perform a **point estimation**. Instead, we'd better compare two models regarding their **performance distribution**. In a nutshell, we should perform a statistical test to determine if the performance of two models are **statistically different**. One of the most basic method for this kind of comparison is the *(Student's) T-Test*. Roughly speaking, instead of performing one single evaluation on the entire validation set, we first partition the dataset into many folds and evaluate them seperately. In this way, we can estimate the "spreads" of the two models' perfomances. Then, following the procedure of two-sample t-test, one may conclude whether the performances are truely different. Besides t-test, there are many other tests we can perform to obtain a more comprehensive evaluation from multiple dimension. For instance, what should we do if we have multiple evaluation metrics? Please read the links provided above for more information; or you can click [here](https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html).

## Day 7, 03-31-2021

**Question:** $L_1$ vs $L_2$, which one is which, and differences.

**Ans from Web:** 

Mathematical princples can be found [here](https://blog.csdn.net/red_stone1/article/details/80755144?utm_source=app&app_version=4.5.7)

**In a Nutshell:** 

The difference between $L_1$ and $L_2$ derives from their penalty. For L1, its penalty is the sum of the absolute values of the coefficients, thus forming a **square**. The contour line and penalty of the loss will eventually be tangent to the corners of the square, so that many parameters are 0, forming a **sparse matrix**. So L1 can be used to select eigenvalues. For L2, the square of its penalty is the sum of the squares of the coefficients, so that a **circle** can be formed. The contour line of the loss and the penalty are usually tangent to a non-zero point. The model parameters can be controlled by adjusting the parameter lambda to prevent overfitting.

## Day 8, 04-01-2021

**Question:** Please explain and derive Lasso regression and Ridge regression and answer why the matrix formed by $L_1$ is more sparse than L_2.

**Ans from Web:** 

An greate article answering this question can be found [here](https://bjlkeng.github.io/posts/probabilistic-interpretation-of-regularization/). One can have a pretty good understanding of this question after reading the derivations. So here I will just give a brief summary.

**In a Nutshell:** 

In a word: **assumption of the coefficients' prior distribution**. In a *Bayesian Curve-fitting* point of view, the variance in observed data stems from the variance of the coefficients. Without any assumption about the distribution of regression coefficients, say $\theta$, we model the regression problem as: 

$$p(y\ |\ \mu,\ \sigma) = p(y\ |\ \theta^Tx,\ \sigma)$$

where $\sigma$ denotes the strength of "noise". Then, the parameters are estimated by via *Maximum Likelihood Estimation*. With prior assumption about the parameters $\theta$, we can think of the model fitting procedure as **updating our belief about $\theta$ after we have "witnessed" the dataset $(y,\ x)$**. That is to maximize:

$$p(\theta\ |\ y) = \frac{p(y\ |\ \theta)p(\theta)}{p(y)}$$

And the difference between $L_1$ and $L_2$ is the direct consequence of choosing different assumption about $p(\theta)$:

$$L_1 \Longleftrightarrow \theta\sim Laplace(0,\ b)$$

$$L_2 \Longleftrightarrow \theta\sim \mathcal{N}(0,\ \sigma_{\theta})$$

This also explains why imposing $L_1$ penalty results in sparse solutions as one can see that the density of a $Laplacean\ Distribution$ is very clustered around 0, meaning that we have a strong prior assumption that the coefficients are very likely to be 0.

## Day 9, 04-02-2021

**Question:** Why not $L_3$, $L_4$ ...?

**Ans from Web:** 

Something useful about L1, L2 and L3 can be found [here](https://blog.csdn.net/nymph_h/article/details/95068873?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-2.control&dist_request_id=&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-2.control). More general information can be found in [wiki](https://en.wikipedia.org/wiki/Lp_space).

**In a Nutshell:** 

For $L_p$ norm, when 0<p<1, it doesn't meet the condition of p(u+v) <= p(u) + p(v), so it can't be regarded as norms. When p=0, $L_0$ can tell us the number of zero in a vector. $L_1$ can be used to perform feature value extraction and $L_2$ can be used to prevent overfitting as we discussed before. So as far as we concerned, $L_1$ and $L_2$ can satisfy the needs of most of our job in model optimization. Besides, they are much simpler to be calculated than $L_3$...

## Day 10, 04-03-2021

**Question:** Explain the notion of *Precision* and *Recall*, as well as the trade-off between them.

**Ans from Web:** 

This is actually one of a series of [ML Courses](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall). Here is the [wiki](https://en.wikipedia.org/wiki/Precision_and_recall).

**In a Nutshell:**

* Precision: How many "bad guys" (positive sample) are **Actually Bad Guys**?

* Recall: How many "bad guys" are identified by your model?

When talking about the trade-off between these two measurements, we need to discuess regarding **one specific model** and **adjust the classification threshold**. Intuitively speaking, if the threshold is high, then it means we have a "conservative" model. So only samples with strong evidences indicating they are positive are classified as positive, which results in a **High Precision** and **Low Recall**. On the other hand, if we classify all samples as positive case, resulting from an extremely low threshold, then of course we can pick out all the "bad guys", i.e. **High Recall**, but obviously the **Precision is Very Poor**.

## Day 10, 04-04-2021

**Question:** Which metrics will you use when labels are inbalanced?

**Ans from Web:** 
General information about 24 evaluation metrics can be found [here](https://neptune.ai/blog/evaluation-metrics-binary-classification)

**In a Nutshell:**

When the positive sample only accounts for 1% of the total sample, if the regression result directly makes all samples negative, then the accuracy will reach 99% at this time, but this model is not good.
At this time, you can consider using **precision** and **recall** to evaluate the model. For example, because TP is 0, the recall will be very low.

More generally, we can consider using **ROC** and **KS** to evaluate the classification model:

ROC is a curve with FP as the horizontal axis and TP as the vertical axis. The area under the curve (a.k.a. AUC) can be used to evaluate the classifier. 

KS is equivalent to a default rate, which is obtained when the difference between TP and TN is the largest. At this time, the model's ability to identify positive samples is the strongest.

## Day 11, 05-04-2021

**Question:** Explain AUC(the probability of ranking a randomly selected positive sample higher blabla...)

Let:

* $f(x)$: Negative Class (bad guys) Density Function
* $g(x)$: Positive Class (good guys) Density Function
* $t$: Classification Threshold

$$Precision = \frac{\int_{-\infty}^{t}f(x)\ dx}{\int_{-\infty}^{t}f(x) + g(x)\ dx}$$

$$TPR\ (True\ Positive\ Rate) = \mathcal{F}(t) = \int_{-\infty}^{t}f(x)\ dx$$

$$FPR\ (False\ Positive\ Rate) = \mathcal{G}(t) = \int_{-\infty}^{t}g(x)\ dx$$

$$ROC: \mathbb{R} \longrightarrow \mathbb{R}^2,\ ROC(t) = (TPR,\ FPR) = (\mathcal{F}(t),\ \mathcal{G}(t))$$

$$KS = max \int_{-\infty}^{t} f(x) - g(x)\ dx$$

## Day 11, 06-04-2021

**Question:** Explain Log-loss and when should we use it?

**Ans from Web:** 
Information about KL divergence and cross entrophy can be found [here](https://blog.csdn.net/b1055077005/article/details/100152102)

**In a Nutshell:**

KL divergence is used to measure the difference between the true probability distribution and the model predicted probability distribution. The KL divergence is equal to the sum of a constant and Logloss. Therefore, when optimizing the model, we can convert from the target of min(KL divergence) to min(logloss).

Since Shannon assumed that the entrophy of uncertain things is large, and the entrophy of uncertain things should also be added when they are superimposed. Therefore, in classification problems, such as binary classification problems, the function distribution of logloss determines that if bad people with very low scores are not recognized, the model will punish it strongly.