# **`Introduction to Machine Learning-2`**

`Q1: Define overfitting and underfitting in machine learning. What are the consequences of each, and how
can they be mitigated?`

`Overfitting`
Overfitting occurs when our machine learning model tries to cover all the data points or more than the required data points present in the given dataset. Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model. The overfitted model has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our model. It means the more we train our model, the more chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

**How to avoid the Overfitting in Model**:
- Cross-Validation
- Training with more data
- Removing features
- Early stopping the training
- Regularization
- Ensembling

`Underfitting`
Underfitting occurs when our machine learning model is not able to capture the underlying trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to which the model may not learn enough from the training data. As a result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

**How to avoid underfitting**:
- By increasing the training time of the model.
- By increasing the number of features.

`Q2: How can we reduce overfitting? Explain in brief.`

Overfitting occurs when the model performs well on training data but generalizes poorly to unseen data.

Simple Techniques to Prevent Overfitting:

1. `Hold-out (data)`
Rather than using all of our data for training, we can simply split our dataset into two sets: training and testing. A common split ratio is 80% for training and 20% for testing. We train our model until it performs well not only on the training set but also for the testing set. This indicates good generalization capability since the testing set represents unseen data that were not used for training. However, this approach would require a sufficiently large dataset to train on even after splitting

2. `Cross-validation (data)`
We can split our dataset into k groups (k-fold cross-validation). We let one of the groups to be the testing set (please see hold-out explanation) and the others as the training set, and repeat this process until each individual group has been used as the testing set (e.g., k repeats). Unlike hold-out, cross-validation allows all data to be eventually used for training but is also more computationally expensive than hold-out.

![image.png](attachment:image.png)

3. `Data augmentation (data)`
A larger dataset would reduce overfitting. If we cannot gather more data and are constrained to the data we have in our current dataset, we can apply data augmentation to artificially increase the size of our dataset. For example, if we are training for an image classification task, we can perform various image transformations to our image dataset (e.g., flipping, rotating, rescaling, shifting).

![image.png](attachment:image-2.png)

4. `Feature selection (data)`
If we have only a limited amount of training samples, each with a large number of features, we should only select the most important features for training so that our model doesn’t need to learn for so many features and eventually overfit. We can simply test out different features, train individual models for these features, and evaluate generalization capabilities, or use one of the various widely used feature selection methods.

![image.png](attachment:image-3.png)

5. `L1 / L2 regularization (learning algorithm)`
Regularization is a technique to constrain our network from learning a model that is too complex, which may therefore overfit. In L1 or L2 regularization, we can add a penalty term on the cost function to push the estimated coefficients towards zero (and not take more extreme values). L2 regularization allows weights to decay towards zero but not to zero, while L1 regularization allows weights to decay to zero.

![image.png](attachment:image-4.png)

6. `Remove layers / number of units per layer (model)`
As mentioned in L1 or L2 regularization, an over-complex model may more likely overfit. Therefore, we can directly reduce the model’s complexity by removing layers and reduce the size of our model. We may further reduce complexity by decreasing the number of neurons in the fully-connected layers. We should have a model with a complexity that sufficiently balances between underfitting and overfitting for our task.

![image.png](attachment:image-5.png)

7. Dropout (model)
By applying dropout, which is a form of regularization, to our layers, we ignore a subset of units of our network with a set probability. Using dropout, we can reduce interdependent learning among units, which may have led to overfitting. However, with dropout, we would need more epochs for our model to converge.

![image.png](attachment:image-6.png)

8. Early stopping (model)
We can first train our model for an arbitrarily large number of epochs and plot the validation loss graph (e.g., using hold-out). Once the validation loss begins to degrade (e.g., stops decreasing but rather begins increasing), we stop the training and save the current model. We can implement this either by monitoring the loss graph or set an early stopping trigger. The saved model would be the optimal model for generalization among different training epoch values.

![image.png](attachment:image-7.png)

`Q3: Explain underfitting. List scenarios where underfitting can occur in ML.`

`Underfitting` : A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data, i.e., it only performs well on training data but performs poorly on testing data. (It’s just like trying to fit undersized pants!) Underfitting destroys the accuracy of our machine learning model. Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens when we have fewer data to build an accurate model and also when we try to build a linear model with fewer non-linear data. In such cases, the rules of the machine learning model are too easy and flexible to be applied to such minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can be avoided by using more data and also reducing the features by feature selection. 

In a nutshell, Underfitting refers to a model that can neither performs well on the training data nor generalize to new data. 

`Reasons for Underfitting`:

* High bias and low variance 
* The size of the training dataset used is not enough.
* The model is too simple.
* Training data is not cleaned and also contains noise in it.

`Q4: Explain the bias-variance tradeoff in machine learning. What is the relationship between bias and
variance, and how do they affect model performance?`

`Bias-Variance Trade-Off`

While building the machine learning model, it is really important to take care of bias and variance in order to avoid overfitting and underfitting in the model. If the model is very simple with fewer parameters, it may have low variance and high bias. Whereas, if the model has a large number of parameters, it will have high variance and low bias. So, it is required to make a balance between bias and variance errors, and this balance between the bias error and variance error is known as the Bias-Variance trade-off.

![image.png](attachment:image.png)

For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible because bias and variance are related to each other:

* If we decrease the variance, it will increase the bias.
* If we decrease the bias, it will increase the variance.

Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that accurately captures the regularities in training data and simultaneously generalizes well with the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high variance algorithm may perform well with training data, but it may lead to overfitting to noisy data. Whereas, high bias algorithm generates a much simple model that may not even capture important regularities in the data. So, we need to find a sweet spot between bias and variance to make an optimal model.

Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and variance errors.

`Q5: Discuss some common methods for detecting overfitting and underfitting in machine learning models.
How can you determine whether your model is overfitting or underfitting?`

Overfitting and underfitting are common issues in machine learning, and detecting them is essential for building accurate models. Here are some of the common methods for detecting overfitting and underfitting:

1. **`Visual inspection of training and validation curves`**: One of the simplest methods to detect overfitting and underfitting is by visualizing the training and validation curves. A model that is overfitting has low training error but high validation error, while a model that is underfitting has high training error and high validation error.

2. **`Cross-validation`**: Cross-validation is a popular method for evaluating the performance of a machine learning model. It involves splitting the data into k-folds, training the model on k-1 folds, and testing it on the remaining fold. This process is repeated k times, and the average performance is calculated. If the model performs well on the training data but poorly on the test data, it is likely overfitting. If the model performs poorly on both the training and test data, it is likely underfitting.

3. **`Regularization`**: Regularization is a technique used to prevent overfitting in machine learning models. It involves adding a penalty term to the loss function to reduce the complexity of the model. If the regularization parameter is too high, the model may underfit, and if it is too low, the model may overfit.

4. **`Learning curves`**: Learning curves are plots that show the model's performance as a function of the number of training examples. If the training and validation curves converge to a high error, the model is likely underfitting. If the validation error is much higher than the training error, the model is likely overfitting.

5. **`Feature importance`**: Feature importance is a measure of how much each feature contributes to the model's performance. If a model is overfitting, reducing the number of features or removing the less important features may improve its performance. On the other hand, if a model is underfitting, adding more features or increasing the importance of the existing features may improve its performance.

In summary, detecting overfitting and underfitting requires a combination of methods such as visual inspection of training and validation curves, cross-validation, regularization, learning curves, and feature importance analysis. By using these methods, you can determine whether your model is overfitting or underfitting and adjust it accordingly to improve its accuracy.

`Q6: Compare and contrast bias and variance in machine learning. What are some examples of high bias
and high variance models, and how do they differ in terms of their performance?`

Bias and variance are two fundamental concepts in machine learning that help to understand the performance of a model.

`Bias` refers to the errors that are introduced by a model's assumptions regarding the data. A model with high bias tends to underfit the training data and may miss relevant patterns. In other words, it fails to capture the complexity of the data. Bias is often a result of a model being too simple or having too few parameters to capture the underlying patterns in the data. For example, a linear regression model may have high bias if the relationship between the features and the target variable is not linear.

`Variance`, on the other hand, refers to the sensitivity of the model's predictions to the changes in the training data. A model with high variance tends to overfit the training data and may not generalize well to new data. In other words, it captures too much of the complexity of the data, including noise, and may not generalize well to new data. Variance is often a result of a model being too complex or having too many parameters to fit the noise in the training data. For example, a deep neural network may have high variance if it has too many layers or neurons.

`High bias` models are those that make strong assumptions about the data and are unable to capture the complexity of the relationship between the features and the target variable. Examples of high bias models include linear regression, logistic regression, and decision trees with shallow depths. These models tend to underfit the training data and have low variance. They may perform well on simple datasets with few features but may not perform well on more complex datasets.

`High variance` models are those that are very flexible and can capture complex relationships between the features and the target variable. Examples of high variance models include deep neural networks, random forests, and k-nearest neighbors with low values of k. These models tend to overfit the training data and have high variance. They may perform well on complex datasets with many features but may not generalize well to new data.

To understand the difference between high bias and high variance models, let's take the `example of a classification problem`. Suppose we have a dataset with two features, and the target variable is binary. A high bias model such as linear regression may perform poorly on this dataset as it assumes a linear relationship between the features and the target variable. It may be unable to capture the non-linear relationships between the features and the target variable. On the other hand, a high variance model such as a deep neural network may perform very well on this dataset, but it may also overfit the training data and not generalize well to new data. In contrast, a well-balanced model such as a random forest may perform well on this dataset by capturing the non-linear relationships between the features and the target variable without overfitting the training data.

`Q7: What is regularization in machine learning, and how can it be used to prevent overfitting? Describe
some common regularization techniques and how they work.`

`Regularization` is a technique in machine learning that helps to ***prevent overfitting*** of a model by adding a penalty term to the loss function. The penalty term discourages the model from fitting the noise in the training data and encourages it to find a simpler solution that generalizes better to new data.

There are two common types of regularization techniques:

1. `L1 Regularization (Lasso)`: This regularization technique adds a penalty term to the loss function proportional to the absolute value of the weights of the model. It encourages the model to produce sparse solutions where many of the weights are set to zero. By reducing the number of non-zero weights, L1 regularization can improve the model's interpretability and reduce overfitting. L1 regularization can also be used for feature selection, as the zero-valued weights indicate the least important features.

2. `L2 Regularization (Ridge)`: This regularization technique adds a penalty term to the loss function proportional to the square of the weights of the model. It encourages the model to produce small weights and can help to reduce the impact of noisy features in the data. L2 regularization can also help to stabilize the model by reducing the sensitivity of the weights to changes in the input data.

3. `Elastic Net Regularization`: This regularization technique combines L1 and L2 regularization by adding a penalty term that is a combination of the absolute value and the square of the weights. It provides a balance between the sparsity-inducing effect of L1 regularization and the smoothness-inducing effect of L2 regularization.

4. `Dropout Regularization`: This regularization technique is commonly used in neural networks and randomly drops out some of the neurons during training. This prevents the neural network from relying too much on any single neuron and encourages the network to learn more robust and generalizable features.

5. `Data Augmentation`: This regularization technique involves generating additional training examples by applying random transformations to the original data, such as rotation, flipping, or scaling. This can increase the size and diversity of the training dataset and reduce overfitting by helping the model learn more generalized features.

Regularization helps to prevent overfitting by reducing the model's complexity and constraining the weights to smaller values. This can improve the model's ability to generalize to new data by reducing the impact of noisy or irrelevant features. Regularization can also improve the interpretability of the model by identifying the most important features in the data.