Q1: Define overfitting and underfitting in machine learning. What are the consequences of each, and how can they be mitigated?

Answer:  Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data. When data scientists use machine learning models for making predictions, they first train the model on a known data set. Then, based on this information, the model tries to predict outcomes for new data sets. An overfit model can give inaccurate predictions and cannot perform well for all types of new data.

It can be mitigated by the below steps,

-Using K-fold cross-validation

-Using Regularization techniques such as Lasso and Ridge

-Training model with sufficient data

-Adopting ensembling techniques


Underfitting is another type of error that occurs when the model cannot determine a meaningful relationship between the input and output data. You get underfit models if they have not trained for the appropriate length of time on a large number of data points.

Underfitting can be mitigated following the below steps,

-Increase the number of features in the dataset

-Increase model complexity

-Reduce noise in the data

-Increase the duration of training the data

Q2: How can we reduce overfitting? Explain in brief.

Answer:  Overfitting can be reduced following the below steps,

>Early stopping

Early stopping pauses the training phase before the machine learning model learns the noise in the data. However, getting the timing right is important; else the model will still not give accurate results.

>Pruning

You might identify several features or parameters that impact the final prediction when you build a model. Feature selection—or pruning—identifies the most important features within the training set and eliminates irrelevant ones. For example, to predict if an image is an animal or human, you can look at various input parameters like face shape, ear position, body structure, etc. You may prioritize face shape and ignore the shape of the eyes.

>Regularization

Regularization is a collection of training/optimization techniques that seek to reduce overfitting. These methods try to eliminate those factors that do not impact the prediction outcomes by grading features based on importance. For example, mathematical calculations apply a penalty value to features with minimal impact. Consider a statistical model attempting to predict the housing prices of a city in 20 years. Regularization would give a lower penalty value to features like population growth and average annual income but a higher penalty value to the average annual temperature of the city.

>Ensembling

Ensembling combines predictions from several separate machine learning algorithms. Some models are called weak learners because their results are often inaccurate. Ensemble methods combine all the weak learners to get more accurate results. They use multiple models to analyze sample data and pick the most accurate outcomes. The two main ensemble methods are bagging and boosting. Boosting trains different machine learning models one after another to get the final result, while bagging trains them in parallel.

>Data augmentation

Data augmentation is a machine learning technique that changes the sample data slightly every time the model processes it. You can do this by changing the input data in small ways. When done in moderation, data augmentation makes the training sets appear unique to the model and prevents the model from learning their characteristics. For example, applying transformations such as translation, flipping, and rotation to input images.

Q3: Explain underfitting. List scenarios where underfitting can occur in ML.

Answer: When a model has not learned the patterns in the training data well and is unable to generalize well on the new data, it is known as underfitting. An underfit model has poor performance on the training data and will result in unreliable predictions. Underfitting occurs due to high bias and low variance.    
    
Reasons for Underfitting

-Data used for training is not cleaned and contains noise (garbage values) in it

-The model has a high bias

-The size of the training dataset used is not enough

-The model is too simple

Q4: Explain the bias-variance tradeoff in machine learning. What is the relationship between bias and variance, and how do they affect model performance?

Answer: The bias-variance tradeoff is a fundamental concept in machine learning and statistics. It refers to the delicate balance between two sources of error in a predictive model: bias and variance.

Bias represents the error due to overly simplistic assumptions in the learning algorithm. High bias can cause the model to underfit the data, leading to poor performance on both training and unseen data.

Variance, on the other hand, reflects the model’s sensitivity to small fluctuations in the training data. High variance can lead to overfitting, where the model captures noise in the training data and performs poorly on new, unseen data.

Bias and variance are inversely connected. It is impossible to have an ML model with a low bias and a low variance. When a data engineer modifies the ML algorithm to better fit a given data set, it will lead to low bias—but it will increase variance.

A model with high variance may represent the data set accurately but could lead to overfitting to noisy or otherwise unrepresentative training data. In comparison, a model with high bias may underfit the training data due to a simpler model that overlooks regularities in the data.

Q5: Discuss some common methods for detecting overfitting and underfitting in machine learning models. How can you determine whether your model is overfitting or underfitting?

Answer: Model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y). Your model is overfitting your training data when you see that the model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.

Poor performance on the training data could be because the model is too simple (the input features are not expressive enough) to describe the target well. Performance can be improved by increasing model flexibility. To increase model flexibility, try the following:

Add new domain-specific features and more feature Cartesian products, and change the types of feature processing used (e.g., increasing n-grams size)

Decrease the amount of regularization used



If your model is overfitting the training data, it makes sense to take actions that reduce model flexibility. To reduce model flexibility, try the following:

Feature selection: consider using fewer feature combinations, decrease n-grams size, and decrease the number of numeric attribute bins.

Increase the amount of regularization used.

Q6: Compare and contrast bias and variance in machine learning. What are some examples of high bias and high variance models, and how do they differ in terms of their performance?

Answer: 
    
>Bias

- When an algorithm is employed in a machine learning model and it does not fit well, a phenomenon known as bias can develop. Bias arises in several situations.

- The disparity between the values that were predicted and the values that were actually observed is referred to as bias.

- The model is incapable of locating patterns in the dataset that it was trained on, and it produces inaccurate results for both seen and unseen data.

>Variance

- The term "variance" refers to the degree of change that may be expected in the estimation of the target function as a result of using multiple sets of training data.

- A random variable's variance is a measure of how much it varies from the value that was predicted for it.

- The model recognizes the majority of the dataset's patterns and can even learn from the noise or data that isn't vital to its operation.


Examples of High Bias is Linear Regression and High Variance is Decision Tree.

Q7: What is regularization in machine learning, and how can it be used to prevent overfitting? Describe some common regularization techniques and how they work.

Answer: Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting.

Using Regularization, we can fit our machine learning model appropriately on a given test set and hence reduce the errors in it. 

*Regularization Techniques :- There are two main types of regularization techniques: Ridge Regularization and Lasso Regularization

-Ridge Regularization:  lso known as Ridge Regression, it modifies the over-fitted or under fitted models by adding the penalty equivalent to the sum of the squares of the magnitude of coefficients.

This means that the mathematical function representing our machine learning model is minimized and coefficients are calculated. The magnitude of coefficients is squared and added. Ridge Regression performs regularization by shrinking the coefficients present.


-Lasso Regression: It modifies the over-fitted or under-fitted models by adding the penalty equivalent to the sum of the absolute values of coefficients. 

Lasso regression also performs coefficient minimization,  but instead of squaring the magnitudes of the coefficients, it takes the true values of coefficients. This means that the coefficient sum can also be 0, because of the presence of negative coefficients.