### General Linear Model:

Que1: What is the purpose of the General Linear Model (GLM)?

Ans: GLM models allow us to build a linear relationship between the response and predictors, even though their underlying relationship is not linear. This is made possible by using a link function, which links the response variable to a linear model.

Que2: What are the key assumptions of the General Linear Model?

Ans: The general linear model fitted using ordinary least squares (which includes Student's t test, ANOVA, and linear regression) makes four assumptions: linearity, homoskedasticity (constant variance), normality, and independence.

Que3: How do you interpret the coefficients in a GLM?

Ans: In this case you can interpret the coefficients as multiplying the probabilities by exp(β1) e x p ( β 1 ) , however these models can give you predicted probabilities greater than 1, and often don't converge (don't give an answer).

Que4: What is the difference between a univariate and multivariate GLM?

Ans: univariate regression has one explanatory (predictor) variable x and multivariate regression has more at least two explanatory (predictor) variables x1,x2,...,xn . 

Que5: Explain the concept of interaction effects in a GLM.

Ans: Interaction effects include simultaneous effects of two or more variables on the process output or response. Interaction occurs when the effect of one independent variable changes depending on the level of another independent variable.

Que6: How do you handle categorical predictors in a GLM?

Ans: 

1) Drop Categorical Variables

The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information.

2) Label Encoding

Label encoding assigns each unique value to a different integer.

3) One-Hot Encoding

One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the original data. 

Que7: What is the purpose of the design matrix in a GLM?

Ans: The purpose of the design matrix is to allow models that further constrain parameter sets. These constraints provide additional flexibility in modeling and allows researchers to build models that cannot be derived using the simple PIMs in.

Que 8: How do you test the significance of predictors in a GLM?

Ans: A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable.

Que 9: What is the difference between Type I, Type II, and Type III sums of squares in a GLM?

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [2]:
weekday = ['sat', 'sat', 'sat', 'sat', 'sat', 'sat', 'sun', 'sun', 'sun', 'sun']
weather = ['rain', 'rain', 'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'sun']
sales = [100, 100, 100, 100, 100, 10000, 10000, 10000, 10000, 10000]

In [3]:
data = pd.DataFrame({'weekday': weekday, 'weather': weather, 'sales': sales})
data

Unnamed: 0,weekday,weather,sales
0,sat,rain,100
1,sat,rain,100
2,sat,rain,100
3,sat,rain,100
4,sat,rain,100
5,sat,sun,10000
6,sun,sun,10000
7,sun,sun,10000
8,sun,sun,10000
9,sun,sun,10000


In [4]:
# Type I tells us that weekday is more important. The interaction effect is not signifcant.
lm = ols('sales ~ C(weekday)*C(weather)',data=data).fit()
table = sm.stats.anova_lm(lm, typ=1) # Type 1 ANOVA DataFrame
print(table)

                        df        sum_sq       mean_sq             F  \
C(weekday)             1.0  1.633500e+08  1.633500e+08  2.438328e+31   
C(weather)             1.0  8.167500e+07  8.167500e+07  1.219164e+31   
C(weekday):C(weather)  1.0  1.464446e-24  1.464446e-24  2.185981e-01   
Residual               7.0  4.689484e-23  6.699263e-24           NaN   

                              PR(>F)  
C(weekday)             3.689375e-108  
C(weather)             4.174051e-107  
C(weekday):C(weather)   6.543160e-01  
Residual                         NaN  


In [5]:
# Type II tells us that weather is more important. There is no interaction effect.
lm = ols('sales ~ C(weekday) + C(weather)',data=data).fit()
table = sm.stats.anova_lm(lm, typ=2) # Type 2 ANOVA DataFrame
print(table)

                  sum_sq   df             F         PR(>F)
C(weekday)  1.654361e-23  1.0  2.509430e-01   6.317769e-01
C(weather)  8.167500e+07  1.0  1.238893e+30  1.247833e-103
Residual    4.614803e-22  7.0           NaN            NaN


In [6]:
# Type III tells us that weekday is more important. The interaction effect is not signifcant.
lm = ols('sales ~ C(weekday)*C(weather)',data=data).fit()
table = sm.stats.anova_lm(lm, typ=3) # Type 3 ANOVA DataFrame
print(table)

                             sum_sq   df             F         PR(>F)
Intercept              5.000000e+04  1.0  7.463508e+27   7.353172e-96
C(weekday)             1.118348e-22  1.0  1.669360e+01   4.655642e-03
C(weather)             8.167500e+07  1.0  1.219164e+31  4.174051e-107
C(weekday):C(weather)  2.382280e-23  1.0  3.556033e+00   1.013070e-01
Residual               4.689484e-23  7.0           NaN            NaN


10. Explain the concept of deviance in a GLM.

Ans: Deviance is a goodness-of-fit metric for statistical models, particularly used for GLMs. It is defined as the difference between the Saturated and Proposed Models and can be thought as how much variation in the data does our Proposed Model account for. Therefore, the lower the deviance, the better the model.

## Regression:

Que 11: What is regression analysis and what is its purpose?

Ans: Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. While there are many types of regression analysis, at their core they all examine the influence of one or more independent variables on a dependent variable.

Que 12: What is the difference between simple linear regression and multiple linear regression?

Ans: Multiple regression is a broader class of regressions that encompasses linear and nonlinear regressions with multiple explanatory variables. 

Whereas linear regress only has one independent variable impacting the slope of the relationship, multiple regression incorporates multiple independent variables.

Que 13: How do you interpret the R-squared value in regression?

Ans: In linear regression models, r squared interpretation is a goodness-fit-measure. It takes into account the strength of the relationship between the model and the dependent variable. Its convenience is measured on a scale of 0 – 100%.

Que 14: What is the difference between correlation and regression?

Ans: Correlation is a statistical measure that determines the association or co-relationship between two variables.

Regression describes how to numerically relate an independent variable to the dependent variable. To represent a linear relationship between two variables

Que 15: What is the difference between the coefficients and the intercept in regression?

Ans: The simple linear regression model is essentially a linear equation of the form y = c + b*x; where y is the dependent variable (outcome), x is the independent variable (predictor), b is the slope of the line; also known as regression coefficient and c is the intercept; labeled as constant

intercept is simply the expected value of Y at that value.

Que 16: How do you handle outliers in regression analysis?

Ans: 
    
Method 1: “Fogetaboutit…”

    One option to dealing with outliers can be to drop the observations altogether. This can be a suitable option if it can be determined through further investigation that the survey entry was made in error. 

Method 2: Replacing The Outlier With a Another Value

    If there is reason to believe that there could be reason to include outliers in the model, another option is to set a ceiling or floor for the variable in question.

Method 3: Assign a Dummy Variable to Outliers

    This is often my preferred option when dealing with outliers. It keeps allows the model to use all the sample data and also gives information about the outliers in the data.

Que 17: What is the difference between ridge regression and ordinary least squares regression?

Ans: Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients

Que 18: What is heteroscedasticity in regression and how does it affect the model?

Ans: Heteroskedasticity refers to situations where the variance of the residuals is unequal over a range of measured values. When running a regression analysis, heteroskedasticity results in an unequal scatter of the residuals (also known as the error term).

Que 19. How do you handle multicollinearity in regression analysis?

Ans: Remove some of the highly correlated independent variables.

Linearly combine the independent variables, such as adding them together.

Partial least squares regression uses principal component analysis to create a set of uncorrelated components to include in the model.

LASSO and Ridge regression are advanced forms of regression analysis that can handle multicollinearity. If you know how to perform linear least squares regression, you’ll be able to handle these analyses with just a little additional study.

Que 20. What is polynomial regression and when is it used?

Ans: A polynomial regression model is a machine learning model that can capture non-linear relationships between variables by fitting a non-linear regression line, which may not be possible with simple linear regression. It is used when linear regression models may not adequately capture the complexity of the relationship.

## Loss function:

Que 21. What is a loss function and what is its purpose in machine learning?

Ans: a loss function is a measure of how good your prediction model does in terms of being able to predict the expected outcome(or value). We convert the learning problem into an optimization problem, define a loss function and then optimize the algorithm to minimize the loss function.

Que 22. What is the difference between a convex and non-convex loss function?

Ans: A convex function is one in which a line drawn between any two points on the graph lies on the graph or above it. There is only one requirement.

A non-convex function is one in which a line drawn between any two points on the graph may cross additional points. It was described as “wavy

Que 23. What is mean squared error (MSE) and how is it calculated?

Ans: The Mean Squared Error measures how close a regression line is to a set of data points. It is a risk function corresponding to the expected value of the squared error loss. Mean square error is calculated by taking the average, specifically the mean, of errors squared from data as it relates to a function.

Que 24. What is mean absolute error (MAE) and how is it calculated?

Ans: The MAE score is measured as the average of the absolute error values. The Absolute is a mathematical function that makes a number positive. 

Mean Absolute Error (MAE) is calculated by taking the summation of the absolute difference between the actual and calculated values of each observation over the entire array and then dividing the sum obtained by the number of observations in the array.

Que 25. What is log loss (cross-entropy loss) and how is it calculated?

Ans: Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.

The cross entropy formula takes in two distributions, p(x)
, the true distribution, and q(x)
, the estimated distribution, defined over the discrete variable x
 and is given by

H(p,q)=−∑∀xp(x)log(q(x))

Que 26. How do you choose the appropriate loss function for a given problem?

Ans: As part of the optimization algorithm, the error for the current state of the model must be estimated repeatedly. This requires the choice of an error function, conventionally called a loss function, that can be used to estimate the loss of the model so that the weights can be updated to reduce the loss on the next evaluation.

Neural network models learn a mapping from inputs to outputs from examples and the choice of loss function must match the framing of the specific predictive modeling problem, such as classification or regression. Further, the configuration of the output layer must also be appropriate for the chosen loss function.

Que 27. Explain the concept of regularization in the context of loss functions.

Ans: During the L2 regularization the loss function of the neural network as extended by a so-called regularization term, which is called here Ω. The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of the weight matrices, which is the sum over all squared weight values of a weight matrix.

Que 28. What is Huber loss and how does it handle outliers?

Ans: In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. 

Median is much more robust to outliers than mean. Huber loss is a balanced compromise between these two types. It is robust to the outliers but does not completely ignore them either.

Que 29. What is quantile loss and when is it used?

Ans: quantile loss — a flexible loss function that can be incorporated into any regression model to predict a certain variable quantile. 

Que 30. What is the difference between squared loss and absolute loss?

Ans: For square loss, you will choose the estimated mean of y0, as the true mean minimizes square loss on average (where the average is taken across random samples of y0 subject to x=x0).

For absolute loss, you will choose the estimated median.

## Optimizer (GD):


Que 31. What is an optimizer and what is its purpose in machine learning?

Ans: optimization is a process of finding optimal parameters for the model, which significantly reduces the error function.

The ultimate goal of ML model is to reach the minimum of the loss function. After we pass input, we calculate the error and update the weights accordingly. This is where optimizer comes into play. It defines how to tweak the parameters to get closer to the minima.

Que 32. What is Gradient Descent (GD) and how does it work?

Ans: Gradient Descent is the most common optimization algorithm in machine learning and deep learning. It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters.

Que 33. What are the different variations of Gradient Descent?

Ans: Three simple variants of gradient descent algorithms, namely batch gradient descent, stochastic gradient descent and mini-batch gradient descent are compared in this experiment.

Que 34. What is the learning rate in GD and how do you choose an appropriate value?

Ans: The learning rate hyperparameter controls the rate or speed at which the model learns. Specifically, it controls the amount of apportioned error that the weights of the model are updated with each time they are updated.

Specifically, the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0. The learning rate controls how quickly the model is adapted to the problem.

Que 35. How does GD handle local optima in optimization problems?

Ans: A local optima is the extrema (minimum or maximum) of the objective function for a given region of the input space, e.g. a basin in a minimization problem.

Gradient Descent is an iterative process that finds the minima of a function. This is an optimisation algorithm that finds the parameters or coefficients of a function where the function has a minimum value. Although this function does not always guarantee to find a global minimum and can get stuck at a local minimum.

Que 36. What is Stochastic Gradient Descent (SGD) and how does it differ from GD?

Ans: SGD tries to solve the main problem in Batch Gradient descent which is the usage of whole training data to calculate gradients at each step. SGD is stochastic in nature i.e. it picks up a “random” instance of training data at each step and then computes the gradient, making it much faster as there is much fewer data to manipulate at a single time, unlike Batch GD.

Que 37. Explain the concept of batch size in GD and its impact on training.

Ans: The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.

Think of a batch as a for-loop iterating over one or more samples and making predictions. At the end of the batch, the predictions are compared to the expected output variables and an error is calculated. From this error, the update algorithm is used to improve the model, e.g. move down along the error gradient.

Que 38. What is the role of momentum in optimization algorithms?

Ans: Momentum is an extension to the gradient descent optimization algorithm that allows the search to build inertia in a direction in the search space and overcome the oscillations of noisy gradients and coast across flat spots of the search space.

Que 39. What is the difference between batch GD, mini-batch GD, and SGD?

Ans: 

* Batch Gradient Descent:

The samples from the whole dataset are used to optimize the parameters i.e to compute the gradients for a single update. For a dataset of 100 samples, updates occur only once.

* Mini Batch Gradient Descent:

This is meant to capture the good aspects of Batch and Stochastic GD. Instead of a single sample ( Stochastic GD ) or the whole dataset ( Batch GD ), we take small batches or chunks of the dataset and update the parameters accordingly. For a dataset of 100 samples, if the batch size is 5 meaning we have 20 batches. Hence, updates occur 20 times.

* Stochastic Gradient Descent:

Stochastic GD computes the gradients for each and every sample in the dataset and hence makes an update for every sample in the dataset. For a dataset of 100 samples, updates occur 100 times.

Que 40. How does the learning rate affect the convergence of GD?

Ans:  the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0.

The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.

A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.

## Regularization:


Que 41. What is regularization and why is it used in machine learning?

Ans: Regularization is one of the most important concepts of machine learning. It is a technique to prevent the model from overfitting by adding extra information to it.

This technique can be used in such a way that it will allow to maintain all variables or features in the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a generalization of the model.

It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In regularization technique, we reduce the magnitude of the features by keeping the same number of features."

Que 42: What is the difference between L1 and L2 regularization?

Ans: L1 regularization penalizes the sum of absolute values of the weights, whereas L2 regularization penalizes the sum of squares of the weights.

Que 43. Explain the concept of ridge regression and its role in regularization.

Ans: Ridge regression is one of the types of linear regression in which a small amount of bias is introduced so that we can get better long-term predictions. Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L2 regularization.

Que 44. What is the elastic net regularization and how does it combine L1 and L2 penalties?

Ans: The elastic net is a linear regression regularization technique that combines both the L1 (Lasso) and L2 (Ridge) regularization penalties. It is particularly useful when dealing with datasets that have high collinearity or when there are more predictors than observations.

Que 45. How does regularization help prevent overfitting in machine learning models?

Ans: Regularization is a technique that penalizes the coefficient. In an overfit model, the coefficients are generally inflated. Thus, Regularization adds penalties to the parameters and avoids them weigh heavily. The coefficients are added to the cost function of the linear equation.

Que 46. What is early stopping and how does it relate to regularization?

Ans: In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration.

Que 47. Explain the concept of dropout regularization in neural networks.

Ans: Dropout is a regularization method approximating concurrent training of many neural networks with various designs. During training, some layer outputs are ignored or dropped at random. This makes the layer appear and is regarded as having a different number of nodes and connectedness to the preceding layer.

Que 48. How do you choose the regularization parameter in a model?

Ans: 
    
1) on the training set, we estimate several different Ridge regressions, with different values of the regularization parameter;

2) on the validation set, we choose the best model (the regularization parameter which gives the lowest MSE on the validation set);

3) on the test set, we check how much overfitting we have done by doing model selection on the validation set.

Que 49. What is the difference between feature selection and regularization?

Ans: Feature selection, also known as feature subset selection, variable selection, or attribute selection. This approach removes the dimensions (e.g. columns) from the input data and results in a reduced data set for model inference. 

Regularization, where we are constraining the solution space while doing optimization.

Que 50. What is the trade-off between bias and variance in regularized models?

Ans: If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and low variance condition and thus is error-prone. If algorithms fit too complex (hypothesis with high degree equation) then it may be on high variance and low bias.

## SVM:


Que 51. What is Support Vector Machines (SVM) and how does it work?

Ans: Support vector machines (SVMs) are powerful machine learning tools for data classification and prediction (Vapnik, 1995). The problem of separating two classes is handled using a hyperplane that maximizes the margin between the classes (Fig. 8.8). The data points that lie on the margins are called support vectors.

Que 52. How does the kernel trick work in SVM?

Ans: Kernel trick allows the inner product of mapping function instead of the data points. The trick is to identify the kernel functions which can be represented in place of the inner product of mapping functions. Kernel functions allow easy computation.

Que 53. What are support vectors in SVM and why are they important?

Ans: Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.

Que 54. Explain the concept of the margin in SVM and its impact on model performance.

Ans: Margin: it is the distance between the hyperplane and the observations closest to the hyperplane (support vectors). In SVM large margin is considered a good margin. There are two types of margins hard margin and soft margin.

The performance of the SVM depends on different parameters such as penalty factor, , and the kernel factor, . Also choosing an appropriate kernel function can improve the recognition score and lower the amount of computation.

Que 55. How do you handle unbalanced datasets in SVM?

Ans: Perhaps the simplest and most common extension to SVM for imbalanced classification is to weight the C value in proportion to the importance of each class. To accommodate these factors in SVMs an instance-level weighted modification was proposed.

Que 56. What is the difference between linear SVM and non-linear SVM?

Ans: Linear SVM: When the data points are linearly separable into two classes, the data is called linearly-separable data. We use the linear SVM classifier to classify such data. Non-linear SVM: When the data is not linearly separable, we use the non-linear SVM classifier to separate the data points.

Que 57. What is the role of C-parameter in SVM and how does it affect the decision boundary?

Ans: C parameter adds a penalty for each misclassified data point. If c is small, the penalty for misclassified points is low so a decision boundary with a large margin is chosen at the expense of a greater number of misclassifications .

Que 58. Explain the concept of slack variables in SVM.

Ans: Slack variables are introduced to allow certain constraints to be violated. That is, certain train- ing points will be allowed to be within the margin. We want the number of points within the margin to be as small as possible, and of course we want their penetration of the margin to be as small as possible.

Que 59. What is the difference between hard margin and soft margin in SVM?

Ans: When the data is linearly separable, and we don't want to have any misclassifications, we use SVM with a hard margin. However, when a linear boundary is not feasible, or we want to allow some misclassifications in the hope of achieving better generality, we can opt for a soft margin for our classifier.

Que 60. How do you interpret the coefficients in an SVM model?

Ans: Recall that in linear SVM, the result is a hyperplane that separates the classes as best as possible. The weights represent this hyperplane, by giving you the coordinates of a vector which is orthogonal to the hyperplane - these are the coefficients given by svm

## Decision Trees:



Que 61. What is a decision tree and how does it work?

Ans:  A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

* Steps :

1) Start with your idea. Begin your diagram with one main idea or decision. ...

2) Add chance and decision nodes. ...

3) Expand until you reach end points. ...

4) Calculate tree values. ...

5) Evaluate outcomes.

Que 62. How do you make splits in a decision tree?

Ans: For each split, individually calculate the entropy of each child node. Calculate the entropy of each split as the weighted average entropy of child nodes. Select the split with the lowest entropy or highest information gain

Que 63. What are impurity measures (e.g., Gini index, entropy) and how are they used in decision trees?

Ans: The Gini impurity measure is one of the methods used in decision tree algorithms to decide the optimal split from a root node, and subsequent splits. To put it into context, a decision tree is trying to create sequential questions such that it partitions the data into smaller groups.


Que 64. Explain the concept of information gain in decision trees.

Ans: Information gain is the basic criterion to decide whether a feature should be used to split a node or not. The feature with the optimal split i.e., the highest value of information gain at a node of a decision tree is used as the feature for splitting the node.

Que 65. How do you handle missing values in decision trees?

Ans: It will consider the missing values by taking the majority of the K nearest values. The random forest also is robust to categorical data with missing values. Many decision tree-based algorithms like XGBoost, Catboost support data with missing values.

Que 66. What is pruning in decision trees and why is it important?

Ans: A Decision tree that is trained to its full depth will highly likely lead to overfitting the training data - therefore Pruning is important. In simpler terms, the aim of Decision Tree Pruning is to construct an algorithm that will perform worse on training data but will generalize better on test data.

Que 67. What is the difference between a classification tree and a regression tree?

Ans: Classification trees are used when the dataset needs to be split into classes that belong to the response variable. Regression trees, on the other hand, are used when the response variable is continuous.

Que 68. How do you interpret the decision boundaries in a decision tree?

Ans : Decision boundary of a decision tree is determined by overlapping orthogonal half-planes (representing the result of each subsequent decision) and can end up as displayed on the pictures.

Que 69. What is the role of feature importance in decision trees?

Ans: Feature importance is a common way to make interpretable machine learning models and also explain existing models. That enables to see the big picture while taking decisions and avoid black box models. We've mentioned feature importance for linear regression and decision trees before

Que 70. What are ensemble techniques and how are they related to decision trees?

Ans: Using one decision tree is can be problematic and might not be stable enough; however, using multiple decision trees and combining their results will do great. Combining multiple classifiers in a prediction model is called ensembling. The simple rule of ensemble methods is to reduce the error by reducing the variance.

## Ensemble Techniques:


Que 71. What are ensemble techniques in machine learning?

Ans: Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods in machine learning usually produce more accurate solutions than a single model would.

Que 72. What is bagging and how is it used in ensemble learning?

Ans: Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the performance and accuracy of machine learning algorithms. It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model.

Que 73. Explain the concept of bootstrapping in bagging.

Ans: Bagging is composed of two parts: aggregation and bootstrapping. Bootstrapping is a sampling method, where a sample is chosen out of a set, using the replacement method. The learning algorithm is then run on the samples selected.

The bootstrapping technique uses sampling with replacements to make the selection procedure completely random. When a sample is selected without replacement, the subsequent selections of variables are always dependent on the previous selections, making the criteria non-random.

Que 74. What is boosting and how does it work?

Ans: Boosting is a method used in machine learning to reduce errors in predictive data analysis. Data scientists train machine learning software, called machine learning models, on labeled data to make guesses about unlabeled data. A single machine learning model might make prediction errors depending on the accuracy of the training dataset. For example, if a cat-identifying model has been trained only on images of white cats, it may occasionally misidentify a black cat. Boosting tries to overcome this issue by training multiple models sequentially to improve the accuracy of the overall system.

Decision trees
Decision trees are data structures in machine learning that work by dividing the dataset into smaller and smaller subsets based on their features. The idea is that decision trees split up the data repeatedly until there is only one class left. For example, the tree may ask a series of yes or no questions and divide the data into categories at every step.

Boosting ensemble method
Boosting creates an ensemble model by combining several weak decision trees sequentially. It assigns weights to the output of individual trees. Then it gives incorrect classifications from the first decision tree a higher weight and input to the next tree. After numerous cycles, the boosting method combines these weak rules into a single powerful prediction rule.

Boosting compared to bagging
Boosting and bagging are the two common ensemble methods that improve prediction accuracy. The main difference between these learning methods is the method of training. In bagging, data scientists improve the accuracy of weak learners by training several of them at once on multiple datasets. In contrast, boosting trains weak learners one after another.

Que 75. What is the difference between AdaBoost and Gradient Boosting?

Ans: AdaBoost is the first designed boosting algorithm with a particular loss function. On the other hand, Gradient Boosting is a generic algorithm that assists in searching the approximate solutions to the additive modelling problem. This makes Gradient Boosting more flexible than AdaBoost.

Que 76. What is the purpose of random forests in ensemble learning?

Ans: A random forest is a machine learning technique that's used to solve regression and classification problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems.

Que 77. How do random forests handle feature importance?

Ans: The final feature importance, at the Random Forest level, is it's average over all the trees. The sum of the feature's importance value on each trees is calculated and divided by the total number of trees: RFfi sub(i)= the importance of feature i calculated from all trees in the Random Forest model.

Que 78. What is stacking in ensemble learning and how does it work?

Ans: 
Stacking in Machine Learning - Javatpoint
Stacking is one of the most popular ensemble machine learning techniques used to predict multiple nodes to build a new model and improve model performance. Stacking enables us to train multiple models to solve similar problems, and based on their combined output, it builds a new model with improved performance.

Stacking involves training multiple base-models to predict the target variable in a machine learning problem while at the same time, a meta-model learns to use the predictions of each base model to predict the value of the target variable.

Que 79. What are the advantages and disadvantages of ensemble techniques?

Ans:

Pros:
    
    Ensemble methods offer several advantages over single models, such as improved accuracy and performance, especially for complex and noisy problems. They can also reduce the risk of overfitting and underfitting by balancing the trade-off between bias and variance, and by using different subsets and features of the data. Furthermore, ensemble methods can handle different types of data and tasks, such as classification, regression, clustering, and anomaly detection, by using different types of base models and aggregation methods. Additionally, they can provide more confidence and reliability by measuring the diversity and agreement of the base models, and by providing confidence intervals and error estimates for the predictions.
    
Cons:

    Ensemble methods have some drawbacks and challenges, such as being computationally expensive and time-consuming due to the need for training and storing multiple models, and combining their outputs. This can increase the complexity and memory requirements of the system. Additionally, they can be difficult to interpret and explain, as they involve multiple layers of abstraction and aggregation, which can obscure the logic and reasoning behind the predictions. Furthermore, they can be prone to overfitting and underfitting if the base models are too weak or too strong, or if the aggregation method is too simple or too complex. This can lead to underestimating or overestimating the uncertainty and variability of the data. Lastly, they can be sensitive to the quality and diversity of the data and the base models, as they depend on the assumptions and limitations of the individual models, and on the representativeness and independence of the data samples and features.

Que 80. How do you choose the optimal number of models in an ensemble?

Ans:

Step 1 : Find the KS of individual models. 

Step 2: Index all the models for easy access. 

Step 3: Choose the first two models as the initial selection and set a correlation limit. 

Step 4: Iteratively choose all the models which are not highly correlated with any of the any chosen model.