#### General Linear Model


1. The purpose of the General Linear Model (GLM) is to model and analyze the relationship between dependent variables and independent variables, allowing for the examination of various factors and their effects on the outcome.
<br>

2. The key assumptions of the GLM include linearity, independence of errors, constant variance of errors (homoscedasticity), and normally distributed errors.
<br>

3. Coefficients in a GLM represent the change in the mean response associated with a one-unit change in the corresponding predictor, assuming all other predictors are held constant.
<br>

4. A univariate GLM involves a single dependent variable, whereas a multivariate GLM includes multiple dependent variables simultaneously, allowing for the examination of their relationships with the independent variables.
<br>

5. Interaction effects in a GLM occur when the relationship between one predictor variable and the outcome variable depends on the level of another predictor variable. It means that the effect of one predictor is not constant across different levels of the other predictor.
<br>

6. Categorical predictors in a GLM are typically handled by converting them into a set of binary (dummy) variables, representing different levels of the categorical variable. These binary variables are then included as predictors in the model.
<br>

7. The design matrix in a GLM is a matrix that represents the relationship between the dependent variable and the independent variables. It contains the values of the predictors and their interactions, allowing for the estimation of the model parameters.
<br>

8. The significance of predictors in a GLM can be tested using hypothesis tests, such as the t-test or F-test, by comparing the estimated coefficients to their standard errors. This helps determine if the predictors have a statistically significant effect on the outcome variable.
<br>

9. Type I, Type II, and Type III sums of squares are different methods for partitioning the total sum of squares into components associated with each predictor in a GLM. They are used to assess the unique contribution of each predictor to the model and can yield different results depending on the order of entry of predictors.
<br>

10. Deviance in a GLM measures the difference between the observed data and the model's predicted values. It is used to assess the goodness-of-fit of the model and compare the fit of different models. Lower deviance values indicate a better fit to the data.

#### Regression


11. Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. Its purpose is to understand and predict the value of the dependent variable based on the values of the independent variables.


12. Simple linear regression involves a single independent variable predicting a dependent variable, while multiple linear regression involves two or more independent variables predicting a dependent variable. Multiple linear regression allows for the examination of the combined effects of multiple predictors on the outcome.


13. The R-squared value in regression represents the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data. It tells about how accurately your model is working


14. Correlation measures the strength and direction of the linear relationship between two variables, while regression analyzes the relationship between a dependent variable and one or more independent variables, including the estimation of coefficients and prediction of the dependent variable.


15. Coefficients in regression represent the estimated effect of each independent variable on the dependent variable, indicating the magnitude and direction of the relationship. The intercept represents the predicted value of the dependent variable when all independent variables are set to zero.


16. Outliers in regression analysis can be handled by either removing them if they are data errors or influential points, or by applying robust regression techniques that are less sensitive to outliers, such as robust regression or weighted least squares.


17. Ordinary least squares (OLS) regression aims to minimize the sum of squared residuals, while ridge regression adds a penalty term to the OLS objective function to reduce the impact of multicollinearity by shrinking the coefficient estimates towards zero. Ridge regression helps stabilize the model when there is multicollinearity among the predictors.


18. Heteroscedasticity in regression refers to the unequal variance of the residuals across the range of the predictor variables. It violates the assumption of constant variance and can affect the reliability of statistical inference. To address heteroscedasticity, robust standard errors or transformations of the variables can be used.


19. Multicollinearity in regression occurs when independent variables are highly correlated with each other. It can be handled by removing highly correlated variables, performing dimensionality reduction techniques, or using regularization methods like ridge regression or lasso regression.


20. Polynomial regression is a form of regression analysis where the relationship between the independent and dependent variables is modeled as an nth-degree polynomial. It is used when the relationship between the variables is nonlinear, allowing for a more flexible representation of the data.

#### Loss Function

21. A loss function is a mathematical function that measures the discrepancy between predicted values and actual values in machine learning. Its purpose is to quantify the model's performance and guide the learning process by minimizing the error or maximizing the accuracy.


22. A convex loss function has a single global minimum, meaning that there is only one optimal solution. Non-convex loss functions have multiple local minima, making it challenging to find the global minimum and potentially leading to suboptimal solutions.


23. Mean squared error (MSE) is a commonly used loss function that measures the average squared difference between predicted and actual values. It is calculated by summing the squared residuals and dividing by the number of samples.


24. Mean absolute error (MAE) is a loss function that measures the average absolute difference between predicted and actual values. It is calculated by summing the absolute residuals and dividing by the number of samples.


25. Log loss, also known as cross-entropy loss, is a loss function often used in classification tasks. It quantifies the difference between predicted probabilities and true class labels. The formula for log loss involves taking the logarithm of the predicted probabilities and summing them across the samples.


26. The choice of loss function depends on the nature of the problem and the specific goals. For example, MSE is commonly used in regression tasks, while log loss is suitable for binary classification. Understanding the properties of different loss functions and considering the problem requirements helps in selecting an appropriate one.


27. Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. It encourages the model to find simpler and more generalizable solutions by balancing the fit to the training data and complexity. Regularization methods include L1 (Lasso) and L2 (Ridge) regularization.


28. Huber loss is a loss function that combines properties of both squared loss (MSE) and absolute loss (MAE). It is less sensitive to outliers compared to squared loss and provides a more robust estimation. Huber loss uses a delta parameter to determine the threshold for switching between squared and absolute differences.


29. Quantile loss is a loss function used in quantile regression, where the goal is to estimate different quantiles of the conditional distribution. It measures the difference between predicted and actual quantiles, allowing for more flexible modeling of the distribution's tails or specific percentiles of interest.


30. Squared loss (MSE) penalizes larger errors more than absolute loss (MAE) because it squares the residuals. This makes squared loss more sensitive to outliers and can amplify their impact on the overall loss. In contrast, absolute loss treats all errors equally, making it less affected by outliers but less sensitive to smaller errors.

#### Optimizers

31. An optimizer is an algorithm or method used in machine learning to adjust the parameters of a model to minimize the loss function. Its purpose is to find the optimal values for the model's parameters that lead to the best performance on the training data.


32. Gradient Descent (GD) is an iterative optimization algorithm used to find the minimum of a function, typically the loss function in machine learning. It works by calculating the gradients of the function with respect to the parameters and updating the parameters in the opposite direction of the gradient.


33. Different variations of Gradient Descent include Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), and Mini-Batch Gradient Descent. These variations differ in the amount of data used to compute the gradients and update the parameters.


34. The learning rate in GD determines the step size at each iteration. It controls how much the parameters are adjusted based on the gradients. Choosing an appropriate learning rate involves a trade-off between faster convergence (with a larger learning rate) and stability (with a smaller learning rate), and it often requires experimentation or tuning.


35. GD can struggle with local optima in optimization problems, as it can get stuck in suboptimal solutions. However, by using techniques like random initialization of parameters or employing variations of GD, such as adding momentum or using adaptive learning rates, it is possible to mitigate the issue and find better solutions.


36. Stochastic Gradient Descent (SGD) is a variation of GD where the gradients and updates are computed and applied for each individual training sample instead of the entire dataset. This makes SGD faster and more computationally efficient but introduces more noise due to the high variance of individual samples.


37. The batch size in GD refers to the number of training samples used in each iteration to compute the gradients and update the parameters. A larger batch size (e.g., full batch GD) uses the entire dataset, while a smaller batch size (e.g., mini-batch GD) uses a subset of the data. The choice of batch size impacts the trade-off between accuracy (using more data) and computational efficiency (using less data).


38. Momentum is a concept in optimization algorithms that helps accelerate convergence and navigate narrow and steep regions in the loss function. It introduces a "momentum" term that adds a fraction of the previous parameter update to the current update, allowing for more consistent and stable movement in the parameter space.


39. Batch GD uses the entire training dataset to compute gradients and update parameters in each iteration. Mini-batch GD uses a subset (mini-batch) of the training data, striking a balance between the efficiency of SGD and the stability of batch GD. SGD computes gradients and updates parameters for each individual training sample.


40. The learning rate affects the convergence of GD by determining the step size taken in the parameter space. If the learning rate is too large, GD may overshoot the minimum and fail to converge. If the learning rate is too small, GD may converge slowly or get stuck in local optima. An appropriate learning rate should be chosen to balance convergence speed and stability.

#### Regularization

41. Regularization is a technique used in machine learning to prevent overfitting and improve the generalization performance of models. It adds a penalty term to the loss function that encourages simpler models with smaller parameter values, thereby reducing the impact of noisy or irrelevant features.


42. L1 regularization (Lasso regularization) adds the absolute values of the model's coefficients to the loss function as a penalty term. L2 regularization (Ridge regularization) adds the squared values of the coefficients. L1 regularization promotes sparsity by driving some coefficients to exactly zero, while L2 regularization encourages smaller but non-zero coefficients.


43. Ridge regression is a linear regression technique that uses L2 regularization. It adds the sum of squared coefficients multiplied by a regularization parameter (lambda) to the loss function. Ridge regression shrinks the coefficient estimates towards zero, reducing the impact of multicollinearity and improving the stability of the model.


44. Elastic net regularization combines L1 and L2 penalties in the loss function. It adds both the absolute values (L1) and the squared values (L2) of the coefficients, with separate regularization parameters (alpha and lambda). Elastic net regularization offers a balance between L1 and L2 regularization, allowing for variable selection and handling multicollinearity.


45. Regularization helps prevent overfitting by imposing constraints on the model's complexity. It discourages large parameter values and reduces the model's sensitivity to noise in the training data, leading to better generalization performance on unseen data.


46. Early stopping is a form of regularization that involves stopping the training process before the model fully converges. It monitors the model's performance on a validation set during training and stops training when the performance starts to deteriorate. Early stopping prevents overfitting by finding an optimal trade-off between model complexity and performance on the validation data.


47. Dropout regularization is a technique used in neural networks. It randomly "drops out" a fraction of the nodes (neurons) in each training iteration, preventing them from contributing to the forward and backward passes. This encourages the network to learn more robust and independent representations, reducing overreliance on specific neurons and preventing overfitting.


48. The regularization parameter is typically chosen using techniques like cross-validation or grid search. These methods involve evaluating the model's performance on a validation set for different values of the regularization parameter and selecting the one that gives the best trade-off between performance and complexity.


49. Feature selection aims to identify and select a subset of relevant features, discarding irrelevant or redundant ones. Regularization, on the other hand, shrinks the coefficients of all features but rarely eliminates them completely. Feature selection explicitly chooses a subset of features, while regularization automatically downplays the impact of less important features.


50. Regularized models strike a trade-off between bias and variance. By adding regularization penalties, the models tend to have smaller coefficients and reduced complexity, leading to higher bias but lower variance. Regularization helps in finding a balance between underfitting (high bias) and overfitting (high variance) by controlling the model's complexity and reducing the risk of over-reliance on noisy or irrelevant features.

#### SVM

51. Support Vector Machines (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that maximally separates different classes or captures the regression relationship. SVM aims to find the best decision boundary by maximizing the margin between support vectors.


52. The kernel trick is a technique used in SVM to handle non-linearly separable data. It maps the original input data into a higher-dimensional feature space, where the data becomes linearly separable. By applying the kernel function, the need to explicitly compute the transformed feature space is avoided, making the computations more efficient.


53. Support vectors in SVM are the data points closest to the decision boundary or within the margin. They play a crucial role in defining the decision boundary and determining the optimal hyperplane. Support vectors have non-zero coefficients and contribute to the SVM's decision-making process.


54. The margin in SVM is the region between the support vectors of different classes. It represents the separation between classes and influences the generalization ability of the model. A larger margin indicates a more robust and better-performing model, as it provides a wider buffer zone against misclassifications.


55. Unbalanced datasets in SVM can be handled by using techniques such as class weighting, oversampling the minority class, undersampling the majority class, or using specialized SVM variants like weighted SVM or cost-sensitive SVM. These techniques help to address the issue of imbalanced class distribution and improve the model's performance.


56. Linear SVM constructs a linear decision boundary to separate classes in the original feature space, while non-linear SVM uses the kernel trick to transform the data into a higher-dimensional feature space where it becomes linearly separable. Non-linear SVM captures complex relationships by implicitly mapping the data to a higher-dimensional space without explicitly computing the transformed features.


57. The C-parameter in SVM controls the trade-off between maximizing the margin and minimizing the classification error. A smaller C-value allows for a wider margin but may lead to more misclassifications, while a larger C-value enforces a stricter margin and focuses on minimizing misclassifications. The choice of C influences the model's bias-variance trade-off.


58. Slack variables in SVM are introduced in soft margin SVM to allow for misclassifications and violations of the margin. They represent the degree of misclassification or proximity to the margin and are used to relax the optimization problem. Slack variables allow the algorithm to find a more flexible decision boundary that handles overlapping or noisy data.


59. Hard margin SVM aims to find a decision boundary with no misclassifications, meaning it strictly enforces the margin and does not allow any data points to be inside the margin or misclassified. Soft margin SVM, on the other hand, allows for some misclassifications and violations of the margin by introducing slack variables. Soft margin SVM provides a more flexible and robust solution, suitable for handling noisy or overlapping data.


60. The coefficients in an SVM model represent the weights assigned to each feature in the decision-making process. They indicate the importance or contribution of each feature to the decision boundary. Larger coefficient values indicate stronger influences, while coefficients close to zero suggest less importance. The sign of the coefficients determines the direction of influence (+/-) on the predicted class.

#### Decision Trees

61. A decision tree is a supervised machine learning algorithm that uses a tree-like structure to make decisions or predictions. It recursively partitions the feature space based on the values of the input features, creating a hierarchical set of rules that lead to the final decision or prediction.


62. Splits in a decision tree are made based on a chosen feature and a threshold value. The algorithm evaluates different splits by considering different feature thresholds and selects the split that maximizes the information gain or minimizes the impurity measure.


63. Impurity measures, such as the Gini index or entropy, quantify the degree of impurity or disorder within a node in a decision tree. These measures help determine the quality of a split and are used to evaluate the homogeneity of the target variable within each node.


64. Information gain is a concept used in decision trees to measure the reduction in impurity or uncertainty achieved by a particular split. It quantifies how much information is gained about the target variable when a specific feature is used for splitting. The feature with the highest information gain is chosen as the splitting criterion.


65. Missing values in decision trees can be handled by different strategies. One approach is to assign the missing values to the most frequent category or the mean/median value of the feature. Another option is to create a separate category for missing values or to use algorithms specifically designed to handle missing data, such as surrogate splits or missing value imputation.


66. Pruning is a technique used in decision trees to prevent overfitting. It involves removing or collapsing nodes and branches that do not contribute significantly to improving the tree's predictive accuracy. Pruning helps simplify the tree, improve generalization, and reduce the risk of overfitting to noisy or irrelevant features.


67. A classification tree is used for categorical or discrete target variables and aims to classify instances into specific classes or categories. A regression tree, on the other hand, is used for continuous target variables and predicts a numeric value as the output based on the input features.


68. Decision boundaries in a decision tree are represented by the splits and the resulting branches. Each split creates a partition in the feature space, dividing the data into separate regions. The decision boundaries are determined by the feature values and thresholds used in the splits, which dictate the path of the instance through the tree.


69. Feature importance in decision trees quantifies the relative importance or contribution of each feature in the tree's decision-making process. It is often derived from metrics such as the total reduction in impurity or information gain associated with the feature. Feature importance helps identify the most influential features and provides insights into the underlying relationships.


70. Ensemble techniques combine multiple individual models, often decision trees, to improve predictive performance. Bagging (Bootstrap Aggregating) and Random Forest build multiple decision trees using subsets of the data, while boosting methods like AdaBoost and Gradient Boosting iteratively train decision trees, focusing on instances with higher error rates. Ensemble techniques leverage the diversity and collective decision-making of multiple models to enhance accuracy and robustness.

#### Ensemble Techniques

71. Ensemble techniques in machine learning combine multiple individual models to make more accurate predictions. They leverage the diversity and collective decision-making of the models to enhance performance and robustness, often outperforming single models.


72. Bagging (Bootstrap Aggregating) is an ensemble technique where multiple models are trained on different subsets of the training data using bootstrapping (sampling with replacement). The individual models are then combined by averaging their predictions (for regression) or voting (for classification) to make the final prediction.


73. Bootstrapping in bagging involves randomly sampling the training data with replacement to create multiple subsets. Each subset has the same size as the original data but may contain duplicate instances. This sampling process allows for the generation of diverse training sets to train different models in the bagging ensemble.


74. Boosting is an ensemble technique that combines weak models (learners) to create a strong model. It works by iteratively training models on different weighted versions of the training data, with each subsequent model focusing on the instances that were misclassified by the previous models. The final prediction is made by aggregating the predictions of all models.


75. AdaBoost (Adaptive Boosting) and Gradient Boosting are both boosting algorithms. AdaBoost assigns weights to instances and adjusts the weights based on their misclassification, allowing subsequent models to focus more on difficult instances. Gradient Boosting, on the other hand, uses gradient descent to optimize a loss function, with each subsequent model fitting the negative gradient of the loss function.


76. Random forests are an ensemble technique that combines multiple decision trees. Each tree is trained on a different bootstrap sample of the data and makes independent predictions. The final prediction is obtained by averaging the predictions (for regression) or taking the majority vote (for classification) of all the trees.


77. Random forests determine feature importance by analyzing how much the predictive accuracy decreases when a particular feature is randomly permuted. By measuring the decrease in accuracy, the importance of each feature can be determined. Features with a larger decrease in accuracy are considered more important.


78. Stacking is an ensemble technique that combines the predictions of multiple models by training a meta-model on their outputs. The predictions of the individual models serve as input features for the meta-model. Stacking enables the higher-level model to learn how to best combine the predictions of the individual models, potentially improving the overall performance.


79. The advantages of ensemble techniques include improved predictive accuracy, robustness to noise and outliers, and the ability to capture complex relationships. However, disadvantages include increased computational complexity, potential overfitting if not properly tuned, and reduced interpretability compared to individual models.


80. The optimal number of models in an ensemble depends on the specific problem and data. Adding more models initially improves performance but may eventually lead to diminishing returns or even overfitting. Model selection techniques, such as cross-validation or monitoring the performance on a validation set, can help determine the optimal number of models that balances performance and computational efficiency.
