1. The purpose of the General Linear Model (GLM) is to analyze the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the data.

2. The key assumptions of the General Linear Model include linearity, independence, homoscedasticity (constant variance), normality of residuals, and absence of multicollinearity.

3. Coefficients in a GLM represent the estimated change in the dependent variable for a one-unit change in the corresponding independent variable, holding other variables constant.

4. A univariate GLM involves analyzing the relationship between a single dependent variable and one or more independent variables, while a multivariate GLM deals with multiple dependent variables simultaneously.

5. Interaction effects in a GLM occur when the relationship between the dependent variable and an independent variable depends on the value of another independent variable. It implies that the effect of one predictor variable on the outcome depends on the levels of another predictor variable.

6. Categorical predictors in a GLM are typically represented using dummy variables. Each level of the categorical variable is encoded as a separate binary variable (0 or 1) to account for its effects on the dependent variable.

7. The design matrix in a GLM is a matrix that includes the independent variables used in the model. Each column represents a predictor, and each row represents an observation or case.

8. The significance of predictors in a GLM is typically tested using hypothesis tests, such as the t-test or F-test, to determine if the coefficients are significantly different from zero. This helps assess the contribution of each predictor to the model.

9. Type I, Type II, and Type III sums of squares refer to different methods of partitioning the sum of squares in a GLM. They are used when there are multiple predictors to determine how much variability in the dependent variable is accounted for by each predictor and their interactions.

10. Deviance in a GLM is a measure of the discrepancy between the observed data and the fitted model. It quantifies the goodness-of-fit of the model and is used in hypothesis testing and model comparison. Lower deviance values indicate a better fit of the model to the data.

11. Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. Its purpose is to predict or estimate the value of the dependent variable based on the values of the independent variables.

12. Simple linear regression involves a single independent variable, while multiple linear regression involves two or more independent variables.

13. The R-squared value in regression represents the proportion of the variance in the dependent variable that can be explained by the independent variables. It ranges from 0 to 1, where a higher value indicates a better fit of the regression model to the data.

14. Correlation measures the strength and direction of the relationship between two variables, while regression analyzes the relationship between a dependent variable and one or more independent variables, allowing for prediction and estimation.

15. Coefficients in regression represent the estimated effects of the independent variables on the dependent variable. The intercept is the estimated value of the dependent variable when all independent variables are set to zero.

16. Outliers in regression analysis can be handled by either removing them from the dataset if they are due to data entry errors or influential observations, or by using robust regression techniques that are less sensitive to outliers.

17. Ordinary least squares (OLS) regression aims to minimize the sum of squared residuals, while ridge regression adds a penalty term to the OLS objective function to mitigate multicollinearity and reduce the coefficients' variance.

18. Heteroscedasticity in regression refers to the unequal spread or variability of the residuals across the range of values of the independent variables. It can affect the model by violating the assumption of homoscedasticity, leading to biased standard errors and inefficient parameter estimates.

19. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other. It can be handled by removing one of the correlated variables, performing dimensionality reduction techniques like principal component analysis, or using regularization techniques like ridge regression.

20. Polynomial regression is a form of regression analysis where the relationship between the independent variable(s) and the dependent variable is modeled as an nth-degree polynomial. It is used when the relationship is nonlinear and can capture more complex patterns in the data.

21. A loss function measures the inconsistency between predicted and actual values in machine learning. Its purpose is to quantify the model's performance and guide the optimization process.

22. A convex loss function has a single global minimum, while a non-convex loss function can have multiple local minima.

23. Mean squared error (MSE) is a loss function that calculates the average squared difference between predicted and actual values. It is computed by summing the squared residuals and dividing by the number of samples.

24. Mean absolute error (MAE) is a loss function that calculates the average absolute difference between predicted and actual values. It is computed by summing the absolute residuals and dividing by the number of samples.

25. Log loss, also known as cross-entropy loss, is a loss function used in classification problems. It quantifies the dissimilarity between predicted probabilities and true labels. It is calculated as the negative logarithm of the predicted probability of the correct class.

26. The choice of the appropriate loss function depends on the nature of the problem. MSE is commonly used for regression, while log loss is used for binary classification. The choice can also be influenced by the desired properties of the model and the data distribution.

27. Regularization is a technique used in loss functions to prevent overfitting and improve the generalization of the model. It adds a penalty term to the loss function, which discourages complex or large parameter values.

28. Huber loss is a loss function that combines the characteristics of squared loss (MSE) and absolute loss (MAE). It is less sensitive to outliers than squared loss and provides a smooth gradient near zero.

29. Quantile loss is a loss function used for quantile regression, where the goal is to predict conditional quantiles of the target variable. It measures the discrepancy between predicted and actual quantiles.

30. Squared loss penalizes larger errors more than absolute loss. It is sensitive to outliers and can lead to biased estimates. Absolute loss treats all errors equally and is more robust to outliers.

31. An optimizer is an algorithm or method used to adjust the parameters of a model in order to minimize the loss function during the training process in machine learning.

32. Gradient Descent (GD) is an optimization algorithm that iteratively updates the model's parameters in the direction of the steepest descent of the loss function. It starts with an initial guess and adjusts the parameters by computing gradients.

33. Different variations of Gradient Descent include Batch GD, Mini-Batch GD, and Stochastic GD. They differ in the amount of data used to compute the gradient and update the parameters at each iteration.

34. The learning rate in GD controls the step size taken in the direction of the gradients during parameter updates. An appropriate value is chosen based on the problem and can impact the convergence and stability of the optimization process.

35. GD can get stuck in local optima, but it can still converge to a suboptimal solution. Techniques like learning rate scheduling, momentum, or using random restarts can help escape local optima and find better solutions.

36. Stochastic Gradient Descent (SGD) is a variation of GD that updates the parameters using one randomly selected training sample at a time. It is computationally efficient but introduces more noise and exhibits more frequent parameter updates.

37. Batch size in GD refers to the number of training samples used to compute the gradient and update the parameters in each iteration. It impacts the trade-off between computational efficiency and parameter update frequency.

38. Momentum is a technique used in optimization algorithms to accelerate convergence and navigate through areas with low gradients. It accumulates past gradients to determine the direction and speed of parameter updates.

39. Batch GD uses the entire training dataset to compute the gradient, while mini-batch GD uses a subset (mini-batch) of the training dataset. SGD uses a single random sample at a time. They differ in the computational cost and the quality of the parameter updates.

40. The learning rate affects the convergence of GD. If it is too high, the optimization process may diverge. If it is too low, the convergence may be slow. Choosing an appropriate learning rate is crucial for efficient and stable optimization.

41. Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the objective function. It helps to control the complexity of the model and encourage simpler models that generalize better to unseen data.

42. L1 regularization, also known as Lasso regularization, adds the absolute values of the coefficients as a penalty term. L2 regularization, also known as Ridge regularization, adds the squared values of the coefficients as a penalty term.

43. Ridge regression is a form of linear regression that incorporates L2 regularization. It adds the sum of squared coefficients multiplied by a regularization parameter to the least squares objective function, balancing the model's fit to the data with the magnitude of the coefficients.

44. Elastic Net regularization combines L1 and L2 penalties by adding both the absolute values and squared values of the coefficients as penalty terms. It allows for variable selection (like L1) while also handling correlated predictors effectively (like L2).

45. Regularization helps prevent overfitting by reducing the complexity of the model, constraining the magnitude of the coefficients, and discouraging excessive reliance on individual predictors. It prevents the model from fitting the noise in the training data and improves its generalization performance on unseen data.

46. Early stopping is a form of regularization that stops the training process before the model starts overfitting the training data. It involves monitoring the model's performance on a validation set during training and stopping when the performance begins to degrade.

47. Dropout regularization is a technique used in neural networks where randomly selected neurons are ignored or "dropped out" during the training process. It helps prevent overfitting by reducing the interdependencies between neurons and encourages the network to learn more robust and generalized representations.

48. The regularization parameter in a model, such as the lambda in ridge regression, is typically chosen through techniques like cross-validation. It involves evaluating the model's performance on different subsets of the training data with varying regularization parameter values and selecting the one that provides the best trade-off between bias and variance.

49. Feature selection aims to select a subset of relevant features from the original set, while regularization aims to shrink the coefficients towards zero, effectively reducing the impact of less relevant features. Feature selection explicitly chooses features, while regularization implicitly shrinks their coefficients.

50. Regularized models have a trade-off between bias and variance. Increasing regularization reduces variance but may introduce bias by simplifying the model. Decreasing regularization reduces bias but may increase variance by allowing the model to fit the noise in the training data more closely. The regularization parameter controls this trade-off. 

51. Support Vector Machines (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that separates different classes or fits the data with the maximum margin.

52. The kernel trick in SVM allows nonlinear decision boundaries to be learned by mapping the original input space into a higher-dimensional feature space where the data becomes linearly separable. This avoids the need for explicitly transforming the data into a higher dimension.

53. Support vectors in SVM are the data points that lie closest to the decision boundary or are misclassified. They are important because they define the location of the decision boundary and influence the construction of the SVM model.

54. The margin in SVM refers to the region between the decision boundary and the support vectors. A larger margin indicates better generalization performance and increased model robustness to new data. SVM aims to maximize this margin during model training.

55. Unbalanced datasets in SVM can be handled by adjusting the class weights or using techniques like oversampling the minority class, undersampling the majority class, or employing specialized SVM algorithms that explicitly address class imbalance.

56. Linear SVM uses a linear decision boundary to separate classes in the input space. Non-linear SVM uses the kernel trick to implicitly map the data into a higher-dimensional feature space where a linear boundary can be applied.

57. The C-parameter in SVM controls the trade-off between achieving a larger margin and allowing training errors. Smaller values of C increase the margin but may lead to misclassified training examples, while larger values of C reduce the margin to avoid misclassifications.

58. Slack variables in SVM allow for a soft margin by relaxing the strict requirement of perfect separation. They measure the extent to which training examples violate the margin and are used to handle cases where the data is not linearly separable.

59. In SVM, a hard margin means the decision boundary must perfectly separate the classes without any misclassifications, which may not be possible in some cases. A soft margin allows for some misclassifications, giving flexibility to the model to fit the data better.

60. Coefficients in an SVM model represent the importance or weight assigned to each feature in determining the decision boundary. Positive coefficients indicate the importance of a feature for one class, while negative coefficients indicate importance for the other class.

61. A decision tree is a supervised machine learning algorithm that recursively splits the data based on the values of features to make predictions or decisions. It works by constructing a tree-like model of decisions and their possible consequences.

62. Splits in a decision tree are made by evaluating different features and their values to create subsets of data that are more homogeneous in terms of the target variable. The goal is to maximize the homogeneity or purity of the subsets.

63. Impurity measures, such as the Gini index and entropy, are used in decision trees to quantify the impurity or randomness of a subset of data. They help determine the best feature and value to split on by maximizing the reduction in impurity.

64. Information gain in decision trees measures the reduction in entropy or impurity achieved by splitting on a specific feature. It quantifies how much information is gained by knowing the feature's value, helping to decide the optimal splits.

65. Missing values in decision trees can be handled by either assigning them to the most common class or value, propagating them down the tree, or using algorithms specifically designed to handle missing data, like surrogate splits.

66. Pruning in decision trees involves removing or collapsing branches of the tree to reduce overfitting. It helps simplify the model by removing unnecessary complexity and improving its generalization capability.

67. A classification tree is used for predicting categorical or discrete target variables, while a regression tree is used for predicting continuous target variables.

68. Decision boundaries in a decision tree are determined by the splits in the tree. Each split creates a boundary that separates the data into different regions or classes based on the feature values.

69. Feature importance in decision trees measures the significance or contribution of each feature in making accurate predictions. It helps identify the most informative features and understand their impact on the model's performance.

70. Ensemble techniques combine multiple models, often decision trees, to improve prediction accuracy and generalization. They can be used to reduce overfitting, increase robustness, and handle complex relationships in the data.

71. Ensemble techniques in machine learning combine multiple models, known as base learners, to make predictions or decisions. They leverage the collective knowledge of the individual models to improve overall performance.

72. Bagging, or bootstrap aggregating, is an ensemble technique that involves training multiple base models on different bootstrap samples of the training data and then combining their predictions through averaging or voting.

73. Bootstrapping in bagging is the process of creating multiple bootstrap samples by randomly selecting data points from the training set with replacement. It allows each bootstrap sample to have the same size as the original dataset.

74. Boosting is an ensemble technique that combines weak base models sequentially to create a strong predictive model. It assigns higher weights to misclassified instances in each iteration to focus on the difficult examples.

75. AdaBoost and Gradient Boosting are both boosting algorithms. AdaBoost adjusts the weights of instances based on misclassification, while Gradient Boosting fits subsequent models to the residuals of the previous models.

76. Random forests are an ensemble technique that combines multiple decision trees by randomly selecting subsets of features and training each tree independently. They aggregate the predictions of the individual trees to make the final prediction.

77. Random forests measure feature importance by calculating the average decrease in impurity or the average improvement in prediction accuracy when a particular feature is used for splitting in the trees.

78. Stacking in ensemble learning combines the predictions of multiple base models by training a meta-model that learns to combine their outputs. It uses the predictions of the base models as input features for the meta-model.

79. The advantages of ensemble techniques include improved predictive performance, robustness to noise and outliers, and the ability to handle complex relationships. Disadvantages may include increased complexity, computation time, and potential overfitting.

80. The optimal number of models in an ensemble depends on various factors, such as the dataset size, complexity, and diversity of the base models. It can be determined through cross-validation or by monitoring the ensemble's performance on a validation set.