
This are questions related to the material in the folder "slides _notes on classification - decision trees - DC ensemble - SVM"

### 1. **Chapter 4: Classification**
1. What is the primary goal of classification in machine learning?
2. How does linear regression differ from logistic regression in handling binary classification tasks?
3. What issue arises when using linear regression for binary classification with probabilities?
4. In logistic regression, what is the range of predicted probabilities for a binary outcome?
5. Why might probabilities less than zero or greater than one be problematic in classification?
6. What is an advantage of estimating probabilities for each class in classification rather than only predicting labels?
7. How is logistic regression related to linear discriminant analysis?
8. Describe a scenario where estimating the probability of a classification is more valuable than just predicting the class.
9. What are qualitative variables, and how are they treated in classification tasks?
10. How does logistic regression model the probability of a binary response?

---

### 2. **Chapter 8: Tree-Based Methods**
1. What are the main types of problems for which decision trees are used?
2. Describe how a decision tree segments the predictor space.
3. What is the difference between internal nodes and terminal nodes in a decision tree?
4. How is the response for each terminal node in a decision tree typically determined in regression?
5. Explain the concept of recursive binary splitting in decision tree building.
6. What role does “Years” play in the sample decision tree for predicting baseball player salaries?
7. What is the purpose of bagging in tree-based methods?
8. How does boosting improve prediction accuracy in tree-based methods?
9. What is the trade-off between interpretability and accuracy in bagging and boosting?
10. Describe the general process for fitting a decision tree model.

---

### 3. **Chapter 9: Support Vector Machines**
1. What is a hyperplane, and how is it defined mathematically?
2. How does a maximal margin classifier differ from other separating hyperplanes?
3. Describe the concept of a support vector in SVM.
4. Explain the role of the regularization parameter \( C \) in a support vector classifier.
5. How does the kernel trick enable SVMs to handle non-linear data?
6. What is the primary goal when finding a separating hyperplane in SVM?
7. Explain the concept of a soft margin in SVM and when it is useful.
8. How does enlarging the feature space help in classifying non-linearly separable data?
9. What types of decision boundaries are possible by using non-linear transformations in SVM?
10. When and why might a linear boundary fail in an SVM model?

---

### 4. **Notes on Decision Trees and XGBoost**
1. In decision trees, how is the input domain partitioned into different regions?
2. How is the prediction for a region in a decision tree typically determined for classification tasks?
3. Explain the significance of axis-aligned subsets in decision tree regions.
4. What is the purpose of the gain function \( G(j, \theta) \) in decision tree construction?
5. Describe the greedy algorithm used in fitting decision trees.
6. How does XGBoost differ from traditional decision tree methods?
7. What does it mean for a function to be non-differentiable in the context of decision tree optimization?
8. Explain the process of recursive partitioning in decision tree construction.
9. How is feature importance determined in a decision tree?
10. Describe the NP-completeness of finding an optimal tree model and its practical implications.

---

### 5. **Trees, Forests, Bagging, and Boosting Notes**
1. What does CART stand for, and what are its primary uses?
2. Describe the process of model fitting in a regression tree.
3. Explain the difference between bagging and boosting in ensemble methods.
4. How does random forest handle the problem of overfitting compared to a single decision tree?
5. What are surrogate splits, and how do they handle missing values in tree models?
6. Describe the role of the Gini index in decision tree classification.
7. What is the purpose of pruning in decision trees, and how does it help avoid overfitting?
8. How does entropy differ from the Gini index as a criterion for splits in decision trees?
9. Why are tree models considered easy to interpret compared to other machine learning methods?
10. Explain the concept of feature selection in the context of decision trees and ensemble methods.



# Exercises on the same material
Here are the computational exercises, updated to specify certain metrics to measure classifier performance. This addition emphasizes evaluating models based on accuracy, precision, recall, F1-score, and more as appropriate.

---

### **1. Classification (Logistic Regression and Linear Discriminant Analysis)**

**Exercise 1**: Implement logistic regression on a synthetic binary classification dataset (e.g., using `make_classification`). Calculate class probabilities and classify based on a 0.5 threshold. Evaluate the model’s performance using accuracy, precision, and recall.

- *Hint*: Use `make_classification` to generate data, then fit `LogisticRegression` from `sklearn.linear_model`. Use `predict_proba` for class probabilities and evaluate using `accuracy_score`, `precision_score`, and `recall_score` from `sklearn.metrics`.

**Exercise 2**: Apply logistic regression to a multi-class classification problem and assess model performance using accuracy, precision, and the macro-average F1-score.

- *Hint*: Use `make_classification` with `n_classes=3` and fit `LogisticRegression(multi_class='multinomial')`. Use `accuracy_score` and `f1_score` with `average='macro'` from `sklearn.metrics` for evaluation.

**Exercise 3**: Compare linear regression and logistic regression for binary classification on a synthetic dataset, assessing both models using ROC-AUC.

- *Hint*: Use `make_classification` and fit both `LinearRegression` and `LogisticRegression`. Calculate probabilities for logistic regression using `predict_proba` and use `roc_auc_score` to evaluate each model.

**Exercise 4**: Visualize the decision boundary of a logistic regression model on a 2D dataset, and measure performance using confusion matrix and accuracy.

- *Hint*: Use `make_blobs` to create a 2D dataset, fit `LogisticRegression` with different `C` values, and plot the decision boundaries using `DecisionBoundaryDisplay.from_estimator`. Assess the model with `confusion_matrix` and `accuracy_score`.

**Exercise 5**: Fit a logistic regression model on a synthetic dataset and interpret the model’s coefficients. Evaluate model performance with precision, recall, and F1-score.

- *Hint*: Use `make_classification`, fit `LogisticRegression`, and check `coef_` to interpret feature influence. Evaluate with `precision_score`, `recall_score`, and `f1_score`.

---

### **2. Tree-Based Methods (Decision Trees)**

**Exercise 1**: Implement a decision tree classifier on a synthetic dataset and visualize it. Use precision, recall, and F1-score to measure performance.

- *Hint*: Generate data with `make_classification` and fit `DecisionTreeClassifier` from `sklearn.tree`. Use `plot_tree` for visualization and evaluate using `precision_score`, `recall_score`, and `f1_score`.

**Exercise 2**: Fit a decision tree regressor on a synthetic regression dataset and evaluate performance using mean absolute error (MAE) and R-squared.

- *Hint*: Use `make_regression` to generate data, fit `DecisionTreeRegressor`, and measure performance with `mean_absolute_error` and `r2_score` from `sklearn.metrics`.

**Exercise 3**: Use cross-validation to find the optimal depth of a decision tree classifier on a synthetic dataset, and evaluate using accuracy and F1-score.

- *Hint*: Use `DecisionTreeClassifier` with `GridSearchCV` to search over `max_depth`. Generate data with `make_classification`, and evaluate using `accuracy_score` and `f1_score`.

**Exercise 4**: Prune a decision tree by adjusting `max_depth` and compare performance on the training and testing datasets using accuracy and recall.

- *Hint*: Use `make_classification`, fit `DecisionTreeClassifier` with varying `max_depth`, and compare overfitting by assessing `accuracy_score` and `recall_score` on train vs. test data.

**Exercise 5**: Plot feature importance for a decision tree classifier on a synthetic dataset and measure model performance using precision and accuracy.

- *Hint*: Fit `DecisionTreeClassifier` on data from `make_classification`. Plot `feature_importances_` and evaluate using `precision_score` and `accuracy_score`.

---

### **3. Support Vector Machines (SVM)**

**Exercise 1**: Fit a linear SVM classifier on a 2D dataset, visualizing the hyperplane, support vectors, and margins. Measure model performance using precision, recall, and F1-score.

- *Hint*: Use `make_blobs` to create separable data, then fit `SVC(kernel='linear')`. Visualize the hyperplane and evaluate with `precision_score`, `recall_score`, and `f1_score`.

**Exercise 2**: Apply a non-linear kernel SVM on a synthetic dataset, and compare performance with the linear SVM using accuracy and ROC-AUC.

- *Hint*: Use `make_circles` for non-linearly separable data, and fit `SVC(kernel='rbf')` and `SVC(kernel='linear')`. Use `accuracy_score` and `roc_auc_score` for model comparison.

**Exercise 3**: Investigate the effect of the regularization parameter \( C \) on a linear SVM model, assessing each model’s performance using accuracy and precision.

- *Hint*: Generate data with `make_classification` and fit `SVC(kernel='linear')` with varying `C` values. Use `accuracy_score` and `precision_score` to evaluate.

**Exercise 4**: Use cross-validation to select optimal parameters (e.g., \( C \) and kernel) for an SVM classifier on a synthetic dataset. Measure performance using cross-validated accuracy and F1-score.

- *Hint*: Use `GridSearchCV` with `SVC` on `make_moons` or `make_circles`. Evaluate each configuration with cross-validated `accuracy_score` and `f1_score`.

**Exercise 5**: Experiment with feature expansion by adding polynomial features before applying a linear SVM, and evaluate the model using accuracy and recall.

- *Hint*: Use `PolynomialFeatures` to transform a `make_classification` dataset, then fit `LinearSVC`. Measure model performance using `accuracy_score` and `recall_score`.

---

### **4. Decision Trees and XGBoost**

**Exercise 1**: Implement a binary decision tree using a synthetic dataset, and assess performance using accuracy and F1-score.

- *Hint*: Use `make_classification` and fit `DecisionTreeClassifier`. Measure performance with `accuracy_score` and `f1_score` from `sklearn.metrics`.

**Exercise 2**: Fit an XGBoost model on a synthetic dataset, plot feature importances, and evaluate performance using precision, recall, and ROC-AUC.

- *Hint*: Use `XGBClassifier` on `make_classification`, plot feature importances using `plot_importance`, and evaluate with `precision_score`, `recall_score`, and `roc_auc_score`.

**Exercise 3**: Use cross-validation to find the optimal number of boosting rounds for an XGBoost classifier on a synthetic dataset. Measure performance using cross-validated accuracy and F1-score.

- *Hint*: Use `make_classification` and `XGBClassifier` with `cv` from `xgboost`. Assess accuracy and F1-score for each boosting round using cross-validation.

**Exercise 4**: Adjust the learning rate in XGBoost and observe its impact on accuracy and mean squared error (MSE).

- *Hint*: Generate data with `make_regression`, train `XGBRegressor` with different `learning_rate` values, and measure performance with `accuracy_score` for classification and `mean_squared_error` for regression.

**Exercise 5**: Implement early stopping in XGBoost on a synthetic dataset and evaluate using precision and F1-score.

- *Hint*: Use `make_classification`, fit `XGBClassifier` with `early_stopping_rounds`, and use `precision_score` and `f1_score` to evaluate the early-stopped model.

---

### **5. Trees, Forests, Bagging, and Boosting**

**Exercise 1**: Implement a bagged ensemble of decision trees on a synthetic dataset, and compare its performance to a single tree using accuracy and F1-score.

- *Hint*: Use `BaggingClassifier` with `DecisionTreeClassifier` as the base estimator on `make_classification`. Measure ensemble vs. single tree performance using `accuracy_score` and `f1_score`.

**Exercise 2**: Fit a random forest on a synthetic dataset and analyze feature importances. Measure performance using precision and recall.

- *Hint*: Use `RandomForestClassifier` on `make_classification`. Plot feature importances using `feature_importances_` and evaluate with `precision_score` and `recall_score`.

**Exercise 3**: Train an AdaBoost model with decision stumps on a synthetic dataset and measure accuracy improvement over rounds.

- *Hint*: Use `make_classification` and fit `AdaBoostClassifier` with `base_estimator=DecisionTreeClassifier(max_depth=1)`. Track accuracy improvement over rounds using `accuracy_score`.

**Exercise 4**: Compare bagging and boosting on a synthetic dataset, evaluating each using accuracy and recall.

- *Hint*: Use `BaggingClassifier` and `AdaBoostClassifier` on `make_classification` and compare with `accuracy_score` and `recall_score`.

**Exercise 5**: Implement a basic gradient boosting model on a synthetic dataset, measuring the reduction in MSE at each boosting stage.

- *Hint*: Use `make_regression`, fit `GradientBoostingRegressor`, and observe `mean_squared_error` at each boosting stage.

