# MSDA 9213: Data Mining 

### 1. Discuss the difference between supervised and unsupervised methods

- **Supervised Learning** involves labeled data. The algorithm learns a mapping from inputs (features) to outputs (labels). Examples: classification, regression.
- **Unsupervised Learning** involves unlabeled data. The algorithm tries to identify hidden patterns or groupings. Examples: clustering, dimensionality reduction.

---

### 2. Describe two real-life applications:
#### i. Classification
- **Application**: Spam detection in emails.
- **Response**: Binary variable (Spam or Not Spam).
- **Predictors**: Email content, sender, subject, etc.
- **Goal**: Prediction — we want to predict the category of new emails.

#### ii. Regression
- **Application**: Predicting house prices.
- **Response**: Continuous variable (Price).
- **Predictors**: Size, location, number of bedrooms, etc.
- **Goal**: Prediction — we estimate a continuous outcome.

---

### 3. Why is Naïve Bayes so “naïve”?

Naïve Bayes assumes **independence among predictors** given the class label, which is rarely true in real life. This “naïve” assumption simplifies computation and works surprisingly well even when the assumption is violated.

---

### 4. What is the difference between KNN and K-means?

| Feature           | KNN (K-Nearest Neighbors) | K-Means Clustering |
|------------------|---------------------------|--------------------|
| Type             | Supervised Learning       | Unsupervised Learning |
| Goal             | Classification or regression | Clustering |
| Input Required   | Labeled data              | Unlabeled data |
| Output           | Class label or prediction | Cluster assignments |

---

### 5. When is ridge regression favorable over Lasso regression?

- Ridge regression is preferred when **many predictors have small/medium effects**, and we want to **shrink coefficients** but not eliminate them.
- Unlike Lasso, Ridge **does not perform variable selection** but handles multicollinearity better.

---

### 6. What is a confusion matrix and how does it work?

A confusion matrix is a performance summary for classification models. It compares actual vs predicted classes:

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| Actual Positive | True Positive (TP)   | False Negative (FN) |
| Actual Negative | False Positive (FP)  | True Negative (TN)  |

It helps compute metrics like accuracy, precision, recall, and F1-score.

---

### 7. Cross-validation

#### i. How k-fold cross-validation is implemented:
- The dataset is split into *k* equal parts.
- The model is trained on *k−1* folds and tested on the remaining fold.
- This process is repeated *k* times, each fold serving once as the test set.
- The performance is averaged across all k trials.

#### ii. Advantages and disadvantages:

**a. Compared to the validation set approach:**
- **Advantage**: Uses the data more efficiently, leading to lower variance.
- **Disadvantage**: More computationally intensive.

**b. Compared to LOOCV (Leave-One-Out Cross-Validation):**
- **Advantage**: Less computation than LOOCV.
- **Disadvantage**: LOOCV has lower bias but higher variance.

---

### 8. Estimating standard deviation of prediction

To estimate the standard deviation of our prediction (also known as the standard error of the prediction, $SE(\hat{Y}_0)$) for a particular value of the predictor $X_0$, we consider two main sources of variability: uncertainty in the model's parameters and irreducible error.

1.  **For Parametric Models (e.g., Linear Regression):**
    For models with an explicit mathematical form, analytical formulas are typically available. For a simple linear regression predicting $Y$ at $X_0$:

    $$SE(\hat{Y}_0) = \sqrt{\hat{\sigma}^2 \left( 1 + \frac{1}{n} + \frac{(X_0 - \bar{X})^2}{\sum_{i=1}^n (X_i - \bar{X})^2} \right)}$$

    Here, $\hat{\sigma}^2$ estimates the irreducible error, while the other terms account for the uncertainty in the estimated model parameters. Statistical software usually provides these directly.

2.  **For Non-Parametric or Complex Models (e.g., Tree-based methods):**
    For models without simple analytical solutions, resampling methods like the **bootstrap** are used:
    * **Generate Bootstrap Samples:** Create many new datasets by sampling with replacement from the original training data.
    * **Train and Predict:** Train your statistical learning method on each bootstrap sample and make a prediction for $X_0$. This yields multiple predictions ($\hat{Y}_{0,1}, \hat{Y}_{0,2}, \dots, \hat{Y}_{0,B}$).
    * **Calculate Standard Deviation:** The standard deviation of these $B$ predictions is then used as the estimate for the standard deviation of your prediction. This primarily captures the uncertainty from the model's training process.

---

### 9. Two methods for variable selection and how they work

1. **Forward Selection**:
   - Start with no variables.
   - Add predictors one by one that most improve model performance.
   - Stop when adding more variables doesn’t significantly improve the model.

2. **Lasso Regression**:
   - Adds L1 penalty to the loss function.
   - Shrinks some coefficients to exactly zero, effectively performing variable selection.

---


## Decision tree
Decision trees consist of nodes (root, internal, leaf) representing tests or outcomes, and branches representing the test results. Pruning is essential to prevent overfitting, improve interpretability, and reduce computational cost by simplifying the tree. Decision trees are attractive for classification due to their high interpretability, ability to handle various data types without extensive pre-processing, and capacity to capture non-linear relationships.

Improving an existing rule-based classifier with a data-driven approach involves collecting and preparing a large dataset, selecting an appropriate machine learning model, training it on the data, and then rigorously evaluating and tuning its performance using a separate validation set. To test the validity of the new model, its performance should be quantitatively assessed on an independent test set using metrics like accuracy, precision, and recall, compared directly against the existing rule-based system, and ideally, subjected to qualitative review by domain experts to ensure its real-world applicability and interpretability.

## Parameter tuning

Parameter tuning, also known as hyperparameter tuning, is the process of finding the best combination of settings (hyperparameters) for a machine learning model that results in optimal performance on a given task.