[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mouryarahul/7CS107_PracticalWorks_Assignment/blob/master/Week5_Assignment.ipynb)

# 7CS107: Advanced AI and Machine Learning — Programming Assignment (Week 5)

**Topic coverage:** Decision Theory, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), model evaluation.

**Total marks:** 100 (5 questions × 20 marks)

**Allowed libraries:** NumPy, SciPy, scikit-learn, pandas, matplotlib/seaborn, and Python standard library.

**Submission:** Submit this **executed** notebook (.ipynb) on the Canvas with **all outputs visible**.

> **Academic integrity:** Work must be your own. Cite any external sources. You may discuss general ideas, but code must be written independently.

---

## How to work through this assignment (step-by-step)

1. **Read the task** in the markdown cell before each question.
2. **Open the code cell** that contains a function with a `# TODO` block.
3. **Implement only inside `# TODO`** (do not change function names or signatures).
4. **Run the code cell** to define your function.
5. **Run the test cell** that follows. The test cell:
   - Creates or loads data,
   - Calls your function,
   - Computes metrics,
   - Checks thresholds using `assert` statements.
6. If a test fails:
   - Read the **inline comments** (every line in the test cells is commented to explain intent),
   - Print intermediate values if needed,
   - Refine your code and re-run.
7. **Keep code concise** (a few lines are sufficient). You may add brief comments to explain your reasoning.
8. **Save and re-run `Kernel > Restart & Run All`** before submission to ensure a clean run.

### Marking scheme (per question, 20 marks)
- **Implementation correctness (12 marks):** passes tests; follows the requested approach.
- **Result quality (5 marks):** metrics meet/beat thresholds in the tests (or are well-justified if borderline).
- **Code quality (3 marks):** clear, concise, readable; uses appropriate library functions.

### Datasets used (auto-downloaded by scikit-learn)
- **20 Newsgroups (text)**: `sklearn.datasets.fetch_20newsgroups`
- **Breast Cancer Wisconsin**: `sklearn.datasets.load_breast_cancer`
- **Iris**: `sklearn.datasets.load_iris`

For dataset documentation, see scikit-learn docs (no manual download needed).


In [None]:
# Standard imports used across questions
import numpy as np
import pandas as pd
from scipy.stats import multivariate_normal

from sklearn.datasets import fetch_20newsgroups, load_breast_cancer, load_iris
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import (confusion_matrix, classification_report, f1_score, accuracy_score,
                             roc_auc_score, roc_curve)
import matplotlib.pyplot as plt
import seaborn as sns

# Global random seed for reproducibility across tests
np.random.seed(42)


## Q1 (20 marks) — Decision Theory: Bayes Optimal (Cost-Sensitive) Classifier for Gaussian Models

We consider a two-class problem with **known** class-conditional densities and class priors:
- $x\mid y=k \sim \mathcal{N}(\mu_k,\Sigma_k)$,
- Priors $(\pi_0,\pi_1)$,
- Optional **cost matrix** $C$ where $C[i,j]$ is the cost of predicting class $j$ when the true class is $i$.

### Your task
1. Compute the **posterior** $P(y=k\mid x)$ using Bayes' rule. Use **log-densities** for numerical stability.
2. If a cost matrix is supplied, choose the class that **minimizes expected risk**:
   $$\hat{y}(x)=\arg\min_j \sum_i C[i,j] \cdot P(y=i\mid x).$$
   If `cost_matrix=None`, perform **MAP** (argmax posterior).

### Implementation tips
- Use `scipy.stats.multivariate_normal.logpdf` for $\log p(x\mid y=k)$.
- Convert log-posteriors to probabilities with a log-sum-exp trick: subtract the row-wise max, exponentiate, and normalize.
- Return a 1D array of predicted class indices of shape `(n,)`.


In [None]:
from typing import Sequence, Optional

def bayes_decision_gaussian(X: np.ndarray,
                            priors: Sequence[float],
                            mus: Sequence[np.ndarray],
                            covs: Sequence[np.ndarray],
                            cost_matrix: Optional[np.ndarray] = None) -> np.ndarray:
    """Bayes (cost-sensitive) classifier for Gaussian class-conditionals.

    Parameters
    ----------
    X : (n, d) array of inputs.
    priors : length-K sequence of class priors that sum to 1.
    mus : list of K arrays of shape (d,) (class means).
    covs : list of K arrays of shape (d,d) (class covariances).
    cost_matrix : (K,K) array or None. If None, do MAP; else minimize expected risk.

    Returns
    -------
    y_pred : (n,) array of predicted class indices (0..K-1)
    """
    X = np.asarray(X)
    K = len(priors)
    n = X.shape[0]

    # TODO: Compute unnormalized log-posteriors: log p(x|y=k) + log pi_k
    log_posts = np.zeros((n, K))

    # TODO: Normalize to get posteriors using log-sum-exp stabilization
    # subtract row-wise max
    # exponentiate stabilized values
    # row-wise normalization to 1

    if cost_matrix is None:
        # return MAP decision

    # TODO: Expected risk for predicting class j: sum_i C[i,j] * P(y=i|x)
    # compute expected risks for all classes
    # return minimize expected risk

In [None]:
# === Tests for Q1 ===
np.random.seed(0)  # set seed for this test to make it reproducible

# --- Create a synthetic 2D Gaussian dataset for two classes ---
mu0 = np.array([0.0, 0.0])                # mean of class 0
mu1 = np.array([2.0, 2.0])                # mean of class 1
Sigma0 = np.array([[1.0, 0.2],            # covariance of class 0
                   [0.2, 1.0]])
Sigma1 = np.array([[1.0, -0.3],           # covariance of class 1
                   [-0.3, 1.2]])
priors = [0.6, 0.4]                       # class priors p(y=0)=0.6, p(y=1)=0.4

# Sample points from each class
n0, n1 = 90, 30                          # number of samples per class
X0 = np.random.multivariate_normal(mu0, Sigma0, size=n0)  # samples of class 0
X1 = np.random.multivariate_normal(mu1, Sigma1, size=n1)  # samples of class 1
X = np.vstack([X0, X1])                    # stack into a single dataset (n, d)
y_true = np.hstack([np.zeros(n0, dtype=int), np.ones(n1, dtype=int)])  # true labels (n,)

# --- Define a cost matrix to penalize false negatives more heavily ---
# cost[i,j] = cost of predicting j when true class is i
cost = np.array([[0.0, 1.0],               # predicting 1 when true is 0 costs 1
                 [5.0, 0.0]])              # predicting 0 when true is 1 costs 5 (more serious)

# --- Call the student's function in two modes: cost-sensitive and MAP ---
y_pred_cost = bayes_decision_gaussian(X, priors, [mu0, mu1], [Sigma0, Sigma1], cost_matrix=cost)  # minimize expected risk
y_pred_map  = bayes_decision_gaussian(X, priors, [mu0, mu1], [Sigma0, Sigma1], cost_matrix=None)  # MAP (no costs)

# --- Compute confusion matrices to see error patterns ---
conf_cost = confusion_matrix(y_true, y_pred_cost, labels=[0, 1])  # confusion for cost-sensitive predictions
conf_map  = confusion_matrix(y_true, y_pred_map,  labels=[0, 1])  # confusion for MAP predictions

# --- Compute empirical expected risk for each strategy ---
# Multiply elementwise by the cost matrix and average by number of samples
risk_cost = (conf_cost * cost).sum() / len(X)
risk_map  = (conf_map  * cost).sum() / len(X)

# --- Display diagnostics ---
print("Confusion (cost-sensitive):", conf_cost)
print("Confusion (MAP):", conf_map)
print(f"Empirical expected risk (cost-sensitive): {risk_cost:.3f}")
print(f"Empirical expected risk (MAP):           {risk_map:.3f}")

# --- Assertion: cost-sensitive decision should not have higher expected risk than MAP here ---
assert risk_cost <= risk_map + 1e-6, "Cost-sensitive decision should not yield higher expected risk than MAP here."
print("[Q1] Tests passed ✅")


Confusion (cost-sensitive): [[74 16]
 [ 1 29]]
Confusion (MAP): [[86  4]
 [ 5 25]]
Empirical expected risk (cost-sensitive): 0.175
Empirical expected risk (MAP):           0.242
[Q1] Tests passed ✅


## Q2 (20 marks) — Naive Bayes on Text (20 Newsgroups)

Train a **Multinomial Naive Bayes** classifier on a **2-class** subset of the 20 Newsgroups dataset (default: `sci.space` vs `rec.autos`).

### Your task
1. Load train **and** test splits for the chosen categories with `fetch_20newsgroups` (remove headers/footers/quotes).
2. Vectorize text using `CountVectorizer(min_df=2)` (bag-of-words counts).
3. Fit `MultinomialNB(alpha=alpha)` on training data and predict on test data.
4. Return **macro F1** on the test set.

### Tips
- Keep code compact (4–6 lines inside the function is enough).
- Use provided defaults: `categories=("sci.space","rec.autos")`, `alpha=1.0`.


In [6]:
def train_nb_20ng(categories=("sci.space", "rec.autos"), alpha: float = 1.0) -> float:
    """Train MultinomialNB on a two-class 20NG subset and return macro F1 on the test set."""
    # TODO: load train & test data for the specified categories
    data_train = fetch_20newsgroups(subset='train', categories=list(categories), remove=('headers', 'footers', 'quotes'))
    data_test  = fetch_20newsgroups(subset='test',  categories=list(categories), remove=('headers', 'footers', 'quotes'))

    # TODO: vectorize text and fit NB
    # Use CountVectorizer with min_df=2 to vectorize the text data
    # Fit and transform the training data
    # Transform the test data
    vectorizer = CountVectorizer(min_df=2)
    X_train = vectorizer.fit_transform(data_train.data)
    X_test  = vectorizer.transform(data_test.data)
    # create Multinomial Naive Bayes classifier with given alpha
    # fit the model
    # predict on test data
    clf = MultinomialNB(alpha=alpha)
    clf.fit(X_train, data_train.target)
    y_pred = clf.predict(X_test)
    # TODO: compute macro F1
    # return macro F1 score
    return f1_score(data_test.target, y_pred, average='macro')

In [7]:
# === Tests for Q2 ===
# Train on the default categories and evaluate macro F1
f1 = train_nb_20ng()                                     # run train_nb_20ng function with defaults
print(f"Macro F1: {f1:.3f}")                             # display the macro-averaged F1 score

# Minimal performance threshold for this simple baseline
assert f1 >= 0.75, "F1 should be at least 0.75 on this binary subset with bag-of-words + NB."
print("[Q2] Tests passed ✅")


Macro F1: 0.898
[Q2] Tests passed ✅


## Q3 (20 marks) — Logistic Regression (Breast Cancer) with ROC–AUC

Train a **Logistic Regression** classifier on the Breast Cancer Wisconsin dataset. Use a train/test split (stratified), a standardization step, and report **ROC–AUC** on the test set.

### Your task
1. Split into train/test with `train_test_split(..., stratify=y, test_size=0.25, random_state=42)`.
2. Build a pipeline: `StandardScaler()` → `LogisticRegression(max_iter=1000, solver='liblinear')`.
3. Return **ROC–AUC** on the test set using predicted probabilities.

### Tips
- The `solver='liblinear'` works well for smaller datasets; keep defaults unless you experiment.
- Expose `C` and `class_weight` as parameters to the function and pass them into the model.


In [10]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
def logreg_breast_cancer(C: float = 1.0, class_weight=None) -> float:
    """Train LogisticRegression on breast cancer dataset and return ROC–AUC on test set."""
    # TODO: load data and split
    # load breast cancer dataset
    # split into train and test sets
    data = load_breast_cancer()
    X = data.data
    y = data.target

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.30, random_state=42, stratify=y
    )
    # TODO: pipeline and fit
    # create pipeline with scaling and logistic regression
    # fit the model
    scaler = StandardScaler()
    X_train_s = scaler.fit_transform(X_train)
    X_test_s = scaler.transform(X_test)

    clf = LogisticRegression(C=C, class_weight=class_weight, solver='liblinear', max_iter=1000, random_state=42)
    clf.fit(X_train_s, y_train)
    # TODO: ROC–AUC on test set
    # probability of positive class
    #  return ROC-AUC score
    y_prob = clf.predict_proba(X_test_s)[:, 1]
    auc = roc_auc_score(y_test, y_prob)
    return auc

In [11]:
# === Tests for Q3 ===
auc = logreg_breast_cancer()                          # call logreg_breast_cancer with defaults
print(f"Test ROC–AUC: {auc:.3f}")                     # display the ROC–AUC on the test split

# Require a strong baseline with proper scaling (typical performance is high on this dataset)
assert auc >= 0.95, "ROC–AUC should be at least 0.95 on Breast Cancer with scaling + LR."
print("[Q3] Tests passed ✅")


Test ROC–AUC: 0.998
[Q3] Tests passed ✅


## Q4 (20 marks) — RBF SVM (Iris) with Cross-Validation

Train an **RBF-kernel SVM** on the Iris dataset using a pipeline (scaling + SVC). Report **mean accuracy** using stratified 5-fold cross-validation.

### Your task
1. Build pipeline: `StandardScaler()` → `SVC(kernel='rbf', C=C, gamma=gamma)`.
2. Compute **mean accuracy** via stratified K-fold CV (default `cv=5`, shuffled with `random_state=42`).
3. Return the **mean** CV accuracy.

### Tips
- Use `cross_val_score` with `StratifiedKFold` to ensure class balance across folds.
- Keep defaults unless you want to experiment with `C` and `gamma`.


In [12]:
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
import numpy as np
def svm_rbf_iris(C: float = 1.0, gamma='scale', cv: int = 5) -> float:
    """Train SVC with RBF kernel on Iris and return mean CV accuracy."""
    # TODO: load data and split
    # create pipeline with scaling and SVC
    # perform stratified CV
    # return mean accuracy
    data = load_iris()
    X, y = data.data, data.target

    model = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=C, gamma=gamma, random_state=42))

    cv_splitter = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv_splitter, scoring='accuracy')

    return scores.mean()


In [13]:
# === Tests for Q4 (fully commented) ===
acc = svm_rbf_iris()                              # run svm_rbf_iris with defaults
print(f"Mean CV accuracy: {acc:.3f}")             # display mean accuracy across folds

# The Iris dataset is clean and low-dimensional; RBF SVM typically achieves high accuracy
assert acc >= 0.95, "Mean CV accuracy should be at least 0.95 on Iris with RBF SVM."
print("[Q4] Tests passed ✅")


Mean CV accuracy: 0.960
[Q4] Tests passed ✅


## Q5 (20 marks) — Nested Cross-Validation for SVM (Breast Cancer)

Perform **nested cross-validation** to estimate the generalization performance of an RBF SVM on the Breast Cancer dataset while tuning hyperparameters $(C, \gamma)$.

### Your task
1. **Outer loop**: `StratifiedKFold(outer_k, shuffle=True, random_state=42)` splits data into train/test.
2. **Inner loop**: `GridSearchCV` over a parameter grid for `C` and `gamma` with `StratifiedKFold(inner_k, shuffle=True, random_state=123)`.
3. For each outer split, fit the inner **grid search** on the outer-train portion, then evaluate the **best model** on the outer-test portion.
4. Return `(mean_acc, std_acc)` across all outer folds.

### Tips
- Build pipeline: `StandardScaler()` → `SVC(kernel='rbf')` and name parameters in grid as `svc__C`, `svc__gamma`.
- Keep grids small for runtime: defaults `(0.1, 1, 10)` for `C`, and `('scale', 0.01, 0.1, 1.0)` for `gamma`.


In [24]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score
import numpy as np
def nested_cv_svm_breast(outer_k: int = 3, inner_k: int = 3,
                         Cs = (0.1, 1, 10), gammas = ('scale', 0.01, 0.1, 1.0)):
    "Perform nested CV with SVM on breast cancer and return mean and std of outer accuracies."""
    # TODO: load data
    ds = load_breast_cancer()
    X, y = ds.data, ds.target

    #TODO: perform outer CV (stratified)
    outer_cv = StratifiedKFold(n_splits=outer_k, shuffle=True, random_state=42)
    outer_scores = []

    # TODO: for each outer fold, perform inner CV grid search and evaluate on outer test fold
    for train_idx, test_idx in outer_cv.split(X, y):
        # Split data for this outer fold
        X_tr, X_te = X[train_idx], X[test_idx]
        y_tr, y_te = y[train_idx], y[test_idx]

        # Perform Inner CV grid search on the training portion of the outer fold
        model = make_pipeline(StandardScaler(), SVC(kernel='rbf', random_state=42))
        param_grid = {'svc__C': Cs, 'svc__gamma': gammas}
        inner_cv = StratifiedKFold(n_splits=inner_k, shuffle=True, random_state=42)
        grid = GridSearchCV(model, param_grid, cv=inner_cv, scoring='accuracy')
        grid.fit(X_tr, y_tr)
        # Evaluate the best model found in inner loop on the held-out outer test split
        best_model = grid.best_estimator_
        y_pred = best_model.predict(X_te)
        outer_scores.append(accuracy_score(y_te, y_pred))

    # return mean and std of outer accuracies
    return {np.mean(outer_scores), np.std(outer_scores)}


In [25]:
# === Tests for Q5 ===
# Run nested CV with default outer/inner folds and parameter grids
mean_acc, std_acc = nested_cv_svm_breast()                      # returns mean and std accuracy across outer folds
print(f"Outer mean accuracy: {mean_acc:.3f} ± {std_acc:.3f}")   # display aggregate performance

# Expect strong performance on Breast Cancer with proper scaling and model selection
assert mean_acc >= 0.93, "Outer mean accuracy should be at least 0.93 for SVM on Breast Cancer."
print("[Q5] Tests passed ✅")

Outer mean accuracy: 0.975 ± 0.013
[Q5] Tests passed ✅
