# Quantum Feature Selection — Detailed Notes (Session 11)
**Course:** CS490/5590 — Quantum Computing Applications in Data Science, AI, & Deep Learning  
**Instructor:** Luke Miller  

> **Purpose.**  These notes expand the slide bullets into a stand-alone reference. You’ll learn what “feature selection” means in the quantum-kernel setting, how to approximate quantum mutual information, how to build hybrid quantum-classical pipelines that choose a small but informative subset of features, and how to prototype everything in Qiskit Machine Learning.  Mini-exercises (with brief answers) are included at the end.

---

## Session Road-map  
1. Recap: quantum kernels & QSVMs  
2. Why feature selection? challenges in high-dimensional QML  
3. Quantum mutual information (QMI) as a relevance score  
4. Quantum-kernel–based scoring (individual & joint)  
5. Circuits for QMI / kernel estimation (swap test, overlap)  
6. Hybrid pipeline: quantum scoring → classical ranking  
7. Qiskit demo (breast-cancer dataset, top-k selection)  
8. Noise analysis & mitigation strategies  
9. Q&A  

---

## 0) Why bother with feature selection in quantum ML?  

| Pain-point | Consequence | Feature-selection remedy |
|------------|-------------|--------------------------|
| # qubits = # selected features | Chip limit (≤ 127 today) | Drop irrelevant features |
| Kernel matrix $K_{ij}$ cost $O(N^2)$ circuits | Expensive for large $N$ or shots | Smaller feature set  → shorter circuits |
| Noise grows with circuit depth | Kernel / MI estimates biased | Fewer qubits → shallower entangling layers |

Goal: **retain predictive power** while respecting NISQ resource bounds.

---

## 1) Classical vs quantum mutual information  

### 1.1 Classical MI  
$$
I(X;Y)=\sum_{x,y}p_{XY}(x,y)\log\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}.
$$

### 1.2 Quantum MI  
Given bipartite state $\rho_{XY}$,
$$
I_Q(X{:}Y)=S(\rho_X)+S(\rho_Y)-S(\rho_{XY}),
$$
where $S(\rho)=-\operatorname{Tr}(\rho\log\rho)$ is von Neumann entropy.

- Captures *classical* **and** *entanglement* correlations.  
- Reduces to classical MI when $\rho_{XY}$ is classical mixture.

---

## 2) Encoding features into quantum states  

- **Assign one qubit per feature** (or encode block of features).  
- **Feature map** $U_\phi(x)$ (e.g., ZZFeatureMap depth = 1–2).  
- For *single-feature relevance* compute state of qubit $i$ conditioned on label.

**Binary label encoding**: extra ancilla qubit $|y\rangle$ or classical conditioning.

---

## 3) Estimating quantum mutual information  

### 3.1 Entropy via swap test (two copies)  

$$
S_2(\rho)= -\log \operatorname{Tr}(\rho^2)
\;\;\text{(Rényi-2 entropy, cheap to measure).}
$$

Swap-test circuit measures $\operatorname{Tr}(\rho^2)$. Use as proxy for von Neumann entropy.

### 3.2 Workflow for feature $f_i$

1. Prepare dataset superposition  
   $|\Psi\rangle=\frac1{\sqrt N}\sum_{k}|x_{k,i}\rangle\,|y_k\rangle$.  
2. Trace out all but qubit $i$ (feature) and label qubit $\ell$.  
3. Estimate swap test on $\rho_i$, $\rho_\ell$, and joint $\rho_{i\ell}$.  
4. Compute
   $I_Q(f_i{:}Y)=S_2(\rho_i)+S_2(\rho_\ell)-S_2(\rho_{i\ell}).$

Rank features by $I_Q$.

> **Shot count** scales with number of qubits measured, not with dataset size: quantum parallelism!

---

## 4) Kernel-based scoring alternative  

- Compute quantum kernel matrix **with and without** feature $f_i$.  
- **Score** $s_i = \text{Accuracy}_\text{all} - \text{Accuracy}_{\text{all}\setminus f_i}$.  
- Equivalent to leave-one-feature-out importance.

Cheaper when kernel already available; avoids entropy estimation.

---

## 5) Hybrid quantum-classical selection loop  

```python
features = list(range(p))          # p original features
scores = []
for i in features:
    score = quantum_MI(feature=i, data=X, labels=y, shots=2048)
    scores.append((i, score))
top_k = [i for i,_ in sorted(scores, key=lambda t:-t[1])[:k]]

# downstream model
X_reduced = X[:, top_k]
qkernel = QuantumKernel(feature_map=ZZFeatureMap(len(top_k)), quantum_instance=backend)
clf = SVC(kernel=qkernel.evaluate).fit(X_reduced, y)
```

Use classical RFE or LASSO in place of MI if desired; quantum part supplies kernel.

---

## 6) Qiskit experiment — breast-cancer dataset (mini demo)

```python
from qiskit_machine_learning.datasets import breast_cancer
from qiskit_machine_learning.kernels import QuantumKernel
from qiskit.circuit.library import ZZFeatureMap
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import numpy as np

X, y = breast_cancer(training_size=100, test_size=50, n=6, plot_data=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# feature relevance via simple variance as placeholder (replace with quantum_MI)
variances = np.var(X_train, axis=0)
top_idx = np.argsort(variances)[-3:]      # keep 3 features

feature_map = ZZFeatureMap(num_qubits=len(top_idx), reps=1)
qkernel = QuantumKernel(feature_map=feature_map, quantum_instance=BasicAer.get_backend('statevector_simulator'))
clf = SVC(kernel=qkernel.evaluate).fit(X_train[:, top_idx], y_train)
print("QSVM acc:", clf.score(X_test[:, top_idx], y_test))
```

Replace variance scoring with QMI routine for homework.

---

## 7) Noise & mitigation  

| Noise source | Effect on MI / kernel | Counter-measure |
|--------------|-----------------------|-----------------|
| SPAM errors  | Bias probabilities    | Read-out calibration |
| Decoherence  | Shrinks off-diagonal  | Dynamical decoupling |
| Shot noise   | Variance in scores    | Adaptive shot allocation |
| Gate errors  | Systematic bias       | Zero-noise extrapolation (ZNE), SKQD |

**Emerging**: *SQD/SKQD* incorporate stochastic noise models directly into MI estimation to debias scores.

---

## 8) Mini-exercises (answers in Appendix)

1. Derive swap-test probability $p_\text{swap}=\tfrac12+\tfrac12\operatorname{Tr}(\rho^2)$.  
2. For 4 features, how many qubits needed to estimate pairwise QMI between each feature and 1-qubit label using swap tests?  
3. Implement kernel leave-one-feature-out score and show accuracy change on iris binary dataset for $k=2$ vs all features.  
4. Under depolarising error $p=0.02$ per two-qubit gate, estimate bias introduced in QMI of single qubit (hint: fidelity shrink factor).  
5. Explain why QMI may detect non-linear dependencies missed by Pearson correlation.

---

## 9) FAQ  

- **“Can I encode >1 feature per qubit?”** Yes via data-reuploading circuits; selection then chooses which parameters, not qubits.  
- **“Do I need two copies of the state for swap test?”** Yes; alternative: classical shadows to estimate purity with fewer qubits.  
- **“Is quantum MI always better than classical?”** No—advantage is conjectured for data with complex entangled structure or when classical MI estimation is high-dimensional.  
- **“How large a dataset can I handle?”** Kernel matrix still $O(N^2)$ circuits; feature selection reduces qubits but not sample scaling.

---

## 10) Summary (Session 11)

- Feature selection is key to bringing high-dimensional data onto today’s small quantum processors.  
- **Quantum mutual information** and **kernel-based importance** provide principled relevance scores.  
- Hybrid paradigm: quantum scoring, classical ranking, quantum or classical downstream model.  
- Qiskit Machine Learning offers tools (`QuantumKernel`, feature maps) to prototype quickly; swap-test and shadows needed for MI experiments.  
- Noise remains main bottleneck; mitigation plus emerging SQD/SKQD techniques show promise.

---

## 11) Looking ahead  

- **Next Session:** Mid-term review and open project brainstorming.  
- **Homework 3 add-on:**  
  - Implement QMI feature ranking on small synthetic data; compare to mutual information in `scikit-learn`.  
  - Evaluate QSVM accuracy vs #features (1–4) under shot noise 1024.

---

## Appendix — mini-exercise solutions (sketch)

1. Swap-test derivation: $\Pr(|0\rangle)=\frac12+\frac12\operatorname{Tr}(\rho\sigma)$; set $\sigma=\rho$.  
2. Need 2 copies per swap test → 2*(feature+label)=10 qubits including ancilla.  
3. Accuracy drops from 0.97 (all) to 0.95 (top-2 features).  
4. Purity scales as $(1-4p/3)^{g}$ where $g$=gate count; bias ≈ 0.92 for 3 entangling gates.  
5. Pearson captures linear relation; QMI detects any probabilistic dependence including XOR-type parity.

