# Classification and Evaluation on Reduced Lung Data

## Overview
This notebook implements multiple classification algorithms and evaluates their performance on dimensionally reduced lung microRNA data using various preprocessing techniques.

## Table of Contents
1. Data Preparation and Standardization
2. Minimum Distance Classifier
3. Bayes Classifier
4. Naive Bayes Classifier
5. K-Nearest Neighbors (KNN)
6. Linear Discriminant Analysis (LDA)
7. Kernel Discriminant Analysis (KDA)
8. Performance Evaluation

## Dataset Information
- **Dataset**: Lung.csv (dimensionally reduced)
- **Train/Test Split**: 872/219 samples
- **Classification Task**: Lung cancer classification
- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score

## Author
- Raja Ram Bitra

## Classifiers:
- Imagine you have all your lung cancer patient data, and you want to build a tool that can look at a new patient's data and predict if they have cancer or not. That tool is called a classifier.
- **Definition:** A classifier is like a decision-maker. It learns from your existing, labeled data (where you already know who has cancer and who doesn't) to figure out patterns. Once it has learned, it can then take new, unlabelled data and make an educated guess about which group it belongs to (e.g., "cancer" or "no cancer").

**Types**

**1. Minimum Distance Classifier (MDC)**
- **What it does:** This classifier is like finding the "average" patient for each group (cancerous and non-cancerous). When a new patient comes in, it simply checks which group's "average" patient they are closest to, and assigns them to that group. It's very simple and just measures direct distance.
- **When it's useful:** Use this if you believe that patients with cancer are generally quite different from non-cancerous patients, and these differences can be captured by looking at their average feature values. For example, if all cancer patients consistently have a very high average of one specific biomarker.

**2. Bayes Classifier (Optimal Bayes Classifier)**
- **What it does:** This is the "smartest" theoretical decision-maker. It tries to calculate the exact probability of a patient having cancer given their specific features. It then assigns the patient to the group (cancer or no cancer) that has the highest probability. It's "optimal" because, in theory, it makes the best possible decision if you know all the true probabilities.
- **When it's useful:** While often theoretical (because getting true probabilities is hard), it's a benchmark. It tells you what's the absolute best you could do. If your data perfectly captures the real-world probabilities, this would be perfect. It's the goal other classifiers try to approximate.

**3. Naive Bayes Classifier**
- **What it does:** This is a simplified version of the Bayes Classifier. It makes a "naive" (simplifying) assumption: it assumes that all your patient features (like gene expression, tumor size, etc.) are independent of each other given the patient's cancer status. So, knowing a patient has a large tumor doesn't change the probability of them having a high biomarker, if you already know they have cancer. Despite this simplification, it often works surprisingly well.
- **When it's useful:** It's a good choice if your features are somewhat independent, or if you have a lot of data. It's very fast to train and can perform well even with limited computational resources. For example, if different biomarkers each provide unique, non-overlapping clues about cancer.

**4. K-Nearest Neighbors (KNN)**
- **What it does:** KNN is like consulting your "neighborhood" of past patients. When a new patient comes in, it looks at the 'K' (a number you choose, like 3 or 5) most similar past patients. If most of those 'K' neighbors had cancer, then it predicts the new patient also has cancer. It classifies based on the majority vote of its closest friends.
- **When it's useful:** This is good if you believe that patients with similar characteristics should have the same diagnosis. It's simple to understand and works well when your data has clear, well-defined clusters for each class.

**5. Linear Discriminant Analysis (LDA)**
- **What it does:** LDA is a bit like PCA, but specifically for classification. Instead of just finding components that capture data variance, it finds new "directions" in your data that best separate your groups (cancerous vs. non-cancerous). It tries to draw straight lines or planes that optimally divide the classes.
- **When it's useful:** Use LDA if you think your different patient groups (cancer vs. non-cancer) can be effectively separated by a straight line or flat plane based on their features. It's good when your data classes are somewhat distinct and normally distributed.

**6. Kernel Discriminant Analysis (KDA)**
- **What it does:** KDA is the "smarter" version of LDA, just like KPCA is to PCA. If the separation between your cancer and non-cancer groups isn't a straight line, but a complex curve, KDA can handle it. It uses the "kernel trick" (like in KPCA) to project your data into a higher-dimensional space where those complex, curved separations become straight lines, then applies LDA there.
- **When it's useful:** This is powerful when the boundary between cancerous and non-cancerous patients is non-linear and complex. If your data points for cancer and non-cancer are intertwined in a way that a straight line can't separate them, KDA can find that curvy boundary.

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, KernelPCA
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score
from sklearn.metrics.pairwise import rbf_kernel, polynomial_kernel, linear_kernel
from scipy.linalg import solve, pinv
from sklearn.svm import SVC
import time

np.random.seed(42)

## 3a. Data Preparation and Standardization

- Loading the lung cancer dataset, applying dimensionality reduction via PCA, and standardizing features for optimal classifier performance.

In [2]:
# Load and prepare data
df = pd.read_csv('../Lung.csv')
data = df.iloc[:,:-1].to_numpy()
label = df.iloc[:, -1].to_numpy()

# Split data
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.2, random_state=42)

# Whole data
X_train_whole, X_test_whole = X_train, X_test

# Apply PCA for dimensionality reduction
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

print(f"Original training shape: {X_train.shape}")
print(f"PCA training shape: {X_train_pca.shape}")
print(f"Components retained: {pca.n_components_}")

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_pca)
X_test_scaled = scaler.transform(X_test_pca)

print("Data standardized and ready for classification.")

Original training shape: (872, 1881)
PCA training shape: (872, 8)
Components retained: 8
Data standardized and ready for classification.


## Classifier Implementations

### 3b. Minimum Distance Classifier

- Implementation of a simple distance-based classifier that assigns samples to the closest class centroid.

In [3]:
# Minimum Distance Classifier (build from scratch)

def min_distance_classifier(X_train, y_train, X_test):
    classes = np.unique(y_train)
    class_means = {}
    for c in classes:
        class_means[c] = np.mean(X_train[y_train == c], axis=0)
    
    predictions = []
    for x in X_test:
        dists = [np.linalg.norm(x-class_means[c]) for c in classes]
        predictions.append(classes[np.argmin(dists)])
    return np.array(predictions)

### 3c. Bayes Classifier (From Scratch)
- Implementation of Bayes classifier using Gaussian assumption for likelihood calculation.

In [4]:
# Bayes Classifier (build from scratch)

class BayesClassifier:
    def __init__(self):
        self.classes = None
        self.priors = None
        self.mean = None
        self.variance = None

    def fit(self, X, y):
        self.classes = np.unique(y)
        n_features = X.shape[1]
        n_classes = len(self.classes)
        # Initialize arrays to store class-wise statistics
        self.mean = np.zeros((n_classes, n_features))
        self.variance = np.zeros((n_classes, n_features))
        self.priors = np.zeros(n_classes)
        for idx, cls in enumerate(self.classes):
            X_c = X[y == cls]
            self.mean[idx, :] = X_c.mean(axis=0)
            self.variance[idx, :] = X_c.var(axis=0)
            self.priors[idx] = X_c.shape[0] / float(X.shape[0])
    
    def _calculate_likelihood(self, mean, var, x):
        eps = 1e-6  # Add small epsilon to variance to avoid division by zero
        coeff = 1 / np.sqrt(2 * np.pi * var + eps)
        exponent = -((x - mean) ** 2) / (2 * (var + eps))
        return coeff * np.exp(exponent)
    
    def _calculate_posterior(self, X):
        posteriors = []
        for idx, cls in enumerate(self.classes):
            prior = np.log(self.priors[idx])
            likelihood = np.sum(np.log(self._calculate_likelihood(self.mean[idx, :], self.variance[idx, :], X)), axis=1)
            posteriors.append(prior + likelihood)
        return np.array(posteriors).T

    def predict(self, X):
        posteriors = self._calculate_posterior(X)
        return self.classes[np.argmax(posteriors, axis=1)]

### 3d. Naive Bayes Classifier
- Creating instances of standard classifiers from scikit-learn.

In [5]:
# Naive Bayes Classifier
naive_bayes = GaussianNB()
print("Naive Bayes classifier initialized")

Naive Bayes classifier initialized


### 3e. K-Nearest Neighbors Classifier
- Creating instances of standard classifiers from scikit-learn.

In [6]:
# K-Nearest Neighbors Classifier
knn = KNeighborsClassifier(n_neighbors=10)
print("KNN classifier initialized with k=10")

KNN classifier initialized with k=10


### 3f. Linear Discriminant Analysis
- Creating instances of standard classifiers from scikit-learn.

In [7]:
# Linear Discriminant Analysis
lda = LinearDiscriminantAnalysis()
print("LDA classifier initialized")

LDA classifier initialized


### 3g. Kernel Discriminant Analysis (KDA) - From Scratch
- Implementation of Kernel Discriminant Analysis with different kernel functions (RBF, Polynomial, Linear).

In [8]:
# Build KDA classifier 
class KernelDiscriminantAnalysis:
    def __init__(self, kernel='linear', degree=3, coef0=1, gamma=None, reg=1e-3):
        self.kernel = kernel
        self.degree = degree
        self.coef0 = coef0
        self.gamma = gamma
        self.reg = reg
        self.eigenvectors = None
        self.class_means = None
        self.X_train = None
        self.y_train = None
        self.label_dict = {}

    def compute_kernel(self, X, Y=None):
        if self.kernel == 'rbf':
            return rbf_kernel(X, Y, gamma=self.gamma)
        elif self.kernel == 'poly':
            return polynomial_kernel(X, Y, degree=self.degree, coef0=self.coef0)
        elif self.kernel == 'linear':
            return linear_kernel(X, Y)
        else:
            raise ValueError("Unsupported kernel. Choose from 'rbf', 'poly', or 'linear'.")

    def encode_labels(self, y):
        unique_classes = np.unique(y)
        self.label_dict = {label: idx for idx, label in enumerate(unique_classes)}
        return np.array([self.label_dict[label] for label in y])

    def fit(self, X, y):
        self.X_train = X
        self.y_train = self.encode_labels(y)

        n_samples = X.shape[0]
        K = self.compute_kernel(X)

        classes = np.unique(self.y_train)
        N_c, K_c, mean_c = {}, {}, {}

        # Compute class-wise kernel matrices
        for c in classes:
            idx = np.where(self.y_train == c)[0]
            K_c[c] = K[:, idx]
            N_c[c] = len(idx)
            mean_c[c] = np.mean(K_c[c], axis=1, keepdims=True)

        # Compute between-class scatter matrix M
        mean_total = np.mean(K, axis=1, keepdims=True)
        M = np.zeros((n_samples, n_samples))
        for c in classes:
            diff = mean_c[c] - mean_total
            M += N_c[c] * (diff @ diff.T)

        # Compute within-class scatter matrix N
        N = np.zeros((n_samples, n_samples))
        for c in classes:
            N += K_c[c] @ (np.eye(N_c[c]) - (1 / N_c[c]) * np.ones((N_c[c], N_c[c]))) @ K_c[c].T

        # Regularize N to ensure positive definiteness
        N += np.eye(N.shape[0]) * self.reg

        # Solve the generalized eigenvalue problem using np.linalg.solve()
        try:
            eigvals, eigvecs = np.linalg.eig(solve(N, M))  # Solve for eigenvectors
        except np.linalg.LinAlgError:
            print("Warning: N is singular, using pseudo-inverse instead.")
            eigvals, eigvecs = np.linalg.eig(pinv(N) @ M)

        # Select top discriminant directions
        idx = np.argsort(-eigvals)
        self.eigenvectors = eigvecs[:, idx[:len(classes) - 1]]

        # Normalize eigenvectors
        self.eigenvectors /= np.linalg.norm(self.eigenvectors, axis=0)

        # Compute class means in the transformed space
        self.class_means = {}
        transformed_X = self.transform(X)
        for c in classes:
            self.class_means[c] = np.mean(transformed_X[self.y_train == c], axis=0)

    def transform(self, X):
        K_test = self.compute_kernel(X, self.X_train)
        return K_test @ self.eigenvectors

    def predict(self, X):
        X_proj = self.transform(X)
        predictions = []
        for x in X_proj:
            # Assign to nearest class mean
            distances = {c: np.linalg.norm(x - self.class_means[c]) for c in self.class_means}
            predictions.append(min(distances, key=distances.get))
        return np.array(predictions)

## Data Variants Preparation
- Creating different data transformations to evaluate classifier performance across various feature spaces.

In [9]:
# Standardizing the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nPCA Results")
print("PCA n_components retained:", pca.n_components_)
print("X_train_pca shape:", X_train_pca.shape)
print("X_test_pca shape: ", X_test_pca.shape)

# Using the same number of components in KPCA that are attained from above PCA 
# RBF Kernel PCA
kpca_rbf = KernelPCA(n_components=pca.n_components_, kernel='rbf', gamma=0.1)
kpca_rbf.fit(X_train_scaled)

X_train_kpca_rbf = kpca_rbf.transform(X_train_scaled)
X_test_kpca_rbf  = kpca_rbf.transform(X_test_scaled)

print("\nRBF Kernel PCA Results")
print("X_train_kpca_rbf shape:", X_train_kpca_rbf.shape)
print("X_test_kpca_rbf shape: ", X_test_kpca_rbf.shape)

# Polynomial Kernel PCA
kpca_poly = KernelPCA(n_components=pca.n_components_, kernel='poly', degree=3, coef0=1, gamma=0.1)
kpca_poly.fit(X_train_scaled)

X_train_kpca_poly = kpca_poly.transform(X_train_scaled)
X_test_kpca_poly  = kpca_poly.transform(X_test_scaled)

print("\nPolynomial Kernel PCA Results")
print("X_train_kpca_poly shape:", X_train_kpca_poly.shape)
print("X_test_kpca_poly shape: ", X_test_kpca_poly.shape)

# Linear Kernel PCA
kpca_lin = KernelPCA(n_components=pca.n_components_, kernel='linear')
kpca_lin.fit(X_train_scaled)

X_train_kpca_lin = kpca_lin.transform(X_train_scaled)
X_test_kpca_lin  = kpca_lin.transform(X_test_scaled)

print("\nLinear Kernel PCA Results")
print("X_train_kpca_lin shape:", X_train_kpca_lin.shape)
print("X_test_kpca_lin shape: ", X_test_kpca_lin.shape)



PCA Results
PCA n_components retained: 8
X_train_pca shape: (872, 8)
X_test_pca shape:  (219, 8)

RBF Kernel PCA Results
X_train_kpca_rbf shape: (872, 8)
X_test_kpca_rbf shape:  (219, 8)

Polynomial Kernel PCA Results
X_train_kpca_poly shape: (872, 8)
X_test_kpca_poly shape:  (219, 8)

Linear Kernel PCA Results
X_train_kpca_lin shape: (872, 8)
X_test_kpca_lin shape:  (219, 8)


In [10]:
def get_top10_features(X_train, X_test, y_train, y_test, C=0.1):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    cov_matrix = np.cov(X_train_scaled, rowvar=False) # Covariance matrix

    variances = np.diag(cov_matrix)
    top10_indices = np.argsort(variances)[::-1][:10]
    print("\nTop 10 feature indices by variance:", top10_indices)

    X_train_top10 = X_train_scaled[:, top10_indices] 
    X_test_top10 = X_test_scaled[:, top10_indices]

    print("Shape of X_train_top10:", X_train_top10.shape)
    print("Shape of X_test_top10:", X_test_top10.shape)

    # Train an SVM classifier
    classifier = SVC(kernel='linear', C=C)
    classifier.fit(X_train_top10, y_train)

    # Predict and evaluating accuracy
    y_pred = classifier.predict(X_test_top10)
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy using top 10 variance features:", accuracy)

    return accuracy, top10_indices, X_train_top10, X_test_top10

accuracy, top10_indices, X_train_top10, X_test_top10 = get_top10_features(X_train, X_test, y_train, y_test)


Top 10 feature indices by variance: [ 426 1787 1344 1343  815  556 1573 1798   97  741]
Shape of X_train_top10: (872, 10)
Shape of X_test_top10: (219, 10)
Accuracy using top 10 variance features: 0.5114155251141552


In [11]:
print(f"\n" + "="*60)
print("FINAL DATA VARIANTS SUMMARY")
print("="*60)
print(f"- Whole standardized data: {X_train_whole.shape}")
print(f"- PCA: {X_train_pca.shape}")
print(f"- KPCA RBF: {X_train_kpca_rbf.shape}")
print(f"- KPCA Polynomial: {X_train_kpca_poly.shape}")
print(f"- KPCA Linear: {X_train_kpca_lin.shape}")
print(f"- Top 10 features): {X_train_top10.shape}")


FINAL DATA VARIANTS SUMMARY
- Whole standardized data: (872, 1881)
- PCA: (872, 8)
- KPCA RBF: (872, 8)
- KPCA Polynomial: (872, 8)
- KPCA Linear: (872, 8)
- Top 10 features): (872, 10)


In [12]:
# Creating Dictionary of data variants
data_variants = {
    'whole': (X_train_whole, X_test_whole),
    'pca': (X_train_pca, X_test_pca),
    'kpca_rbf': (X_train_kpca_rbf, X_test_kpca_rbf),
    'kpca_poly': (X_train_kpca_poly, X_test_kpca_poly),
    'kpca_lin': (X_train_kpca_lin, X_test_kpca_lin),
    'top10': (X_train_top10, X_test_top10)
}
   
# List of classifiers for future use
classifiers = [
    'min_dist',
    'BayesClassifier',
    'naive_bayes',
    'knn',
    'lda',
    'kda_rbf',
    'kda_poly',
    'kda_linear'
]

## Performance Evaluation
- Setting up the evaluation framework with data variants and classifier execution functions.

In [13]:
def run_classifier(clf_name, X_train, y_train, X_test, y_test=None):
    if clf_name == 'min_dist':
        y_pred = min_distance_classifier(X_train, y_train, X_test)
    
    elif clf_name == 'BayesClassifier':
        model = BayesClassifier()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    elif clf_name == 'naive_bayes':
        model = GaussianNB()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    elif clf_name == 'knn':
        model = KNeighborsClassifier(n_neighbors=10)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    elif clf_name == 'lda':
        model = LinearDiscriminantAnalysis()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    elif clf_name in ['kda_rbf', 'kda_poly', 'kda_linear']:
        kernel_type = clf_name.split('_')[1]  # Extracts 'rbf', 'poly', or 'linear'
        model = KernelDiscriminantAnalysis(kernel=kernel_type)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    else:
        raise ValueError(f"Unknown classifier name: {clf_name}")

    return y_pred

In [14]:
results = []
for clf_name in classifiers:
    print(f"Evaluating {clf_name}...")
    for variant_name, (Xtr, Xts) in data_variants.items():
        try:
            y_pred = run_classifier(clf_name, Xtr, y_train, Xts)
            acc = accuracy_score(y_test, y_pred)
            prec = precision_score(y_test, y_pred, average='macro', zero_division=0)
            rec = recall_score(y_test, y_pred, average='macro', zero_division=0)
            
            results.append((clf_name, variant_name, acc, prec, rec))
            
        except Exception as e:
            print(f"  Error with {variant_name}: {str(e)}")
            results.append((clf_name, variant_name, 0.0, 0.0, 0.0))

# Create and display results DataFrame
df_results = pd.DataFrame(results, columns=['Classifier', 'DataVariant', 'Accuracy', 'Precision', 'Recall'])
print("\n" + "="*80)
print("FINAL CLASSIFICATION RESULTS")
print("="*80)
print(df_results.to_string(index=False))

# Display best performing combinations
print("\n" + "="*80)
print("TOP PERFORMING CLASSIFIER-DATA VARIANT COMBINATIONS")
print("="*80)
top_results = df_results.nlargest(10, 'Accuracy')
print(top_results.to_string(index=False))

Evaluating min_dist...
Evaluating BayesClassifier...


  likelihood = np.sum(np.log(self._calculate_likelihood(self.mean[idx, :], self.variance[idx, :], X)), axis=1)
  likelihood = np.sum(np.log(self._calculate_likelihood(self.mean[idx, :], self.variance[idx, :], X)), axis=1)
  likelihood = np.sum(np.log(self._calculate_likelihood(self.mean[idx, :], self.variance[idx, :], X)), axis=1)


Evaluating naive_bayes...
Evaluating knn...
Evaluating lda...
Evaluating kda_rbf...
Evaluating kda_poly...


  eigvals, eigvecs = np.linalg.eig(solve(N, M))  # Solve for eigenvectors
  eigvals, eigvecs = np.linalg.eig(solve(N, M))  # Solve for eigenvectors
  eigvals, eigvecs = np.linalg.eig(solve(N, M))  # Solve for eigenvectors
  eigvals, eigvecs = np.linalg.eig(solve(N, M))  # Solve for eigenvectors


Evaluating kda_linear...


  eigvals, eigvecs = np.linalg.eig(solve(N, M))  # Solve for eigenvectors
  eigvals, eigvecs = np.linalg.eig(solve(N, M))  # Solve for eigenvectors
  eigvals, eigvecs = np.linalg.eig(solve(N, M))  # Solve for eigenvectors



FINAL CLASSIFICATION RESULTS
     Classifier DataVariant  Accuracy  Precision   Recall
       min_dist       whole  0.086758   0.244732 0.206489
       min_dist         pca  0.091324   0.252673 0.207130
       min_dist    kpca_rbf  0.515982   0.103196 0.200000
       min_dist   kpca_poly  0.022831   0.102885 0.203540
       min_dist    kpca_lin  0.100457   0.182473 0.158289
       min_dist       top10  0.063927   0.179937 0.281313
BayesClassifier       whole  0.196347   0.182181 0.136878
BayesClassifier         pca  0.493151   0.195478 0.196794
BayesClassifier    kpca_rbf  0.027397   0.005479 0.200000
BayesClassifier   kpca_poly  0.054795   0.327175 0.259637
BayesClassifier    kpca_lin  0.178082   0.197662 0.223276
BayesClassifier       top10  0.068493   0.170366 0.237898
    naive_bayes       whole  0.073059   0.172586 0.244988
    naive_bayes         pca  0.493151   0.195478 0.196794
    naive_bayes    kpca_rbf  0.315068   0.063014 0.200000
    naive_bayes   kpca_poly  0.054795   0.