#  Setting_2 - 62.5% accuracy on test dataset
Building a machine learning pipeline with minimum number of features achieving more than 62.5% accuracy on test dataset using topics covered in the lecture so far.


## Imports and Environment Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


## Data Loading, Vectorizing and Preprocessing

In [2]:
# Load dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
vectorizer = TfidfVectorizer(max_features=2000, stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)  # Features (18846 samples, 2000 features each)
y = newsgroups.target  # Labels (digits 0 to19)

# Convert sparse to dense for PCA
X_dense = X.toarray()

# Split into training and test sets (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_dense, y, test_size=0.2, random_state=42, stratify=y)

## Initial hyperparameter tuning
Phase 1: Tune SVM (C, gamma) with PCA=99
- to identify the best combination of C and gamma to use in SVM.

In [3]:
# PHASE 1: Grid Search for best SVM params at fixed PCA size

initial_pca_components = 99

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=initial_pca_components, random_state=42)),
    ('clf', SVC(kernel='rbf'))
])

param_grid = {
    'clf__C': [0.1, 1, 10],
    'clf__gamma': [0.001, 0.01, 0.1]
}

grid = GridSearchCV(pipe, param_grid, cv=3, n_jobs=-1)
grid.fit(X_train, y_train)

best_params = grid.best_params_
print("Best hyperparameters (initial PCA=99):", best_params)

Best hyperparameters (initial PCA=99): {'clf__C': 1, 'clf__gamma': 0.01}


## PCA Diamension Optimization
Phase 2: Finding minimum PCA components ≥62.5% accuracy
- to find the smallest PCA dimension that still gives high accuracy - optimizing for performance and interpretability.

In [4]:
# PHASE 2: Minimize PCA components while maintaining >62.5% accuracy

best_C = best_params['clf__C']
best_gamma = best_params['clf__gamma']

for n_components in range(65, 100):
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=n_components, random_state=42)),
        ('clf', SVC(kernel='rbf', C=best_C, gamma=best_gamma, random_state=42))
    ])
    
    pipe.fit(X_train, y_train)
    acc = pipe.score(X_test, y_test)
    print(f"PCA components = {n_components:3d} → Test Accuracy = {acc * 100:.2f}%")
    
    if acc >= 0.625:
        print(f"\nFound minimum features: PCA components = {n_components} with accuracy = {acc * 100:.2f}%")
        break


PCA components =  65 → Test Accuracy = 61.86%
PCA components =  66 → Test Accuracy = 62.39%
PCA components =  67 → Test Accuracy = 61.46%
PCA components =  68 → Test Accuracy = 62.25%
PCA components =  69 → Test Accuracy = 62.28%
PCA components =  70 → Test Accuracy = 62.49%
PCA components =  71 → Test Accuracy = 62.81%

Found minimum features: PCA components = 71 with accuracy = 62.81%


# Summary & Conclusion
We built a machine learning pipeline using TF-IDF features from the 20 Newsgroups dataset, reduced dimensionality with PCA, and used an RBF-kernel SVM as the classifier. After tuning hyperparameters (C and gamma) using GridSearchCV and systematically reducing PCA components, we found that:

- Best Model: SVM (RBF kernel)
- Best Parameters: C=1, gamma=0.01
- Minimum PCA Components to reach ≥62.5% Accuracy: 71
- Test Accuracy Achieved: 62.81%