<b>Q2.1</b> Download the a9a dataset from the LIBSVM Dataset page. This is a preprocessed dataset of the Adult dataset in the UCI Irvine Machine Learning Repository2, which consists of a training set (available here) and a test set (available here). Each file (the train set or the test set) is a text format in which each line represents a labeled data instance as follows:

label index1:value1 index2:value2 ...

where “label” denotes the class label of each instance, “indexT” denotes the T-th feature, and valueT denotes the value of the T-th feature of the instance. Read Pages 63-64 of the lecture notes of L5 for reference The details of the original Adult dataset can be found here. This is a sparse format, where only non-zero feature values are stored for each instance. For example, suppose given a data set, where each data instance has 5 dimensions (features). If a data instance whose label is “+1” and the input data instance vector is [2 0 2.5 4.3 0], then it is presented in a line as

+1 1:2 3:2.5 4:4.3 

Hint: sciki-learn provides an API (“sklearn.datasets.load svmlight file”) to load such a sparse data format. Detailed information is available here.

In [1]:
from sklearn.datasets import load_svmlight_files

X_train, y_train, X_test, y_test = load_svmlight_files((
    "a9a.txt",   # train
    "a9a.t"      # test
))
print(X_train.shape, X_test.shape)

(32561, 123) (16281, 123)


### About Dataset
The <b>LIBSVM</b> format only stores non-zero features, so when you load it with load_svmlight_file, it returns a sparse matrix to save memory and speed up computation.

The full matrix has shape (60000, 123)

But most entries are zero, so only the <u>non-zero ones are stored and printed</u>

### Preproccessing
The original Adult dataset has 14 features, but when it's preprocessed for LIBSVM (like in the a9a dataset), it explodes into 123 features due to a combination of one-hot encoding and discretization

For each categorical column:
* Every distinct value becomes its own column
* The value is 1 if the instance has that category, 0 otherwise

For numeric features like age or hours-per-week, they’re often discretized:
* Split into bins (e.g., age 18–25, 26–35, etc.)
* Each bin becomes a binary column

In [2]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np

In [3]:
# Default params
print("Linear Kernel")
model = SVC(kernel="linear")
params = model.get_params()
print(params)

print("RBF Kernel")
model = SVC(kernel='rbf')
params = model.get_params()
print(params)

Linear Kernel
{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'linear', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
RBF Kernel
{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}


In [4]:
# HashMap to map accuracy to [C, gamma]
accuracy_to_params = {}

<b>Q2.2</b> Regarding the linear kernel, show 3-fold cross-validation results in terms of classification accuracy on the training set with different values of the parameter C in {0.01, 0.05, 0.1, 0.5, 1}, respectively, in the following table. Note that for all the other parameters, you can simply use the default values or specify the specific values you used in your submitted PDF file.

In [5]:
C_values = [0.01, 0.05, 0.1, 0.5, 1]
results = {}
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=64)

for C in C_values:
    model = SVC(kernel="linear", C=C)
    scores = cross_val_score(model, X_train, y_train, cv=skf, scoring="accuracy", n_jobs=-1)
    results[C] = np.mean(scores)
    accuracy_to_params[results[C]] = [C, -1] # -1, cause we dont need gamma for linear kernel

In [6]:
print("C-value\tAccuracy")
for C, acc in results.items():
    print(f"{C}\t{acc:.6f}")

C-value	Accuracy
0.01	0.843494
0.05	0.846995
0.1	0.847210
0.5	0.848008
1	0.848285


<b>Q2.3</b> Regarding the rbf kernel, show 3-fold cross-validation results in terms of classification accuracy on the training set with different values of the parameter gamma (i.e., σ2 on the lecture notes) in {0.01, 0.05, 0.1, 0.5, 1} and different values of the parameter C in {0.01, 0.05, 0.1, 0.5, 1}, respectively, in the following table. Note that for all the other parameters, you can simply use the default values or specify the specific values you used in your submitted PDF file.

In [7]:
gamma_values = [0.01, 0.05, 0.1, 0.5, 1]
results = {}

for C in C_values:
    row = []
    for gamma in gamma_values:
        model = SVC(kernel='rbf', C=C, gamma=gamma)
        scores = cross_val_score(model, X_train, y_train, cv=skf, scoring='accuracy', n_jobs=-1)
        row.append(np.mean(scores))
        accuracy_to_params[row[-1]] = [C, gamma]
    results[C] = row

In [8]:
# Print the table header
print(f"{'C gamma':<10}", end="")
for gamma in gamma_values:
    print(f"{gamma:<10}", end="")
print()

# Print each row of the table
for i, C in enumerate(C_values):
    print(f"{C:<10}", end="")
    for acc in results[C]:
        print(f"{acc:.6f}   ", end="")
    print()

C gamma   0.01      0.05      0.1       0.5       1         
0.01      0.759190   0.819723   0.819569   0.759190   0.759190   
0.05      0.830411   0.836000   0.834065   0.789687   0.759190   
0.1       0.837444   0.839563   0.839010   0.806333   0.761002   
0.5       0.843340   0.845091   0.847087   0.833298   0.789503   
1         0.844814   0.847179   0.847578   0.837444   0.799423   


<b>Q2.4</b> ased on the results shown in Tables 1-2, determine the best kernel and the best parameter setting. Use the best kernel with the best parameter setting to train a SVM using the whole training set and make predictions on test set to generate the following table:

In [9]:
best_accuracy = max(accuracy_to_params)
best_C, best_gamma = accuracy_to_params[best_accuracy]

In [10]:
from sklearn.metrics import accuracy_score

In [11]:
if best_gamma == -1: # Linear Kernel performed best
    print("Using linear Kernel with C=%.2f"%(best_C))
    best_model = SVC(kernel="linear", C=best_C)
else: # RBF kernel performed best
    print("Using RBF Kernel with C=%.2f and gamma=%.2f"%(best_C, best_gamma))
    best_model = SVC(kernel='rbf', C=best_C, gamma=best_gamma)

# Fitting and predicting
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

Using linear Kernel with C=1.00


In [12]:
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {test_accuracy:.6f}")

Test accuracy: 0.849764
