### Module 9: Supervised Learning- II

#### Case Study – 3

Objective:

• Employ SVM from scikit learn for binary classification.
• Impact of preprocessing data and hyper parameter search using grid search.

Questions:

1. Load the data from “college.csv” that has attributes collected about private and public colleges for a particular year. We will try to predict the private/public status of the college from other attributes.
2. Use LabelEncoder to encode the target variable into numerical form and split the data such that 20% of the data is set aside for testing.
3. Fit a linear SVM from scikit learn and observe the accuracy.
[Hint: Use Linear SVC]
4. Preprocess the data using StandardScalar and fit the same model again and observe the change in accuracy.
[Hint: Refer to scikitlearn’s preprocessing methods]
5. Use scikit learns grid search to select the best hyperparameter for a non-linear SVM, and identify the model with the best score and its parameters.
[Hint: Refer to model_selection module of Scikit learn]

In [1]:
# Load the data from “college.csv” that has attributes collected about private and public colleges for a particular year. 
# We will try to predict the private/public status of the college from other attributes.

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

#  Load dataset
df = pd.read_csv("college.csv")

#  Define target and features
# Assuming 'Private' column indicates private/public status (binary: Yes/No)
y = df['Private'].map({'Yes': 1, 'No': 0})   # convert to numeric
X = df.drop(columns=['Private'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Preprocessing (scaling is critical for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#  Define SVM model
svm = SVC()

#  Hyperparameter tuning with Grid Search
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best CV Accuracy:", grid_search.best_score_)

#  Evaluate on test set
best_svm = grid_search.best_estimator_
y_pred = best_svm.predict(X_test_scaled)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Best Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}
Best CV Accuracy: 0.9500774193548386
Test Accuracy: 0.9230769230769231

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.84      0.86        43
           1       0.94      0.96      0.95       113

    accuracy                           0.92       156
   macro avg       0.91      0.90      0.90       156
weighted avg       0.92      0.92      0.92       156



In [2]:
# Use LabelEncoder to encode the target variable into numerical form and split the data such that 20% of the data is set aside for testing.

from sklearn.preprocessing import LabelEncoder

# Encode target variable using LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['Private'])  # 'Yes' → 1, 'No' → 0

# Define features (drop target column)
X = df.drop(columns=['Private'])

#  Train-test split (20% test size)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Encoded target classes:", list(label_encoder.classes_))
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])


Encoded target classes: ['No', 'Yes']
Training set size: 621
Test set size: 156


In [3]:
# Fit a linear SVM from scikit learn and observe the accuracy. [Hint: Use Linear SVC]

from sklearn.svm import LinearSVC

#  Encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['Private'])  # 'Yes' → 1, 'No' → 0

#  Define features
X = df.drop(columns=['Private'])

#  Train-test split (20% test size)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Preprocessing (scaling is important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#  Fit Linear SVM
svm_linear = LinearSVC(max_iter=5000, random_state=42)
svm_linear.fit(X_train_scaled, y_train)

#  Predictions
y_pred = svm_linear.predict(X_test_scaled)

#  Evaluate accuracy
print("Linear SVM Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Linear SVM Accuracy: 0.9102564102564102

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.79      0.83        43
           1       0.92      0.96      0.94       113

    accuracy                           0.91       156
   macro avg       0.90      0.87      0.88       156
weighted avg       0.91      0.91      0.91       156



In [4]:
# Preprocess the data using StandardScalar and fit the same model again and observe the change in accuracy. 
# [Hint: Refer to scikitlearn’s preprocessing methods]

# Fit Linear SVM (without scaling for comparison)
svm_no_scale = LinearSVC(max_iter=5000, random_state=42)
svm_no_scale.fit(X_train, y_train)
y_pred_no_scale = svm_no_scale.predict(X_test)

print("Accuracy without scaling:", accuracy_score(y_test, y_pred_no_scale))

# Fit Linear SVM (with scaling)
svm_scaled = LinearSVC(max_iter=5000, random_state=42)
svm_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = svm_scaled.predict(X_test_scaled)

print("Accuracy with StandardScaler:", accuracy_score(y_test, y_pred_scaled))
print("\nClassification Report (Scaled Data):\n", classification_report(y_test, y_pred_scaled))


Accuracy without scaling: 0.9102564102564102
Accuracy with StandardScaler: 0.9102564102564102

Classification Report (Scaled Data):
               precision    recall  f1-score   support

           0       0.87      0.79      0.83        43
           1       0.92      0.96      0.94       113

    accuracy                           0.91       156
   macro avg       0.90      0.87      0.88       156
weighted avg       0.91      0.91      0.91       156



In [5]:
# Use scikit learns grid search to select the best hyperparameter for a non-linear SVM, and identify the model with the best score and its parameters. 
# [Hint: Refer to model_selection module of Scikit learn]

# Define SVM model
svm = SVC()

# Define hyperparameter grid for non-linear SVM
param_grid = {
    'C': [0.1, 1, 10, 100],        # Regularization parameter
    'gamma': [0.001, 0.01, 0.1, 1], # Kernel coefficient for RBF
    'kernel': ['rbf']               # Non-linear kernel
}

#  Grid Search with cross-validation
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

#  Best model and parameters
print("Best Parameters:", grid_search.best_params_)
print("Best CV Accuracy:", grid_search.best_score_)

#  Evaluate on test set
best_svm = grid_search.best_estimator_
y_pred = best_svm.predict(X_test_scaled)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Best Parameters: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
Best CV Accuracy: 0.9468645161290322
Test Accuracy: 0.9230769230769231

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.79      0.85        43
           1       0.92      0.97      0.95       113

    accuracy                           0.92       156
   macro avg       0.92      0.88      0.90       156
weighted avg       0.92      0.92      0.92       156

