# Lab 07 Practice: Logistic Regression and Support Vector Machines
**CS201L: Artificial Intelligence Laboratory**  
**Indian Institute of Technology, Dharwad**

---

In this lab we will:
1. Load and preprocess the **Date Fruit** dataset
2. Train a **Logistic Regression** model and analyze feature weights
3. Train **SVM** models with Linear, Polynomial, and RBF kernels
4. Tune the hyperparameter **C** using the validation set
5. Evaluate all models using standard classification metrics

## Step 0: Import Libraries

In [None]:
# Importing all the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, accuracy_score,
                             precision_score, recall_score, f1_score)

print("All libraries imported successfully!")

## Step 1: Load the Dataset

The dataset is already split into Train (60%), Validation (20%), and Test (20%) sets.  
We load each CSV separately and separate the features from the target label (`Class`).

In [None]:
# Loading train, validation and test splits
X_train_df = pd.read_csv("DateFruit_Train.csv")
X_val_df   = pd.read_csv("DateFruit_Validation.csv")
X_test_df  = pd.read_csv("DateFruit_Test.csv")

# Separating features and labels
# 'Class' is the target column, rest are features
X_train = X_train_df.drop(columns=["Class"])
y_train = X_train_df["Class"]

X_val   = X_val_df.drop(columns=["Class"])
y_val   = X_val_df["Class"]

X_test  = X_test_df.drop(columns=["Class"])
y_test  = X_test_df["Class"]

print("Dataset loaded successfully!")
print(f"Train size     : {X_train.shape}")
print(f"Validation size: {X_val.shape}")
print(f"Test size      : {X_test.shape}")
print(f"\nClasses: {y_train.unique()}")

In [None]:
# Let's take a quick look at the training data
X_train_df.head()

## Step 2: Data Scaling

**Why is scaling important?**  
This dataset has 34 features with very different value ranges. If we don't scale, features with larger values will dominate the model and make training unstable (especially for SVM and Logistic Regression).

`StandardScaler` transforms each feature to have **zero mean** and **unit variance**.

> **Important:** We `fit` the scaler **only on training data** and then `transform` val/test with the same scaler. Fitting on val/test would be data leakage!

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Fit ONLY on training data, then transform
X_train_scaled = scaler.fit_transform(X_train)

# Only transform val and test (DO NOT fit again!)
X_val_scaled   = scaler.transform(X_val)
X_test_scaled  = scaler.transform(X_test)

print("Data scaling done using StandardScaler.")
print(f"Mean of first feature in train (should be ~0): {X_train_scaled[:, 0].mean():.4f}")
print(f"Std  of first feature in train (should be ~1): {X_train_scaled[:, 0].std():.4f}")

## Step 3: Helper Function for Metrics

We'll be evaluating many models, so let's write a helper function once to avoid repeating the same code every time.

In [None]:
def print_metrics(y_true, y_pred, dataset_name="Test"):
    """
    Prints confusion matrix and classification metrics
    for multiclass classification.
    """
    cm       = confusion_matrix(y_true, y_pred)
    acc      = accuracy_score(y_true, y_pred)
    prec_mac = precision_score(y_true, y_pred, average='macro',  zero_division=0)
    prec_mic = precision_score(y_true, y_pred, average='micro',  zero_division=0)
    rec_mac  = recall_score(y_true, y_pred,    average='macro',  zero_division=0)
    rec_mic  = recall_score(y_true, y_pred,    average='micro',  zero_division=0)
    f1_mac   = f1_score(y_true, y_pred,        average='macro',  zero_division=0)
    f1_mic   = f1_score(y_true, y_pred,        average='micro',  zero_division=0)

    print(f"\n{'='*50}")
    print(f"  {dataset_name} Results")
    print(f"{'='*50}")
    print("Confusion Matrix:")
    print(cm)
    print(f"\nAccuracy          : {acc:.4f}")
    print(f"Precision (Macro) : {prec_mac:.4f}")
    print(f"Precision (Micro) : {prec_mic:.4f}")
    print(f"Recall    (Macro) : {rec_mac:.4f}")
    print(f"Recall    (Micro) : {rec_mic:.4f}")
    print(f"F1-Score  (Macro) : {f1_mac:.4f}")
    print(f"F1-Score  (Micro) : {f1_mic:.4f}")

    return acc

print("Helper function defined!")

---
## Task 1: Logistic Regression

Logistic Regression fits a linear model to the log-odds and uses the sigmoid function to output class probabilities. For multiclass (like our date fruit dataset), sklearn uses One-vs-Rest (OvR) by default.

In [None]:
# Initializing and training the Logistic Regression model
# max_iter=1000 because the default 100 might not be enough to converge
logistic_reg = LogisticRegression(max_iter=1000)
logistic_reg.fit(X_train_scaled, y_train)

print("Logistic Regression model trained successfully!")

In [None]:
# Predicting on validation and test data
y_val_pred_lr  = logistic_reg.predict(X_val_scaled)
y_test_pred_lr = logistic_reg.predict(X_test_scaled)

# Printing metrics for both splits
print_metrics(y_val,  y_val_pred_lr,  dataset_name="Logistic Regression - Validation")
print_metrics(y_test, y_test_pred_lr, dataset_name="Logistic Regression - Test")

### Task 1 – Feature Importance via Weight Vector

For multiclass Logistic Regression, `coef_` has shape `(n_classes, n_features)`.  
We take the **mean of absolute values** across classes to get one importance score per feature.

In [None]:
# Accessing the weight vector
weights = logistic_reg.coef_
print(f"Shape of coef_ : {weights.shape}  (n_classes x n_features)")

# Taking mean of absolute values across all classes
avg_weights   = np.mean(np.abs(weights), axis=0)
feature_names = X_train.columns

# Plotting bar graph of feature weights
plt.figure(figsize=(14, 5))
plt.bar(range(len(avg_weights)), avg_weights, color='steelblue')
plt.xticks(range(len(feature_names)), feature_names, rotation=90, fontsize=7)
plt.xlabel("Features")
plt.ylabel("Mean |Weight|")
plt.title("Logistic Regression – Feature Importance (Mean Absolute Weights)")
plt.tight_layout()
plt.savefig("lr_feature_importance.png", dpi=150)
plt.show()
print("Plot saved as 'lr_feature_importance.png'")

---
## Task 2: Support Vector Machine (SVM)

SVM tries to find the **optimal hyperplane** that separates classes with the **maximum margin**. We will experiment with three types of kernels:
- **Linear** – for linearly separable data
- **Polynomial** – captures non-linear interactions (degrees 2, 3, 4, 5)
- **RBF** – great general-purpose non-linear kernel

We tune the regularization parameter **C** using the validation set.

In [None]:
# C values to try for hyperparameter tuning
C_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

# We'll store all results here to make a comparison table at the end
results_table = []

print("C values to try:", C_values)

### Task 2a: SVM with Linear Kernel

In [None]:
print("SVM with Linear Kernel – Tuning C\n")
print(f"{'C':<10} {'Val Accuracy':<18} {'Test Accuracy'}")
print("-" * 45)

best_val_acc_linear = 0
best_C_linear       = None
best_linear_model   = None

for C in C_values:
    svm_linear = SVC(kernel='linear', C=C)
    svm_linear.fit(X_train_scaled, y_train)

    y_val_pred  = svm_linear.predict(X_val_scaled)
    y_test_pred = svm_linear.predict(X_test_scaled)

    val_acc  = accuracy_score(y_val,  y_val_pred)
    test_acc = accuracy_score(y_test, y_test_pred)

    print(f"{C:<10} {val_acc:<18.4f} {test_acc:.4f}")
    results_table.append({"Kernel": "Linear", "C": C, "Degree": "-",
                           "Val Acc": round(val_acc, 4), "Test Acc": round(test_acc, 4)})

    # Track the best model based on validation accuracy
    if val_acc > best_val_acc_linear:
        best_val_acc_linear = val_acc
        best_C_linear       = C
        best_linear_model   = svm_linear

print(f"\n>> Best C for Linear SVM: {best_C_linear}  (Val Accuracy: {best_val_acc_linear:.4f})")

In [None]:
# Detailed metrics for the best linear SVM
y_val_pred_best  = best_linear_model.predict(X_val_scaled)
y_test_pred_best = best_linear_model.predict(X_test_scaled)

print_metrics(y_val,  y_val_pred_best,  dataset_name=f"Linear SVM (C={best_C_linear}) – Validation")
print_metrics(y_test, y_test_pred_best, dataset_name=f"Linear SVM (C={best_C_linear}) – Test")

In [None]:
# Weight plot for Linear SVM
# coef_ is available only for the linear kernel
# Shape: (n_classes*(n_classes-1)/2, n_features) for OvO multi-class
lin_weights     = best_linear_model.coef_
avg_lin_weights = np.mean(np.abs(lin_weights), axis=0)

print(f"Shape of coef_ for Linear SVM: {lin_weights.shape}")

plt.figure(figsize=(14, 5))
plt.bar(range(len(avg_lin_weights)), avg_lin_weights, color='darkorange')
plt.xticks(range(len(feature_names)), feature_names, rotation=90, fontsize=7)
plt.xlabel("Features")
plt.ylabel("Mean |Weight|")
plt.title(f"Linear SVM (C={best_C_linear}) – Feature Weights")
plt.tight_layout()
plt.savefig("svm_linear_weights.png", dpi=150)
plt.show()
print("Plot saved as 'svm_linear_weights.png'")

### Task 2b: SVM with Polynomial Kernel

Polynomial kernel: $K(x_i, x_j) = (\gamma \cdot x_i^T x_j + r)^d$

We try degrees [2, 3, 4, 5] and all C values.

In [None]:
print("SVM with Polynomial Kernel – Tuning C and Degree\n")
degrees = [2, 3, 4, 5]
print(f"{'C':<10} {'Degree':<10} {'Val Accuracy':<18} {'Test Accuracy'}")
print("-" * 55)

for degree in degrees:
    for C in C_values:
        poly_svm = SVC(kernel='poly', degree=degree, C=C, gamma='scale')
        poly_svm.fit(X_train_scaled, y_train)

        y_val_pred  = poly_svm.predict(X_val_scaled)
        y_test_pred = poly_svm.predict(X_test_scaled)

        val_acc  = accuracy_score(y_val,  y_val_pred)
        test_acc = accuracy_score(y_test, y_test_pred)

        print(f"{C:<10} {degree:<10} {val_acc:<18.4f} {test_acc:.4f}")
        results_table.append({"Kernel": f"Poly(deg={degree})", "C": C, "Degree": degree,
                               "Val Acc": round(val_acc, 4), "Test Acc": round(test_acc, 4)})

In [None]:
# Finding the best polynomial configuration based on validation accuracy
poly_results = [r for r in results_table if "Poly" in r["Kernel"]]
best_poly    = max(poly_results, key=lambda x: x["Val Acc"])

print(f"Best Poly Config: Degree={best_poly['Degree']}, C={best_poly['C']}  "
      f"(Val Acc: {best_poly['Val Acc']:.4f})")

# Training best poly model and printing detailed metrics
best_poly_model = SVC(kernel='poly', degree=best_poly['Degree'], C=best_poly['C'], gamma='scale')
best_poly_model.fit(X_train_scaled, y_train)

print_metrics(y_val,  best_poly_model.predict(X_val_scaled),
              dataset_name=f"Poly SVM (deg={best_poly['Degree']}, C={best_poly['C']}) – Validation")
print_metrics(y_test, best_poly_model.predict(X_test_scaled),
              dataset_name=f"Poly SVM (deg={best_poly['Degree']}, C={best_poly['C']}) – Test")

# Weight vectors are NOT accessible for polynomial kernel
print("\nNote: Weight vectors are not available for Polynomial kernel.")
print("The decision boundary exists in a higher-dimensional space.")

### Task 2c: SVM with RBF Kernel

RBF kernel: $K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)$

It measures similarity between points based on distance. Works well for most non-linear problems.

In [None]:
print("SVM with RBF Kernel – Tuning C\n")
print(f"{'C':<10} {'Val Accuracy':<18} {'Test Accuracy'}")
print("-" * 45)

best_val_acc_rbf = 0
best_C_rbf       = None
best_rbf_model   = None

for C in C_values:
    rbf_svm = SVC(kernel='rbf', C=C, gamma='scale')
    rbf_svm.fit(X_train_scaled, y_train)

    y_val_pred  = rbf_svm.predict(X_val_scaled)
    y_test_pred = rbf_svm.predict(X_test_scaled)

    val_acc  = accuracy_score(y_val,  y_val_pred)
    test_acc = accuracy_score(y_test, y_test_pred)

    print(f"{C:<10} {val_acc:<18.4f} {test_acc:.4f}")
    results_table.append({"Kernel": "RBF", "C": C, "Degree": "-",
                           "Val Acc": round(val_acc, 4), "Test Acc": round(test_acc, 4)})

    if val_acc > best_val_acc_rbf:
        best_val_acc_rbf = val_acc
        best_C_rbf       = C
        best_rbf_model   = rbf_svm

print(f"\n>> Best C for RBF SVM: {best_C_rbf}  (Val Accuracy: {best_val_acc_rbf:.4f})")

In [None]:
# Detailed metrics for best RBF SVM
print_metrics(y_val,  best_rbf_model.predict(X_val_scaled),
              dataset_name=f"RBF SVM (C={best_C_rbf}) – Validation")
print_metrics(y_test, best_rbf_model.predict(X_test_scaled),
              dataset_name=f"RBF SVM (C={best_C_rbf}) – Test")

print("\nNote: Weight vectors are not available for RBF kernel.")
print("The decision boundary exists in an infinite-dimensional space.")

---
## Task 3: Complete Results Table

Let's display all results in one table to compare across all kernels and C values.

In [None]:
# Creating a DataFrame from the collected results
results_df = pd.DataFrame(results_table)

# Displaying the full table
print("COMPLETE RESULTS TABLE – All Kernels and C values")
results_df

In [None]:
# Saving to CSV for future reference
results_df.to_csv("svm_results_table.csv", index=False)
print("Results saved to 'svm_results_table.csv'")

# Quick summary: Best config per kernel
print("\n--- Best Configuration per Kernel (based on Val Accuracy) ---")
best_per_kernel = results_df.loc[results_df.groupby("Kernel")["Val Acc"].idxmax()]
print(best_per_kernel.to_string(index=False))

---
## Summary

| Task | Model | Notes |
|------|-------|-------|
| Task 1 | Logistic Regression | Linear classifier, feature weights accessible via `coef_` |
| Task 2a | SVM – Linear | Weight vector accessible, good baseline |
| Task 2b | SVM – Polynomial | Non-linear, weights not accessible, tune degree + C |
| Task 2c | SVM – RBF | Non-linear, weights not accessible, usually best performer |
| Task 3 | Metrics | Confusion Matrix, Accuracy, Precision, Recall, F1 (Macro & Micro) |

**Key takeaways:**
- Always **scale your data** before training LR or SVM.
- Use the **validation set** to tune C (not the test set!).
- RBF kernel is usually a strong choice for non-linear data.
- Weight vectors can only be interpreted for the **linear kernel**.