# Chapter 24: Support Vector Machines

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the fundamental concepts of Support Vector Machines (SVMs) for classification and regression
- Explain the role of margins, support vectors, and the kernel trick
- Apply SVM for classification (SVC) to predict NEPSE price direction
- Use SVM for regression (SVR) to forecast returns or prices
- Choose appropriate kernel functions (linear, polynomial, RBF) based on data characteristics
- Tune SVM hyperparameters (C, gamma, degree) using time‑series cross‑validation
- Scale features appropriately for SVM
- Interpret SVM models and understand their limitations
- Compare SVM performance with linear models and tree‑based models on the NEPSE dataset

---

## **24.1 SVM Fundamentals**

Support Vector Machines (SVMs) are a class of powerful supervised learning algorithms originally developed for binary classification. The core idea is to find a hyperplane that separates the classes with the largest possible margin. The “support vectors” are the data points closest to the decision boundary; they alone determine the hyperplane. SVMs can also be extended to regression tasks (SVR) and, through the **kernel trick**, can capture non‑linear relationships without explicitly computing features in a high‑dimensional space.

In the context of the NEPSE prediction system, we can use SVM to classify whether the next day's return will be positive (up) or negative (down). Alternatively, we can use SVR to forecast the exact return magnitude.

### **24.1.1 Maximum Margin Classification**

For a binary classification problem with linearly separable data, there are infinitely many separating hyperplanes. SVM chooses the one that maximizes the distance (margin) between the two classes. The decision function is:

`f(x) = w·x + b`

where `w` is the weight vector and `b` is the bias. The class is determined by the sign of `f(x)`. The margin is `2/||w||`, so maximizing the margin is equivalent to minimizing `||w||²` subject to the constraints that all points are correctly classified.

In practice, data is often not perfectly separable. SVM introduces **slack variables** to allow some points to be on the wrong side of the margin (or even misclassified). This is controlled by the hyperparameter `C`:

- Large `C` gives a hard margin (fewer misclassifications, but may overfit).
- Small `C` gives a soft margin (more tolerance, potentially better generalization).

### **24.1.2 The Kernel Trick**

To handle non‑linear decision boundaries, SVM can map the original features into a higher‑dimensional space where the classes become linearly separable. The **kernel trick** allows this mapping to be performed implicitly by replacing dot products with a kernel function `K(xᵢ, xⱼ) = φ(xᵢ)·φ(xⱼ)`. Common kernels:

- **Linear:** `K(x, y) = x·y` (equivalent to no mapping).
- **Polynomial:** `K(x, y) = (γ x·y + r)ᵈ` where `d` is the degree.
- **RBF (Radial Basis Function):** `K(x, y) = exp(-γ ||x - y||²)`, the most popular non‑linear kernel.
- **Sigmoid:** `K(x, y) = tanh(γ x·y + r)`.

For time‑series financial data, RBF is often a good starting point because it can model complex, non‑linear relationships.

---

## **24.2 SVM for Classification (SVC)**

We'll first apply SVM to the binary direction prediction task from previous chapters. The target is 1 if tomorrow's return > 0, else 0.

### **24.2.1 Data Preparation and Scaling**

SVM is sensitive to feature scales because it relies on distances. Therefore, we **must** scale features, typically to zero mean and unit variance.

```python
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

# Load and prepare NEPSE data (same as previous chapters)
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)

# Use a single symbol for simplicity
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()

# Create features and target (as in Chapter 23)
df_stock['Return'] = df_stock['Close'].pct_change() * 100
df_stock['Return_Lag1'] = df_stock['Return'].shift(1)
df_stock['Return_Lag2'] = df_stock['Return'].shift(2)
df_stock['Volume_Lag1'] = df_stock['Vol'].shift(1)
df_stock['MA_5'] = df_stock['Close'].rolling(5).mean()
df_stock['Volatility_5'] = df_stock['Return'].rolling(5).std()
# RSI (simplified)
delta = df_stock['Close'].diff()
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)
avg_gain = gain.rolling(14).mean()
avg_loss = loss.rolling(14).mean()
rs = avg_gain / avg_loss
df_stock['RSI'] = 100 - (100 / (1 + rs))

# Target: direction (1 if next day's return > 0, else 0)
df_stock['Target'] = (df_stock['Return'].shift(-1) > 0).astype(int)

# Drop NaN
df_stock = df_stock.dropna()

# Feature columns
feature_cols = ['Return_Lag1', 'Return_Lag2', 'Volume_Lag1', 'MA_5', 'Volatility_5', 'RSI']
X = df_stock[feature_cols]
y = df_stock['Target']

# Temporal split
split_idx = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

# Scale features (fit on training only)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

**Explanation:**

- Scaling is crucial: without it, features with larger ranges (e.g., volume) would dominate the distance calculations, leading to poor performance.
- We fit the scaler only on the training set to avoid data leakage.

### **24.2.2 Training an SVM Classifier**

We'll start with a linear SVM (linear kernel) as a baseline, then move to RBF.

```python
# Linear SVM
svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_linear.fit(X_train_scaled, y_train)

y_pred_linear = svm_linear.predict(X_test_scaled)
acc_linear = accuracy_score(y_test, y_pred_linear)
print(f"Linear SVM test accuracy: {acc_linear:.4f}")
```

**Explanation:**

- `C=1.0` is the default. Larger C tries to classify every training point correctly, which may lead to overfitting.
- The linear SVM is similar to logistic regression but with a margin objective. It can serve as a baseline.

### **24.2.3 RBF Kernel SVM**

The RBF kernel can model non‑linear boundaries. It has two important hyperparameters: `C` (regularization) and `gamma` (kernel coefficient). A small gamma means a large similarity radius, leading to smoother decision boundaries; large gamma makes each point have high influence, potentially overfitting.

```python
# RBF SVM with default gamma
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_rbf.fit(X_train_scaled, y_train)

y_pred_rbf = svm_rbf.predict(X_test_scaled)
acc_rbf = accuracy_score(y_test, y_pred_rbf)
print(f"RBF SVM test accuracy: {acc_rbf:.4f}")
```

**Explanation:**

- `gamma='scale'` uses `1 / (n_features * X.var())` as the gamma value, a reasonable default. `gamma='auto'` uses `1 / n_features`.
- The RBF kernel often outperforms linear if the true decision boundary is non‑linear.

### **24.2.4 Hyperparameter Tuning with Grid Search**

To get the best performance, we must tune `C` and `gamma`. We'll use `GridSearchCV` with time‑series cross‑validation.

```python
from sklearn.model_selection import TimeSeriesSplit

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
    'kernel': ['rbf']
}

# TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=3)

# Grid search
svm = SVC(random_state=42)
grid = GridSearchCV(svm, param_grid, cv=tscv, scoring='accuracy', verbose=1, n_jobs=-1)
grid.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid.best_params_}")
print(f"Best cross-validation accuracy: {grid.best_score_:.4f}")

# Evaluate on test set
y_pred_best = grid.predict(X_test_scaled)
acc_best = accuracy_score(y_test, y_pred_best)
print(f"Test accuracy with best model: {acc_best:.4f}")
```

**Explanation:**

- We search over a range of `C` and `gamma` values. The grid should be chosen based on data scale; a logarithmic scale is common.
- `TimeSeriesSplit` ensures that the validation sets are temporally after the training folds, preventing look‑ahead.
- The best model is then evaluated on the held‑out test set.

### **24.2.5 Interpreting SVM Results**

SVM models are not as interpretable as linear models or decision trees. However, we can gain some insight by:

- Examining the support vectors (the points that define the margin).
- For linear SVM, the coefficients (weights) indicate feature importance, though they are affected by scaling.

```python
# Support vectors
print(f"Number of support vectors: {len(grid.best_estimator_.support_)}")
print(f"Support vector indices: {grid.best_estimator_.support_[:5]}")  # first five

# For linear kernel, we can get coefficients
if grid.best_estimator_.kernel == 'linear':
    coef = grid.best_estimator_.coef_[0]
    imp_df = pd.DataFrame({'feature': feature_cols, 'coef': coef})
    print(imp_df.sort_values('coef', key=abs, ascending=False))
```

**Explanation:**

- Support vectors are the critical data points. If there are many, the model may be overfitting.
- For linear SVM, the magnitude of coefficients indicates feature importance, but careful: features were scaled, so coefficients are comparable.

---

## **24.3 SVM for Regression (SVR)**

Support Vector Regression (SVR) aims to find a function that approximates the target values while keeping errors within a margin (epsilon). Points with errors larger than epsilon are penalized.

The hyperparameters are:
- `epsilon`: width of the insensitive tube.
- `C`: regularization (trade‑off between flatness and tolerance of errors).
- Kernel parameters (e.g., `gamma` for RBF).

### **24.3.1 Applying SVR to NEPSE Return Prediction**

We'll use the same features but now the target is the continuous next‑day return.

```python
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# Target: next day's return
y_reg = df_stock['Return'].shift(-1).dropna()  # align with X

# Ensure X and y have same index after dropping NaN in y
X_reg = X.loc[y_reg.index]
y_reg = y_reg

# Temporal split (same split index as before, but ensure lengths match)
split_idx_reg = int(len(X_reg) * 0.8)
X_train_reg, X_test_reg = X_reg.iloc[:split_idx_reg], X_reg.iloc[split_idx_reg:]
y_train_reg, y_test_reg = y_reg.iloc[:split_idx_reg], y_reg.iloc[split_idx_reg:]

# Scale
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

# SVR with RBF kernel (default)
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_train_reg_scaled, y_train_reg)

y_pred_svr = svr.predict(X_test_reg_scaled)
rmse_svr = np.sqrt(mean_squared_error(y_test_reg, y_pred_svr))
print(f"SVR test RMSE: {rmse_svr:.4f}")
```

**Explanation:**

- `epsilon=0.1` means errors less than 0.1% are not penalized. This is domain‑specific; for returns, a small epsilon (e.g., 0.1‑0.5) is typical.
- `C` and `gamma` should be tuned similarly to classification.

### **24.3.2 Tuning SVR Hyperparameters**

We'll use `GridSearchCV` again.

```python
param_grid_svr = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 'scale'],
    'epsilon': [0.01, 0.1, 0.2, 0.5],
    'kernel': ['rbf']
}

svr_tune = SVR()
grid_svr = GridSearchCV(svr_tune, param_grid_svr, cv=tscv, scoring='neg_mean_squared_error', 
                        verbose=1, n_jobs=-1)
grid_svr.fit(X_train_reg_scaled, y_train_reg)

print(f"Best SVR params: {grid_svr.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-grid_svr.best_score_):.4f}")

y_pred_svr_best = grid_svr.predict(X_test_reg_scaled)
rmse_svr_best = np.sqrt(mean_squared_error(y_test_reg, y_pred_svr_best))
print(f"SVR test RMSE (tuned): {rmse_svr_best:.4f}")
```

**Explanation:**

- We use negative MSE as the scoring metric; the grid search maximizes score, so we take negative.
- The best model is then evaluated on the test set.

---

## **24.4 Kernel Selection and Characteristics**

Choosing the right kernel depends on the data:

- **Linear kernel:** Use when data is approximately linearly separable, or as a baseline. Fast, interpretable.
- **Polynomial kernel:** Can model interactions but has more hyperparameters (degree, coef0). May overfit with high degree.
- **RBF kernel:** Most flexible; can model any non‑linear boundary if `gamma` is chosen well. Generally a good default.

For NEPSE data, the relationship between features and returns is likely non‑linear, so RBF is a strong candidate. However, we can compare performance across kernels.

```python
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
results = {}

for kernel in kernels:
    if kernel == 'poly':
        svm = SVC(kernel=kernel, degree=2, C=1, random_state=42)  # default degree 3; try 2 for simplicity
    else:
        svm = SVC(kernel=kernel, C=1, random_state=42)
    svm.fit(X_train_scaled, y_train)
    y_pred = svm.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    results[kernel] = acc
    print(f"{kernel} kernel accuracy: {acc:.4f}")
```

**Explanation:**

- For polynomial, we set `degree=2` to avoid too much complexity. In practice, tuning degree is also needed.
- Sigmoid kernel may behave similarly to RBF for some parameters but is less common.
- The results guide which kernel family is promising.

---

## **24.5 Scaling and Preprocessing**

SVM requires careful scaling. Always:

- Scale features to zero mean and unit variance (StandardScaler) or to a [0,1] range (MinMaxScaler). StandardScaler is more common.
- Fit the scaler on the training set only.
- Apply the same transformation to test/validation sets.

If the data contains outliers, robust scaling (using median and quantiles) may be beneficial, but SVM with RBF is somewhat sensitive to outliers anyway.

```python
from sklearn.preprocessing import RobustScaler

robust_scaler = RobustScaler()
X_train_robust = robust_scaler.fit_transform(X_train)
X_test_robust = robust_scaler.transform(X_test)
# Then train SVM...
```

---

## **24.6 Strengths and Limitations of SVM**

### **Strengths**

- **Effective in high‑dimensional spaces** – even when number of features exceeds samples (though careful with overfitting).
- **Memory efficient** – only support vectors are stored, not all data points.
- **Versatile** through different kernel functions.
- **Good theoretical foundations** – maximum margin principle often leads to good generalization.

### **Limitations**

- **Not suitable for large datasets** – training time can be O(n²) or O(n³). For NEPSE with thousands of rows, it's fine, but for millions, it's slow.
- **Sensitive to feature scaling** – must scale carefully.
- **Difficult to interpret** – especially with non‑linear kernels.
- **Choice of kernel and hyperparameters** can be non‑intuitive; extensive tuning required.
- **No direct probability estimates** – though Platt scaling can be applied (SVC has `probability=True` option, but it's slower).
- **Can perform poorly on very noisy data** – if classes overlap heavily, SVM may not separate well.

---

## **24.7 Comparison with Other Models**

Let's compare SVM with logistic regression and random forest on the same classification task.

```python
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Logistic Regression (with scaling)
lr = LogisticRegression(random_state=42)
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
acc_lr = accuracy_score(y_test, y_pred_lr)

# Random Forest (no scaling needed)
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)  # note: uses unscaled data
y_pred_rf = rf.predict(X_test)
acc_rf = accuracy_score(y_test, y_pred_rf)

# Best SVM (from grid search)
acc_svm = acc_best

print("Model Comparison:")
print(f"Logistic Regression: {acc_lr:.4f}")
print(f"Random Forest:       {acc_rf:.4f}")
print(f"SVM (RBF, tuned):    {acc_svm:.4f}")
```

**Explanation:**

- Logistic regression is a linear baseline; SVM may outperform if non‑linearity is present.
- Random forest is non‑linear and often competitive; comparing helps decide which model family suits the data.

---

## **24.8 Practical Considerations for NEPSE**

### **24.8.1 Dealing with Imbalanced Classes**

If the market has more up days than down days (or vice versa), the classes may be imbalanced. SVM with unbalanced data can be adjusted using the `class_weight` parameter.

```python
svm_balanced = SVC(kernel='rbf', class_weight='balanced', random_state=42)
```

This assigns higher penalty to misclassifications of the minority class.

### **24.8.2 Probability Outputs**

If you need probability estimates (e.g., for position sizing), set `probability=True` (fits an additional Platt scaling model). This increases training time.

```python
svm_prob = SVC(kernel='rbf', probability=True, random_state=42)
svm_prob.fit(X_train_scaled, y_train)
probs = svm_prob.predict_proba(X_test_scaled)[:, 1]  # probability of class 1
```

### **24.8.3 Computational Efficiency**

For larger datasets, consider using `LinearSVC` (for linear kernel) which is more scalable. For non‑linear, you may need to sample data or use a different algorithm.

### **24.8.4 Feature Engineering**

SVM with RBF can capture non‑linear interactions automatically, so extensive feature engineering (like polynomial features) is less critical. However, domain‑specific features (e.g., RSI, moving averages) still help because they encode domain knowledge.

---

## **24.9 Chapter Summary**

In this chapter, we explored Support Vector Machines for both classification and regression, using the NEPSE dataset as a concrete example.

- **SVM fundamentals:** maximum margin, support vectors, and the kernel trick.
- **SVC for direction prediction:** we trained linear and RBF SVMs, tuned hyperparameters (`C`, `gamma`) with time‑series cross‑validation, and evaluated on a test set.
- **SVR for return forecasting:** we applied SVR to continuous targets, tuning `C`, `gamma`, and `epsilon`.
- **Kernel selection:** RBF is a flexible default; we compared linear, polynomial, and sigmoid.
- **Scaling is mandatory** – we used `StandardScaler`.
- **Strengths:** effective in high dimensions, versatile, good generalization.
- **Limitations:** slow on large data, sensitive to scaling, hard to interpret.
- **Comparison with other models** helps contextualize SVM performance.

### **Practical Takeaways for the NEPSE System:**

- SVM can be a powerful tool for predicting stock direction, especially if the decision boundary is non‑linear.
- Always scale features and tune hyperparameters using time‑series CV.
- For large datasets, consider linear SVM or sampling; for small to medium, RBF is fine.
- SVM does not provide native feature importance, but for linear SVM coefficients can be examined.
- Combine SVM with domain‑specific features for best results.

In the next chapter, **Chapter 25: Neural Network Fundamentals**, we will dive into the basics of neural networks, building the foundation for deep learning models in time‑series forecasting.

---

**End of Chapter 24**