# Lab 4: Support Vector Machine (SVM) Regression

This notebook demonstrates the application of Support Vector Regression (SVR) with different kernel functions on the SDSS (Sloan Digital Sky Survey) dataset.

## Objective
- Implement SVR with Linear, Polynomial, and RBF kernels
- Compare the performance of different kernels
- Predict redshift values based on astronomical features

## 1. Import Required Libraries

First, we import all necessary libraries for data manipulation, visualization, and machine learning.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline

## 2. Load and Explore the Dataset

We load the SDSS dataset which contains astronomical measurements including:
- **objid**: Object Identifier
- **ra, dec**: Right Ascension and Declination (coordinates)
- **u, g, r, i, z**: Photometric measurements in different filters
- **redshift**: Target variable (cosmological redshift)

In [None]:
data_path = r"F:\Sem 5\Machine Leaning\archive (1)\Skyserver_SQL2_27_2018 6_51_39 PM.csv"
df = pd.read_csv(data_path)

df.head()

## 3. Exploratory Data Analysis

Let's examine the dataset structure, check for missing values, and visualize the distribution of our target variable (redshift).

In [None]:
print(df.info())
print(df.isnull().sum())
print(df.describe())

sns.histplot(df['redshift'], bins=50, kde=True)
plt.title("Distribution of Redshift (Target)")
plt.show()

## 4. Data Preprocessing

We perform the following preprocessing steps:
1. Remove any rows with missing values
2. Drop unnecessary identifier columns
3. Select relevant numeric features for modeling

In [None]:
df = df.dropna()

df = df.drop(columns=['objid','specobjid','plate','mjd','fiberid'])

numeric_cols = ['ra','dec','u','g','r','i','z']

## 5. Train-Test Split and Feature Scaling

We split the data into training (80%) and testing (20%) sets, then standardize the features using StandardScaler.

**Why scaling is important for SVM:**
- SVM is sensitive to feature scales
- Features with larger ranges can dominate the model
- Standardization ensures all features contribute equally

In [None]:
X = df[numeric_cols]  
y = df['redshift']     

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 6. Linear Kernel SVR

The Linear kernel is the simplest kernel function. It's suitable for linearly separable data.

**Formula:** K(x, x') = x · x'

**Parameters:**
- C=1.0: Regularization parameter (controls trade-off between smooth decision boundary and classifying training points correctly)

In [10]:
linear_svr = SVR(kernel='linear', C=1.0)
linear_svr.fit(X_train, y_train)

y_pred_linear = linear_svr.predict(X_test)

print("Linear SVR R2 Score:", r2_score(y_test, y_pred_linear))
print("Linear SVR MSE:", mean_squared_error(y_test, y_pred_linear))

## 7. Polynomial Kernel SVR

The Polynomial kernel can model non-linear relationships by computing polynomial combinations of features.

**Formula:** K(x, x') = (γx · x' + r)^d

**Parameters:**
- degree=3: Degree of the polynomial
- C=1.0: Regularization parameter

In [11]:
poly_svr = SVR(kernel='poly', degree=3, C=1.0)
poly_svr.fit(X_train, y_train)

y_pred_poly = poly_svr.predict(X_test)

print("Polynomial SVR R2 Score:", r2_score(y_test, y_pred_poly))
print("Polynomial SVR MSE:", mean_squared_error(y_test, y_pred_poly))

## 8. RBF (Radial Basis Function) Kernel SVR

The RBF kernel is the most popular kernel for SVR. It can handle non-linear relationships effectively.

**Formula:** K(x, x') = exp(-γ||x - x'||²)

**Parameters:**
- C=1.0: Regularization parameter
- gamma=0.1: Kernel coefficient (controls the influence of individual training examples)

In [12]:
rbf_svr = SVR(kernel='rbf', C=1.0, gamma=0.1)
rbf_svr.fit(X_train, y_train)

y_pred_rbf = rbf_svr.predict(X_test)

print("RBF SVR R2 Score:", r2_score(y_test, y_pred_rbf))
print("RBF SVR MSE:", mean_squared_error(y_test, y_pred_rbf))

## 9. Kernel Comparison

Let's visualize the R² scores of all three kernels to compare their performance.

**R² Score Interpretation:**
- R² = 1: Perfect prediction
- R² = 0: Model performs as well as predicting the mean
- R² < 0: Model performs worse than predicting the mean

In [13]:
r2_scores = {
    'Linear': r2_score(y_test, y_pred_linear),
    'Polynomial': r2_score(y_test, y_pred_poly),
    'RBF': r2_score(y_test, y_pred_rbf)
}

plt.figure(figsize=(8,5))
sns.barplot(x=list(r2_scores.keys()), y=list(r2_scores.values()))
plt.title("SVR Kernel Comparison (R2 Score)")
plt.ylabel("R2 Score")
plt.show()

## 10. Actual vs Predicted Values Visualization

This scatter plot helps us visualize how well our best model (RBF) predicts the actual values.

Points closer to the diagonal line indicate better predictions.

In [14]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rbf, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Redshift')
plt.ylabel('Predicted Redshift')
plt.title('RBF SVR: Actual vs Predicted Redshift Values')
plt.tight_layout()
plt.show()

## 11. Conclusions

Based on the results:

1. **RBF Kernel** performed the best with the highest R² score (~0.65)
2. **Linear Kernel** showed moderate performance (~0.20)
3. **Polynomial Kernel** performed poorly (negative R² score)

**Key Takeaways:**
- The relationship between features and redshift is non-linear
- RBF kernel is most suitable for this dataset
- Polynomial kernel may need hyperparameter tuning to improve performance
- Feature engineering and hyperparameter optimization could further improve results