### Support Vector Machine (SVM) Classifier

1. **Model Definition**: Using an SVM classifier with a linear kernel, which is suitable for high-dimensional data.
2. **Training**: The classifier is trained on the sampled training data subset.
3. **Prediction**: We evaluate the model by making predictions on the validation set.
4. **Evaluation**: The classification report and accuracy metrics are displayed to analyze the SVM model's performance.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import os

# Define dataset filenames
train_data_file = "split_data/train_1_split.csv"
val_data_file = "split_data/val_1_split.csv"

# Construct dynamic paths based on the current working directory
current_dir = os.getcwd()
train_data_path = os.path.join(current_dir, "data/features", train_data_file)
val_data_path = os.path.join(current_dir, "data/features", val_data_file)

# Optional check if paths exist
if not os.path.isfile(train_data_path):
    print(f"Warning: {train_data_path} not found.")
if not os.path.isfile(val_data_path):
    print(f"Warning: {val_data_path} not found.")


# Load the training and validation datasets
train_data = pd.read_csv(train_data_path)
val_data = pd.read_csv(val_data_path)

# Separate features and labels for the training and validation sets
X_train, y_train = train_data.iloc[:, 2:], train_data['label']
X_val, y_val = val_data.iloc[:, 2:], val_data['label']

# Sample a subset for training
X_train_sample, _, y_train_sample, _ = train_test_split(X_train, y_train, train_size=50000, random_state=42, stratify=y_train)


## SVM Model: Training, Prediction, and Evaluation

In this cell, we implement a Support Vector Machine (SVM) classifier with a linear kernel and evaluate its performance on our validation data. The steps involved are as follows:

1. **Import Required Libraries**:
   - `SVC` from `sklearn.svm` to create the SVM model.
   - `classification_report` and `accuracy_score` from `sklearn.metrics` to evaluate the model.

2. **Define the SVM Classifier**:
   - `svm_clf` is set up with `kernel='linear'` to use a linear decision boundary, making it suitable for linearly separable data.
   - `random_state=42` is specified to maintain reproducibility.
   - `probability=True` enables probability estimates, potentially useful for further analysis.

3. **Train the Model**:
   - The classifier is trained on the sampled training data (`X_train_sample` and `y_train_sample`), allowing it to learn the relationship between features and labels.

4. **Make Predictions**:
   - Using the trained classifier, predictions are made on the validation dataset (`X_val`). These predictions (`svm_y_pred`) provide the classifier's guess for each sample in the validation set.

5. **Evaluate the Model’s Performance**:
   - We compute the model’s accuracy, which reflects the percentage of correct predictions.
   - The classification report provides detailed metrics like precision, recall, and F1-score for each class, helping us understand the model's performance across different classes.

By analyzing these metrics, we can gauge how well the SVM model performs and identify areas for potential improvement.


In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Define the SVM Classifier with default parameters
svm_clf = SVC(kernel='linear', random_state=42, probability=True)

# Train the classifier on the sampled training data
svm_clf.fit(X_train_sample, y_train_sample)

# Make predictions on the validation data
svm_y_pred = svm_clf.predict(X_val)

# Evaluate the model
print("Support Vector Machine (SVM) Classifier Accuracy:", accuracy_score(y_val, svm_y_pred))
print("\nClassification Report:\n", classification_report(y_val, svm_y_pred))


### Model Evaluation Metrics

In this cell, we calculate and display a set of comprehensive metrics to evaluate the Random Forest model:

1. **Accuracy**: The overall percentage of correct predictions.
2. **Precision (Weighted)**: The weighted average of precision for all classes, considering class imbalance.
3. **Recall (Weighted)**: The weighted average of recall for all classes.
4. **F1 Score (Weighted)**: The weighted average of the F1 score for all classes, giving a balance between precision and recall.


In [None]:
# Import necessary libraries
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Step 1: Calculate Evaluation Metrics
# Precision, Recall, F1-Score, and Accuracy
precision = precision_score(y_val, svm_y_pred, average='weighted')
recall = recall_score(y_val, svm_y_pred, average='weighted')
f1 = f1_score(y_val, svm_y_pred, average='weighted')
accuracy = accuracy_score(y_val, svm_y_pred)

# Step 2: Print the metrics
print("SVM Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


### SVM Model Performance Visualizations

This cell includes several visualizations to evaluate the SVM classifier's performance:

1. **Confusion Matrix**:
   - A heatmap of the confusion matrix showing the distribution of true and false predictions for each class.
   - Helps in identifying where the classifier performs well and where it misclassifies.

2. **ROC Curve and AUC for Each Class**:
   - Displays ROC (Receiver Operating Characteristic) curves for each class, illustrating the trade-off between true positive rate and false positive rate.
   - AUC (Area Under the Curve) values provide a measure of separability for each class, with higher AUC indicating better class distinction.

3. **Class-wise Accuracy**:
   - A bar chart showing the accuracy per class, calculated as the ratio of correct predictions to the total predictions for each class.
   - Useful for identifying classes where the model has high or low accuracy.


In [None]:
# Import necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Step 1: Confusion Matrix
conf_matrix = confusion_matrix(y_val, svm_y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix for SVM Classifier")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# Step 2: ROC Curve and AUC for each class (One-vs-Rest)
y_val_binary = label_binarize(y_val, classes=sorted(set(y_val)))  # Convert labels to binary format for ROC computation
svm_y_pred_proba = svm_clf.decision_function(X_val)

# Plot ROC curve for each class
plt.figure(figsize=(10, 7))
for i in range(y_val_binary.shape[1]):
    fpr, tpr, _ = roc_curve(y_val_binary[:, i], svm_y_pred_proba[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"Class {i} (AUC = {roc_auc:.2f})")

plt.plot([0, 1], [0, 1], "k--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("SVM Classifier ROC Curve (One-vs-Rest)")
plt.legend(loc="best")
plt.show()

# Step 3: Accuracy by Class from Confusion Matrix
class_accuracies = conf_matrix.diagonal() / conf_matrix.sum(axis=1)
plt.figure(figsize=(12, 6))
plt.bar(range(len(class_accuracies)), class_accuracies, color="skyblue")
plt.title("Class-wise Accuracy for SVM Classifier")
plt.xlabel("Class")
plt.ylabel("Accuracy")
plt.xticks(range(len(class_accuracies)))
plt.show()
