### Support Vector Machine (SVM) Binary Classification

This notebook demonstrates the implementation of **Support Vector Machine (SVM)** for binary classification using the **Pima Indian Diabetes Prediction Dataset**. SVM is a robust supervised learning algorithm that excels at classification tasks, particularly when dealing with high-dimensional and non-linearly separable data.

---

#### 1. **What is SVM?**

**Support Vector Machine (SVM)** is a powerful machine learning algorithm designed to classify data by finding the optimal decision boundary that separates classes. The key idea of SVM is to maximize the **margin**, which is the distance between the decision boundary (hyperplane) and the nearest data points from each class, called **support vectors**.

Key Features of SVM:
- **Maximizing Margin**: Ensures better generalization by finding the hyperplane that best separates the classes.
- **Kernel Trick**: Maps data into higher dimensions to enable linear separation of non-linear data. Common kernels include:
  - `Linear`: Works when the data is linearly separable.
  - `RBF (Radial Basis Function)`: Useful for non-linear data and is widely used.
  - `Polynomial`: Fits non-linear data with polynomial relationships.
- **Regularization Parameter (C)**: Controls the trade-off between maximizing the margin and minimizing classification errors.
- **Gamma**: Defines how far the influence of a single data point reaches, particularly in RBF and polynomial kernels.

SVM is especially effective in scenarios with smaller datasets and a clear margin of separation between classes.

---

#### 2. **Steps Implemented in this Notebook**

1. **Dataset Loading and Exploration**:
   - The **Pima Indian Diabetes Prediction Dataset** is loaded to predict diabetes based on features like glucose level, BMI, and age.
   - The dataset consists of two classes: `0` (No Diabetes) and `1` (Diabetes).

2. **Data Preprocessing**:
   - **Normalization**: Feature scaling is performed using standardization to improve SVM’s performance, as it is sensitive to the range of feature values.
   - Handling missing or invalid values, such as replacing zero BMI with the median or mean.

3. **Train-Test Split**:
   - The dataset is split into training and testing sets in an 80:20 ratio, ensuring a fair evaluation of the model's performance.

4. **Model Training**:
   - An SVM model is trained using the **RBF kernel** to handle non-linear decision boundaries.
   - Hyperparameters (`C` and `gamma`) are tuned to optimize performance.

5. **Model Evaluation**:
   - Performance metrics such as **accuracy, precision, recall, and F1-score** are used to evaluate the model.
   - A confusion matrix is generated to visualize the classification results, including true positives, false positives, true negatives, and false negatives.

6. **Insights and Visualization**:
   - Analyzes the influence of hyperparameters and kernels on the classification task.
   - Visualizes the decision boundaries (for reduced feature dimensions) to understand how SVM separates the two classes.

---

#### 3. **Why Use SVM?**

- SVM is ideal for binary classification tasks where the margin of separation is critical.
- Its flexibility to handle non-linear data using kernels makes it a versatile choice.
- SVM is less prone to overfitting, especially when properly tuned, and works well even with a smaller number of samples.

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

SEED = 42

## 1. Data Collection

In [None]:
!wget https://raw.githubusercontent.com/devdio/flyai_datasets/main/diabetes.csv

In [None]:
path = 'diabetes.csv'
diabetes = pd.read_csv(path)
diabetes.shape

In [None]:
diabetes.head()

In [None]:
df = diabetes.copy()
df.info() # -> There's no missing data, all of them are numeric variables(No need to encode)

In [None]:
df.describe().T

### Categorical variables

In [None]:
set(df.columns)

In [None]:
df['Outcome'].value_counts()

In [None]:
sns.countplot(data=df, x='Outcome')

### Continuous variables

In [None]:
tmp = df['Pregnancies'].sort_values(ascending=False)
tmp = tmp.reset_index()
tmp.head()

In [None]:
sns.barplot(x=tmp.index, y = tmp['Pregnancies'])

In [None]:
df.hist()

### Missing datas

In [None]:
df.isna().sum()

### Duplication

In [None]:
df.duplicated().sum()

### Outlier

In [None]:
# Draw boxplot
df.boxplot(figsize=(10,10))

In [None]:
df.describe().T

## 2. Separate train, test data.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1, random_state=SEED, stratify=df['Outcome'])
train.shape, test.shape

In [None]:
train['Outcome'].value_counts()

In [None]:
train.head()

### Separate variables x and y.

In [None]:
X_train = train.drop('Outcome', axis=1)
y_train = train['Outcome']

X_train.shape, y_train.shape

## 3. Replace outliers with 0 values to a specific value (median).

In [None]:
# 'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'

In [None]:
median_list = []

col_list = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for col in col_list:
  med = X_train[col].median()
  X_train.loc[X_train[col] == 0, col] = med
  median_list.append(med)

In [None]:
# Confirm that there are no values where the minimum is 0.
X_train.describe().T

### Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train_s = ss.fit_transform(X_train)
X_train_s # it's automatically changed to np.array

In [None]:
print(ss.mean_) # Each col's mean value
print(ss.var_) # Each columns's variance

In [None]:
y_train_e = y_train.to_numpy()
y_train_e

In [None]:
print(X_train_s.shape, y_train_e.shape)
print(type(X_train_s), type(y_train_e))

## 4. Model Learning

In [None]:
from sklearn.svm import SVC

clf = SVC(random_state=SEED)
clf.fit(X_train_s, y_train_e)

## 5. Validation

In [None]:
X_test = test.drop('Outcome', axis=1)
y_test = test['Outcome']

X_test.shape, y_test.shape

In [None]:
# Data proprecessing(test)
# median_list = []
col_list = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for i, col in enumerate(col_list):
  X_test.loc[X_test[col] == 0, col] = median_list[i]
  median_list.append(med)

In [None]:
X_test_s = ss.transform(X_test)
X_test_s

In [None]:
y_test_e = y_test.to_numpy()
y_test_e

In [None]:
print(X_test_s.shape, y_test_e.shape)
print(type(X_test_s), type(y_test_e))

In [None]:
y_pred = clf.predict(X_test_s)
y_pred

In [None]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix

# Define a function to print classification metrics and display a confusion matrix heatmap
def print_metrics(y_true, y_pred, ave='binary'):
    print('accuracy:', accuracy_score(y_test_e, y_pred))
    print('recall:', recall_score(y_test_e, y_pred, average=ave))
    print('precision:', precision_score(y_test_e, y_pred, average=ave))
    print('f1 :', f1_score(y_test_e, y_pred, average=ave))

    # Generate and display the confusion matrix as a heatmap
    clm = confusion_matrix(y_test_e, y_pred)
    s = sns.heatmap(clm, annot=True, fmt='d', cbar=False)
    s.set(xlabel='Predicted', ylabel='Actual')  # Set axis labels
    plt.show()


In [None]:
print_metrics(y_test_e, y_pred)

## 6.Model tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Define a parameter grid for hyperparameter tuning
prams_grid = {
    'C': [0.01, 0.02, 0.05, 0.1, 0.5, 1, 10, 100],  # Regularization parameter
    'gamma': [1, 0.1, 0.01, 0.001],  # Kernel coefficient
    'kernel': ['rbf', 'poly']  # Types of kernel functions
}

# Initialize the SVC model with a fixed random seed for reproducibility
clf = SVC(random_state=SEED)

# Set up GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(
    estimator=clf,               # Model to be optimized
    param_grid=prams_grid,       # Parameter grid to search
    cv=3,                        # 3-fold cross-validation
    n_jobs=-1,                   # Use all available CPU cores for parallel processing
    refit=True,                  # Refit the model with the best parameters on the entire training data
    verbose=2,                   # Increase verbosity for progress updates
    return_train_score=True      # Include training scores in the results
)

# Perform grid search and fit the model on the training data
grid_search.fit(X_train_s, y_train_e)


In [None]:
# Retrieve the best estimator (model) from the grid search
# This will provide the SVC model with the optimal hyperparameters found during grid search
grid_search.best_estimator_

In [None]:
# Retrieve the best hyperparameters found during the grid search
# This will return a dictionary containing the optimal parameter values for the model
grid_search.best_params_

In [None]:
# Use the best estimator (model with optimal hyperparameters) to make predictions on the test data
y_pred = grid_search.best_estimator_.predict(X_test_s)
y_pred

In [None]:
# Evaluate the model's performance using the custom print_metrics function
# This will display accuracy, recall, precision, F1 score, and the confusion matrix heatmap
print_metrics(y_test_e, y_pred)