
<p align="center">
    <img src="https://github.com/GeostatsGuy/GeostatsPy/blob/master/TCG_color_logo.png?raw=true" width="220" height="240" />

</p>

## Subsurface Data Analytics Final Project

### Impact of Outliers on Margin Length of Support Vector Machines

#### Barun Das, Graduate Student, The University of Texas at Austin

#### Lei Liu, Graduate Student, The University of Texas at Austin

#### Michael Pyrcz, Associate Professor, University of Texas at Austin 


### PGE 383 Final Project: Support Vector Machines (SVM) for Subsurface Modeling in Python 

#### Table of Contents
#### 1) Executive Summary
#### 2) SVM Introduction
#### 3) Mathematical Concepts behind SVMs
#### 4) Applications of SVMs

#### 1) Executive Summary

The following project shows the impact of changing the number outliers on the margin size of SVMs. The SVC functionality of scikit-learn is employed to generate the SVMs. It is shown that as the number of outliers increase, the margins also increase. 

#### 2) SVM Introduction

Support Vector Machines (SVMs) are powerful supervised learning models used for classification and regression tasks. SVMs excel in finding the optimal decision boundary, known as the hyperplane, that best separates different classes in the input space. They work by identifying a hyperplane that maximizes the margin, the distance between the hyperplane and the closest data points of each class, hence aiming for better generalization to new, unseen data. SVMs are effective in high-dimensional spaces, even when the number of dimensions exceeds the number of samples, making them suitable for various machine learning tasks, including both linear and non-linear classification through the use of different kernel functions like polynomial, radial basis function (RBF), or sigmoid functions.


#### Mathematical Concepts behind SVMs

**1. Hyperplane:**
SVMs aim to find the optimal hyperplane that separates classes in the input space. For a binary classification problem, the hyperplane is represented by the equation $$ w \cdot x + b = 0 $$ in a feature space, where  $w$ is the weight vector perpendicular to the hyperplane, $x$ represents the input features, and $ b $ is the bias term.

**2. Margin:**
The margin is the distance between the hyperplane and the nearest data points of each class, known as support vectors. SVMs aim to maximize this margin. Mathematically, the margin is given by $\frac{2}{\|w\|}$, where $\|w\|$ represents the Euclidean norm of the weight vector.

**3. Objective Function:**
SVMs use an objective function to optimize the hyperplane. The objective is to maximize the margin while minimizing classification errors. This is typically formulated as minimizing $ \frac{1}{2}\|w\|^2 $ subject to the constraints that for each data point $(x_i, y_i)$, where $x_i$ is the input and $y_i$ is the class label (-1 or 1), $y_i(w \cdot x_i + b) \geq 1 $ for points lying on or inside the margin.

**4. Lagrange Multipliers and Dual Formulation:**
Solving the optimization problem involves using Lagrange multipliers to convert it into its dual form, allowing for more efficient computation. This leads to expressing the problem in terms of Lagrange multipliers $\alpha_i$ and formulating the dual objective function that needs to be maximized, which involves summations over pairs of data points.

**5. Kernel Trick:**
For non-linearly separable data, SVMs use the kernel trick to implicitly map the input data into a higher-dimensional space where the classes might be linearly separable. This allows the SVM to find a linear decision boundary in the higher-dimensional space without explicitly computing the transformed features.

**6. Optimization:**
The optimization process involves solving the dual objective function using optimization techniques like Sequential Minimal Optimization (SMO) or quadratic programming to find the optimal $\alpha_i$ values, which then allow computation of $w$ and $b$ to define the separating hyperplane.

Overall, SVMs offer an effective way to find the best possible separation between classes by maximizing the margin and handling both linear and non-linear classification tasks by using appropriate kernels.

#### Applications SVMs

Support Vector Machines (SVMs) find applications across various domains due to their effectiveness in both classification and regression tasks. Some typical uses of SVMs include:

#### 1. **Image Classification**
   SVMs are used in image classification tasks, such as object recognition, facial expression analysis, and handwritten digit recognition. They perform well in distinguishing between different classes within images.

#### 2. **Text and Document Classification**
   In Natural Language Processing (NLP), SVMs are employed for text classification tasks like sentiment analysis, spam detection, topic classification, and document categorization.

#### 3. **Biomedical Applications**
   SVMs aid in medical diagnosis, such as cancer classification from tissue samples, identifying disease risk factors from genetic data, and predicting patient outcomes based on medical records.

#### 4. **Financial Forecasting**
   SVMs are used in stock market forecasting, credit scoring, fraud detection, and risk assessment due to their ability to handle complex data and make accurate predictions.

#### 5. **Handwriting Recognition**
   They are utilized in Optical Character Recognition (OCR) systems to recognize handwritten characters or text in forms, documents, or bank checks.

#### 6. **Bioinformatics**
   SVMs play a role in genomics, proteomics, and other bioinformatics fields for tasks like protein classification, gene expression analysis, and protein structure prediction.

#### 7. **Remote Sensing**
   SVMs analyze remote sensing data for land cover classification, crop yield prediction, object detection in satellite images, and environmental monitoring.

#### 8. **Anomaly Detection**
   SVMs are effective in anomaly detection tasks, such as detecting intrusions in network security, identifying defective products in manufacturing, or finding outliers in datasets.

#### 9. **Regression Analysis**
   Though primarily a classification algorithm, SVMs can also be used for regression tasks by modifying the formulation to predict continuous values.

#### 10. **Feature Extraction and Dimensionality Reduction**
   SVMs contribute to feature selection and dimensionality reduction techniques, assisting in reducing the complexity of high-dimensional data while preserving important information.

SVMs are versatile and perform well in various scenarios, especially when the data is well-structured and there's a need for a clear separation between classes or groups in the dataset. However, their performance might be affected by large datasets or noise, and careful selection of hyperparameters is crucial for optimal results.

#### Step 1: Import relevant packages

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC
from ipywidgets import interact, IntSlider

#### Generate Synthetic Data

In [2]:
X, y = datasets.make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)
y = np.where(y == 0, -1, 1)  # Change labels to -1 and 1

#### Create plotting function

In [3]:
def plot_svm(num_outliers):
    np.random.seed(42)
    outliers_indices = np.random.choice(len(X), num_outliers, replace=False)
    y_copy = y.copy()
    y_copy[outliers_indices] *= -1  # Flip the labels for outliers

    plt.figure(figsize=(6, 6))
    plt.scatter(X[:, 0], X[:, 1], c=y_copy, cmap='viridis')

    # Fit SVM with outliers
    svm = SVC(kernel='linear', C=1.0)
    svm.fit(X, y_copy)

    # Plot decision boundary and margins
    ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # Create grid to evaluate model
    xx = np.linspace(xlim[0], xlim[1], 30)
    yy = np.linspace(ylim[0], ylim[1], 30)
    YY, XX = np.meshgrid(yy, xx)
    xy = np.vstack([XX.ravel(), YY.ravel()]).T
    Z = svm.decision_function(xy).reshape(XX.shape)

    # Plot decision boundary and margins
    ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1], s=100, linewidth=1, facecolors='none', edgecolors='k')
    plt.title(f"SVM with {num_outliers} Outliers")
    plt.show()

#### Create slider for number of outliers

In [4]:
interact(plot_svm, num_outliers=IntSlider(min=0, max=20, step=1, value=0, description='Num Outliers:'))

interactive(children=(IntSlider(value=0, description='Num Outliers:', max=20), Output()), _dom_classes=('widge…

<function __main__.plot_svm(num_outliers)>