Name : Marcel Zama <br> 
ID: C00260146 <br> 
Date: 25/02/2024

# Support Vector Machines <br>
1. Business Understanding <br>
In this Program is being ilustrated the work of Support Vector Machines (SVMs) which are powerful supervised machine learning algorithms used for both classification and regression tasks. In this program, we focus on the classification aspect.<br>
SVMs are effective for tasks where we want to classify data points into two or more classes. The main idea behind SVMs is to find the hyperplane that best separates the classes in the feature space. This hyperplane is chosen to maximize the margin between the classes, making it robust to new, unseen data points.


## Types of SVMs used in the Program:
In this program, we are using the SVC implementation of SVMs, which stands for Support Vector Classification. SVC is designed for classification tasks and is particularly versatile due to its ability to use different kernel functions.<br>
Kernel Functions:<br>
- A key feature of SVMs is the use of kernel functions, which map the input data into higher-dimensional spaces.<br>
- The kernel parameter in the SVC implementation allows us to specify different kernel functions. In this program, we use the 'rbf' (radial basis function) kernel.<br>
- The 'rbf' kernel is commonly used and is effective for non-linear classification tasks. It allows SVMs to learn complex decision boundaries that can better separate classes that are not linearly separable in the original feature space.

Support Vector Machines (SVMs) are effective in the following scenarios:<br>

1. Binary Classification: SVMs are primarily used for binary classification tasks, where the goal is to classify data into two classes, such as spam vs. non-spam emails, fraudulent vs. non-fraudulent transactions, or cancerous vs. non-cancerous tumors.<br>

2. Linearly Separable Data: SVMs work well when the data is linearly separable, meaning the classes can be separated by a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions). SVMs aim to find the hyperplane that best separates the classes with the maximum margin.<br>

3. High-Dimensional Space: SVMs are effective in high-dimensional spaces, such as text classification and image classification tasks, where each feature represents a different dimension.<br>

4. Small to Medium-Sized Datasets: SVMs can handle small to medium-sized datasets efficiently. They are memory efficient and do not require a lot of training data to perform well.<br>

5. Data with Non-Linear Separability: SVMs can also handle non-linear data by using different kernel tricks such as polynomial kernel, radial basis function (RBF) kernel, or sigmoid kernel. These kernels allow SVMs to map the input features into a higher-dimensional space where the classes become linearly separable.<br>

6. When Regularization is Needed: SVMs have a regularization parameter (C) that helps control overfitting. When the dataset has noise or outliers, SVMs with appropriate C values can generalize well and avoid overfitting.<br>

7. Image Recognition: SVMs have been successfully used in image recognition tasks, such as handwriting recognition, object detection, and face detection.<br>

8. Text and Document Classification: SVMs are popular for text classification tasks, such as sentiment analysis, spam detection, and topic categorization. They can efficiently handle large feature spaces and sparse data.<br>

9. Medical Diagnosis: SVMs can be used in medical diagnosis to predict diseases based on patient data, such as symptoms, test results, and patient history.<br>

10. When Interpretability is Not the Top Priority: SVMs can be complex models, especially when using non-linear kernels. While they provide high accuracy, the decision boundary can be difficult to interpret.<br>

11. When Robust Generalization is Required: SVMs aim to maximize the margin between classes, which leads to better generalization on unseen data. They tend to perform well in practice when properly tuned.<br>

In summary, Support Vector Machines (SVMs) are a good choice when dealing with binary classification problems, linearly separable data, high-dimensional spaces, and when a good generalization performance is desired. They are particularly effective when the dataset is small to medium-sized and when the data can be separated by a clear margin. SVMs also shine in scenarios where data has a non-linear relationship, and various kernel tricks can be applied to map the data into higher-dimensional spaces.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Data Understanding

In [None]:
# Load the Breast Cancer Wisconsin (Diagnostic) dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

In [None]:
# Print the types of data
print("Data types:")
print("X - Features (first 5 rows):")
print(pd.DataFrame(X, columns=cancer.feature_names).head())
print("\ny - Target (first 5 rows):")
print(pd.DataFrame(y, columns=["Target"]).head())

## 3. Data Preparation

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 4. Model

In [None]:
# Create an SVM classifier
svm_classifier = SVC(kernel='rbf', random_state=42)

# Train the SVM model
svm_classifier.fit(X_train, y_train)

## 5. Data Valuation

In [None]:
# Make predictions on the testing set
y_pred = svm_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Print confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

In [None]:
# Print confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()

## Observations:<br>
- Accuracy: The overall accuracy of the SVM model on the test set is printed at the end.<br>
- Classification Report:<br>
 - Precision: Indicates the proportion of correctly predicted positive instances out of all predicted positives.<br>
 - Recall: Indicates the proportion of correctly predicted positive instances out of all actual positives.<br>
 - F1-score: The harmonic mean of precision and recall, providing a balance between the two.<br>
 - Support: The number of actual occurrences of the class in the dataset.<br>
- Confusion Matrix:<br>
 - The diagonal elements represent the number of correct predictions for each class.<br>
 - Off-diagonal elements represent misclassifications.<br>
 - The heatmap helps visualize the distribution of correct and incorrect predictions.<br><br>
 By analyzing the classification report and confusion matrix, we can gain insights into how well the SVM model is performing in classifying breast cancer diagnoses (Malignant or Benign) based on the given features. The confusion matrix allows us to identify any specific areas where the model may be making more errors, such as false positives or false negatives.<br>

MarkDown:
In order for the algorithm to work following commands have to be executed.<br> 
*pip install numpy*<br> 
*pip install pandas*<br> 
*pip install scikit-learn*<br> 