# SVM Classification of Cancer Images (Basic SVM)
Sources:
* [Datacamp](https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python)
* [Stackabuse](https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/)
### Overview
SVM (Support Vector Machines):
* Classifier model which allows you to split data using hyperplanes
* Hyperplanes are boundaries between data points which are optimized to create the biggest possible margin between the plane itself and data points either side
* Support vectors are the closest data points to the hyperplane, these vectors have a larger impact on maximising the margin between the hyperplane and the data points
* Kernels are the underlying algorithm/method used in your SVM model to solve the classification problem, they allow transformations of the data points from an original dimension to a higher dimension in order to more easily split the data into classes
* For example, if data cannot be linearly split (e.g. you need to draw a ring around the data instead of a straight line), then a certain kernel can transform your data points into a linearly separable distribution to solve this issue
* Common kernels include linear (line), polynomial (curve) and radial basis function (complex) and each kernel has custom parameters which need to be set and tuned to optimize your model (e.g. polynomial has the parameter d (for dimension) which allows you to define how many kinks the curve has in it)

![SVM](Images/Kernels.png)

### Breast Cancer Images
Here, we will use an SVM model to determine whether or not breast cancer images show a malignant (cancerous) or benign (safe) cancer cell. The images have been converted into data features and we have a set of targets from previously classified data which will allow us to train and validate our model.

### 1. Load and Investigate Data
SKLearn provides us with the dataset already split into features and targets.

In [12]:
# load library
from sklearn import datasets

# load data
data = datasets.load_breast_cancer()

# show feature names
data.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [13]:
# show example of data row
data.data[0]

array([1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
       3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
       8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
       3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
       1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01])

In [15]:
# show target names
data.target_names

array(['malignant', 'benign'], dtype='<U9')

In [18]:
# show targets
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [21]:
# dimensions of data
data.data.shape

(569, 30)

To summarize, we have 30 features which are various properties of the cancer images and 1 target which classes each image as either malignant or benign (0 or 1 respectively). We have 569 images in total and our data is already nicely stored in numpy arrays.

### Train and Test Sets
Let's split our data into train (70%) and test (30%) sets to allow us to build our model and validate it later without leaking results to our model.

In [22]:
# load library
from sklearn.model_selection import train_test_split

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=109) # set seed

### Build and Fit Model
Let's build our model and fit it to our training data. We can then make predictions on our test features and compare with our test targets to assess initial accuracy.

We'll begin with the simplest model of SVM, a linear kernel, and see how well this works as a first attempt.

In [23]:
# load library
from sklearn import svm

# build classifier model
clf = svm.SVC(kernel='linear') # linear
#clf = SVC(kernel='poly', degree=8) # polynomial with 7 kinks
#clf = SVC(kernel='rbf') # gaussian/rbf
#clf = SVC(kernel='sigmoid') # sigmoid

# fit model to train data
clf.fit(X_train, y_train)

# make predictions on test features
y_pred = clf.predict(X_test)

### Model Accuracy
Now that we've made our initial predictions, let's compare them with our known results and assess the accuracy of this first attempt (before tuning, optimizing etc.).

All of the below scores are very high which could either be because this pre-built dataset is designed to be easy to model (i.e. good, clean data) or that our model is over-fitted even though we are yet to do any tuning. Note that precision and recall are additional accuracy metrics which look at the distribution of true/false positives/negatives between our predicted values and actual results.

Our results show us that there are 4 misclassifications from the entire dataset, 2 false positives and 4 false negatives.

In [26]:
# load libraries
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

# show accuracy (pred vs. actual)
#print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))

# show precision (true positive / all pred. positives)
#print("Precision: ", metrics.precision_score(y_test, y_pred))

# show recall (true positive / all real positives)
#print("Recall: ", metrics.recall_score(y_test, y_pred))

# confusion matrix
print(confusion_matrix(y_test, y_pred))

# classification report
print(classification_report(y_test, y_pred))

[[ 61   2]
 [  4 104]]
              precision    recall  f1-score   support

           0       0.94      0.97      0.95        63
           1       0.98      0.96      0.97       108

    accuracy                           0.96       171
   macro avg       0.96      0.97      0.96       171
weighted avg       0.97      0.96      0.97       171



### Model Tuning and Principles
Gamma:
* Simply, a low version of Gamma leads to underfitting, a high value leads to overfitting
* For a low value, the model will only consider local/close data points when determining the hyperplane at each point
* Higher gamma values mean that the model looks further, at more data points to decide the hyperplane position
* You can tweak this parameter to essentially decide how closely your model chooses to stick to every data point, thinking of data points as magnets for the hyperplane and as gamma increases, so does the strength of each magnet relative to its distance from the hyperplane

Regularization:
* SVM is a trade off between maximizing margin between classes and minimizing missclassification of classes
* In a perfect scenario, you could have a hyperplane with a large margin that separates all classes exactly, however in practice there are often cases where you cannot separate classes without missclassification (i.e. overlapping classes)
* The regularization parameter (lambda, or c) lets you set the threshold of missclassification you're willing to accept
* A smaller value of c means a more relaxed model where underfitting can occur (i.e. larger hyperplane margins) whilst a higher value of c means a stricter model where overfitting can occur (i.e. narrower margins)
* In general, lower (but not tiny) values of c tend to generalise well for SVM
* [Regularization Explained](https://datascience.stackexchange.com/questions/4943/intuition-for-the-regularization-parameter-in-svm#:~:text=The%20regularization%20parameter%20(lambda)%20serves,minimizing%20the%20amount%20of%20misclassifications.&text=For%20non%2Dlinear%2Dkernel%20SVM%20the%20idea%20is%20the%20similar)

![Regularization](Images/Regularization.png)

### Pros and Cons
Pros:
* Faster prediction than some methods (e.g. Naive Bayes)
* Better accuracy than a lot of models when the right kernel is used and the dataset has clear class separation
* Less memory than some models due to only using a subset of training data
* Works well with high dimensional data (i.e. many features)

Cons:
* Slows down a lot with larger datasets vs. other models
* Doesn't work brilliantly if classes overlap significantly
* Sensitive to which kernel method is used and hyperparameter tuning

### TO DO:
To Do:
* Visualize data and model (i.e. how is the initial linear kernel working and what does the hyperplane look like?)
* Feature selection (correlation, variance, RFE etc.)
* Cleaning and checks (any nulls etc.?)