# Support Vector Machine Classifier
Advanced tool for classifying vacuum gauges 

* [Setup](#setup)
    * [Environment](#env)
    * [Data Retrieval (illustrative)](#retrieval)
    * [Data Normalization (illustrative)](#norm)
    * [Data Interpolation (illustrative)](#interpolate)
    * [Data Masking (illustrative)](#masking)
* [Dataset Creation](#create)
    * [Bin Data (illustrative)](#level)
    * [Generate Dataset](#gen)
    * [Select Dataset](#select)
* [Support Vector Machine Classifiaton](#support)
    * [Split Dataset](#split)
    * [Train Model (illustrative)](#train)
    * [Confusion Matrix (illustrative)](#confusion)
    * [Parameter Optimization (illustrative)](#crossvalidate)
    * [Manual Optimization (illustrative)](#manuel)
    * [Final Evaluation](#eva)
* [Use Model](#use)
    * [Save Model](#save)
    * [Load Model](#load)

# <a id='setup'> Setup </a>
## <a id='env'>Environment </a>

In [3]:

%run BackEnd_Plotters.ipynb
%run BackEnd_DataProcessing.ipynb
%run BackEnd_Classifiers.ipynb




Populating the interactive namespace from numpy and matplotlib


## <a id='retrieval'> Retrieve Data </a>

In [None]:
gauge_id = "VGPB.4.7L5.R.PR"
fillNo = 2219

pressure_readings,\
time_readings,\
beam_time,\
beam_energy = retrieve_gauge_data(gauge_id, fillNo,show_plot=True)


## <a id='norm'> Data Normalization </a>

In [None]:
pressure_readings,\
beam_energy = normalize_y(time_readings,
                          beam_time,
                          pressure_readings,
                          beam_energy,
                          show_plot=True)

## <a id=interpolate> Data Interpolation </a>

In [None]:
time_readings,\
pressure_readings,\
beam_time,\
beam_energy = interpolate_readings(pressure_readings, time_readings, beam_time, beam_energy,
                                   show_plot=True)


## <a id='masking'> Data Masking </a>

In [None]:
mask,\
threshold = double_threshold_energy_masking(time_readings, 
                             beam_time,
                             beam_energy,
                            show_plot=True)

# <a id='create'> Dataset Creation </a>
## <a id='level'> Bin the Data </a>
Split data into segments and return their averages. <br>
These are used as the **feautures** for the Algorithm

In [None]:
pressure_levels = bin_data(pressure_readings[mask],8)
plot_levelled_data(gauge_id, pressure_levels, time_readings[mask],pressure_readings[mask])

## <a id='gen'> Generate Datasets </a>
Use CSV file catalogue containing probe/gauge information and their ground truth to generate a SciKit Learn dataset ('bunch').


In [4]:

balanced_datasets = get_or_create_datasets("Balanced", feature_spec=(1,13), reset=0, verbose=1)


[1mCatalogue Balanced[0m: (495, 6)


In [None]:
for key in balanced_datasets.keys(): 
    print("{0}Balanced Dataset with {2} Features:{1}".format(color.BOLD,color.END,key))
    display(balanced_datasets[key].head())

In [5]:

representative_datasets = get_or_create_datasets("Representative", feature_spec=(1,13), reset=0, verbose=1)


[1mCatalogue Representative[0m: (1248, 6)


In [None]:
for key in representative_datasets.keys(): 
    print("{0}Representative Dataset with {2} Features:{1}".format(color.BOLD,color.END,key))
    display(representative_datasets[key].head())

## <a id="select"> Select Specific Dataset </a>
Used for illustrating how the algorithm is prepared step-by-step, for full optimization <br>
all loaded datasets can be used to vary the no. of features.

In [6]:
feature_options = representative_datasets.keys()
feature_field = widgets.BoundedIntText(
    value=4,
    min=min(feature_options),
    max=max(feature_options),
    step=1,
    description='No Of Features:',
)
display(feature_field)

BoundedIntText(value=4, description='No Of Features:', max=12, min=1)

In [7]:
dataset = representative_datasets[feature_field.value]

X,y,lookup = dataset.iloc[:,0:dataset.shape[1]-3],dataset.Response,dataset.iloc[:,-2:]

plt.figure()
plt.bar(x=['0:Normal','1:Anamolous'],height=[y.value_counts()[0],y.value_counts()[1]],color=['brown','blue'])
plt.ylabel("Count")
plt.title("Frequency of classes in dataset")
plt.show()

display(dataset.head())

<IPython.core.display.Javascript object>

Unnamed: 0,Bin0,Bin1,Bin2,Bin3,Response,Probe ID,Fill
0,0.419812,0.190751,0.233995,0.241369,1,VGPB.220.1L1.X.PR,5979
1,0.05515,0.050539,0.044476,0.037265,0,VGPB.222.1L1.X.PR,5979
2,0.937469,0.862674,0.766351,0.654398,0,VGPB.190.1L2.X.PR,5979
3,0.8142,0.682644,0.557005,0.493102,0,VGPB.220.1L2.X.PR,5979
4,0.801912,0.663048,0.541328,0.486384,1,VGPB.222.1L2.X.PR,5979


# <a id='support'> Support Vector Machine Classification </a>
## <a id='split'> Setup Training and Testing sets </a>
Split the data into a test and training set using a random seed. 

In [8]:
X_train, X_test, X_train_labels, X_test_labels = split_dataset(X,y, seed = 0)

# Support Vector Machine
Support vector machines belongs to the 5th tribe of Machine Learning: the analogizers. <br>
It is effective in higher-dimensional problems, hence it may be possible to use more parameters than with the other models. <br>
In our case the number of features is far less than the number of samples, hence SVM may be especially effective. <br>

<img src=https://i.imgur.com/iFIeg67.png alt='Illustration of Support Vector Machine' align='left' width=800></img>

## Unoptimized Model
Complete run through of creation and evaluation of simple Random Forest model.<br>
No attempt made to optimize parameters in this section.

### <a id='train'>Train Model</a>
Complete run through of creation and evaluation of simple Support Vector Machine model.<br>
No attempt made to optimize parameters in this section. <br>


In [10]:
SV = sklearn.svm.SVC(C=1.0, kernel='rbf',gamma='auto',probability=True,verbose=True,random_state=0).fit(X_train,X_train_labels)
print(SV)

[LibSVM]SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=True, random_state=0, shrinking=True, tol=0.001,
    verbose=True)


### Predict classes in unseen data
With our model trained, we can try to use it to predict labels of unseen data.

In [11]:
yhat = SV.predict(X_test)
print(yhat)

[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0]


We can also see how confident each prediction was by looking at the probability of <br> 
each class for each sample.

In [12]:
yhat_prob = SV.predict_proba(X_test)
print("{0}   1     0{1}\n{2}\n...\n {3}".format(color.BOLD,color.END,yhat_prob[0:5],yhat_prob[-1]))

[1m   1     0[0m
[[0.79334036 0.20665964]
 [0.83126277 0.16873723]
 [0.0095463  0.9904537 ]
 [0.91910311 0.08089689]
 [0.98615383 0.01384617]]
...
 [0.97422316 0.02577684]


### <a id='confusion'>Confusion Matrix</a>
A confusion matrix is a useful tool for understanding how our model performs.<br>

<b>Precision:</b> How well our model correctly predicts a label.<br> Indicates how confident we are that positive is indeed positive. <br> Hence Precision = True Positive / (True Positive + False Positive)

<b>Recall:</b> The amount true positive samples over the total amount of positive samples. <br> Indicates that we class is correctly recognized. <br> Hence Recall = True Positive / (True Positive + False Negative)

In [13]:

cnf_matrix = get_confusion_matrix(X_test_labels,yhat, normal=False)


<IPython.core.display.Javascript object>

Confusion matrix, without normalization
[[152   0]
 [ 38   2]]


### Classification Report
Breakdown the confusion matrix above to get the precision and recall for each class. <br>

<b>F1 Score:</b> Mean of the precision and the recall, a value of 1.0 is a perfect classifier.

In [14]:
print (classification_report(X_test_labels, yhat))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89       152
           1       1.00      0.05      0.10        40

    accuracy                           0.80       192
   macro avg       0.90      0.53      0.49       192
weighted avg       0.84      0.80      0.72       192



### Simple Metric Accuracy
To cross-compare with other machine learing algorithms, we also calculate the accuracy. I.e. how often our model <br>
correctly predicts labels in the test set based on their ground truths.

In [15]:
yhat = SV.predict(X_test)
xhat = SV.predict(X_train)
training_std_dev = np.std(xhat==X_train_labels)/np.sqrt(xhat.shape[0]) # std error appropriate?
testing_std_dev = np.std(yhat==X_test_labels)/np.sqrt(yhat.shape[0])
print("%sTrain set Accuracy:%s %.3f \xb1 %.3f" %(color.BOLD, color.END, metrics.accuracy_score(X_train_labels, xhat),training_std_dev))
print("%sTest set Accuracy:%s %.3f \xb1 %.3f\n" %(color.BOLD, color.END, metrics.accuracy_score(X_test_labels, yhat),testing_std_dev))

[1mTrain set Accuracy:[0m 0.847 ± 0.013
[1mTest set Accuracy:[0m 0.802 ± 0.029



# <a id='crossvalidate'> Grid Search Cross-Validation </a>
Test our model over a range of different values to find the best value for that parameter. <br>

We implement GridSearch once manually, to reveal what it does, and then use the in-built function provided by SkLearn. <br>

This section corresponds to the left branch of this flowchart, as we are trying to find the best paramaters to build our model with. <br>

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" alt="Grid Search Workflow" width="400" align="left"> </img>

## Manual Grid-Search Walkthrough

### Cross-Validation
If we simply split our dataset once and train our model this way, it is possible that the model will be biased by how we happen to split 
up the original data.

To deal with this potential bias in performance, we employ cross-validation.

Here we split the dataset differently over multiple iterations to improve generalization. <br>

For each split with the cross-validator, we put aside 20% of the training data for validation and 80% for traning. <br>
<b>Background Reading: </b><a href="https://scikit-learn.org/stable/modules/cross_validation.html">SciKit Learn Documentation: Cross Validation<a> <br>
    

Here we have chosen to use Stratified K fold splitter, which preserves the original distribution of the dataset (about 50% of each class) in both the training_set and testing_set.<br>
By convention, we run 10 passes to ensure a fair scoring. <br>
<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_0071.png" alt="Stratified KFold" width="600" align="left"></img>

In [16]:
cv_splitter,\
average_cv_score = cross_validate_sets(X_train,
                                       X_train_labels,
                                       scorer = metrics.balanced_accuracy_score,
                                       cv_splitter = sklearn.model_selection.StratifiedShuffleSplit,
                                       classifier=sklearn.svm.SVC,
                                       verbose=0)


>>>> Cross Validation <<<<
    [1mAverage CV score[0m: 0.5444444444444445
    [1mTrain Set[0m: 765.6
    [1mValidation set[0m: 191.4


##  <a id="manual">Find Best Parameter Manually</a>
If cross-validation is used to reduce bias from splitting, it is possible to objectively <br>
find the best parameter for maximizing a certain score (e.g. accuracy or the f1-score)

Here this is used to vary a single parameter and find it's optimum value, in the Grid Search <br>
in the section this is generalized to any number of parameters. 

Note that the seed only fixes the CV_splitter, results may still vary depening on which seed <br>
was used to split the dataset earlier.

In [19]:
best_score, best_param_setting = find_best_parameter_manualy(X_train, X_train_labels,
                                                         scorer = metrics.balanced_accuracy_score, cv_splitter=sklearn.model_selection.StratifiedShuffleSplit,
                                                         classifier = sklearn.svm.SVC, seed = 0, verbose=1, show_plot=True)

print("{}Best Score{}: {} for parameter set to {}".format(color.BOLD,color.END,best_score, best_param_setting))

>>>>[1mDOCUMENTATION[0m<<<<

    C : float, optional (default=1.0)
        Penalty parameter C of the error term.

    kernel : string, optional (default='rbf')
        Specifies the kernel type to be used in the algorithm.
        It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
        a callable.
        If none is given, 'rbf' will be used. If a callable is given it is
        used to pre-compute the kernel matrix from data matrices; that matrix
        should be an array of shape ``(n_samples, n_samples)``.

    degree : int, optional (default=3)
        Degree of the polynomial kernel function ('poly').
        Ignored by all other kernels.

    gamma : float, optional (default='auto')
        Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.

        Current default is 'auto' which uses 1 / n_features,
        if ``gamma='scale'`` is passed then it uses 1 / (n_features * X.var())
        as value of gamma. The current default of gamma, 'auto', will chan

IntProgress(value=0, bar_style='info', description='Progress:', max=40)


        [1mBest balanced_accuracy_score[0m: 0.575925925925926
        [1mBest kernel[0m: linear 
        [1mMean balanced_accuracy_score[0m:


Unnamed: 0,linear,poly,rbf,sigmoid
kernel,0.575926,0.5,0.544444,0.503704


<IPython.core.display.Javascript object>

[1mBest Score[0m: 0.575925925925926 for parameter set to linear


## <a id="optimal">Optimal Parameters for the Model</a>

Employ techiques from optimizing the k-neighbour & logistic regression classifiers (see associated notebooks) to <br>
identify the best parameters for our model. <br>

In this case we can vary:
* <b>C</b>: the penalty paramater of the error term
* <b>Kernel</b>: Kernel type used in the algorithm
    * <b> Degree</b>: Degree for the polynomial kernel
    * <b> Gamma</b>: Coeffecient for 'rbf', 'poly' and 'sigmoid'
    * <b> Coef0</b>: Coeffecient for 'poly' and 'sigmoid'
* <b>Shrinking</b>: Whether to use shrinking heuristic

As before we will vary these parameters by iterating over them. For each parameter combinaton we then <br>
cross-validate the performance to make sure our score (the accuracy) isn't biased by how we happened to <br>
split the training set. <br>

<b> WARNING </b> As this is an exhaustive technique, it will take a long time to execute


### Varying The No. Of Features

Before we binned the data by calculating the average of X intervals. 

Changing the number of intervals could affect how well our model can discriminate between expected and unexpected behaviour, <br> 
since too many may confuse it by giving it meaningless patterns and too few may not give it enough useful information.

To implement this we iterate over different datasets with different number of bins.<br>
For each dataset we use Grid-Search Cross-validation as before to get the best parameters for that dataset.

**WARNING** If results between manual and grid search do NOT match, this is likely due to the seed value being different

In [22]:
best_no_of_features, best_params, best_score = find_best_parameters(balanced_datasets,
                                                                    param_grid = {'kernel':['linear', 'poly', 'rbf', 'sigmoid'],'C': np.linspace(0.01,2,20),'degree':np.arange(3,6),'gamma':['auto','scale']},
                                                                    classifier = sklearn.svm.SVC,
                                                                    cv_splitter = sklearn.model_selection.StratifiedShuffleSplit,
                                                                    scorer=metrics.balanced_accuracy_score,
                                                                    seed = 0,
                                                                    verbose=1)
print("{0}Best Score{1}: {2:.2f} with {} features and parameters: \n {3}".format(color.BOLD,color.END,best_score, best_no_of_features, best_params))

>>>>[1mDOCUMENTATION[0m<<<<

    C : float, optional (default=1.0)
        Penalty parameter C of the error term.

    kernel : string, optional (default='rbf')
        Specifies the kernel type to be used in the algorithm.
        It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
        a callable.
        If none is given, 'rbf' will be used. If a callable is given it is
        used to pre-compute the kernel matrix from data matrices; that matrix
        should be an array of shape ``(n_samples, n_samples)``.

    degree : int, optional (default=3)
        Degree of the polynomial kernel function ('poly').
        Ignored by all other kernels.

    gamma : float, optional (default='auto')
        Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.

        Current default is 'auto' which uses 1 / n_features,
        if ``gamma='scale'`` is passed then it uses 1 / (n_features * X.var())
        as value of gamma. The current default of gamma, 'auto', will chan

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed: 38.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   47.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   53.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   41.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   39.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   39.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   40.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   40.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   38.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   39.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   40.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 480 candidates, totalling 4800 fits
[1mBest Score[0m: 0.82 with parameters: 
 {'C': 1.8952631578947368, 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf'}


[Parallel(n_jobs=1)]: Done 4800 out of 4800 | elapsed:   40.1s finished


## <a id='eva'>Final Evaluaton </a>
Uses seeded training_set (not cross validated!) for a final evaluation of the model, including the standard deviation.

In [25]:
testing_dataset = representative_datasets[best_no_of_features-min(representative_datasets.keys())]

model = model_evaluation(balanced_datasets, testing_dataset = testing_dataset, classifier = sklearn.svm.SVC,
                                no_of_features = best_no_of_features, best_params = best_params,
                                scorer = metrics.balanced_accuracy_score, seed = 0, verbose = 0)

[1mTrain Set balanced_accuracy_score [0m: 0.8301962256758522
[1mTest Set balanced_accuracy_score [0m: 0.7713385484571925


<IPython.core.display.Javascript object>

Confusion matrix, without normalization
[[626 154]
 [ 46 131]]
[1mClassification Report[0m
              precision    recall  f1-score   support

           0       0.93      0.80      0.86       780
           1       0.46      0.74      0.57       177

    accuracy                           0.79       957
   macro avg       0.70      0.77      0.71       957
weighted avg       0.84      0.79      0.81       957



# <a id='use'>Use Model</a>
## <a id='save'>Store Model</a>

In [26]:
save_model(model, best_no_of_features)

[1mModel Desc:[0m
 SVC(C=1.8952631578947368, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
Please enter a name for your model
[4mNote[0m: the no.of.features will be appended)
>>>SVM_optimized_for_balance
Successfully saved model to SVM_optimized_for_balance_6.joblib


## <a id='load'>Load Model</a>

In [None]:
model = load_model()