# Introduction

This notebook provides you an opportunity to demonstrate proficiency in meeting course learning goals by applying a support vector machine to solve a classification problem using widely-used ML libraries and an ML workflow.


# Mine Detection (Revisited)

In this notebook, you will revisit [a previously seen classification problem](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons), and see if you can build a better classification model that can predict whether or not a sonar signature is from a mine or a rock.

<div class="alert alert-block alert-warning">
<b>Tip:</b> We suggest reviewing your Notebook 4: Classification with Perceptrons.
</div>

We'll use a version of the [sonar data set](https://www.openml.org/search?type=data&sort=runs&id=40&status=active) by Gorman and Sejnowski. Take a moment now to [reacquaint yourself with the subject matter of this data set](https://datahub.io/machine-learning/sonar%23resource-sonar), and look at the details of the version of this data set, [Mines vs Rocks, hosted on Kaggle](https://www.kaggle.com/datasets/mattcarter865/mines-vs-rocks).

Similar to [a previous notebook](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons), this notebook expects each student to implement the ML workflow steps. We will get you started by providing the first step, loading the data, and providing some landmarks and tips below. Your process should demonstrate:

1. Loading the data
2. Exploring the data
3. Preprocessing the data
4. Preparing the training and test sets
5. Creating and configuring a sklearn.svm.SVC
6. Training the SVM
7. Validating and Testing the SVM
8. Demonstrating making predictions
9. Evaluate (and Improve) the results

Can you train a classifier that can predict whether a sonar signature is from a mine or a rock? "Three trained human subjects were each tested on 100 signals, chosen at random from the set of 208 returns used to create this data set. Their responses ranged between 88% and 97% correct." Can your classifier outperform the human subjects?

Most importantly, how does the performance of the SVM classifier compare to the perceptron results observed in [Notebook 4](https://www.kaggle.com/code/bakosy/cs-513-notebook-4-classification-with-perceptrons)?



## Step 1: Load the Data

The notebook comes pre-bundled with the [Mines vs Rocks data set](https://www.kaggle.com/datasets/mattcarter865/mines-vs-rocks). Our first step is to create a pandas DataFrame from the CSV file. Note that the CSV file has no header row. Loading the CSV file into a DataFrame will make it easy for us to explore the data, preprocess it, and split it into training and test sets.


In [1]:
import pandas as pd

sonar_csv_path = "../input/mines-vs-rocks/sonar.all-data.csv"
sonar_data = pd.read_csv(sonar_csv_path, header=None)



We now have a pandas DataFrame encapsulating the sonar data, and can proceed with our data exploration.

## Step 2: Explore the Data

Replace this text with a prose description of what you will do in this section, and why you have chosen to do this. We *strongly* recommend using **multiple markdown and code cells** to illustrate each exploration step you take.

You are welcome to use your work from Notebook 4 here, if it meets the expectations and requirements.

First, we begin by **looking at the first few lines of data**

In [2]:
sonar_data.head(n=10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R
5,0.0286,0.0453,0.0277,0.0174,0.0384,0.099,0.1201,0.1833,0.2105,0.3039,...,0.0045,0.0014,0.0038,0.0013,0.0089,0.0057,0.0027,0.0051,0.0062,R
6,0.0317,0.0956,0.1321,0.1408,0.1674,0.171,0.0731,0.1401,0.2083,0.3513,...,0.0201,0.0248,0.0131,0.007,0.0138,0.0092,0.0143,0.0036,0.0103,R
7,0.0519,0.0548,0.0842,0.0319,0.1158,0.0922,0.1027,0.0613,0.1465,0.2838,...,0.0081,0.012,0.0045,0.0121,0.0097,0.0085,0.0047,0.0048,0.0053,R
8,0.0223,0.0375,0.0484,0.0475,0.0647,0.0591,0.0753,0.0098,0.0684,0.1487,...,0.0145,0.0128,0.0145,0.0058,0.0049,0.0065,0.0093,0.0059,0.0022,R
9,0.0164,0.0173,0.0347,0.007,0.0187,0.0671,0.1056,0.0697,0.0962,0.0251,...,0.009,0.0223,0.0179,0.0084,0.0068,0.0032,0.0035,0.0056,0.004,R


We can explore the data in a similar maner with the describe function and get more summary statistics


In [3]:
sonar_data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,...,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,0.029164,0.038437,0.043832,0.053892,0.075202,0.10457,0.121747,0.134799,0.178003,0.208259,...,0.016069,0.01342,0.010709,0.010941,0.00929,0.008222,0.00782,0.007949,0.007941,0.006507
std,0.022991,0.03296,0.038428,0.046528,0.055552,0.059105,0.061788,0.085152,0.118387,0.134416,...,0.012008,0.009634,0.00706,0.007301,0.007088,0.005736,0.005785,0.00647,0.006181,0.005031
min,0.0015,0.0006,0.0015,0.0058,0.0067,0.0102,0.0033,0.0055,0.0075,0.0113,...,0.0,0.0008,0.0005,0.001,0.0006,0.0004,0.0003,0.0003,0.0001,0.0006
25%,0.01335,0.01645,0.01895,0.024375,0.03805,0.067025,0.0809,0.080425,0.097025,0.111275,...,0.008425,0.007275,0.005075,0.005375,0.00415,0.0044,0.0037,0.0036,0.003675,0.0031
50%,0.0228,0.0308,0.0343,0.04405,0.0625,0.09215,0.10695,0.1121,0.15225,0.1824,...,0.0139,0.0114,0.00955,0.0093,0.0075,0.00685,0.00595,0.0058,0.0064,0.0053
75%,0.03555,0.04795,0.05795,0.0645,0.100275,0.134125,0.154,0.1696,0.233425,0.2687,...,0.020825,0.016725,0.0149,0.0145,0.0121,0.010575,0.010425,0.01035,0.010325,0.008525
max,0.1371,0.2339,0.3059,0.4264,0.401,0.3823,0.3729,0.459,0.6828,0.7106,...,0.1004,0.0709,0.039,0.0352,0.0447,0.0394,0.0355,0.044,0.0364,0.0439


Lets see the number of rows and colums

In [4]:
sonar_data.shape


(208, 61)

Lets check for missing data.

In [5]:
sonar_data.isna().sum().sum()

0

great no missing data. Lets see what the range of data is

In [6]:
print(f"The min value in sonar_data is {min(sonar_data.describe().loc['min'])}\nThe max value in sonar_data is:{max(sonar_data.describe().loc['max'])}")

The min value in sonar_data is 0.0
The max value in sonar_data is:1.0


We can see this dataset has 61 columns, 59 are numerical dependent variables and 1 independent variable “R” or “M” rock or mine. There are 208 lines of data. Looks like the features of the dataset are already normalized between 0 and 1 and there is no missing data.

## Step 3: Preprocess the Data
The data is already normalized with no missing values, so a degree of pre-processing has occurred. Important additional steps include splitting the features for X variables and the labels or Y variables additionally we need to do scaling of the data or see score normalization using the skikit-learn library.  We should convert the string "R" and "M" to numeric values to aid in prediction. We can use 1 for rock (R) and -1 for mine (M).

Lets split the data

In [7]:
# Separate features and target columns
X = sonar_data.iloc[:, :-1]
y = sonar_data.iloc[:, -1]
print("Sanity check:")
print("X.shape:",X.shape)
print("y.shape:",y.shape)


Sanity check:
X.shape: (208, 60)
y.shape: (208,)


Now that we have split the data lets apply scale the data with z-score normalization

In [8]:
from sklearn.preprocessing import StandardScaler

#creates a new scaled X var using standardscar and fit_transform
X_scale = StandardScaler().fit_transform(X)

Lets look at orginal data again

In [9]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0033,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0241,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0156,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094


lets look at scaled data

In [10]:
X_scale[0:4]

array([[-0.39955135, -0.04064823, -0.02692565, -0.71510513,  0.36445605,
        -0.10125288,  0.52163841,  0.29784323,  1.12527153,  0.02118591,
        -0.56738192, -0.65854108, -0.35204302, -1.41437288, -1.24041609,
        -0.65141323, -0.40291277, -0.5842021 ,  0.01161165, -0.31809184,
        -0.11959712, -0.45902868, -0.85816473, -0.49322534, -0.01769506,
        -0.24662866,  0.03364482,  0.48168725,  0.15448626, -0.8865206 ,
        -1.75089006, -0.83977659,  0.46054842,  1.52357887,  1.78380502,
         1.76803946,  1.27600761,  1.27102447,  0.84846088, -0.20651076,
        -1.39574065,  0.03033902,  0.25932835,  1.59077057,  0.44206152,
        -0.16488536, -0.20004835,  0.68858804, -0.37997825,  0.87851031,
         0.59528304, -1.11543184, -0.59760446,  0.68089736, -0.29564577,
         1.4816347 ,  1.76378447,  0.06987027,  0.17167808, -0.65894689],
       [ 0.70353822,  0.42163039,  1.05561832,  0.32333027,  0.77767571,
         2.60721675,  1.52262508,  2.51098151,  1.

We can see that we successfully split the initial data set with the Y labeled Now as a separate variable and the features saved as the X variable. We then transformed with Z score normalization the features which was proved successful by looking at the first four lines of the X & X_scale variables. We need to be aware of this transformation and the model will be built on transformed data and so raw data cannot be input it without transformation prior

## Step 4: Prepare the Training and Test Data Sets

We need to prepare the data for training and test by using the train_test_split function. We already separated the features and labels of the data set and  seperate variables so we will split these variable’s accordingly. 

In [11]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.2, random_state=42)

lets do a sanity check and use head and shape to examine our split for size and randomness before proceeding


In [12]:
X_train[0:4]

array([[-0.45187181, -0.04368954,  1.34255769,  0.61417836, -0.90768305,
        -0.94584766,  0.33506877,  0.73458488,  0.07872089, -0.2062695 ,
        -0.16703512, -0.7007639 , -0.53906812,  1.14476704,  1.99870086,
         1.73300595,  1.42416855,  0.61280871, -0.32953484, -1.07757393,
        -0.36648896,  0.51955348,  0.85195413,  1.37228427,  0.69689187,
        -0.08521347, -0.02062633, -0.4637177 , -0.6879166 , -0.27621655,
        -0.37650578, -1.31551107, -1.21553597, -1.49308858,  0.0159718 ,
         1.25188163,  1.53798351,  0.05291702, -1.34393564, -0.23680843,
         0.76595565,  0.64820169, -0.01833151, -0.10585083,  0.22125484,
         0.20932255,  0.84786934,  1.26353613,  2.36066913,  1.42134574,
        -0.14764689, -0.42870441, -1.05193401,  0.04930977, -0.88965714,
        -0.44066765,  0.20441707, -0.34845747, -0.185088  , -0.81834326],
       [-0.45623184, -0.1166809 , -0.70514597, -0.77973804, -0.64784187,
         0.99095403,  1.31496496,  0.40732294,  0.

In [13]:
X_train.shape

(166, 60)

In [14]:
X_test[0:4]

array([[ 0.05825262, -0.06497869, -0.58515314, -0.67201652, -0.53416135,
        -0.64566036, -0.4517684 , -0.44261634,  0.63925206,  0.77887342,
        -0.12700044,  0.08500989, -0.22332993, -0.47945489, -0.99692116,
        -1.09779167, -1.24460191, -1.06101333, -1.27837283, -1.56112358,
        -2.16899385, -1.73847198, -0.97396238, -0.43160064, -0.07253736,
         0.20085481,  0.62083423,  0.90642216,  0.93764137,  1.3998491 ,
         1.88321806,  1.94223593,  1.47163553,  0.9396658 ,  0.90027525,
         1.13840281,  1.54550435,  1.35762871,  1.24222617,  1.68709347,
         1.58435394,  1.29636339,  1.84739854,  2.73087609,  2.60120715,
         2.39245146,  3.05783437,  2.99480438,  2.53073983,  1.61207171,
         0.92083661,  0.68462643, -0.52661546, -0.54108732, -0.09764198,
         0.11854758, -0.0728038 , -0.58086178, -0.39590432, -0.8781169 ],
       [ 0.02773236,  0.70143061,  0.55217016,  0.82315809,  1.55719924,
         2.11708042,  1.55507197,  0.80639415, -0.

In [15]:
X_test.shape

(42, 60)

In [16]:
y_test[0:9]

161    M
15     R
73     R
96     R
166    M
9      R
100    M
135    M
18     R
Name: 60, dtype: object

In [17]:
y_test.shape

(42,)

In [18]:
y_train[0:9]

86     R
203    M
67     R
82     R
205    M
194    M
38     R
24     R
60     R
Name: 60, dtype: object

We successfully split the training and test data based on 80:20 rule and divided the feature verse outcome.  We verified it this split and saw that X_train has 166 values and X_test has 42 rows the same distribution occurred with the Y test and train. The X data features the 59 records and the y data is only 1 and - 1 outcomes as we expect. We should be aware that 80:20 split might not be optimium. 

## Step 5: Instantiate and Configure an SVM

In this step we will train or SVM on this sonar data the SVM model works well because this data set has 60 features and the model is an effective classifier for high dimensional data. We will use Sklearn functions for this step

In [19]:
from sklearn.svm import SVC

# generates the model
sonar_SVC = SVC(kernel='rbf', C=1.0, gamma=0.01)

Here we simply created the support vector machine model and set a couple key parameters such as the C parameter so one the kernel to RBF and gamma to 0.01. During tuning of the model we will likely want to tweak these hyperparameters to improve accuracy.

## Step 6: Train the SVM
To train the SVM model we will use the XY train and split data from earlier and reference the fit function.**

In [20]:
# Training the model
sonar_SVC.fit(X_train, y_train)

SVC(gamma=0.01)

This was a simple one line of code step. In future steps we will want to be conscious that if we soon the hyperparameters of the model we will need to refit it as well.

## Step 7: Validate and Test the SVM

Here we will validate and test our model to make sure its produces logical oupouts and see how accurate they are given the defualt C, kernel, and gamma hyperparmaters. 

Lets examine the accuracy of the svm and its training set.

In [21]:
#accuracy check
accuracy_train = sonar_SVC.score(X_train, y_train)

print(f"Training set accuracy: {accuracy_train*100:0.2f}%")

Training set accuracy: 95.78%


Lets examine the accuracy of the svm and its test set.


In [22]:
#accuracy check
accuracy_test = sonar_SVC.score(X_test, y_test)

print(f"Training set accuracy: {accuracy_test*100:0.2f}%")

Training set accuracy: 85.71%


You can see the model is functioning well and accurately classifying the training and test sets given the default parameters with an accuracy of greater than 95% for the training set and roughly 86% for the test set. We'll want to keep in mind that the accuracy of the training set is not as informative as the test set accuracy at how well the model is performing



## Step 8: Demonstrate Making Predictions

Replace this text with a prose description of what you will do in this section, and why you have chosen to do this. Demonstrate how you can use the SVM to make new predictions. Feel free to use arbitrary examples from the training set to demonstrate this. Show the predicted class label(s).

Lets make some predictions on a subset of data where R = Rock and M = Mine

In [23]:
#generate prediction of subset of tests
results = sonar_SVC.predict(X_test[0:9])
results

array(['M', 'R', 'R', 'R', 'M', 'R', 'M', 'M', 'R'], dtype=object)

Lets see what the actual values where

In [24]:
print(y_test[0:9])

161    M
15     R
73     R
96     R
166    M
9      R
100    M
135    M
18     R
Name: 60, dtype: object


Here we showed the sonar_SVC model making a prediction on a small subset of test data and we can see that just visually comparing the predicted result to the actual values that the model performed correctly on all 10 predictions. That being said this is a very small subset and it cannot be inferred the model has 100% accuracy.

## Step 9: Evaluate (and Improve?)

The classifier performed well given the inital accuracy of the test set (85.7%). The model was configured with RBF kernel to allow for nonlinear decision boundaries the gamma value was set to 0.01 increasing the reach of each training record reducing overfitting. The  C=1 parameter “adjusts the margin steps” of the model or how much the model “cares” about misclassification. 

The initial configuration of the SVM model performed similar, but not quite as good to the 88% accuracy of the perceptron model on the same data set from notebook 4. Optimizing the model by setting C = 10 while keeping gamma at 0.01 and the 'RBF' kernel increased the accuracy of the training data set to 100% and the test data set to greater than 95%. This is a notable increase in performance over the perceptron and intial training of the SVC. The different gamma, C and kernel possible values we're tested using a grid search CV function to find the optimized parameters to use in the model. An accuracy of 95% seems quite good for a classifier on high dimensional data.


> 

Lets see if we can optimize C, gamma, or the kernel type to improve the model accuracy using the convenient gridsearchCV function.

In [25]:
from sklearn.model_selection import GridSearchCV

#setting diferent parameters for the model to use
parameters = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf', 'linear','poly']}

#creates models with different paramaters
grid_search_model = GridSearchCV(SVC(), parameters)

#fits the model to training data
grid_search_model.fit(X_train, y_train)

#finds and prints the optimized paramaters
grid_search_model.best_estimator_

SVC(C=10, gamma=0.01)

It looks like increasing C from 1 to 10 is suggested to improve the models preformance. Lets see:

In [26]:
# generates the optimized model
sonar_SVC_OP= SVC(C=100, gamma=0.01)

# Training the optimized model
sonar_SVC_OP.fit(X_train, y_train)

#accuracy check test and train
accuracy_test_OP = sonar_SVC_OP.score(X_test, y_test)
accuracy_train_OP = sonar_SVC_OP.score(X_train, y_train)
print(f"Optimized model train set accuracy: {accuracy_train_OP*100:0.2f}%")
print(f"Optimized model test set accuracy: {accuracy_test_OP*100:0.2f}%")

Optimized model train set accuracy: 100.00%
Optimized model test set accuracy: 95.24%


This indeed improved the model's performance by increasing the training accuracy to 100% and the test accuracy to 95%. It worked!

## Conclusion
In this notebook we went through the machine learning process on the rock and mine data set using a support vector machine model. Important preprocessing step was applying a Z score normalization. The data set was split 80:20 and the SVC function from SKlearn was used to generate an initial model performance of 85.7% on the test set. Then the C parameter was optimized to 10 resulting in a improved model with a test set accuracy of 95%.

The first notable thing of this example was using the Z_score normalization which I found to be interesting because I've heard this value and seen it referenced in lab reports or statistics but never actually seen it applied in a data set that I've manipulated. I'm impressed at how well support vector machine modeling can work on a 60 dimensional data set and that this model performed better than the perceptron which is the base of the neural Nets which are all of the rave these days. Finally I'm continuously impressed at the power of these SK learn functions and how so much can be accomplished with so few lines of code, sure beats having to write all this stuff out by hand.

The model was optimized using a grid search CV function to find the best parameters based off of gamma ,C, and kernel selection. This was done in the perceptron problem to hyper tuned that model and it performed well in this example by identifying in increase of C to a value of 10 improving the models performance to 95% accuracy.

An additional step that we could take to improve the performance of the classifier would be to reduce the dimensionality of the data set this. This worked on the perceptron example and with 60 features of the dataset used in the SVC example I could see it improving accuracy further. SKlearn documentation also lists many other parameters beyond kernel, gamma, and C that could be optimized, but its not clear if these will influence beyond the three primary parameters of gamma, C and kernel selection.
 

