# Using Knn Classifier to differentiate Water Mines from Rocks

## Introduction

### The Sonar system
The Sonar is a system that uses sound waves to detect objects under the water. (oceanservice.noaa.gov) It is mostly used to locate underwater hazards for navigation, search and map objects on the seafloor such as shipwrecks, and map the seafloor itself. (oceanservice.noaa.gov) What is interesting about these sound wave signals, which we will be using in our project as well, is the fact that the patterns of the signals bouncing off metal cylinders vs. similar shaped rocks are different. This property can be used to detect hidden mines in the mine field.



### The Data Structure of the Presented Sonar Signals

Gorman and Sejnowski reported a data set that contains sonar signals of mine and rock sample. A beam of sonar wave was sent to the target. The reflected signal was processed by the sonar system and recorded in a matrix. The spectrum was divided into 60 bands, and intensity of the signal at each frequency band was noted with a value between 0 to 1. 

Each of these sets of frequencies is tied to a label which is either “M” for metal cylinder or “R” for roughly cylindrical rock. 

### Introduction to Knn classifier model

To do

In this project, we will be training the system, which will determine whether the solar system that is being bounced off, is from a metal cylinder, or a roughly cylindrical rock

In [3]:
# set up the environment
import random

import altair as alt
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)

from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder

alt.data_transformers.disable_max_rows()


# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Methods:

In this project, we will train a mine detecting model through the following stages:

  - Exploratory data analysis:
  - Using SearchgridCV to select the best K value
  - Model training
  - Model performance analysis

First, the dataset is loaded and cleaned. Then, we verify if the data is balanced. Next, data is then split to training and testing set in a 75-25 ratio. After verifying the balance of the sub data sets, we will conduct some exploratory visualization to arm our intuition for exploiting this data set. 

We will train a Knn model to distinguish the mine from rocks. To achieve that, we will perform grid search with 5-fold cross-validation to find the best value for k (number of neighbors). Using the optimum K value, we will train the Knn Classifier, and use our test data set to evaluate the performance of our model. 

### Exploratory data analysis:
#### Load Data

The data set was stored at Kaggle in .csv format. 

To ensure this data is loadable regardless of users, GitHub is used to host this file. It is accessed by referring to the permanent link of the raw file.

To read the data set, simply read in using pd.read_csv. The data is properly formatted using comma as delimiters. 

In [38]:
sonar_data = pd.read_csv ("https://raw.githubusercontent.com/OminiCarlos/DSCI100_Group_14_Proposal_Mine_Finder/main/sonar.all-data.csv")
sonar_data

Unnamed: 0,Freq_1,Freq_2,Freq_3,Freq_4,Freq_5,Freq_6,Freq_7,Freq_8,Freq_9,Freq_10,...,Freq_52,Freq_53,Freq_54,Freq_55,Freq_56,Freq_57,Freq_58,Freq_59,Freq_60,Label
0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,0.0187,0.0346,0.0168,0.0177,0.0393,0.1630,0.2028,0.1694,0.2328,0.2684,...,0.0116,0.0098,0.0199,0.0033,0.0101,0.0065,0.0115,0.0193,0.0157,M
204,0.0323,0.0101,0.0298,0.0564,0.0760,0.0958,0.0990,0.1018,0.1030,0.2154,...,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067,M
205,0.0522,0.0437,0.0180,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.0160,0.0029,0.0051,0.0062,0.0089,0.0140,0.0138,0.0077,0.0031,M
206,0.0303,0.0353,0.0490,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048,M


#### Data Cleaning and Wrangling
After a thorough inspection, we can say that the data are already in clean format. Each column is a variable, each row is one observation and each value is a cell. 

#### Summary Statistics
To understand the content of the data set, we first take a look at sample size. 

In [5]:
nb_observations = sonar_data.shape[0]
nb_observations

208

208 is not a very big sample size, but it's enough to train our model. 


Secondly, we investigate the number of samples in each class, to verify if the sample is balanced. Imbalanced sample will let the dominant class hold the majority vote when it's not supposed to. This will give unnecessary favor to the majority class and hinder the model's accuracy. 

In [6]:
sample_breakdown = sonar_data["Label"].value_counts(normalize = True)
sample_breakdown

M    0.533654
R    0.466346
Name: Label, dtype: float64

We have 53% mine sample and 46% rock samples. Since the sample size is roughly the same, it is safe to say the sample is balanced. Therefore, we do not need to resample our data set. 

Since all the data are of the same type, which is the frequency, and that all the signal are in the same unit, the data is naturally centered. Therefore, we do not need to standardize the data set.

#### Separate Test and Training Data
Now that we have made sure the data set is legitimate, it's time to split the data into training and testing set.

In [8]:
# set random state
np.random.seed(1024)

sonar_train, sonar_test = train_test_split(
    sonar_data, train_size=0.75, stratify=sonar_data["Label"]
)

sonar_train

Unnamed: 0,Freq_1,Freq_2,Freq_3,Freq_4,Freq_5,Freq_6,Freq_7,Freq_8,Freq_9,Freq_10,...,Freq_52,Freq_53,Freq_54,Freq_55,Freq_56,Freq_57,Freq_58,Freq_59,Freq_60,Label
63,0.0067,0.0096,0.0024,0.0058,0.0197,0.0618,0.0432,0.0951,0.0836,0.1180,...,0.0048,0.0023,0.0020,0.0040,0.0019,0.0034,0.0034,0.0051,0.0031,R
87,0.0856,0.0454,0.0382,0.0203,0.0385,0.0534,0.2140,0.3110,0.2837,0.2751,...,0.0172,0.0138,0.0079,0.0037,0.0051,0.0258,0.0102,0.0037,0.0037,R
158,0.0107,0.0453,0.0289,0.0713,0.1075,0.1019,0.1606,0.2119,0.3061,0.2936,...,0.0164,0.0120,0.0113,0.0021,0.0097,0.0072,0.0060,0.0017,0.0036,M
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
91,0.0253,0.0808,0.0507,0.0244,0.1724,0.3823,0.3729,0.3583,0.3429,0.2197,...,0.0178,0.0073,0.0079,0.0038,0.0116,0.0033,0.0039,0.0081,0.0053,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,0.0459,0.0437,0.0347,0.0456,0.0067,0.0890,0.1798,0.1741,0.1598,0.1408,...,0.0067,0.0032,0.0109,0.0164,0.0151,0.0070,0.0085,0.0117,0.0056,R
200,0.0131,0.0387,0.0329,0.0078,0.0721,0.1341,0.1626,0.1902,0.2610,0.3193,...,0.0150,0.0076,0.0032,0.0037,0.0071,0.0040,0.0009,0.0015,0.0085,M
123,0.0270,0.0163,0.0341,0.0247,0.0822,0.1256,0.1323,0.1584,0.2017,0.2122,...,0.0189,0.0204,0.0085,0.0043,0.0092,0.0138,0.0094,0.0105,0.0093,M
130,0.0443,0.0446,0.0235,0.1008,0.2252,0.2611,0.2061,0.1668,0.1801,0.3083,...,0.0274,0.0205,0.0141,0.0185,0.0055,0.0045,0.0115,0.0152,0.0100,M


In [10]:
train_breakdown = sonar_train["Label"].value_counts(normalize = True)
train_breakdown

M    0.532051
R    0.467949
Name: Label, dtype: float64

In [11]:
test_breakdown = sonar_test["Label"].value_counts(normalize = True)
test_breakdown

M    0.538462
R    0.461538
Name: Label, dtype: float64

The train and test set are roughly balanced, too. Now we can move on. To separate the predictor columns from the label column.

In [21]:
# Separate predictors and labels
X_train = sonar_train.drop('Label', axis=1)
y_train = sonar_train['Label']

63     R
87     R
158    M
1      R
91     R
      ..
93     R
200    M
123    M
130    M
82     R
Name: Label, Length: 156, dtype: object

In [16]:
# Separate predictors and labels
X_test = sonar_test.drop('Label', axis=1)
y_test = sonar_test['Label']

### Exploratory Visualization

I will group my data into 2 sets: Mine and Rock, to see if there is a visible pattern with their sonar profile. First I will group my data set by labels, then I will aggregate my data by calculating the mean value of each frequency. After that, I will plot the sonar profile using bar plots for both signals, to compare the intensity at each frequency. 

In [12]:
snoar_data_agg = sonar_train.groupby(["Label"]).mean().reset_index()
snoar_data_agg

Unnamed: 0,Label,Freq_1,Freq_2,Freq_3,Freq_4,Freq_5,Freq_6,Freq_7,Freq_8,Freq_9,...,Freq_51,Freq_52,Freq_53,Freq_54,Freq_55,Freq_56,Freq_57,Freq_58,Freq_59,Freq_60
0,M,0.034478,0.046443,0.054616,0.068734,0.089508,0.114306,0.127623,0.14906,0.220994,...,0.019617,0.016706,0.01198,0.012099,0.009931,0.008971,0.007801,0.008966,0.008417,0.006922
1,R,0.023474,0.029532,0.035075,0.039905,0.062152,0.098152,0.117866,0.121126,0.140244,...,0.012119,0.010173,0.009723,0.008992,0.008041,0.007697,0.007526,0.00667,0.006537,0.006077


We can see that the signal of Mine is different from that of rocks. For example,  the intensity of Freq_9 of the Mine is significantly higher than that of the rock. Now it's time to visualize the data.

In [13]:
# sort columns
sorted_cols = ['Label'] + sorted(snoar_data_agg.columns[1:], key=lambda x: int(x.split('_')[1]))

# melt data into long format
sonar_agg_melt = snoar_data_agg.melt(id_vars = "Label",
                                     var_name="Frequency",
                                     value_name = "Intensity"
                                    )
sonar_agg_melt

Unnamed: 0,Label,Frequency,Intensity
0,M,Freq_1,0.034478
1,R,Freq_1,0.023474
2,M,Freq_2,0.046443
3,R,Freq_2,0.029532
4,M,Freq_3,0.054616
...,...,...,...
115,R,Freq_58,0.006670
116,M,Freq_59,0.008417
117,R,Freq_59,0.006537
118,M,Freq_60,0.006922


In [14]:
agg_bar_plot = alt.Chart(sonar_agg_melt).mark_bar().encode(
    x = alt.X("Frequency", sort=sorted_cols[1:]),
    y = alt.Y("Intensity"),
    color = "Label"
)

agg_bar_plot

As can be seen in the bar plot, the intensity of the rock signal is generally weaker than that of the mines. This difference is very obvious in the region between Frequency 18 -31. This pattern tells us it is possible to distinguish a rock and a mine with the sonar signal. 

### Finding the Best K for the Classifier
Now it's the time to search for the best parameter K for our Knn model. To do this, we need to first specify the parameters of the SearchGridCV function. 

In [26]:
# Convert M (Mine) to 1 and R (Rock) to 0, so the data can be used by the model.
y_train = y_train.replace({'M': 1, 'R': 0})
y_test = y_test.replace({'M': 1, 'R': 0})

In [27]:
# Specify the estimator. 
# Since the data is naturally centered, we don't need to standardize it
knn_spec =  KNeighborsClassifier()
display (knn_spec.get_params()) # get the parameter that specifies the No. of neighbors.

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

We will create a search grid by plugging K = 2 to 20 in the 'n_neighbors' parameter. 
 
Then we will perform grid search with 5-fold cross-validation to find the best value for k (number of neighbors).

Here we choose **recall** as our score, because we want to have as little false negative for our model as possible. The reason is that we water mines are dangerous. The water mines that we fail to detect could potentially cause damage to ships and even cost lives. On the other hand, when we have false positive, we risk wasting peoples time to verify if the object is really a mine, which is acceptable. 

Lastly, we will report the performance of the model in a dataframe.

In [28]:
#set the parameter grid.
param_grid = {
    'n_neighbors': range(2, 21)
}

# Perform grid search with 5-fold cross-validation to find the best value for k (number of neighbors)
mine_detector_grid = GridSearchCV(estimator = knn_spec,
                                  param_grid = param_grid,
                                  cv=5,
                                  return_train_score=True,
                                  scoring='recall')

mine_detector_grid.fit(X_train, y_train)

CV_results = pd.DataFrame(mine_detector_grid.fit(X_train,y_train).cv_results_)
CV_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.002997,0.000396,0.005378,0.000243,2,{'n_neighbors': 2},0.647059,0.6875,0.875,0.823529,...,0.759559,0.084017,11,0.848485,0.910448,0.835821,0.818182,0.878788,0.858345,0.032712
1,0.003133,0.000397,0.005887,0.000876,3,{'n_neighbors': 3},0.764706,0.8125,0.9375,0.941176,...,0.879412,0.075682,1,0.878788,0.940299,0.895522,0.924242,0.924242,0.912619,0.022241
2,0.008446,0.011055,0.005346,0.00048,4,{'n_neighbors': 4},0.588235,0.625,0.9375,0.823529,...,0.771324,0.139714,8,0.833333,0.865672,0.880597,0.893939,0.863636,0.867436,0.020279
3,0.003445,0.000465,0.006304,0.000757,5,{'n_neighbors': 5},0.764706,0.75,0.9375,0.882353,...,0.843382,0.07322,2,0.878788,0.865672,0.910448,0.939394,0.909091,0.900678,0.025961
4,0.003286,0.000423,0.006063,0.000704,6,{'n_neighbors': 6},0.588235,0.6875,0.75,0.882353,...,0.746324,0.102889,13,0.787879,0.820896,0.791045,0.848485,0.833333,0.816327,0.023633
5,0.002771,3.3e-05,0.005215,9e-05,7,{'n_neighbors': 7},0.705882,0.75,0.9375,0.882353,...,0.819853,0.084415,5,0.863636,0.850746,0.880597,0.893939,0.893939,0.876572,0.017051
6,0.002768,3.4e-05,0.005265,5.2e-05,8,{'n_neighbors': 8},0.470588,0.75,0.875,0.764706,...,0.701471,0.136186,18,0.69697,0.820896,0.835821,0.818182,0.818182,0.79801,0.050944
7,0.00285,0.000145,0.005388,0.000247,9,{'n_neighbors': 9},0.588235,0.8125,0.9375,0.882353,...,0.820588,0.122761,4,0.818182,0.850746,0.880597,0.909091,0.818182,0.85536,0.035524
8,0.002764,3.5e-05,0.00532,0.000251,10,{'n_neighbors': 10},0.529412,0.8125,0.9375,0.882353,...,0.797059,0.141119,6,0.69697,0.820896,0.791045,0.787879,0.787879,0.776934,0.04187
9,0.002908,0.000299,0.005162,7.2e-05,11,{'n_neighbors': 11},0.647059,0.8125,0.9375,0.882353,...,0.832353,0.100781,3,0.757576,0.835821,0.880597,0.863636,0.833333,0.834193,0.042175


With the result in hand, we visualize the model's performance with different K in a line chart, to help us find the best K. 

In [29]:
recall_plot = alt.Chart(CV_results).mark_line(point=True).encode(
    x = alt.X("param_n_neighbors"),
    y = alt.Y("mean_test_score")
)
recall_plot 

It seems we have the highest recall when K = 3. We will choose this to train our model! Other K values give us a recall below 85%, which is not acceptable.

In [32]:
# Set the random seed.
np.random.seed(2023)

mine_detector_spec = KNeighborsClassifier( n_neighbors = 3)
mine_detector_fit = mine_detector_spec.fit (X_train,y_train)

The model is trained. Now we are going to use the test data set to evaluate the model's performance. We will compare the prediction and the true value side-by-side.

In [33]:
Mine_detector_reality_check = pd.DataFrame({"True Label": y_test,
                                           "Prediction": mine_detector_fit .predict(X_test)})
Mine_detector_reality_check

Unnamed: 0,True Label,Prediction
145,1,0
134,1,1
177,1,0
31,0,0
15,0,0
147,1,1
14,0,0
106,1,1
207,1,0
16,0,0


In many instances the model seems to give an incorrect labeling. We will take a closer look by constructing a confusion matrix. 

In [34]:
mnist_conf_mat = pd.crosstab(Mine_detector_reality_check["True Label"],
                             Mine_detector_reality_check["Prediction"])

mnist_conf_mat

Prediction,0,1
True Label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,19,5
1,6,22


Recall that 0 is Rock and 1 is Mine, we move on to report the recall and accuracy of our model.

$\mathrm{precision} = \frac{\mathrm{number \; of  \; correct \; positive \; predictions}}{\mathrm{total \;  number \;  of \; positive  \; predictions}}$

$\mathrm{recall} = \frac{\mathrm{number \; of  \; correct  \; positive \; predictions}}{\mathrm{total \;  number \;  of  \; positive \; test \; set \; observations}}$


In [36]:
Mine_precision= 22/ (5+22)
Mine_precision

0.8148148148148148

In [37]:
Mine_recall = 22/ (6+22)
Mine_recall

0.7857142857142857

## Discussion

With the analysis, we have successfully trained a Knn-model that can distinguish water mine from rocks based on the signal of a sonar system. We have determined that for the sake of public safety, recall should be favored when training the model. With a Gridsearch, we conclude that the model produces the highest recall by choose K = 3. This classifier model rendered a 78.6% recall, which is good. However, there is still a lot of room for it to improve.

## For Winnie and Farbod

- discuss whether this is what you expected to find?
    We expect to use our classifier to predict whether the object underwater is an explosive mine or a rock.
- discuss what impact could such findings have?
    This model allows us to automate the water mine detection. With such a model, underwater robots can locate and flag mines automatically. They can search through a large area of water thoroughly. It is hard for humans because it is expensive to train and hire human analysts. This can save the lives of civilians, especially the fishermen. On the other hand, it is also important for national defense because it can protect the expensive battleships.
    
- discuss what future questions could this lead to?
    This analysis could lead to future questions such as whether there is a way to train the model with fewer parameters and, if so, how to find them, whether the model can detect mines of other shapes, and whether we can develop a model that can label the type of the mine.


## References

Gorman, R. Paul and Terrence J. Sejnowski. “Analysis of hidden units in a layered network trained to classify sonar targets.” Neural Networks 1 (1988): 75-89.