# Using Knn Classifier to differentiate Water Mines from Rocks

## Introduction

- The Sonar is a system that uses sound waves to detect objects under the water. (oceanservice.noaa.gov) It is mostly used to locate underwater hazards for navigation, search and map objects on the seafloor such as shipwrecks, and map the seafloor itself. (oceanservice.noaa.gov) What is interesting about these sound wave signals, which we will be using in our project as well, is the fact that the frequency of the signals bouncing off metal cylinders vs. rocks are transmitted at different frequencies. These data were obtained via signals being used at different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.
- In this project, we will be training the system, which will determine whether the solar system that is being bounced off, is from a metal cylinder, or a roughly cylindrical rock
- There are 60 columns of frequency ranging from 0.0-1.0 and each of the numbers represents the energy within a particular frequency band, integrated over a certain period of time. Each of these sets of frequencies is tied to a label which is either “M” for metal cylinder or “R” for roughly cylindrical rock. We will implement these data in our project to train the system to recognize whether the frequency is related to a metal cylinder or a roughly cylindrical rock.

In [1]:
# set up the environment
import random

import altair as alt
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

**Preliminary exploratory data analysis:**

- The data set was stored at Kaggle in .csv format. 
- To ensure this data is loadable regardless of users, GitHub is used to host this file. It is accessed by referring to the permanent link of the raw file.
- To read the data set, simply read in using pd.read_csv. The data is properly formatted using comma as delimiters. 

In [2]:
sonar_data = pd.read_csv ("https://raw.githubusercontent.com/OminiCarlos/DSCI100_Group_14_Proposal_Mine_Finder/main/sonar.all-data.csv")
sonar_data

Unnamed: 0,Freq_1,Freq_2,Freq_3,Freq_4,Freq_5,Freq_6,Freq_7,Freq_8,Freq_9,Freq_10,...,Freq_52,Freq_53,Freq_54,Freq_55,Freq_56,Freq_57,Freq_58,Freq_59,Freq_60,Label
0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,0.0187,0.0346,0.0168,0.0177,0.0393,0.1630,0.2028,0.1694,0.2328,0.2684,...,0.0116,0.0098,0.0199,0.0033,0.0101,0.0065,0.0115,0.0193,0.0157,M
204,0.0323,0.0101,0.0298,0.0564,0.0760,0.0958,0.0990,0.1018,0.1030,0.2154,...,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067,M
205,0.0522,0.0437,0.0180,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.0160,0.0029,0.0051,0.0062,0.0089,0.0140,0.0138,0.0077,0.0031,M
206,0.0303,0.0353,0.0490,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048,M


After a thorough inspection, we can say that the data are already in clean format. Each column is a variable, each row is one observation and each value is a cell. 

To understand the content of the data set, we first take a look at sample size. 

In [3]:
nb_observations = sonar_data.shape[0]
nb_observations

208

208 is not a very big sample size, but it's enough to train our model. 


Secondly, we investigate the number of samples in each class, to verify if the sample is balanced. Imbalanced sample will let the dominant class hold the majority vote when it's not supposed to. This will give unnecessary favor to the majority class and hinder the model's accuracy. 

In [4]:
sample_breakdown = sonar_data["Label"].value_counts(normalize = True)
sample_breakdown

Label
M    0.533654
R    0.466346
Name: proportion, dtype: float64

We have 53% mine sample and 46% rock samples. Since the sample size is roughly the same, it is safe to say the sample is balanced. Therefore, we do not need to resample our data set. 

Since all the data are of the same type, which is the frequency, and that all the signal are in the same unit, the data is naturally centered. Therefore, we do not need to standardize the data set.

Now that we have made sure the data set is legitimate, it's time to split the data into training and testing set.

In [5]:
# set random state
np.random.seed(1024)

sonar_data_train, sonar_data_test = train_test_split(
    sonar_data, train_size=0.75, stratify=sonar_data["Label"]
)

sonar_data_train

Unnamed: 0,Freq_1,Freq_2,Freq_3,Freq_4,Freq_5,Freq_6,Freq_7,Freq_8,Freq_9,Freq_10,...,Freq_52,Freq_53,Freq_54,Freq_55,Freq_56,Freq_57,Freq_58,Freq_59,Freq_60,Label
63,0.0067,0.0096,0.0024,0.0058,0.0197,0.0618,0.0432,0.0951,0.0836,0.1180,...,0.0048,0.0023,0.0020,0.0040,0.0019,0.0034,0.0034,0.0051,0.0031,R
87,0.0856,0.0454,0.0382,0.0203,0.0385,0.0534,0.2140,0.3110,0.2837,0.2751,...,0.0172,0.0138,0.0079,0.0037,0.0051,0.0258,0.0102,0.0037,0.0037,R
158,0.0107,0.0453,0.0289,0.0713,0.1075,0.1019,0.1606,0.2119,0.3061,0.2936,...,0.0164,0.0120,0.0113,0.0021,0.0097,0.0072,0.0060,0.0017,0.0036,M
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
91,0.0253,0.0808,0.0507,0.0244,0.1724,0.3823,0.3729,0.3583,0.3429,0.2197,...,0.0178,0.0073,0.0079,0.0038,0.0116,0.0033,0.0039,0.0081,0.0053,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,0.0459,0.0437,0.0347,0.0456,0.0067,0.0890,0.1798,0.1741,0.1598,0.1408,...,0.0067,0.0032,0.0109,0.0164,0.0151,0.0070,0.0085,0.0117,0.0056,R
200,0.0131,0.0387,0.0329,0.0078,0.0721,0.1341,0.1626,0.1902,0.2610,0.3193,...,0.0150,0.0076,0.0032,0.0037,0.0071,0.0040,0.0009,0.0015,0.0085,M
123,0.0270,0.0163,0.0341,0.0247,0.0822,0.1256,0.1323,0.1584,0.2017,0.2122,...,0.0189,0.0204,0.0085,0.0043,0.0092,0.0138,0.0094,0.0105,0.0093,M
130,0.0443,0.0446,0.0235,0.1008,0.2252,0.2611,0.2061,0.1668,0.1801,0.3083,...,0.0274,0.0205,0.0141,0.0185,0.0055,0.0045,0.0115,0.0152,0.0100,M


I will group my data into 2 sets: Mine and Rock, to see if there is a visible pattern with their sonar profile. First I will group my data set by labels, then I will aggregate my data by calculating the mean value of each frequency. After that, I will plot the sonar profile using bar plots for both signals, to compare the intensity at each frequency. 

In [6]:
snoar_data_agg = sonar_data_train.groupby(["Label"]).mean().reset_index()
snoar_data_agg

Unnamed: 0,Label,Freq_1,Freq_2,Freq_3,Freq_4,Freq_5,Freq_6,Freq_7,Freq_8,Freq_9,...,Freq_51,Freq_52,Freq_53,Freq_54,Freq_55,Freq_56,Freq_57,Freq_58,Freq_59,Freq_60
0,M,0.034478,0.046443,0.054616,0.068734,0.089508,0.114306,0.127623,0.14906,0.220994,...,0.019617,0.016706,0.01198,0.012099,0.009931,0.008971,0.007801,0.008966,0.008417,0.006922
1,R,0.023474,0.029532,0.035075,0.039905,0.062152,0.098152,0.117866,0.121126,0.140244,...,0.012119,0.010173,0.009723,0.008992,0.008041,0.007697,0.007526,0.00667,0.006537,0.006077


We can see that the signal of Mine is different from that of rocks. For example,  the intensity of Freq_9 of the Mine is significantly higher than that of the rock. Now it's time to visualize the data.

In [7]:
# sort columns
sorted_cols = ['Label'] + sorted(snoar_data_agg.columns[1:], key=lambda x: int(x.split('_')[1]))

# melt data into long format
sonar_agg_melt = snoar_data_agg.melt(id_vars = "Label",
                                     var_name="Frequency",
                                     value_name = "Intensity"
                                    )
sonar_agg_melt

Unnamed: 0,Label,Frequency,Intensity
0,M,Freq_1,0.034478
1,R,Freq_1,0.023474
2,M,Freq_2,0.046443
3,R,Freq_2,0.029532
4,M,Freq_3,0.054616
...,...,...,...
115,R,Freq_58,0.006670
116,M,Freq_59,0.008417
117,R,Freq_59,0.006537
118,M,Freq_60,0.006922


In [8]:
agg_bar_plot = alt.Chart(sonar_agg_melt).mark_bar().encode(
    x = alt.X("Frequency", sort=sorted_cols[1:]),
    y = alt.Y("Intensity"),
    color = "Label"
)

agg_bar_plot

As can be seen in the bar plot, the intensity of the rock signal is generally weaker than that of the mines. This difference is very obvious in the region between Frequency 18 -31. This pattern tells us it is possible to distinguish a rock and a mine with the sonar signal. 


## Methods:

### Data selected:

- We do not need all the 60 frequencies. The visualization shows that after frequency 49, the signal is not significant between classes. We will discard them to have a smaller data set, so we can train the model more efficient.
    
    
### Plan of action:

Using K-nearest neighbor classification algorithm

- Execute a cross-validation in Python to choose the number of neighbors with a K-nearest neighbors classifier.
- Combine preprocessing and model training using make_pipeline
- Construct the confusion matrix to evaluate the performance of the model. 


In [1]:
# 

le = LabelEncoder()
sonar_data['Label'] = le.fit_transform(sonar_data['Label'])

X = sonar_data.drop('Label', axis=1)
y = sonar_data['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=1024)
# create the search model
knn_pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())

#set the parameter grid.
param_grid = {'kneighborsclassifier__n_neighbors': range(1, 21)}

# create the grid search model using GridSearchCV. 
grid_search = GridSearchCV(knn_pipe, 
                           param_grid, 
                           cv=5, 
                           return_train_score=True,
                           scoring='accuracy')

grid_search.fit(X_train, y_train)
CV_results = pd.DataFrame(text_tune_grid.fit(training_data,training_labels).cv_results_)
CV_results

NameError: name 'LabelEncoder' is not defined

In [None]:
best_knn_model = grid_search.best_estimator_

y_pred = best_knn_model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

sonar_conf_mat = conf_matrix

sonar_acc = grid_search.best_score_

sonar_acc = grid_search.best_score_ * 100
sonar_acc

## Expected outcomes and significance:
- We expect to use our classifier to predict whether the object underwater is an explosive mine or a rock.
- This model allows us to automate the water mine detection. With such a model, underwater robots can locate and flag mines automatically. They can search through a large area of water thoroughly. It is hard for humans because it is expensive to train and hire human analysts. This can save the lives of civilians, especially the fishermen. On the other hand, it is also important for national defence because it can protect the expensive battleships.
- This analysis could lead to future questions such as whether there is a way to train the model with fewer parameters and, if so, how to find them, whether the model can detect mines of other shapes, and whether we can develop a model that can label the type of the mine.
