# **SONAR** <h3> *Rock* **vs** *Mine* </h3>

![submarine_vs_minefield](./submarine_vs_minefield.jpg "https://media.istockphoto.com/id/932625038/photo/3d-illustration-of-a-submarine-passing-through-a-minefield.jpg?s=612x612&w=0&k=20&c=WyPyf29iGu3V1_VMCVI5u0WHurX1Dxy04YCht6bXc98=")

## Objective :

In this project, we are trying to build a system to predict whether the object beneath the submarine is mine or rock.

First, let's import all the dependencies

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder

Now, we shall read the csv data into a pandas dataframe.

In [2]:
sonar_data = pd.read_csv("sonar_data.csv", header=None)
sonar_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


Let's try to understand our dataset better before we go into the model building stuff.

In [3]:
sonar_data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,...,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,0.029164,0.038437,0.043832,0.053892,0.075202,0.10457,0.121747,0.134799,0.178003,0.208259,...,0.016069,0.01342,0.010709,0.010941,0.00929,0.008222,0.00782,0.007949,0.007941,0.006507
std,0.022991,0.03296,0.038428,0.046528,0.055552,0.059105,0.061788,0.085152,0.118387,0.134416,...,0.012008,0.009634,0.00706,0.007301,0.007088,0.005736,0.005785,0.00647,0.006181,0.005031
min,0.0015,0.0006,0.0015,0.0058,0.0067,0.0102,0.0033,0.0055,0.0075,0.0113,...,0.0,0.0008,0.0005,0.001,0.0006,0.0004,0.0003,0.0003,0.0001,0.0006
25%,0.01335,0.01645,0.01895,0.024375,0.03805,0.067025,0.0809,0.080425,0.097025,0.111275,...,0.008425,0.007275,0.005075,0.005375,0.00415,0.0044,0.0037,0.0036,0.003675,0.0031
50%,0.0228,0.0308,0.0343,0.04405,0.0625,0.09215,0.10695,0.1121,0.15225,0.1824,...,0.0139,0.0114,0.00955,0.0093,0.0075,0.00685,0.00595,0.0058,0.0064,0.0053
75%,0.03555,0.04795,0.05795,0.0645,0.100275,0.134125,0.154,0.1696,0.233425,0.2687,...,0.020825,0.016725,0.0149,0.0145,0.0121,0.010575,0.010425,0.01035,0.010325,0.008525
max,0.1371,0.2339,0.3059,0.4264,0.401,0.3823,0.3729,0.459,0.6828,0.7106,...,0.1004,0.0709,0.039,0.0352,0.0447,0.0394,0.0355,0.044,0.0364,0.0439


In [4]:
sonar_data.isnull().sum()

0     0
1     0
2     0
3     0
4     0
     ..
56    0
57    0
58    0
59    0
60    0
Length: 61, dtype: int64

In [5]:
sonar_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 61 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       208 non-null    float64
 1   1       208 non-null    float64
 2   2       208 non-null    float64
 3   3       208 non-null    float64
 4   4       208 non-null    float64
 5   5       208 non-null    float64
 6   6       208 non-null    float64
 7   7       208 non-null    float64
 8   8       208 non-null    float64
 9   9       208 non-null    float64
 10  10      208 non-null    float64
 11  11      208 non-null    float64
 12  12      208 non-null    float64
 13  13      208 non-null    float64
 14  14      208 non-null    float64
 15  15      208 non-null    float64
 16  16      208 non-null    float64
 17  17      208 non-null    float64
 18  18      208 non-null    float64
 19  19      208 non-null    float64
 20  20      208 non-null    float64
 21  21      208 non-null    float64
 22  22

We saw the data doesn't contain any null values(in the *sonar_data.info()* command) and the data is standardized(in the *sonar_data.describe()* command).
So, now we're ready to build a model to identify between a rock and a mine.
First, let's define our feature variables and our target variable.

In [6]:
X = sonar_data.drop(60, axis=1)
y = sonar_data[60]

Now, we split our dataset to train and test set.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=0)

We need to encode our categorical target variable to numerical values before we can feed it to our model. All our feature variables are already in numerical format(float64) as we saw earlier in the *sonar_data.info()* command.

In [8]:
le = LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)

Now, let's create our model object. We're using **Logistic Regression** Model for this task.

In [9]:
lr = LogisticRegression(max_iter=10000)

We'll perform a Grid Search to find the optimum value for our hyperparameters. First we'll create a params dictionary containing all the combinations of the parameters from which we want grid search to find the optimum value of hyperparameters.

In [10]:
params = [{"solver": ["lbfgs", "sag", "newton-cg"], "penalty": ["l2"], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
          {"solver": ["lbfgs", "sag", "newton-cg", "saga"], "penalty": ["none"]},
          {"solver": ["liblinear"], "penalty": ["l2"], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
          {"solver": ["liblinear"], "penalty": ["l1"], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
          {"solver": ["saga"], "penalty": ["l1", "l2"], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
          {"solver": ["saga"], "penalty": ["elasticnet"], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000], "l1_ratio": list(np.linspace(0, 1, 100))}]

Now, we give the params dictionary as a parameter to the **GridSearchCV()** function and do a 20-fold Stratified K-Fold cross validation to find the optimum value for hyperparameters for our logistic regression model. We'll also use **tqdm()** from the tqdm library to keep a track of the progress via a progress bar.

In [11]:
Grid1 = GridSearchCV(lr, param_grid=params, cv=20, n_jobs=-1)
for param in tqdm(Grid1.param_grid):
    Grid1.set_params(param_grid=param)
    Grid1.fit(X_train, y_train)

100%|████████████████████████████████████████████████████████████████████████████████| 6/6 [2:17:59<00:00, 1379.94s/it]


In [12]:
Grid1.best_estimator_

LogisticRegression(C=10, l1_ratio=0.9595959595959597, max_iter=10000,
                   penalty='elasticnet', solver='saga')

In [13]:
Grid1.best_score_

0.7927777777777779

We can see the best values for accuracy are derived using the parameters: **C**=10, **l1_ratio**=0.9595959595959597, **max_iter**=10000, **penalty**='elasticnet', **solver**='saga'. Now just for curiosity purpose, let's see the other top hyperparameters values.

In [17]:
results_df = pd.DataFrame(Grid1.cv_results_)
top100 = results_df.sort_values(by="rank_test_score").head(100)
top100.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_l1_ratio,param_penalty,param_solver,params,split0_test_score,...,split13_test_score,split14_test_score,split15_test_score,split16_test_score,split17_test_score,split18_test_score,split19_test_score,mean_test_score,std_test_score,rank_test_score
496,3.520919,1.449479,0.012503,0.007965,10,0.969697,elasticnet,saga,"{'C': 10, 'l1_ratio': 0.9696969696969697, 'pen...",0.9,...,0.666667,0.777778,0.777778,0.888889,0.777778,0.888889,0.777778,0.792778,0.135504,1
495,2.007602,0.312085,0.012064,0.010239,10,0.959596,elasticnet,saga,"{'C': 10, 'l1_ratio': 0.9595959595959597, 'pen...",0.9,...,0.666667,0.777778,0.777778,0.888889,0.777778,0.888889,0.777778,0.792778,0.135504,1
490,1.638791,0.315326,0.012851,0.008603,10,0.909091,elasticnet,saga,"{'C': 10, 'l1_ratio': 0.9090909090909092, 'pen...",0.9,...,0.777778,0.777778,0.777778,0.888889,0.666667,0.888889,0.777778,0.792778,0.135504,3
489,1.500664,0.152334,0.012358,0.006468,10,0.89899,elasticnet,saga,"{'C': 10, 'l1_ratio': 0.8989898989898991, 'pen...",0.9,...,0.777778,0.777778,0.777778,0.888889,0.666667,0.888889,0.777778,0.792778,0.135504,3
497,3.216443,0.46887,0.018249,0.016348,10,0.979798,elasticnet,saga,"{'C': 10, 'l1_ratio': 0.9797979797979799, 'pen...",0.9,...,0.666667,0.777778,0.777778,0.888889,0.777778,0.888889,0.777778,0.787778,0.144098,5


Using the best hyperparameter values let's train the model again and see how well it performs on the test data.

In [15]:
lr = LogisticRegression(C=10, l1_ratio=0.9595959595959597, max_iter=10000, penalty='elasticnet', solver='saga')
lr.fit(X_train, y_train)
y_hat = lr.predict(X_test)
print(f"Accuracy on the test set:", accuracy_score(y_test, y_hat))

Accuracy on the test set: 0.7619047619047619


As we see our model got a score of around 76.19% on the test dataset. Thank you for your time.