Consider we have a Submarine, lets say there is a fight going on between 2 different countries and the submarine of one country is going through the water where the other country has laid some mines.These mines are explosive and can explode if submarine crosses it.Write a machine learning model to predict if the object under water is a rock or mine.


Dataset:
*   The submarine sends SONAR signals and the data is collected.
*   The SONAR is used to send signals on metal cylinders(mines are made up of metals) and rocks.
*   This data can be used to predict if the given object is mine or rock
*   It is a .csv file. Link: https://drive.google.com/file/d/1pQxtljlNVh0DHYg-Ye7dtpDTlFceHVfa/view
*   It has total of 61 columns and 208 data rows.The last column has only 2 values : "R" which means "rock" and "M" which means "mine"
*   Dataset doesnot have any header row which means title of column









Workflow:

*   Once we have collected the data
*   We cannot use the data directly, we need to process the data so that it can be used to extract insights. This is called Data Preprocessing
*  We will split data into 2 parts (training and testing data)
*  This data is then fed into machine learning model (which in this case is logistic regression model. Logistic Regression model is supervised learning algorithm and can be used for binary classification problem)







In [None]:
#importing the dependencies
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
#train_test_split function is used to split data without doing it manually
from sklearn.linear_model import LogisticRegression
#LogisticRegression function is to run logisctic regression model
from sklearn.metrics import accuracy_score
#accuracy_score is a function to calculate accuracy of model


Data collection and data processing

In [None]:
#mount your drive to google colab to access the data file
from google.colab import drive
drive.mount('/content/drive')
dataset_path="/content/drive/MyDrive/data science/ml projects/sonar_data.csv"
#Loading Dataset into a pandas dataframe
df=pd.read_csv(dataset_path,header=None)
#here we have to mention there is no header in file which means data in 1st row would be used for calculations also
print(df.head(10))
#df.head(10) prints top 10 rows of the data by default df.head() only prints 5 rows

Mounted at /content/drive
       0       1       2       3       4       5       6       7       8   \
0  0.0200  0.0371  0.0428  0.0207  0.0954  0.0986  0.1539  0.1601  0.3109   
1  0.0453  0.0523  0.0843  0.0689  0.1183  0.2583  0.2156  0.3481  0.3337   
2  0.0262  0.0582  0.1099  0.1083  0.0974  0.2280  0.2431  0.3771  0.5598   
3  0.0100  0.0171  0.0623  0.0205  0.0205  0.0368  0.1098  0.1276  0.0598   
4  0.0762  0.0666  0.0481  0.0394  0.0590  0.0649  0.1209  0.2467  0.3564   
5  0.0286  0.0453  0.0277  0.0174  0.0384  0.0990  0.1201  0.1833  0.2105   
6  0.0317  0.0956  0.1321  0.1408  0.1674  0.1710  0.0731  0.1401  0.2083   
7  0.0519  0.0548  0.0842  0.0319  0.1158  0.0922  0.1027  0.0613  0.1465   
8  0.0223  0.0375  0.0484  0.0475  0.0647  0.0591  0.0753  0.0098  0.0684   
9  0.0164  0.0173  0.0347  0.0070  0.0187  0.0671  0.1056  0.0697  0.0962   

       9   ...      51      52      53      54      55      56      57  \
0  0.2111  ...  0.0027  0.0065  0.0159  0.0072  0.01

In [None]:
#number of rows and columns
df.shape

(208, 61)

In [None]:
df.describe()
#used to print many parameters of the dataframe in pandas like median,mean etc.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,...,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,0.029164,0.038437,0.043832,0.053892,0.075202,0.10457,0.121747,0.134799,0.178003,0.208259,...,0.016069,0.01342,0.010709,0.010941,0.00929,0.008222,0.00782,0.007949,0.007941,0.006507
std,0.022991,0.03296,0.038428,0.046528,0.055552,0.059105,0.061788,0.085152,0.118387,0.134416,...,0.012008,0.009634,0.00706,0.007301,0.007088,0.005736,0.005785,0.00647,0.006181,0.005031
min,0.0015,0.0006,0.0015,0.0058,0.0067,0.0102,0.0033,0.0055,0.0075,0.0113,...,0.0,0.0008,0.0005,0.001,0.0006,0.0004,0.0003,0.0003,0.0001,0.0006
25%,0.01335,0.01645,0.01895,0.024375,0.03805,0.067025,0.0809,0.080425,0.097025,0.111275,...,0.008425,0.007275,0.005075,0.005375,0.00415,0.0044,0.0037,0.0036,0.003675,0.0031
50%,0.0228,0.0308,0.0343,0.04405,0.0625,0.09215,0.10695,0.1121,0.15225,0.1824,...,0.0139,0.0114,0.00955,0.0093,0.0075,0.00685,0.00595,0.0058,0.0064,0.0053
75%,0.03555,0.04795,0.05795,0.0645,0.100275,0.134125,0.154,0.1696,0.233425,0.2687,...,0.020825,0.016725,0.0149,0.0145,0.0121,0.010575,0.010425,0.01035,0.010325,0.008525
max,0.1371,0.2339,0.3059,0.4264,0.401,0.3823,0.3729,0.459,0.6828,0.7106,...,0.1004,0.0709,0.039,0.0352,0.0447,0.0394,0.0355,0.044,0.0364,0.0439


*   As we can see each column has 208 count means no NULL values
*   It tells mean, median , standard deviation etc. for each column respectively
*   But if we see properly it is showing only 60 columns and not 61 columns, that is because one of those columns has text like "R" "M" which doesnot have mean median mode etc.







In [None]:
#to count number of rock and mines in the last column
df[60].value_counts()
#60 here represents the column index of whose value count we want to calculate
#to get a good accuracy both the categories should have more or less equal count

Unnamed: 0_level_0,count
60,Unnamed: 1_level_1
M,111
R,97


M ---> MINE




R ---> ROCK

In [None]:
'''
group by --> is used to group data according to a category of mentioned column index(which here is 60)
After grouping, you can perform operations (like sum, mean, count, etc.) on each group in each column separately.
It's like organizing your data into smaller chunks based on some feature and then doing calculations on each chunk.'''
#count function is used to count number of instaces in each group M and R
df.groupby(60).count()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
60,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M,111,111,111,111,111,111,111,111,111,111,...,111,111,111,111,111,111,111,111,111,111
R,97,97,97,97,97,97,97,97,97,97,...,97,97,97,97,97,97,97,97,97,97


In [None]:
df.groupby(60).median()
#median function is used to find median of instaces in each group M and R

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
60,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M,0.0269,0.0353,0.0386,0.0547,0.0748,0.1091,0.1232,0.1298,0.1864,0.2245,...,0.0171,0.0132,0.0101,0.0096,0.0072,0.0074,0.0057,0.007,0.007,0.0053
R,0.0201,0.0242,0.0288,0.035,0.0476,0.0792,0.1015,0.0973,0.1054,0.1264,...,0.0107,0.0088,0.0081,0.0088,0.0077,0.0065,0.0061,0.0052,0.0058,0.0054


In [None]:
df.groupby(60).mean()
#mean function is used to find mean of instaces in each group M and R

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
60,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M,0.034989,0.045544,0.05072,0.064768,0.086715,0.111864,0.128359,0.149832,0.213492,0.251022,...,0.019352,0.016014,0.011643,0.012185,0.009923,0.008914,0.007825,0.00906,0.008695,0.00693
R,0.022498,0.030303,0.035951,0.041447,0.062028,0.096224,0.11418,0.117596,0.137392,0.159325,...,0.012311,0.010453,0.00964,0.009518,0.008567,0.00743,0.007814,0.006677,0.007078,0.006024


In [None]:
#seperate the data and label
X=df.drop(columns=60,axis=0)
Y=df[60]
print(X)
print(Y)

         0       1       2       3       4       5       6       7       8   \
0    0.0200  0.0371  0.0428  0.0207  0.0954  0.0986  0.1539  0.1601  0.3109   
1    0.0453  0.0523  0.0843  0.0689  0.1183  0.2583  0.2156  0.3481  0.3337   
2    0.0262  0.0582  0.1099  0.1083  0.0974  0.2280  0.2431  0.3771  0.5598   
3    0.0100  0.0171  0.0623  0.0205  0.0205  0.0368  0.1098  0.1276  0.0598   
4    0.0762  0.0666  0.0481  0.0394  0.0590  0.0649  0.1209  0.2467  0.3564   
..      ...     ...     ...     ...     ...     ...     ...     ...     ...   
203  0.0187  0.0346  0.0168  0.0177  0.0393  0.1630  0.2028  0.1694  0.2328   
204  0.0323  0.0101  0.0298  0.0564  0.0760  0.0958  0.0990  0.1018  0.1030   
205  0.0522  0.0437  0.0180  0.0292  0.0351  0.1171  0.1257  0.1178  0.1258   
206  0.0303  0.0353  0.0490  0.0608  0.0167  0.1354  0.1465  0.1123  0.1945   
207  0.0260  0.0363  0.0136  0.0272  0.0214  0.0338  0.0655  0.1400  0.1843   

         9   ...      50      51      52      53   

In [None]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.1,random_state=None,stratify=Y)
#X is the data and Y is the label
'''
X_train is data and Y_train is the corresponding label for X_train this is used for training the ML model
X_test is the data and Y_test is the corresponding label for X_test this will be used to test the model
test_size means the split will be 90% for training and 10% for testing
stratify is used to ensure that there are equal no. of rocks and mines in testing and trainig data this will make sure training is done on unbaised dataset
If random_state is set to None (which is the default value), the function will use the current system time (or some other changing factor) to generate a random seed,
so the data split may be different each time the code is run.
If random_state is set to an arbitrary number (e.g., random_state=10) for different experiments,
it ensures that the data is split the same way in all experiments for consistency.'''

'\nX_train is data and Y_train is the corresponding label for X_train this is used for training the ML model\nX_test is the data and Y_test is the corresponding label for X_test this will be used to test the model \ntest_size means the split will be 90% for training and 10% for testing\nstratify is used to ensure that there are equal no. of rocks and mines in testing and trainig data this will make sure training is done on unbaised dataset\nIf random_state is set to None (which is the default value), the function will use the current system time (or some other changing factor) to generate a random seed, \nso the data split may be different each time the code is run.\nIf random_state is set to an arbitrary number (e.g., random_state=10) for different experiments,\nit ensures that the data is split the same way in all experiments for consistency.'

In [None]:
print(X.shape,X_train.shape,X_test.shape)

(208, 60) (187, 60) (21, 60)


In [None]:
model=LogisticRegression()
#now we will work with logistic regression model

In [None]:
#lets train the model
model.fit(X_train,Y_train)  #model.fit(data,label)

In [None]:
#now we need to use test data to test the accuracy of model
#as the model has seen the training data there is a chance that model remembers it and then we will get 100% accuracy even if that =s not the case
#for that we show model unseen data
#but we can still calculate for training data
train_predictions=model.predict(X_train) #model.predict makes the model the prediction on the input data which here is X_train
test_predictions=model.predict(X_test) #predictions on tresting data as input is store in test_predictions.It stores allthe values which will further be matched with Y_test to calculate accuracy


In [None]:
train_accuracy=accuracy_score(train_predictions,Y_train)
print(train_accuracy)

0.839572192513369


In [None]:
test_accuracy=accuracy_score(test_predictions,Y_test)
print(test_accuracy)

0.8571428571428571
