<a href="https://colab.research.google.com/github/Tanishq0055/Rock_vs_Mine_Prediction/blob/main/Rock_vs_Mine_Prediction_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The problem is to predict metal or rock objects from sonar return data. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time.

We are going to cover the following steps:

Load the Dataset (Import libraries and Load dataset)

Analyze Data (Descriptive Statistics)

Splitting the dataset into the Training set and Test set

Comparing algorithm

Algorithm Tuning

Finalize Model based on selecting Best method



# **Load the Dataset (Import libraries and Load dataset)**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler

In [None]:
#loading the dataset to a pandas Dataframe
sonar_data = pd.read_csv('/content/sonar_data.csv', header=None) # as we don't have any header names for the columns so put header=None

# **Analyze Data (Descriptive Statistics)**

In [None]:
sonar_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


In [None]:
# number of rows and columns
sonar_data.shape

(208, 61)

In [None]:
sonar_data.describe()  #describe --> statistical measures of the data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
count,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,...,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0,208.0
mean,0.029164,0.038437,0.043832,0.053892,0.075202,0.10457,0.121747,0.134799,0.178003,0.208259,...,0.016069,0.01342,0.010709,0.010941,0.00929,0.008222,0.00782,0.007949,0.007941,0.006507
std,0.022991,0.03296,0.038428,0.046528,0.055552,0.059105,0.061788,0.085152,0.118387,0.134416,...,0.012008,0.009634,0.00706,0.007301,0.007088,0.005736,0.005785,0.00647,0.006181,0.005031
min,0.0015,0.0006,0.0015,0.0058,0.0067,0.0102,0.0033,0.0055,0.0075,0.0113,...,0.0,0.0008,0.0005,0.001,0.0006,0.0004,0.0003,0.0003,0.0001,0.0006
25%,0.01335,0.01645,0.01895,0.024375,0.03805,0.067025,0.0809,0.080425,0.097025,0.111275,...,0.008425,0.007275,0.005075,0.005375,0.00415,0.0044,0.0037,0.0036,0.003675,0.0031
50%,0.0228,0.0308,0.0343,0.04405,0.0625,0.09215,0.10695,0.1121,0.15225,0.1824,...,0.0139,0.0114,0.00955,0.0093,0.0075,0.00685,0.00595,0.0058,0.0064,0.0053
75%,0.03555,0.04795,0.05795,0.0645,0.100275,0.134125,0.154,0.1696,0.233425,0.2687,...,0.020825,0.016725,0.0149,0.0145,0.0121,0.010575,0.010425,0.01035,0.010325,0.008525
max,0.1371,0.2339,0.3059,0.4264,0.401,0.3823,0.3729,0.459,0.6828,0.7106,...,0.1004,0.0709,0.039,0.0352,0.0447,0.0394,0.0355,0.044,0.0364,0.0439


In [None]:
sonar_data[60].value_counts() # in column ,111 data enteries for rock and 97 enteries for sonar

M    111
R     97
Name: 60, dtype: int64

M --> Mine

R --> Rock

In [None]:
sonar_data.groupby(60).mean() # mean values for all the columns for mine and rock separately

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
60,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M,0.034989,0.045544,0.05072,0.064768,0.086715,0.111864,0.128359,0.149832,0.213492,0.251022,...,0.019352,0.016014,0.011643,0.012185,0.009923,0.008914,0.007825,0.00906,0.008695,0.00693
R,0.022498,0.030303,0.035951,0.041447,0.062028,0.096224,0.11418,0.117596,0.137392,0.159325,...,0.012311,0.010453,0.00964,0.009518,0.008567,0.00743,0.007814,0.006677,0.007078,0.006024


In [None]:
# separating data and Labels
X = sonar_data.drop(columns=60, axis=1) # for dropping a column we write axis as 1 , for dropping a row we write axis as 0
Y = sonar_data[60]

In [None]:
print(X)
print(Y)

         0       1       2       3       4       5       6       7       8   \
0    0.0200  0.0371  0.0428  0.0207  0.0954  0.0986  0.1539  0.1601  0.3109   
1    0.0453  0.0523  0.0843  0.0689  0.1183  0.2583  0.2156  0.3481  0.3337   
2    0.0262  0.0582  0.1099  0.1083  0.0974  0.2280  0.2431  0.3771  0.5598   
3    0.0100  0.0171  0.0623  0.0205  0.0205  0.0368  0.1098  0.1276  0.0598   
4    0.0762  0.0666  0.0481  0.0394  0.0590  0.0649  0.1209  0.2467  0.3564   
..      ...     ...     ...     ...     ...     ...     ...     ...     ...   
203  0.0187  0.0346  0.0168  0.0177  0.0393  0.1630  0.2028  0.1694  0.2328   
204  0.0323  0.0101  0.0298  0.0564  0.0760  0.0958  0.0990  0.1018  0.1030   
205  0.0522  0.0437  0.0180  0.0292  0.0351  0.1171  0.1257  0.1178  0.1258   
206  0.0303  0.0353  0.0490  0.0608  0.0167  0.1354  0.1465  0.1123  0.1945   
207  0.0260  0.0363  0.0136  0.0272  0.0214  0.0338  0.0655  0.1400  0.1843   

         9   ...      50      51      52      53   

# Splitting the dataset into the Training set and Test set

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, stratify=Y, random_state=1) #stratify=Y , we need to split the data based on rock and mine ... say for ex we need to have equal number/percentage  of rocks in tested , training data and equal number of mines in training, testing data as we have the original one.

In [None]:
print(X.shape, X_train.shape, X_test.shape)

(208, 60) (187, 60) (21, 60)


In [None]:
print(X_train)
print(Y_train)

         0       1       2       3       4       5       6       7       8   \
115  0.0414  0.0436  0.0447  0.0844  0.0419  0.1215  0.2002  0.1516  0.0818   
38   0.0123  0.0022  0.0196  0.0206  0.0180  0.0492  0.0033  0.0398  0.0791   
56   0.0152  0.0102  0.0113  0.0263  0.0097  0.0391  0.0857  0.0915  0.0949   
123  0.0270  0.0163  0.0341  0.0247  0.0822  0.1256  0.1323  0.1584  0.2017   
18   0.0270  0.0092  0.0145  0.0278  0.0412  0.0757  0.1026  0.1138  0.0794   
..      ...     ...     ...     ...     ...     ...     ...     ...     ...   
140  0.0412  0.1135  0.0518  0.0232  0.0646  0.1124  0.1787  0.2407  0.2682   
5    0.0286  0.0453  0.0277  0.0174  0.0384  0.0990  0.1201  0.1833  0.2105   
154  0.0117  0.0069  0.0279  0.0583  0.0915  0.1267  0.1577  0.1927  0.2361   
131  0.1150  0.1163  0.0866  0.0358  0.0232  0.1267  0.2417  0.2661  0.4346   
203  0.0187  0.0346  0.0168  0.0177  0.0393  0.1630  0.2028  0.1694  0.2328   

         9   ...      50      51      52      53   

# COMPARING **ALGORITHM**

We will use 10-fold cross validaton.As k-fold cross valdiation is not a evaluation parameter.It tell us whether the data is biased or unbiased. We will evaluate algorithms using the accuracy metric. This is a gross metric that will give a quick idea of how correct a given model is.It is more useful on binary classification problems.

In [None]:
FinalPred_dataFrame = pd.DataFrame()
FinalPred_dataFrame['action']=Y_test

In [None]:
#Algorithms we want to test for out Modesl
AlgoModels = []
AlgoModels.append(('LogisticRegression', LogisticRegression()))
AlgoModels.append(('RandomForestClassifier', RandomForestClassifier()))
AlgoModels.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
AlgoModels.append(('GaussianNB', GaussianNB()))
AlgoModels.append(('KNeighborsClassifier', KNeighborsClassifier()))
AlgoModels.append(('SupportVectorMachine', SVC()))

In [None]:
Final_res = []
Algo_Names = []
for name, model in AlgoModels:
    Cross_val_res = cross_val_score(model, X_train, Y_train, cv=10) # on training data set
    Final_res.append(Cross_val_res)
    Algo_Names.append(name)
    print('Validation score->') #K-fold cross validation (Tells us about biased/un-biased)
    print ("%s: %f (%f)" % (name, Cross_val_res.mean(), Cross_val_res.std()))
    answer_mean = "%s Mean: %f " % (name, Cross_val_res.mean())
    answer_std  = "%s Std: %f" % (name,Cross_val_res.std())
    print(answer_mean)
    print(answer_std)
    print()

Validation score->
LogisticRegression: 0.759064 (0.077053)
LogisticRegression Mean: 0.759064 
LogisticRegression Std: 0.077053

Validation score->
RandomForestClassifier: 0.813158 (0.062909)
RandomForestClassifier Mean: 0.813158 
RandomForestClassifier Std: 0.062909

Validation score->
DecisionTreeClassifier: 0.707018 (0.059620)
DecisionTreeClassifier Mean: 0.707018 
DecisionTreeClassifier Std: 0.059620

Validation score->
GaussianNB: 0.711696 (0.142888)
GaussianNB Mean: 0.711696 
GaussianNB Std: 0.142888

Validation score->
KNeighborsClassifier: 0.775439 (0.084318)
KNeighborsClassifier Mean: 0.775439 
KNeighborsClassifier Std: 0.084318

Validation score->
SupportVectorMachine: 0.801462 (0.089956)
SupportVectorMachine Mean: 0.801462 
SupportVectorMachine Std: 0.089956



In [None]:
Final_res = []
Algo_Names = []
for name, model in AlgoModels:
   Cross_val_res = cross_val_score(model, X_train, Y_train, cv=10)#on testing dataset
   Final_res.append(Cross_val_res)
   Algo_Names.append(name)
   model.fit(X_train, Y_train)
   y_pred = model.predict(X_test)
   FinalPred_dataFrame[name.strip(":")]=y_pred
   test_data_accuracy = accuracy_score(y_pred, Y_test)
   Acc="Accuracy of model on the test data by %s  : %f" % (name,test_data_accuracy)
   print(Acc)

Accuracy of model on the test data by LogisticRegression  : 0.761905
Accuracy of model on the test data by RandomForestClassifier  : 0.761905
Accuracy of model on the test data by DecisionTreeClassifier  : 0.761905
Accuracy of model on the test data by GaussianNB  : 0.619048
Accuracy of model on the test data by KNeighborsClassifier  : 0.809524
Accuracy of model on the test data by SupportVectorMachine  : 0.809524


In [None]:
FinalPred_dataFrame

Unnamed: 0,action,LogisticRegression,RandomForestClassifier,DecisionTreeClassifier,GaussianNB,KNeighborsClassifier,SupportVectorMachine
113,M,M,M,M,R,M,M
23,R,R,R,R,R,R,R
45,R,R,R,R,R,R,R
81,R,M,M,R,M,M,M
82,R,M,M,R,M,M,M
109,M,M,R,R,M,M,M
176,M,M,M,M,M,M,M
134,M,M,M,M,M,M,M
96,R,R,R,R,R,R,R
98,M,M,M,R,M,R,R


This show that **k-Nearest Neighbors, Support Vector Machines and Random Forest** Tree algorithm have highest accuracy.

# K-NN algorithm Tuning

We can start off by tuning the number of neighbors for KNN. The default number of neighbors is 7. Below we try all odd values of k from 1 to 21, covering the default value of 7. Each k value is evaluated using 10-fold cross validation on the training standardized dataset.

In [None]:
# KNN algorithm tuning
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
neighbors = [1,3,5,7,9,11,13,15,17,19,21]
param_grid = dict(n_neighbors=neighbors)
model = KNeighborsClassifier()
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=10,n_jobs=-1)
grid_result = grid.fit(rescaledX, Y_train)
print("Best accuracy : %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
ranks = grid_result.cv_results_['rank_test_score']
for mean, stdev, param, rank in zip(means, stds, params, ranks):
    print("#%d %f (%f) with: %r" % (rank, mean, stdev, param))


Best accuracy : 0.849708 using {'n_neighbors': 1}
#1 0.849708 (0.067279) with: {'n_neighbors': 1}
#2 0.844444 (0.062244) with: {'n_neighbors': 3}
#4 0.796199 (0.063526) with: {'n_neighbors': 5}
#3 0.802339 (0.074561) with: {'n_neighbors': 7}
#5 0.780409 (0.103067) with: {'n_neighbors': 9}
#6 0.742398 (0.106132) with: {'n_neighbors': 11}
#8 0.731871 (0.099862) with: {'n_neighbors': 13}
#9 0.726608 (0.111290) with: {'n_neighbors': 15}
#11 0.721345 (0.116769) with: {'n_neighbors': 17}
#10 0.721930 (0.126878) with: {'n_neighbors': 19}
#7 0.732456 (0.113917) with: {'n_neighbors': 21}


In [None]:
df=pd.DataFrame(grid_result.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001757,0.000122,0.003997,0.00204,1,{'n_neighbors': 1},0.947368,0.947368,0.736842,0.842105,0.894737,0.894737,0.789474,0.833333,0.833333,0.777778,0.849708,0.067279,1
1,0.001759,0.000181,0.004943,0.002599,3,{'n_neighbors': 3},0.947368,0.894737,0.789474,0.842105,0.842105,0.894737,0.789474,0.722222,0.888889,0.833333,0.844444,0.062244,2
2,0.00164,0.00014,0.003735,0.00197,5,{'n_neighbors': 5},0.894737,0.894737,0.684211,0.789474,0.842105,0.789474,0.789474,0.777778,0.777778,0.722222,0.796199,0.063526,4
3,0.001489,0.000101,0.002987,0.000109,7,{'n_neighbors': 7},0.842105,0.894737,0.684211,0.789474,0.894737,0.684211,0.789474,0.777778,0.888889,0.777778,0.802339,0.074561,3
4,0.001404,5.7e-05,0.003024,0.000107,9,{'n_neighbors': 9},0.842105,0.894737,0.736842,0.789474,0.947368,0.631579,0.684211,0.666667,0.888889,0.722222,0.780409,0.103067,5
5,0.003126,0.002781,0.00338,0.000571,11,{'n_neighbors': 11},0.789474,0.842105,0.736842,0.736842,0.947368,0.631579,0.684211,0.611111,0.833333,0.611111,0.742398,0.106132,6
6,0.002725,0.002552,0.003056,9.6e-05,13,{'n_neighbors': 13},0.789474,0.842105,0.631579,0.684211,0.947368,0.684211,0.684211,0.611111,0.777778,0.666667,0.731871,0.099862,8
7,0.00167,0.000159,0.003912,0.001892,15,{'n_neighbors': 15},0.842105,0.842105,0.631579,0.684211,0.947368,0.631579,0.631579,0.666667,0.777778,0.611111,0.726608,0.11129,9
8,0.001396,0.000107,0.002798,4.3e-05,17,{'n_neighbors': 17},0.842105,0.842105,0.578947,0.684211,0.947368,0.631579,0.631579,0.666667,0.777778,0.611111,0.721345,0.116769,11
9,0.001456,0.000141,0.003111,0.00039,19,{'n_neighbors': 19},0.842105,0.842105,0.631579,0.631579,0.947368,0.631579,0.526316,0.666667,0.833333,0.666667,0.72193,0.126878,10


# Tuning SVM

We can tune two key parameters of the SVM algorithm, the value of C (how much to relax the margin) and the type of kernel.


The default for SVM (the SVC class) is to use the Radial Basis Function (RBF) kernel with a C value set to 1.0.


Like with KNN, we will perform a grid search using 10-fold cross validation with a standardized copy of the training dataset.

In [None]:
# SVM algorithm tuning
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
c_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
kernel_values = ['linear', 'poly', 'rbf', 'sigmoid']
param_grid = dict(C=c_values, kernel=kernel_values)
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=10)
grid_result = grid.fit(rescaledX, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
ranks = grid_result.cv_results_['rank_test_score']
for mean, stdev, param, rank in zip(means, stds, params, ranks):
    print("#%d %f (%f) with: %r" % (rank, mean, stdev, param))

Best: 0.876901 using {'C': 2.0, 'kernel': 'rbf'}
#17 0.786257 (0.115016) with: {'C': 0.1, 'kernel': 'linear'}
#39 0.588304 (0.046652) with: {'C': 0.1, 'kernel': 'poly'}
#40 0.567251 (0.030917) with: {'C': 0.1, 'kernel': 'rbf'}
#29 0.732456 (0.071374) with: {'C': 0.1, 'kernel': 'sigmoid'}
#25 0.743275 (0.088174) with: {'C': 0.3, 'kernel': 'linear'}
#21 0.749415 (0.077096) with: {'C': 0.3, 'kernel': 'poly'}
#18 0.764327 (0.080270) with: {'C': 0.3, 'kernel': 'rbf'}
#20 0.754678 (0.073780) with: {'C': 0.3, 'kernel': 'sigmoid'}
#34 0.716374 (0.072312) with: {'C': 0.5, 'kernel': 'linear'}
#8 0.802924 (0.087331) with: {'C': 0.5, 'kernel': 'poly'}
#9 0.802339 (0.113247) with: {'C': 0.5, 'kernel': 'rbf'}
#23 0.744444 (0.111120) with: {'C': 0.5, 'kernel': 'sigmoid'}
#35 0.711404 (0.066831) with: {'C': 0.7, 'kernel': 'linear'}
#7 0.803216 (0.107355) with: {'C': 0.7, 'kernel': 'poly'}
#12 0.796784 (0.099941) with: {'C': 0.7, 'kernel': 'rbf'}
#22 0.749415 (0.099106) with: {'C': 0.7, 'kernel': 'sigm

In [None]:
df=pd.DataFrame(grid_result.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002944,0.000182,0.000787,4.3e-05,0.1,linear,"{'C': 0.1, 'kernel': 'linear'}",0.789474,0.894737,0.736842,0.947368,0.894737,0.526316,0.684211,0.777778,0.777778,0.833333,0.786257,0.115016,17
1,0.002988,0.00015,0.000874,4.8e-05,0.1,poly,"{'C': 0.1, 'kernel': 'poly'}",0.631579,0.578947,0.578947,0.526316,0.631579,0.526316,0.631579,0.555556,0.555556,0.666667,0.588304,0.046652,39
2,0.003953,0.000136,0.001069,5.8e-05,0.1,rbf,"{'C': 0.1, 'kernel': 'rbf'}",0.526316,0.578947,0.526316,0.578947,0.578947,0.578947,0.526316,0.555556,0.611111,0.611111,0.567251,0.030917,40
3,0.00412,0.001208,0.00099,0.000199,0.1,sigmoid,"{'C': 0.1, 'kernel': 'sigmoid'}",0.789474,0.736842,0.684211,0.736842,0.894737,0.631579,0.684211,0.666667,0.722222,0.777778,0.732456,0.071374,29
4,0.003448,0.000385,0.000701,6.6e-05,0.3,linear,"{'C': 0.3, 'kernel': 'linear'}",0.789474,0.736842,0.684211,0.894737,0.842105,0.631579,0.631579,0.722222,0.833333,0.666667,0.743275,0.088174,25
5,0.002933,0.000576,0.000831,0.000131,0.3,poly,"{'C': 0.3, 'kernel': 'poly'}",0.631579,0.789474,0.684211,0.842105,0.789474,0.684211,0.684211,0.777778,0.888889,0.722222,0.749415,0.077096,21
6,0.003908,0.000144,0.001071,8.9e-05,0.3,rbf,"{'C': 0.3, 'kernel': 'rbf'}",0.842105,0.789474,0.684211,0.736842,0.894737,0.631579,0.842105,0.666667,0.777778,0.777778,0.764327,0.08027,18
7,0.003738,0.000334,0.001058,0.000161,0.3,sigmoid,"{'C': 0.3, 'kernel': 'sigmoid'}",0.736842,0.789474,0.631579,0.789474,0.842105,0.631579,0.736842,0.722222,0.833333,0.833333,0.754678,0.07378,20
8,0.00508,0.001959,0.000885,0.00016,0.5,linear,"{'C': 0.5, 'kernel': 'linear'}",0.789474,0.736842,0.684211,0.789474,0.789474,0.578947,0.684211,0.722222,0.777778,0.611111,0.716374,0.072312,34
9,0.003296,0.000739,0.000952,0.000239,0.5,poly,"{'C': 0.5, 'kernel': 'poly'}",0.894737,0.842105,0.736842,0.736842,0.894737,0.736842,0.631579,0.777778,0.888889,0.888889,0.802924,0.087331,8


# Tuning Random Forest Tree

In [None]:
# random forest classifier algorithm tuning
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
n_estimators =[10,20,30,40,50]
criterion = ["gini", "entropy", "log_loss"]
param_grid = dict(n_estimators=n_estimators,criterion=criterion)
model = RandomForestClassifier()
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=10)
grid_result = grid.fit(rescaledX, Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
ranks = grid_result.cv_results_['rank_test_score']
for mean, stdev, param, rank in zip(means, stds, params, ranks):
    print("#%d %f (%f) with: %r" % (rank, mean, stdev, param))

Best: 0.839474 using {'criterion': 'gini', 'n_estimators': 30}
#8 0.801754 (0.089014) with: {'criterion': 'gini', 'n_estimators': 10}
#10 0.792690 (0.087743) with: {'criterion': 'gini', 'n_estimators': 20}
#1 0.839474 (0.085333) with: {'criterion': 'gini', 'n_estimators': 30}
#6 0.802632 (0.097577) with: {'criterion': 'gini', 'n_estimators': 40}
#7 0.802047 (0.078861) with: {'criterion': 'gini', 'n_estimators': 50}
#5 0.808480 (0.107547) with: {'criterion': 'entropy', 'n_estimators': 10}
#9 0.797368 (0.068337) with: {'criterion': 'entropy', 'n_estimators': 20}
#4 0.812865 (0.068461) with: {'criterion': 'entropy', 'n_estimators': 30}
#2 0.818713 (0.077835) with: {'criterion': 'entropy', 'n_estimators': 40}
#3 0.813158 (0.071173) with: {'criterion': 'entropy', 'n_estimators': 50}
#11 nan (nan) with: {'criterion': 'log_loss', 'n_estimators': 10}
#12 nan (nan) with: {'criterion': 'log_loss', 'n_estimators': 20}
#13 nan (nan) with: {'criterion': 'log_loss', 'n_estimators': 30}
#14 nan (nan)

50 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py", line 467, in fit
    for i, t in enumerate(trees)
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python3.7/

In [None]:
df=pd.DataFrame(grid_result.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.034073,0.003849,0.003663,0.000252,gini,10,"{'criterion': 'gini', 'n_estimators': 10}",0.842105,0.947368,0.684211,0.947368,0.842105,0.684211,0.736842,0.777778,0.777778,0.777778,0.801754,0.089014,8
1,0.064568,0.00966,0.005494,0.000792,gini,20,"{'criterion': 'gini', 'n_estimators': 20}",0.684211,0.894737,0.631579,0.736842,0.842105,0.736842,0.789474,0.833333,0.888889,0.888889,0.79269,0.087743,10
2,0.060052,0.003998,0.005218,0.000294,gini,30,"{'criterion': 'gini', 'n_estimators': 30}",0.894737,0.789474,0.631579,0.894737,0.894737,0.947368,0.842105,0.833333,0.888889,0.777778,0.839474,0.085333,1
3,0.074823,0.002686,0.008359,0.002228,gini,40,"{'criterion': 'gini', 'n_estimators': 40}",0.894737,0.842105,0.684211,0.894737,0.842105,0.736842,0.631579,0.722222,0.944444,0.833333,0.802632,0.097577,6
4,0.09282,0.003937,0.009589,0.002689,gini,50,"{'criterion': 'gini', 'n_estimators': 50}",0.789474,0.947368,0.684211,0.684211,0.842105,0.842105,0.842105,0.722222,0.833333,0.833333,0.802047,0.078861,7
5,0.024749,0.003441,0.002577,0.000331,entropy,10,"{'criterion': 'entropy', 'n_estimators': 10}",0.894737,0.789474,0.684211,0.894737,0.894737,0.578947,0.736842,0.833333,0.944444,0.833333,0.80848,0.107547,5
6,0.047955,0.003736,0.003983,0.000341,entropy,20,"{'criterion': 'entropy', 'n_estimators': 20}",0.789474,0.842105,0.684211,0.736842,0.894737,0.842105,0.684211,0.833333,0.833333,0.833333,0.797368,0.068337,9
7,0.069333,0.004059,0.005779,0.001325,entropy,30,"{'criterion': 'entropy', 'n_estimators': 30}",0.894737,0.789474,0.684211,0.842105,0.894737,0.789474,0.789474,0.722222,0.833333,0.888889,0.812865,0.068461,4
8,0.092052,0.005993,0.007243,0.001502,entropy,40,"{'criterion': 'entropy', 'n_estimators': 40}",0.894737,0.894737,0.631579,0.842105,0.842105,0.789474,0.736842,0.833333,0.833333,0.888889,0.818713,0.077835,2
9,0.11299,0.006361,0.008964,0.002953,entropy,50,"{'criterion': 'entropy', 'n_estimators': 50}",0.789474,0.842105,0.631579,0.789474,0.894737,0.842105,0.842105,0.777778,0.833333,0.888889,0.813158,0.071173,3


# BEST Algorithm comes out to be -> SUPPORT VECTOR MACHINES

Traing model BY **SVM**

In [None]:
model = SVC(C=2.0,kernel='rbf')

In [None]:
#training the SVM with training data
model.fit(X_train, Y_train)

SVC(C=2.0)

Model Evaluation

In [None]:
#Accuracy on training data
#In most of the cases accuracy of training data will be more bcoz the model has already seen this training data and most of the times the accuracy of test data will be less
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score( Y_train,X_train_prediction)

In [None]:
print('Accuracy on training data : ', training_data_accuracy)

Accuracy on training data :  0.9411764705882353


In [None]:
#Accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy on test data : ', test_data_accuracy)

Accuracy on test data :  0.8095238095238095


Making a Predictive System

In [None]:
input_data = (0.0307,0.0523,0.0653,0.0521,0.0611,0.0577,0.0665,0.0664,0.1460,0.2792,0.3877,0.4992,0.4981,0.4972,0.5607,0.7339,0.8230,0.9173,0.9975,0.9911,0.8240,0.6498,0.5980,0.4862,0.3150,0.1543,0.0989,0.0284,0.1008,0.2636,0.2694,0.2930,0.2925,0.3998,0.3660,0.3172,0.4609,0.4374,0.1820,0.3376,0.6202,0.4448,0.1863,0.1420,0.0589,0.0576,0.0672,0.0269,0.0245,0.0190,0.0063,0.0321,0.0189,0.0137,0.0277,0.0152,0.0052,0.0121,0.0124,0.0055)
data_array = np.asarray(input_data)
reshaped_data = data_array.reshape(1,-1) # we have 1 row and unknown columns # -1 -> It simply means that it is an unknown dimension and we want numpy to figure it out
prediction = model.predict(reshaped_data)
print(prediction)
if (prediction[0]=='R'):#prediction[0] means the first element of prediction
  print('This shows that the object is a Rock')
else:
  print('This shows that the object is a mine')


['M']
This shows that the object is a mine
