# __Create a machine learning model capable of detecting Rock or Mine__

## based on the response of the 60 separate sonar frequencies.


### __Data Source:__
[https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)](http://https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

#### **Data Description**

---

The file **"sonar.mines"** contains **111 patterns** obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. The file **"sonar.rocks"** contains **97 patterns** obtained from rocks under similar conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.

**Each pattern** is a **set of 60 numbers** in the **range 0.0 to 1.0**. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp.

The label associated with each record contains the letter **"R"** if the object is a **rock** and **"M"** if it is a **mine** (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.


#### **Abstract**

|Info                      | Answer        |
|--------------------------|---------------|
|Data Set Characteristics: | Multivariate  | 
|Attribute Characteristics:| Real          |
|Associated Tasks:         | Classification| 
|Number of Web Hits:       | 213249        |
|Number of Instances:      | 208           |
|Area:                     | Physical      |
|Number of Attributes:     | 60            |
|Date Donated              | N/A           |
|Missing Values?           | N/A           |

## **Table of content**

* ### **Part 1 - Data Preprocessing** 
    1. Importing Libraries
    2. Importing Datasets
    3. Exploration Data Analysis (EDA)
    4. Data visualization
    5. Determining the Features and the Target Variable
    6. Spliting the Data to Train & Test
    7. Feature Scaling
    
    
* ### **Part 2 - Building and Training the Classification model**
 
    8. Training the model
    9. Prediction
    10. Evaluating the Model
    11. Selecting the best K
    12. Spot check some algorithms
    13. Algorith Tuning: KNN show as the most promising options
    14. Ensembles
    15. Finilizing the model


# __Part 1 - Data Preprocessing__

Installation __*'pandas-profiling'*__ and __*'sweetviz'*__ for generates reports to help analyze EDA

In [None]:
# EDA using Pandas Profiling
!pip install pandas_profiling

In [None]:
# EDA using Sweetviz
# !pip install sweetviz

## __1. Importing Libraries__

In [None]:
# Load libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas_profiling as pp
# import sweetviz as sv
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report, balanced_accuracy_score



# Color Palette

custom_colors = ["#85CEDA","#D2A7D8", "#A67BC5", "#BB1C8B", "#05A4C0"]
customPalette = sns.set_palette(sns.color_palette(custom_colors))

# Set size

sns.palplot(sns.color_palette(custom_colors),size=1)
plt.tick_params(axis='both', labelsize=0, length = 0)


## **2. Import Datasets**

In [None]:
df = pd.read_csv('../input/connectionist-bench-sonar-mines-vs-rocks-uci/sonar.all-data-uci.csv')

## **3. Exploration Data Analysis (EDA)**

In [None]:
# shape
df.shape

In [None]:
# info
df.info()

__The dataset has 208 samples and 60 features + the target variable (Label).__ All features are float and target is object. 

In [None]:
# type
# pd.set_option('display.max_row', 500)
df.dtypes

In [None]:
# peek at data
# pd.set_option('display.width', 100)
df.head(20)

In [None]:
# describe data
df.describe()

In [None]:
# class distribution
df.groupby(by='Label').size()

__The sonar dataset has 111 mines and 97 rocks.__ 

#### __Dataset information (Pandas Profiling & Sweetviz Report)__

* #### Pandas Profile

In [None]:
pp.ProfileReport(df,title= 'Pandas Profile report of "Sonar" dataset', html= {'style':{'full_width': True}})

* #### Sweetviz Report

In [None]:
# For show result uncomment below lines
# report = sv.analyze(df)
# report.show_html("Sonar_EDA_report.html") # specify a name for the report
# report

## __4. Data Visualization📊📈__

In [None]:
# Histogram
df.hist(sharex= False, sharey= False, xlabelsize= 1, ylabelsize= 1, figsize=(12,12))
plt.show()

In [None]:
# density
df.plot(kind='density', subplots=True, layout=(8,8), sharex=False, legend=False, fontsize=1, figsize=(12,12))
plt.show()

In [None]:
# correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df.corr(), vmin=-1, vmax=1, interpolation='none')
fig.colorbar(cax)
fig.set_size_inches(10,10)
plt.show()

## __5. Determining the Features and the Target Variable__

In [None]:
X = df.drop('Label', axis= 1)
y = df['Label']

## __6. Spliting the Data to Train & Test__

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
# test options
num_folds = 10
seed = 101
scoring = 'accuracy'

## __7. Feature Scaling__

__It should be mentioned that feature scaling is compulsory in the KNN algorithm.__

In [None]:
scaler= StandardScaler()
scaler.fit(X_train)
scaled_X_train= scaler.transform(X_train)
scaled_X_test= scaler.transform(X_test)

# **Part 2 - Building and Training the Classification model**

## __8. Training the Model__

__K nearset neighbors (KNN)__ assigns a label to new data according to the __distance between the old data and the new data.__

In [None]:
knn_model= KNeighborsClassifier(n_neighbors=1)
knn_model.fit(scaled_X_train, y_train)

## __9. Prediction__

In [None]:
y_pred= knn_model.predict(scaled_X_test)

In [None]:
#A comparison between predicted Value vs Actual Values

pd.DataFrame({'Y_Test':y_test, 'Y_Pred': y_pred})

## __10. Evaluating the Model__

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

## __11. Selecting the best K__

* #### __1st Method: Elbow Method__

In [None]:
test_error_rate= []


for k in range (1, 30):
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(scaled_X_train, y_train)
    
    y_pred_test = knn_model.predict(scaled_X_test)
    
    test_error=1- accuracy_score(y_test, y_pred_test)
    test_error_rate.append(test_error)
    
test_error_rate

In [None]:
plt.figure(figsize=(6, 4), dpi = 200)
plt.plot(range(1, 30), test_error_rate, label='Test Error')
plt.legend()
plt.ylabel('Error Rate')
plt.xlabel('K Value')

* #### __2nd Method: Grid Search Cross Validation ( Pipeline application)__

In [None]:
scaler= StandardScaler()
knn= KNeighborsClassifier()

In [None]:
operations= [('scaler', scaler), ('knn', knn)]
pipe= Pipeline(operations)
k_values= list(range(1, 20))
param_grid= {'knn__n_neighbors': k_values}
full_cv_classifier= GridSearchCV(pipe, param_grid, cv=5, scoring= scoring)
full_cv_classifier.fit(X_train, y_train)

In [None]:
full_cv_classifier.best_estimator_.get_params()

## __12. Spot check some algorithms__

In [None]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))

In [None]:
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [None]:
# compare algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(8,6)
plt.show()

In [None]:
# standardized the dataset
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()), ('LR', LogisticRegression())])))
pipelines.append(('ScaledLDA', Pipeline([('Scaler', StandardScaler()), ('LDA', LinearDiscriminantAnalysis())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()), ('KNN', KNeighborsClassifier())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()), ('CART', DecisionTreeClassifier())])))
pipelines.append(('ScaledSVM', Pipeline([('Scaler', StandardScaler()), ('SVM', SVC())])))

In [None]:
results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle= True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [None]:
# compare scaled algorithms
fig = plt.figure()
fig.suptitle('Scaled Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(8,6)
plt.show()

## __13. Algorith Tuning: KNN show as the most promising options__

In [None]:
# KNN algorithm tuning
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
neighbors = [1,2,3,5,7,9,11,13,15,17,19,21]
param_grid = dict(n_neighbors=neighbors)
model = KNeighborsClassifier()
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle= True)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX, y_train)

In [None]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
ranks = grid_result.cv_results_['rank_test_score']
for mean, stdev, param, rank in zip(means, stds, params, ranks):
    print("#%d %f (%f) with: %r" % (rank, mean, stdev, param))

In [None]:
plt.figure(figsize=(6, 4), dpi = 200)
plt.plot(neighbors, means, label='Test Error')
plt.legend()
plt.ylabel('Error Rate')
plt.xlabel('K Value')

__KNN's best of 84.9%. (But what about variance? KNN seemed to indicate a tighter variance during spot checking).__

Let's try some ensemble methods. No standardization on data this time. Because apparantly all four ensembles we are using are based on decision trees and thus are less sensitive to data distributions. (Ok. Nice tip!)

## __14. Ensembles__

In [None]:
# ensembles
ensembles = []
# Boosting methods
ensembles.append(('AB', AdaBoostClassifier()))
ensembles.append(('GBM', GradientBoostingClassifier()))
# Bagging methods
ensembles.append(('RF', RandomForestClassifier()))
ensembles.append(('ET', ExtraTreesClassifier()))

In [None]:
results = []
names = []
for name, model in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle= True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [None]:
# compare ensemble algorithms
fig = plt.figure()
fig.suptitle('Ensemble Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(8,6)
plt.show()

ET might be worthy of further study.

## __15. Finilizing the model__

In [None]:
scaler= StandardScaler()
knn1= KNeighborsClassifier(n_neighbors= 1)
operations= [('scaler', scaler), ('knn1', knn1)]

In [None]:
pipe= Pipeline(operations)
pipe.fit(X_train, y_train)

In [None]:
# estimate accuracy on validation set
predictions = pipe.predict(X_test)
print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))