Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Application of machine learning to classify the Sonar data. Project for the course "Applied Machine Learning (Basic)" at PhD in Physics.

License

Notifications You must be signed in to change notification settings

JustWhit3/the-sonar-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Application of machine learning to classify the Sonar data.

v0.0 license Python 3.10
code size repo size total lines
codeq


Table of contents

Introduction

This project is related to the course Applied Machine Learning (Basic) at PhD in Physics. It consists of the application of machine learning to classify the Sonar data and discriminate between rocks and minerals. It is widely used by the machine learning users to evaluate the capabilities of their algorithms.

Since this project is related to the basic part of the course, only basic machine learning algorithms for binary classification will be used and neural networks will not be considered.

Software setup and run

Setup

  1. Download and unzip the repository.

  2. Once the repository is downloaded and unzipped, cd into it and enter:

source setup.sh

If it is the first time you run it, a virtual environment venv will be created and activated and prerequisites Python modules will be installed. From the second time on, the script will simply activate the virtual environment.

⚠️ be sure to have installed the virtualenv package with pip install virtualenv command.

  1. Download the dataset from here and move it into the data directory:
mkdir -p data
mv path/to/dataset data

Run

First of all cd the src directory.

To run the data preprocessing part:

./all_analysis.sh processing

To run the modelling part:

./all_analysis.sh modelling

To run the final results extraction:

./all_analysis.sh results

To run the entire analysis:

./all_analysis.sh

Data preprocessing

Data are distributed with 60 columns plus one containing only the labels to be used for classification. All them lie in the range between 0 and 1. First 6 columns are printed as an example:

         F0      F1      F2      F3      F4      F5
0    0.0200  0.0371  0.0428  0.0207  0.0954  0.0986
1    0.0453  0.0523  0.0843  0.0689  0.1183  0.2583
2    0.0262  0.0582  0.1099  0.1083  0.0974  0.2280
3    0.0100  0.0171  0.0623  0.0205  0.0205  0.0368
4    0.0762  0.0666  0.0481  0.0394  0.0590  0.0649
..      ...     ...     ...     ...     ...     ...
203  0.0187  0.0346  0.0168  0.0177  0.0393  0.1630
204  0.0323  0.0101  0.0298  0.0564  0.0760  0.0958
205  0.0522  0.0437  0.0180  0.0292  0.0351  0.1171
206  0.0303  0.0353  0.0490  0.0608  0.0167  0.1354
207  0.0260  0.0363  0.0136  0.0272  0.0214  0.0338

The following procedures have been applied, in this order, for data preprocessing:

  • Feature selection: through the SelectKBest algorithm which select features according to the k-highest scores. From this step it has been realized that only 14/60 features are considered really important.
  • Data standardization: through the StandardScaler which standardize feature by removing the mean and scaling to unit variance.
  • Data normalization: through the Normalizer algorithm, which normalize samples individually to unit norm.

Some control plots used for feature exploration have then been produced after data manipulation:

Histograms

Density Plots

Scatter Matrix

Box Plots

Box plots show some outliers in features 0, 11, 12 and 13.

Correlation Matrix

Modelling

Modelling studies have been performed on the processed data obtained with the procedure described in the previous section. First of all, data have been split into training and test sets using the ShuffleSplit cross-validator, which create a random split of the data, but repeat multiple times the process of splitting and evaluation of the algorithm.

Several models have been used to perform the classification:

Hyperparametrization

Hyperparametrization using the GridSearchCV algorithm has been used, in order to choose the best parameters combination for each model and improve their scores.

These combinations have been used to fine-tune the following models:

  • LogisticRegression(max_iter=500, penalty='l1', solver='liblinear').
  • KNeighborsClassifier(metric='manhattan', n_neighbors=1).
  • DecisionTreeClassifier(max_depth=10).
  • SVC(degree=1, gamma='auto', probability=True).
  • RandomForestClassifier(n_jobs=2, random_state=1).

⚠️ hyperparametrization is disabled by default, since it has been used only during the code development. To enable it you can open the src/all_analysis.sh script and change the line --hyperparametrization=off into --hyperparametrization==on; pay attention that this may slow a bit each model execution.

Metrics comparison

For each model the following metrics are computed:

  • Accuracy.
  • Negative log-loss.
  • Area under the ROC curve (AUC).

Results for each metric is shown below:

Accuracy

Area Under the ROC Curve (AUC)

Negative log-loss

With such high scores it is better to check if some kind of overfitting has been performed, therefore the needing of learning curves plotting is required.

Learning curves

Learning curves for accuracy metric of training vs test sets are provided below as a cross-check to investigate possible overfitting of each model:

LogisticRegression

The training and cross-validation accuracy are approaching each other the more the training samples. So, even increasing training examples, the situation will not improve much. Fit times will increase, but we will not obtain huge improvement on accuracy.

DecisionTreeClassifier

The training score doesn’t change much by adding more examples. But the cross-validation score definitely does. This means that adding more examples over the ones we currently have is probably not required.

GaussianNB

The training score is very high at the beginning and slightly decreases and the cross-validation score is a bit low at the beginning and consistently increases. Both them are pretty good at the end.

KNeighborsClassifier

The training score doesn’t change much by adding more examples. But the cross-validation score definitely does. This means that adding more examples over the ones we currently have is probably not required.

LinearDiscriminantAnalysis

The training score is very high at the beginning and slightly decreases and the cross-validation score is a bit low at the beginning and consistently increases. Both them are pretty good at the end.

LogisticRegression

The cross-validation score seems to increase with the number of training examples, so, probably, a larger number of samples could improve the accuracy.

RandomForestClassifier

The training score doesn’t change much by adding more examples. But the cross-validation score definitely does. This means that adding more examples over the ones we currently have is probably not required.

SVC

The training score doesn’t change much by adding more examples. But the cross-validation score definitely does. This means that adding more examples over the ones we currently have is probably not required.

Final results

Final results are obtained by computing the accuracy over the test set produced by each split of the ShuffleSplit cross-validator and results is averaged for each model.

Final results are the following:

  • LogisticRegression: 98.725%
  • LinearDiscriminantAnalysis: 95.913%
  • KNeighborsClassifier: 99.797%
  • DecisionTreeClassifier: 99.493%
  • GaussianNB: 92.217%
  • SVC: 99.594%
  • RandomForestClassifier: 99.594%

About

Application of machine learning to classify the Sonar data. Project for the course "Applied Machine Learning (Basic)" at PhD in Physics.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published