This project is related to the course Applied Machine Learning (Basic) at PhD in Physics. It consists of the application of machine learning to classify the Sonar data and discriminate between rocks and minerals. It is widely used by the machine learning users to evaluate the capabilities of their algorithms.
Since this project is related to the basic part of the course, only basic machine learning algorithms for binary classification will be used and neural networks will not be considered.
-
Download and unzip the repository.
-
Once the repository is downloaded and unzipped,
cd
into it and enter:
source setup.sh
If it is the first time you run it, a virtual environment venv
will be created and activated and prerequisites Python modules will be installed. From the second time on, the script will simply activate the virtual environment.
⚠️ be sure to have installed thevirtualenv
package withpip install virtualenv
command.
- Download the dataset from here and move it into the
data
directory:
mkdir -p data
mv path/to/dataset data
First of all cd
the src
directory.
To run the data preprocessing part:
./all_analysis.sh processing
To run the modelling part:
./all_analysis.sh modelling
To run the final results extraction:
./all_analysis.sh results
To run the entire analysis:
./all_analysis.sh
Data are distributed with 60 columns plus one containing only the labels to be used for classification. All them lie in the range between 0 and 1. First 6 columns are printed as an example:
F0 F1 F2 F3 F4 F5
0 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986
1 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583
2 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280
3 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368
4 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649
.. ... ... ... ... ... ...
203 0.0187 0.0346 0.0168 0.0177 0.0393 0.1630
204 0.0323 0.0101 0.0298 0.0564 0.0760 0.0958
205 0.0522 0.0437 0.0180 0.0292 0.0351 0.1171
206 0.0303 0.0353 0.0490 0.0608 0.0167 0.1354
207 0.0260 0.0363 0.0136 0.0272 0.0214 0.0338
The following procedures have been applied, in this order, for data preprocessing:
- Feature selection: through the
SelectKBest
algorithm which select features according to the k-highest scores. From this step it has been realized that only 14/60 features are considered really important. - Data standardization: through the
StandardScaler
which standardize feature by removing the mean and scaling to unit variance. - Data normalization: through the
Normalizer
algorithm, which normalize samples individually to unit norm.
Some control plots used for feature exploration have then been produced after data manipulation:
Box plots show some outliers in features 0, 11, 12 and 13.
Modelling studies have been performed on the processed data obtained with the procedure described in the previous section. First of all, data have been split into training and test sets using the ShuffleSplit
cross-validator, which create a random split of the data, but repeat multiple times the process of splitting and evaluation of the algorithm.
Several models have been used to perform the classification:
LogisticRegression
.LinearDiscriminantAnalysis
.KNeighborsClassifier
.DecisionTreeClassifier
.GaussianNB
.SVC
.RandomForestClassifier
.
Hyperparametrization using the GridSearchCV
algorithm has been used, in order to choose the best parameters combination for each model and improve their scores.
These combinations have been used to fine-tune the following models:
LogisticRegression(max_iter=500, penalty='l1', solver='liblinear')
.KNeighborsClassifier(metric='manhattan', n_neighbors=1)
.DecisionTreeClassifier(max_depth=10)
.SVC(degree=1, gamma='auto', probability=True)
.RandomForestClassifier(n_jobs=2, random_state=1)
.
⚠️ hyperparametrization is disabled by default, since it has been used only during the code development. To enable it you can open the src/all_analysis.sh script and change the line--hyperparametrization=off
into--hyperparametrization==on
; pay attention that this may slow a bit each model execution.
For each model the following metrics are computed:
- Accuracy.
- Negative log-loss.
- Area under the ROC curve (AUC).
Results for each metric is shown below:
Accuracy
Area Under the ROC Curve (AUC)
Negative log-loss
With such high scores it is better to check if some kind of overfitting has been performed, therefore the needing of learning curves plotting is required.
Learning curves for accuracy metric of training vs test sets are provided below as a cross-check to investigate possible overfitting of each model:
The training and cross-validation accuracy are approaching each other the more the training samples. So, even increasing training examples, the situation will not improve much. Fit times will increase, but we will not obtain huge improvement on accuracy.
The training score doesn’t change much by adding more examples. But the cross-validation score definitely does. This means that adding more examples over the ones we currently have is probably not required.
The training score is very high at the beginning and slightly decreases and the cross-validation score is a bit low at the beginning and consistently increases. Both them are pretty good at the end.
The training score doesn’t change much by adding more examples. But the cross-validation score definitely does. This means that adding more examples over the ones we currently have is probably not required.
The training score is very high at the beginning and slightly decreases and the cross-validation score is a bit low at the beginning and consistently increases. Both them are pretty good at the end.
The cross-validation score seems to increase with the number of training examples, so, probably, a larger number of samples could improve the accuracy.
The training score doesn’t change much by adding more examples. But the cross-validation score definitely does. This means that adding more examples over the ones we currently have is probably not required.
The training score doesn’t change much by adding more examples. But the cross-validation score definitely does. This means that adding more examples over the ones we currently have is probably not required.
Final results are obtained by computing the accuracy over the test set produced by each split of the ShuffleSplit
cross-validator and results is averaged for each model.
Final results are the following:
LogisticRegression
: 98.725%LinearDiscriminantAnalysis
: 95.913%KNeighborsClassifier
: 99.797%DecisionTreeClassifier:
99.493%GaussianNB
: 92.217%SVC
: 99.594%RandomForestClassifier
: 99.594%