In [67]:
using DataFrames,CSV
using Plots
using InformationMeasures
using StatsBase
using Statistics
using ScikitLearn
using IJulia
IJulia.installkernel("Julia nodeps", "--depwarn=no")

#models
@sk_import linear_model: LogisticRegression;
@sk_import neighbors:KNeighborsClassifier;
@sk_import tree:DecisionTreeClassifier;
@sk_import svm: SVC;

# utilities
@sk_import model_selection:train_test_split;
@sk_import decomposition:PCA;
@sk_import metrics:confusion_matrix;
@sk_import preprocessing:MinMaxScaler;
@sk_import metrics:f1_score;


┌ Info: Installing Julia nodeps kernelspec in /home/alchemistdude/.local/share/jupyter/kernels/julia-nodeps-1.7
└ @ IJulia /home/alchemistdude/.julia/packages/IJulia/AQu2H/deps/kspec.jl:94
└ @ ScikitLearn.Skcore /home/alchemistdude/.julia/packages/ScikitLearn/Kn82b/src/Skcore.jl:169


In [2]:
df = CSV.read("data/sonar.csv",DataFrame,header=0); #Creating dataframe object from csv file
df.Column61 .= replace.(df.Column61, "M" => 1)
df.Column61 .= replace.(df.Column61, "R" => 0);
df.Column61 = parse.(Int, df.Column61);

In [3]:
size(df)

(208, 61)

We have 208 rows and 61 columns, the 61 column is the target feature.

In [4]:
unique!(eltype.(eachcol(df)))

2-element Vector{DataType}:
 Float64
 Int64

It looks like all our data is of float64 type.

In [6]:
df[1:5,:] #head of the dataframe

Unnamed: 0_level_0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109
2,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337
3,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598
4,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598
5,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564


Summary of the dataframe

In [7]:
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Nothing,Nothing,DataType
1,Column1,0.0291639,0.0015,0.0228,0.1371,,,Float64
2,Column2,0.0384365,0.0006,0.0308,0.2339,,,Float64
3,Column3,0.0438322,0.0015,0.0343,0.3059,,,Float64
4,Column4,0.0538923,0.0058,0.04405,0.4264,,,Float64
5,Column5,0.0752024,0.0067,0.0625,0.401,,,Float64
6,Column6,0.10457,0.0102,0.09215,0.3823,,,Float64
7,Column7,0.121747,0.0033,0.10695,0.3729,,,Float64
8,Column8,0.134799,0.0055,0.1121,0.459,,,Float64
9,Column9,0.178003,0.0075,0.15225,0.6828,,,Float64
10,Column10,0.208259,0.0113,0.1824,0.7106,,,Float64


Let's check for class imbalance:

In [8]:
countmap(df.Column61)

Dict{Int64, Int64} with 2 entries:
  0 => 97
  1 => 111

It is a little umbalanced, but we can work with that.Now lets look for possible correlations:

In [9]:
corr = cor(Matrix(df))

61×61 Matrix{Float64}:
 1.0       0.735896  0.571537   …  0.357116   0.347078   0.271694
 0.735896  1.0       0.779916      0.3522     0.358761   0.231238
 0.571537  0.779916  1.0           0.425047   0.373948   0.192195
 0.491438  0.606684  0.781786      0.420266   0.400626   0.250638
 0.344797  0.419669  0.546141      0.290982   0.25371    0.222232
 0.238921  0.332329  0.346275   …  0.220573   0.178158   0.132327
 0.260815  0.27904   0.190434      0.183578   0.222493   0.114748
 0.355523  0.334615  0.237884      0.1944     0.146216   0.189314
 0.35342   0.316733  0.252691      0.0972928  0.0952431  0.321448
 0.318276  0.270782  0.219637      0.0582733  0.0973581  0.341142
 0.344058  0.297065  0.27461    …  0.0677261  0.0896953  0.432855
 0.210861  0.194102  0.214807      0.0446137  0.0713637  0.392245
 0.210722  0.249596  0.258767      0.151804   0.0614105  0.312811
 ⋮                              ⋱                        ⋮
 0.269287  0.245868  0.0810956     0.178118   0.139944   0.1

Now lets look for the correlations withe the objective feature:

In [10]:
corr[:,61]

61-element Vector{Float64}:
 0.2716941061552168
 0.23123798457330438
 0.192194745755887
 0.25063845884088126
 0.22223183509528327
 0.13232650383573896
 0.11474838990488473
 0.189314275187784
 0.3214483861926137
 0.3411418488122266
 0.4328549236892342
 0.39224547508336305
 0.31281078441856947
 ⋮
 0.18022415905580047
 0.29320460713370367
 0.2886505526223154
 0.1418711269012208
 0.1826874301454812
 0.09563853711970001
 0.129340554882823
 0.0009328275756250693
 0.18419099614710369
 0.13082593671114393
 0.09005534016567657
 1.0

As we can see there is no strong linear relation with any of the other variables.

Lets scale the data to the range between 0 and 1

In [21]:
scaler = MinMaxScaler(); 
X = Matrix(df[:,1:60])        
scaler.fit(X)
X = scaler.transform(X)
y = df.Column61;

This is how the data was transformed:

In [24]:
X[1:5,:]

5×60 Matrix{Float64}:
 0.136431   0.156451   0.135677  0.0354256  …  0.185355   0.245179  0.0600462
 0.323009   0.221603   0.272011  0.150024      0.105263   0.140496  0.0877598
 0.182153   0.246892   0.35611   0.243699      0.368421   0.258953  0.166282
 0.0626844  0.0707244  0.199737  0.0349501     0.0938215  0.107438  0.256351
 0.550885   0.282898   0.153088  0.0798859     0.102975   0.292011  0.203233

# Feature Engineering

Now lets try a PCA as the dimension of the input features is high, as we have a lot of features, we want the explained variance to be 95-99%:

In [27]:
pca = PCA(0.95)
pca.fit(X)
X = pca.transform(X); 

In [30]:
size(X)

(208, 21)

After PCA we went from 60 features to 21.

Splitting into train,test and validation sets:

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,random_state=66)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, train_size=0.9,random_state=66);

In [32]:
print("Train -> ",size(X_train),"\n")
print("Test -> ",size(X_test),"\n")
print("Validation -> ",size(X_valid),"\n")

Train -> (149, 21)
Test -> (42, 21)
Validation -> (17, 21)


In [58]:
function interpret_cm(m)
    print("Rocks labeled as rocks: ",m[1,1])
    print("\nRocks labeled as mines: ",m[1,2])
    print("\nMines labeled as rocks: ",m[2,1])
    print("\nMines labeled as mines: ",m[2,2])
end
    

interpret_cm (generic function with 1 method)

# Logistic Regression

In [59]:
model_log = LogisticRegression();
model_log.fit(X_train,y_train);
predicted_log = model_log.predict(X_valid);

In [60]:
interpret_cm(confusion_matrix(y_valid,predicted_log))

Rocks labeled as rocks: 6
Rocks labeled as mines: 1
Mines labeled as rocks: 1
Mines labeled as mines: 9

In [47]:
f1 = f1_score(y_valid,predicted_log)

0.9

# KNN

In [61]:
model_knn = KNeighborsClassifier();
model_knn.fit(X_train,y_train);
predicted_knn = model_knn.predict(X_valid);

In [62]:
interpret_cm(confusion_matrix(y_valid,predicted_knn))

Rocks labeled as rocks: 4
Rocks labeled as mines: 3
Mines labeled as rocks: 3
Mines labeled as mines: 7

In [63]:
f1 = f1_score(y_valid,predicted_knn)

0.7

# Decision Tree Classifier

In [64]:
model_dtc = DecisionTreeClassifier()
model_dtc.fit(X_train,y_train);
predicted_dtc = model_dtc.predict(X_valid);

In [65]:
interpret_cm(confusion_matrix(y_valid,predicted_dtc))

Rocks labeled as rocks: 4
Rocks labeled as mines: 3
Mines labeled as rocks: 1
Mines labeled as mines: 9

In [66]:
f1 = f1_score(y_valid,predicted_dtc)

0.8181818181818182

# SVM

In [68]:
model_svc = SVC(gamma="auto");
model_svc.fit(X_train,y_train);
predicted_svc = model_svc.predict(X_valid);

In [69]:
interpret_cm(confusion_matrix(y_valid,predicted_svc))

Rocks labeled as rocks: 6
Rocks labeled as mines: 1
Mines labeled as rocks: 1
Mines labeled as mines: 9

In [70]:
f1 = f1_score(y_valid,predicted_svc)

0.9