# Exemplary usage of TitanicClassifier

### Initiate environment

In [23]:
using Pkg
Pkg.activate(".")

using DataFrames
using CSV

using Revise
using TitanicClassifier

[32m[1m  Activating[22m[39m project at `~/CTU/SEM_5/JUL/TitanicClassifier/examples`


### Load data

I advise to merge train and test data, so that all the preprocessing is done at the same time, then splitting the dataset back into train and test.

In [2]:
train = CSV_to_df("../data/train.csv");
test = CSV_to_df("../data/test.csv");
merged = vcat(select(train, Not("Survived")), test);
select!(merged, Not("PassengerId"));

### Data preprocessing

For the Titanic dataset we can use 

In [3]:
preprocessed = dropmissing(titanic_preprocessing(merged)) #drop missing only fixes the datatype of the features
display(first(preprocessed, 5))

Row,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Unnamed: 0_level_1,Int64,Int64,Int64,Float64,Int64,Int64,Int64,Float64,Int64,Int64
1,3,1,1,22.0,1,0,299,7.25,1,1
2,1,2,2,38.0,1,0,261,71.2833,2,2
3,3,3,2,26.0,0,0,896,7.925,1,1
4,1,2,2,35.0,1,0,464,53.1,2,1
5,3,1,1,35.0,0,0,846,8.05,1,1


### SVM

Standardization of the data matrix both for test, and train data. Transforming the default {0, 1} labels for Survival feature into {-1, 1} labels.

In [18]:
X_merged, y_train = prepare_data_for_SVM(Matrix{Float64}(preprocessed), train[!, "Survived"]);
X_train = X_merged[1:length(y_train), :];
X_test = X_merged[length(y_train)+1:nrow(preprocessed), :];

Train for multiple hyperparameter options and choose the one with the least average error determined by using cross validation. This may run for a while...

In [22]:
best_err, best_hyperparams = hyperparam_cross_validation(X_train, y_train; num_iter=5)

("C" => 1000.0, "Kernel" => RBFKernel(10))

Use the best hyperparameters to train the SVM algortihm on the entire training dataset.

In [9]:
model = solve_SVM(X_train, y_train, 1000; kernel=RBFKernel(10))

Dict{String, Any} with 5 entries:
  "kernel" => RBFKernel(10)
  "bias"   => -3.79939
  "z"      => [0.000226665, 0.00230322, 1000.0, 0.00202367, 1000.0, 547.069, 10…
  "sv"     => [0.841595 -0.716097 … -0.438512 -0.603205; -1.54551 0.178341 … 0.…
  "y"      => [-1, 1, 1, -1, -1, -1, 1, 1, -1, -1  …  -1, 1, 1, -1, -1, -1, -1,…

Use the trained SVM model to make prediction on the test data, and save it into a file called my_pred in the data folder.

In [21]:
pred = classify_SVM(X_test, model)
pred[pred .== -1] .= 0 

df = DataFrame(PassengerId = test[!, "PassengerId"], Survived=pred)

CSV.write("../data/my_pred.csv", df);