### Titanic challenge

In this tutorial, we will walk through the complete process of finishing the famous Titanic challenge in [Kaggle](https://www.kaggle.com/c/titanic/) using [ML.Net](https://github.com/dotnet/machinelearning) and [MLNet.AutoPipeline](https://github.com/LittleLittleCloud/machinelearning-auto-pipeline). The goal of Titanic challenge is to predict if a passanger will be survived or not based on his/her age, ticket, etc. 

#### ML.Net and MLNet.AutoPipeline

[ML.Net](https://github.com/dotnet/machinelearning) is a C# machine learning framework created and maintained by Microsoft. It provides easy API for creating pipelines, training and consuming models, and tons of examples that are easy to start with. Visit the official [website](https://dotnet.microsoft.com/learn/ml-dotnet) for more information.

[MLNet.AutoPipeline](https://github.com/LittleLittleCloud/machinelearning-auto-pipeline) is a third-party library build on top of [ML.Net](https://github.com/dotnet/machinelearning). It provides add-on APIs for creating sweepable pipelines which supports automatic hyper-parameter tunning. 

#### Let's start!

### Install dependencies and include namespaces


In [1]:
#i "nuget:https://pkgs.dev.azure.com/xiaoyuz0315/BigMiao/_packaging/MLNet-Auto-Pipeline/nuget/v3/index.json"
#r "nuget:MLNet.AutoPipeline,0.9.0-v202008143"
using Microsoft.ML;
using Microsoft.ML.Data;
using MLNet.AutoPipeline;
using MLNet.AutoPipeline.Metric;
using MLNet.Sweeper;
using System;
using System.Threading.Tasks;

Installed package MLNet.AutoPipeline version 0.9.0-v202008143

### Load Dataset

The dataset is download from [kaggle](https://www.kaggle.com/c/titanic/data) and it is splitted into two parts: train.csv and test.csv. train.csv will be used to train and validate score, while test.csv will be used for creating submission to kaggle. 

In [2]:
// titanic dataset class
class Titanic
{
    [LoadColumn(0)]
    public string PassengerId;

    [LoadColumn(1)]
    public int Pclass;
    
    [LoadColumn(2)]
    public string Name;
    
    [LoadColumn(3)]
    public string Sex;
    
    [LoadColumn(4)]
    public float Age;
    
    [LoadColumn(5)]
    public float SibSp;
    
    [LoadColumn(6)]
    public float Parch;
    
    [LoadColumn(7)]
    public string Ticket;
    
    [LoadColumn(8)]
    public float Fare;
    
    [LoadColumn(9)]
    public string Cabin;
    
    [LoadColumn(10)]
    public string Embarked;

    [LoadColumn(11)]
    public bool Survived;
}

var context = new MLContext(seed:0);
var dataset = context.Data.LoadFromTextFile<Titanic>(@"titanic/train.csv", separatorChar: ',', hasHeader: true, allowQuoting: true );

// create train-test split on dataset
var split = context.Data.TrainTestSplit(dataset, 0.1);
Console.WriteLine($"train split: {split.TrainSet.Preview(1000)}");
Console.WriteLine($"test split: {split.TestSet.Preview(1000)}");

train split: 13 columns, 806 rows
test split: 13 columns, 85 rows


### data preprocess

According to the definition of each field, PassengerId, Name and Ticket should have no effect to the final result and can be removed. While the Cabin might be useful, it contains too many empty value and we are going to drop that column either. Pclass, Sex and Embarked are catagorised value and should be tokenized before feeding to pipeline. Age is a useful information, and it contains some missing value and we can replace them with its mean value.

In [3]:
var preprocessingPipeline = context.Transforms.Categorical.OneHotEncoding("Pclass", "Pclass")
                                   .Append(context.Transforms.Categorical.OneHotEncoding("Sex"))
                                   .Append(context.Transforms.Categorical.OneHotEncoding("Embarked"))
                                   .Append(context.Transforms.ReplaceMissingValues("Age",replacementMode : Microsoft.ML.Transforms.MissingValueReplacingEstimator.ReplacementMode.Mean))
                                   .Append(context.Transforms.Concatenate("features", new string[] { "Pclass", "Sex", "Embarked", "Age", "Fare", "Parch", "SibSp" }));


### Trainer

We will use [FastTreeBinaryTrainer](https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.trainers.fasttree.fasttreebinarytrainer?view=ml-dotnet) to train our model. It's a very nice binary classifier which implements [MART](https://arxiv.org/abs/1505.01866) gradiant boosting algorithm and has good performance on multiple datasets. But we will not use that trainer in our pipeline directly. Instead, we will use MLNet.AutoPipeline to create a sweepable pipeline, which can sweeping over the hyper-parameter of FastTreeBinaryTrainer and find the best set of parameters. 

The corresponding FastTreeBinaryTrainer in MLNet.AutoPipeline is [FastTree](https://automlnet.com/api/MLNet.AutoPipeline.SweepableBinaryClassificationTrainerExtension.html#MLNet_AutoPipeline_SweepableBinaryClassificationTrainerExtension_FastTree_MLNet_AutoPipeline_SweepableBinaryClassificationTrainers_System_String_System_String_MLNet_AutoPipeline_SweepableOption_Microsoft_ML_Trainers_FastTree_FastTreeBinaryTrainer_Options__Microsoft_ML_Trainers_FastTree_FastTreeBinaryTrainer_Options_), which has the same name of the command creating FastTreeBinaryTrainer in ML.Net. Other than FastTree, MLNet.AutoPipeline also provides almost all binary trainers that are available in ML.Net, you can find the full list [here](https://automlnet.com/api/MLNet.AutoPipeline.SweepableBinaryClassificationTrainerExtension.html).

In [16]:
var trainingPipeline = preprocessingPipeline.Append(context.AutoML().BinaryClassification.FastTree("Survived", "features"));

// Default sweeping configuration
Console.WriteLine(FastTreeBinaryTrainerSweepableOptions.Default);

Type of option: Options

Parameter Name: LabelColumnName
Parameter Type: String
Parameter Value: MLNet.Sweeper.ObjectParameterValue

Parameter Name: FeatureColumnName
Parameter Type: String
Parameter Value: MLNet.Sweeper.ObjectParameterValue

Parameter Name: ExampleWeightColumnName
Parameter Type: String
Parameter Value: MLNet.Sweeper.ObjectParameterValue

Parameter Name: NumberOfLeaves
Parameter Type: int
Min Value: 10
Max Value: 1000
Steps: 100
Log Base: True

Parameter Name: NumberOfTrees
Parameter Type: int
Min Value: 1
Max Value: 1000
Steps: 100
Log Base: True

Parameter Name: MinimumExampleCountPerLeaf
Parameter Type: int
Min Value: 1
Max Value: 100
Steps: 100
Log Base: True

Parameter Name: LearningRate
Parameter Type: double
Min Value: 0.0001
Max Value: 1
Steps: 100
Log Base: True




### Train the model

In [None]:
class Reporter : IProgress<IterationInfo>
{
    public void Report(IterationInfo value)
    {
        Console.WriteLine(value.ParameterSet);
        Console.WriteLine(value.SweepablePipeline.Summary());
        Console.WriteLine($"validate score: {value.ScoreMetric.Name}: {value.ScoreMetric.Score}");
        Console.WriteLine($"training time: {value.TrainingTime}");
    }
}

public class AccuracyMetric : IMetric
{
    public string Name => "Accuracy";

    public bool IsMaximizing => true;

    public double Score(MLContext context, IDataView eval, string label)
    {
        return context.BinaryClassification.Evaluate(eval, label).Accuracy;
    }
}

var experimentOption = new Experiment.Option()
{
    ScoreMetric = new AccuracyMetric(),
    Label = "Survived",
    Iteration = 150,
};
var experiment = context.AutoML().CreateExperiment(trainingPipeline, experimentOption);
var reporter = new Reporter();
var result = await experiment.TrainAsync(split.TrainSet, validateFraction: 0.1f, reporter: reporter);

LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=36 NumberOfTrees=9 MinimumExampleCountPerLeaf=63 LearningRate=0.0007585775750291848
SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=>FastTreeBinaryTrainer)
validate score: Accuracy: 0.7947598253275109
training time: 0.0768245
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=1000 NumberOfTrees=7 MinimumExampleCountPerLeaf=35 LearningRate=0.008317637711026712
SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=>FastTreeBinaryTrainer)
validate score: Accuracy: 0.8078602620087336
training time: 0.068316
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=380 NumberOfTrees=22 MinimumExampleCountPerLeaf=6 LearningRate=0.0013

validate score: Accuracy: 0.7947598253275109
training time: 0.1119246
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=69 NumberOfTrees=2 MinimumExampleCountPerLeaf=3 LearningRate=0.0006918309709189371
SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=>FastTreeBinaryTrainer)
validate score: Accuracy: 0.7816593886462883
training time: 0.0564044
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=11 NumberOfTrees=331 MinimumExampleCountPerLeaf=44 LearningRate=0.7585775750291833
SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=>FastTreeBinaryTrainer)
validate score: Accuracy: 0.759825327510917
training time: 0.1566927
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=26

SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=>FastTreeBinaryTrainer)
validate score: Accuracy: 0.8253275109170306
training time: 0.4376742
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=240 NumberOfTrees=21 MinimumExampleCountPerLeaf=11 LearningRate=0.00043651583224016643
SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=>FastTreeBinaryTrainer)
validate score: Accuracy: 0.7991266375545851
training time: 0.0788892
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=115 NumberOfTrees=708 MinimumExampleCountPerLeaf=1 LearningRate=0.012022644346174135
SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=

training time: 0.123526
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=66 NumberOfTrees=5 MinimumExampleCountPerLeaf=4 LearningRate=0.0006918309709189371
SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=>FastTreeBinaryTrainer)
validate score: Accuracy: 0.7903930131004366
training time: 0.0630087
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=316 NumberOfTrees=7 MinimumExampleCountPerLeaf=76 LearningRate=0.2089296130854038
SweepablePipeline(OneHotEncodingEstimator=>OneHotEncodingEstimator=>OneHotEncodingEstimator=>MissingValueReplacingEstimator=>ColumnConcatenatingEstimator=>FastTreeBinaryTrainer)
validate score: Accuracy: 0.8165938864628821
training time: 0.0605416
LabelColumnName=Label FeatureColumnName=Features ExampleWeightColumnName= NumberOfLeaves=32 NumberOfTrees=234 MinimumExampleCountPerLeaf=

### Evaluate the model

In [18]:
var bestModel = result.BestModel;
var eval = bestModel.Transform(split.TestSet);
var metric = context.BinaryClassification.Evaluate(eval, "Survived");
Console.WriteLine($"best model test score: {metric.Accuracy}");

best model test score: 0.8352941176470589


### Train and evaludate using the default setting

In the next section, we are going to use [FastTreeBinaryTrainer](https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.trainers.fasttree.fasttreebinarytrainer?view=ml-dotnet) from ML.Net with default parameters to train and evaluate the model, and compare its score with the score from sweeping pipeline and see the difference.

In [19]:
context = new MLContext();
var estimatorChain = preprocessingPipeline.Append(context.BinaryClassification.Trainers.FastTree("Survived", "features"));
              
var mlModel = estimatorChain.Fit(split.TrainSet);
var mlModel_eval_train = mlModel.Transform(split.TrainSet);
var mlModel_eval_test = mlModel.Transform(split.TestSet);
var mlModel_train_metric = context.BinaryClassification.Evaluate(mlModel_eval_train, "Survived");
var mlModel_test_metric = context.BinaryClassification.Evaluate(mlModel_eval_test, "Survived");
Console.WriteLine($"mlnet estimator chain test accuracy: {mlModel_test_metric.Accuracy}");

mlnet estimator chain test accuracy: 0.8117647058823529
