#  T M V A Regression
This macro provides examples for the training and testing of the
TMVA classifiers.

As input data is used a toy-MC sample consisting of four Gaussian-distributed
and linearly correlated input variables.

The methods to be used can be switched on and off by means of booleans, or
via the prompt command, for example:

    root -l TMVARegression.C\(\"LD,MLP\"\)

(note that the backslashes are mandatory)
If no method given, a default set is used.

The output file "TMVAReg.root" can be analysed with the use of dedicated
macros (simply say: root -l <macro.C>), which can be conveniently
invoked through a GUI that will appear at the end of the run of this macro.
- Project   : TMVA - a Root-integrated toolkit for multivariate data analysis
- Package   : TMVA
- Root Macro: TMVARegression



**Author:** Andreas Hoecker  
<i><small>This notebook tutorial was automatically generated with <a href= "https://github.com/root-mirror/root/blob/master/documentation/doxygen/converttonotebook.py">ROOTBOOK-izer (Beta)</a> from the macro found in the ROOT repository  on Thursday, January 26, 2017 at 01:24 AM.</small></i>

In [1]:
%%cpp -d
#include <cstdlib>
#include <iostream>
#include <map>
#include <string>

#include "TChain.h"
#include "TFile.h"
#include "TTree.h"
#include "TString.h"
#include "TObjString.h"
#include "TSystem.h"
#include "TROOT.h"

#include "TMVA/Tools.h"
#include "TMVA/Factory.h"
#include "TMVA/DataLoader.h"
#include "TMVA/TMVARegGui.h"


using namespace TMVA;

 Arguments are defined. 

In [2]:
TString myMethodList = "";

The explicit loading of the shared libtmva is done in tmvalogon.c, defined in .rootrc
 if you use your private .rootrc, or run from a different directory, please copy the
 corresponding lines from .rootrc

Methods to be processed can be given as an argument; use format:

     mylinux~> root -l TMVARegression.C\(\"myMethod1,myMethod2,myMethod3\"\)

---------------------------------------------------------------
 This loads the library

In [3]:
TMVA::Tools::Instance();

Default mva methods to be trained + tested

In [4]:
std::map<std::string,int> Use;

Mutidimensional likelihood and nearest-neighbour methods

In [5]:
Use["PDERS"]           = 0;
Use["PDEFoam"]         = 1;
Use["KNN"]             = 1;

 Linear Discriminant Analysis

In [6]:
Use["LD"]		        = 1;

 Function Discriminant analysis

In [7]:
Use["FDA_GA"]          = 1;
Use["FDA_MC"]          = 0;
Use["FDA_MT"]          = 0;
Use["FDA_GAMT"]        = 0;

 Neural Network

In [8]:
Use["MLP"]             = 1;
Use["DNN"]             = 0;

 Support Vector Machine

In [9]:
Use["SVM"]             = 0;

 Boosted Decision Trees

In [10]:
Use["BDT"]             = 0;
Use["BDTG"]            = 1;

---------------------------------------------------------------

In [11]:
std::cout << std::endl;
std::cout << "==> Start TMVARegression" << std::endl;


==> Start TMVARegression


Select methods (don't look at this code - not of interest)

In [12]:
if (myMethodList != "") {
   for (std::map<std::string,int>::iterator it = Use.begin(); it != Use.end(); it++) it->second = 0;

   std::vector<TString> mlist = gTools().SplitString( myMethodList, ',' );
   for (UInt_t i=0; i<mlist.size(); i++) {
      std::string regMethod(mlist[i]);

      if (Use.find(regMethod) == Use.end()) {
         std::cout << "Method \"" << regMethod << "\" not known in TMVA under this name. Choose among the following:" << std::endl;
         for (std::map<std::string,int>::iterator it = Use.begin(); it != Use.end(); it++) std::cout << it->first << " ";
         std::cout << std::endl;
         return;
      }
      Use[regMethod] = 1;
   }
}

--------------------------------------------------------------------------------------------------

Here the preparation phase begins

Create a new root output file

In [13]:
TString outfileName( "TMVAReg.root" );
TFile* outputFile = TFile::Open( outfileName, "RECREATE" );

Create the factory object. later you can choose the methods
 whose performance you'd like to investigate. The factory will
 then run the performance analysis for you.

 The first argument is the base of the name of all the
 weightfiles in the directory weight/

 The second argument is the output file for the training results
 All TMVA output can be suppressed by removing the "!" (not) in
 front of the "Silent" argument in the option string

In [14]:
TMVA::Factory *factory = new TMVA::Factory( "TMVARegression", outputFile,
                                            "!V:!Silent:Color:!DrawProgressBar:AnalysisType=Regression" );


TMVA::DataLoader *dataloader=new TMVA::DataLoader("dataset");

If you wish to modify default settings
 (please check "src/Config.h" to see all available global options)

     (TMVA::gConfig().GetVariablePlotting()).fTimesRMS = 8.0;
     (TMVA::gConfig().GetIONames()).fWeightFileDir = "myWeightDirectory";

Define the input variables that shall be used for the mva training
 note that you may also use variable expressions, such as: "3*var1/var2*abs(var3)"
 [all types of expressions that can also be parsed by TTree::Draw( "expression" )]

In [15]:
dataloader->AddVariable( "var1", "Variable 1", "units", 'F' );
dataloader->AddVariable( "var2", "Variable 2", "units", 'F' );

You can add so-called "spectator variables", which are not used in the mva training,
 but will appear in the final "TestTree" produced by TMVA. This TestTree will contain the
 input variables, the response values of all trained MVAs, and the spectator variables

In [16]:
dataloader->AddSpectator( "spec1:=var1*2",  "Spectator 1", "units", 'F' );
dataloader->AddSpectator( "spec2:=var1*3",  "Spectator 2", "units", 'F' );

Add the variable carrying the regression target

In [17]:
dataloader->AddTarget( "fvalue" );

It is also possible to declare additional targets for multi-dimensional regression, ie:
     factory->AddTarget( "fvalue2" );
 BUT: this is currently ONLY implemented for MLP

Read training and test data (see tmvaclassification for reading ascii files)
 load the signal and background event samples from ROOT trees

In [18]:
TFile *input(0);
TString fname = "./tmva_reg_example.root";
if (!gSystem->AccessPathName( fname ))
   input = TFile::Open( fname ); // check if file in local directory exists
else
   input = TFile::Open( "http://root.cern.ch/files/tmva_reg_example.root" ); // if not: download from ROOT server

if (!input) {
   std::cout << "ERROR: could not open data file" << std::endl;
   exit(1);
}
std::cout << "--- TMVARegression           : Using input file: " << input->GetName() << std::endl;

--- TMVARegression           : Using input file: http://root.cern.ch/files/tmva_reg_example.root


Register the regression tree

In [19]:
TTree *regTree = (TTree*)input->Get("TreeR");

Global event weights per tree (see below for setting event-wise weights)

In [20]:
Double_t regWeight  = 1.0;

You can add an arbitrary number of regression trees

In [21]:
dataloader->AddRegressionTree( regTree, regWeight );

DataSetInfo              : [dataset] : Added class "Regression"
                         : Add Tree TreeR of type Regression with 10000 events


This would set individual event weights (the variables defined in the
 expression need to exist in the original TTree)

In [22]:
dataloader->SetWeightExpression( "var1", "Regression" );

Apply additional cuts on the signal and background samples (can be different)

In [23]:
TCut mycut = ""; // for example: TCut mycut = "abs(var1)<0.5 && abs(var2-0.5)<1";

Tell the dataloader to use all remaining events in the trees after training for testing:

In [24]:
dataloader->PrepareTrainingAndTestTree( mycut,
                                      "nTrain_Regression=1000:nTest_Regression=0:SplitMode=Random:NormMode=NumEvents:!V" );

                         : Dataset[dataset] : Class index : 0  name : Regression


     dataloader->PrepareTrainingAndTestTree( mycut,
            "nTrain_Regression=0:nTest_Regression=0:SplitMode=Random:NormMode=NumEvents:!V" );

If no numbers of events are given, half of the events in the tree are used
 for training, and the other half for testing:

     dataloader->PrepareTrainingAndTestTree( mycut, "SplitMode=random:!V" );

Book mva methods

 Please lookup the various method configuration options in the corresponding cxx files, eg:
 src/MethoCuts.cxx, etc, or here: http://tmva.sourceforge.net/optionRef.html
 it is possible to preset ranges in the option string in which the cut optimisation should be done:
 "...:CutRangeMin[2]=-1:CutRangeMax[2]=1"...", where [2] is the third input variable

Pde - rs method

In [25]:
if (Use["PDERS"])
   factory->BookMethod( dataloader,  TMVA::Types::kPDERS, "PDERS",
                        "!H:!V:NormTree=T:VolumeRangeMode=Adaptive:KernelEstimator=Gauss:GaussSigma=0.3:NEventsMin=40:NEventsMax=60:VarTransform=None" );

And the options strings for the minmax and rms methods, respectively:

      "!H:!V:VolumeRangeMode=MinMax:DeltaFrac=0.2:KernelEstimator=Gauss:GaussSigma=0.3" );
      "!H:!V:VolumeRangeMode=RMS:DeltaFrac=3:KernelEstimator=Gauss:GaussSigma=0.3" );

In [26]:
if (Use["PDEFoam"])
    factory->BookMethod( dataloader,  TMVA::Types::kPDEFoam, "PDEFoam",
			    "!H:!V:MultiTargetRegression=F:TargetSelection=Mpv:TailCut=0.001:VolFrac=0.0666:nActiveCells=500:nSampl=2000:nBin=5:Compress=T:Kernel=None:Nmin=10:VarTransform=None" );

Factory                  : Booking method: [1mPDEFoam[0m
                         : 
DataSetFactory           : [dataset] : Number of events in input trees
                         : 
                         : Number of training and testing events
                         : ---------------------------------------------------------------------------
                         : Regression -- training events            : 1000
                         : Regression -- testing events             : 9000
                         : Regression -- training and testing events: 10000
                         : 
DataSetInfo              : Correlation matrix (Regression):
                         : ------------------------
                         :             var1    var2
                         :    var1:  +1.000  -0.017
                         :    var2:  -0.017  +1.000
                         : ------------------------
DataSetFactory           : [dataset] :  
                         : 


K-nearest neighbour classifier (knn)

In [27]:
if (Use["KNN"])
   factory->BookMethod( dataloader,  TMVA::Types::kKNN, "KNN",
                        "nkNN=20:ScaleFrac=0.8:SigmaFact=1.0:Kernel=Gaus:UseKernel=F:UseWeight=T:!Trim" );

Factory                  : Booking method: [1mKNN[0m
                         : 


Linear discriminant

In [28]:
if (Use["LD"])
   factory->BookMethod( dataloader,  TMVA::Types::kLD, "LD",
                        "!H:!V:VarTransform=None" );

	// Function discrimination analysis (FDA) -- test of various fitters - the recommended one is Minuit (or GA or SA)
if (Use["FDA_MC"])
   factory->BookMethod( dataloader,  TMVA::Types::kFDA, "FDA_MC",
                       "!H:!V:Formula=(0)+(1)*x0+(2)*x1:ParRanges=(-100,100);(-100,100);(-100,100):FitMethod=MC:SampleSize=100000:Sigma=0.1:VarTransform=D" );

if (Use["FDA_GA"]) // can also use Simulated Annealing (SA) algorithm (see Cuts_SA options) .. the formula of this example is good for parabolas
   factory->BookMethod( dataloader,  TMVA::Types::kFDA, "FDA_GA",
                        "!H:!V:Formula=(0)+(1)*x0+(2)*x1:ParRanges=(-100,100);(-100,100);(-100,100):FitMethod=GA:PopSize=100:Cycles=3:Steps=30:Trim=True:SaveBestGen=1:VarTransform=Norm" );

if (Use["FDA_MT"])
   factory->BookMethod( dataloader,  TMVA::Types::kFDA, "FDA_MT",
                        "!H:!V:Formula=(0)+(1)*x0+(2)*x1:ParRanges=(-100,100);(-100,100);(-100,100);(-10,10):FitMethod=MINUIT:ErrorLevel=1:PrintLevel=-1:FitStrategy=2:UseImprove:UseMinos:SetBatch" );

if (Use["FDA_GAMT"])
   factory->BookMethod( dataloader,  TMVA::Types::kFDA, "FDA_GAMT",
                        "!H:!V:Formula=(0)+(1)*x0+(2)*x1:ParRanges=(-100,100);(-100,100);(-100,100):FitMethod=GA:Converger=MINUIT:ErrorLevel=1:PrintLevel=-1:FitStrategy=0:!UseImprove:!UseMinos:SetBatch:Cycles=1:PopSize=5:Steps=5:Trim" );

Factory                  : Booking method: [1mLD[0m
                         : 
Factory                  : Booking method: [1mFDA_GA[0m
                         : 
FDA_GA                   : [dataset] : Create Transformation "Norm" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'var1' <---> Output : variable 'var1'
                         : Input : variable 'var2' <---> Output : variable 'var2'
                         : Input : target 'fvalue' <---> Output : target 'fvalue'
                         : Create parameter interval for parameter 0 : [-100,100]
                         : Create parameter interval for parameter 1 : [-100,100]
                         : Create parameter interval for parameter 2 : [-100,100]
                         : User-defined formula string       : "(0)+(1)*x0+(2)*x1"
                         : TFormula-compatible formula string: "[0

Neural network (mlp)

In [29]:
if (Use["MLP"])
   factory->BookMethod( dataloader,  TMVA::Types::kMLP, "MLP", "!H:!V:VarTransform=Norm:NeuronType=tanh:NCycles=20000:HiddenLayers=N+20:TestRate=6:TrainingMethod=BFGS:Sampling=0.3:SamplingEpoch=0.8:ConvergenceImprove=1e-6:ConvergenceTests=15:!UseRegulator" );

if (Use["DNN"])
{
/*
    TString layoutString ("Layout=TANH|(N+100)*2,LINEAR");
    TString layoutString ("Layout=SOFTSIGN|100,SOFTSIGN|50,SOFTSIGN|20,LINEAR");
    TString layoutString ("Layout=RELU|300,RELU|100,RELU|30,RELU|10,LINEAR");
    TString layoutString ("Layout=SOFTSIGN|50,SOFTSIGN|30,SOFTSIGN|20,SOFTSIGN|10,LINEAR");
    TString layoutString ("Layout=TANH|50,TANH|30,TANH|20,TANH|10,LINEAR");
    TString layoutString ("Layout=SOFTSIGN|50,SOFTSIGN|20,LINEAR");
    TString layoutString ("Layout=TANH|100,TANH|30,LINEAR");
 */
    TString layoutString ("Layout=TANH|100,LINEAR");

    TString training0 ("LearningRate=1e-5,Momentum=0.5,Repetitions=1,ConvergenceSteps=500,BatchSize=50,TestRepetitions=7,WeightDecay=0.01,Regularization=NONE,DropConfig=0.5+0.5+0.5+0.5,DropRepetitions=2");
    TString training1 ("LearningRate=1e-5,Momentum=0.9,Repetitions=1,ConvergenceSteps=170,BatchSize=30,TestRepetitions=7,WeightDecay=0.01,Regularization=L2,DropConfig=0.1+0.1+0.1,DropRepetitions=1");
    TString training2 ("LearningRate=1e-5,Momentum=0.3,Repetitions=1,ConvergenceSteps=150,BatchSize=40,TestRepetitions=7,WeightDecay=0.01,Regularization=NONE");
    TString training3 ("LearningRate=1e-6,Momentum=0.1,Repetitions=1,ConvergenceSteps=500,BatchSize=100,TestRepetitions=7,WeightDecay=0.0001,Regularization=NONE");

    TString trainingStrategyString ("TrainingStrategy=");
    trainingStrategyString += training0 + "|" + training1 + "|" + training2 + "|" + training3;


 //       TString trainingStrategyString ("TrainingStrategy=LearningRate=1e-1,Momentum=0.3,Repetitions=3,ConvergenceSteps=20,BatchSize=30,TestRepetitions=7,WeightDecay=0.0,L1=false,DropFraction=0.0,DropRepetitions=5");

    TString nnOptions ("!H:V:ErrorStrategy=SUMOFSQUARES:VarTransform=G:WeightInitialization=XAVIERUNIFORM");
 //       TString nnOptions ("!H:V:VarTransform=Normalize:ErrorStrategy=CHECKGRADIENTS");
    nnOptions.Append (":"); nnOptions.Append (layoutString);
    nnOptions.Append (":"); nnOptions.Append (trainingStrategyString);

    factory->BookMethod(dataloader, TMVA::Types::kDNN, "DNN", nnOptions ); // NN
}

Factory                  : Booking method: [1mMLP[0m
                         : 
MLP                      : [dataset] : Create Transformation "Norm" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'var1' <---> Output : variable 'var1'
                         : Input : variable 'var2' <---> Output : variable 'var2'
                         : Input : target 'fvalue' <---> Output : target 'fvalue'
MLP                      : Building Network. 
                         : Initializing weights


Support vector machine

In [30]:
if (Use["SVM"])
   factory->BookMethod( dataloader,  TMVA::Types::kSVM, "SVM", "Gamma=0.25:Tol=0.001:VarTransform=Norm" );

Boosted decision trees

In [31]:
if (Use["BDT"])
  factory->BookMethod( dataloader,  TMVA::Types::kBDT, "BDT",
                        "!H:!V:NTrees=100:MinNodeSize=1.0%:BoostType=AdaBoostR2:SeparationType=RegressionVariance:nCuts=20:PruneMethod=CostComplexity:PruneStrength=30" );

if (Use["BDTG"])
  factory->BookMethod( dataloader,  TMVA::Types::kBDT, "BDTG",
                        "!H:!V:NTrees=2000::BoostType=Grad:Shrinkage=0.1:UseBaggedBoost:BaggedSampleFraction=0.5:nCuts=20:MaxDepth=3:MaxDepth=4" );

Factory                  : Booking method: [1mBDTG[0m
                         : 
                         : the option *InverseBoostNegWeights* does not exist for BoostType=Grad --> change
                         : to new default for GradBoost *Pray*


--------------------------------------------------------------------------------------------------

Now you can tell the factory to train, test, and evaluate the mvas

Train mvas using the set of training events

In [32]:
factory->TrainAllMethods();

Factory                  : [1mTrain all methods[0m
Factory                  : [dataset] : Create Transformation "I" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'var1' <---> Output : variable 'var1'
                         : Input : variable 'var2' <---> Output : variable 'var2'
TFHandler_Factory        : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     var1:     3.4152     1.1962   [  0.0026062     4.9957 ]
                         :     var2:     2.4350     1.4125   [  0.0092062     4.9990 ]
                         :   fvalue:     164.97     82.189   [     1.7144     391.23 ]
                         : -----------------------------------------------------------
                         : Ranking input variables (method unspecific)...

Evaluate all mvas using the set of test events

In [33]:
factory->TestAllMethods();

Factory                  : [1mTest all methods[0m
Factory                  : Test method: PDEFoam for Regression performance
                         : 
                         : Dataset[dataset] : Create results for testing
                         : Dataset[dataset] : Evaluation of PDEFoam on testing sample
                         : Dataset[dataset] : Elapsed time for evaluation of 9000 events: [1;31m0.0554 sec[0m       
                         : Create variable histograms
                         : Create regression target histograms
                         : Create regression average deviation
                         : Results created
Factory                  : Test method: KNN for Regression performance
                         : 
                         : Dataset[dataset] : Create results for testing
                         : Dataset[dataset] : Evaluation of KNN on testing sample
                         : Dataset[dataset] : Elapsed time for evaluation of 9000 events: 

Evaluate and compare performance of all configured mvas

In [34]:
factory->EvaluateAllMethods();

Factory                  : [1mEvaluate all methods[0m
                         : Evaluate regression method: PDEFoam
TFHandler_PDEFoam        : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     var1:     3.3308     1.1858   [ 0.00020069     5.0000 ]
                         :     var2:     2.4914     1.4394   [ 0.00071490     5.0000 ]
                         :   fvalue:     164.02     83.934   [     1.6186     394.84 ]
                         : -----------------------------------------------------------
                         : Evaluate regression method: KNN
TFHandler_KNN            : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     var1:     3.3308     1.1858   [ 0.00020069     5.0000 ]
                         :     var2:   

--------------------------------------------------------------

Save the output

In [35]:
outputFile->Close();

std::cout << "==> Wrote root file: " << outputFile->GetName() << std::endl;
std::cout << "==> TMVARegression is done!" << std::endl;

delete factory;
delete dataloader;

==> Wrote root file: TMVAReg.root
==> TMVARegression is done!


Launch the gui for the root macros

In [36]:
if (!gROOT->IsBatch()) TMVA::TMVARegGui( outfileName );