Skip to content

Commit

Permalink
[df] Reduce the memory footprint of the computation graph
Browse files Browse the repository at this point in the history
Every node of the computation graph needs to know which columns it has access
to. This information is stored in the RColumnRegister class, which holds a map
associating every column name available to a certain node with the corresponding
RDefineReader. This object can become quite heavy as each column name is stored
as a std::string and the readers are held by an RDefinesWithReaders object which
itself is not a trivial type. For very deep computation graphs (e.g. O(10K)
`Define` calls chained one after another in the same branch), just the creation
of the graph can take up several GBs of memory and a large portion of the
runtime is spent in the creation and subsequent destruction of such heavy
objects. A similar logic is used for the map of registered variations, but the
number of variations grows much slower than the number of calls to Define, so
the effects of that are even more difficult to notice.

This commit proposes a complete refactoring of how these objects are handled
within the RDataFrame computation graph. At first, both the collection of define
readers as well as the variation readers are stripped of their ownership
responsibilities. RDefinesWithReaders and RVariationsWithReaders objects are
created within the RColumnRegister class API, but they are registered centrally
by the RLoopManager, which now manages them all via unique_ptr. The
RColumnRegister class now only holds references to those objects. As a further
memory optimization measure, all the strings relative to the column/variation
names are also cached centrally in the RLoopManager and only views to those
strings are kept in the RColumnRegister. To avoid circular references in the
shared_ptr ownership of the RLoopManager itself, RColumnRegister does not own
the RLoopManager anymore. The owner(s) of the RLoopManager are the nodes of the
computation graph themselves (via RInterfaceBase). Now, when the last node of
the computation graph is destroyed, it will also trigger the destruction of the
RLoopManager. In turn, this triggers the deregistration of all the define and
variation readers.

Fixes root-project#14510
  • Loading branch information
vepadulano authored and Piter Amador Paye Mamani committed Jun 3, 2024
1 parent 432c0c1 commit 45fa7fd
Show file tree
Hide file tree
Showing 22 changed files with 392 additions and 233 deletions.
2 changes: 2 additions & 0 deletions tree/dataframe/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ ROOT_STANDARD_LIBRARY_PACKAGE(ROOTDataFrame
src/RActionBase.cxx
src/RCsvDS.cxx
src/RDefineBase.cxx
src/RDefineReader.cxx
src/RCutFlowReport.cxx
src/RDataFrame.cxx
src/RDatasetSpec.cxx
Expand All @@ -125,6 +126,7 @@ ROOT_STANDARD_LIBRARY_PACKAGE(ROOTDataFrame
src/RSample.cxx
src/RResultPtr.cxx
src/RVariationBase.cxx
src/RVariationReader.cxx
src/RVariationsDescription.cxx
src/RRootDS.cxx
src/RTrivialDS.cxx
Expand Down
14 changes: 13 additions & 1 deletion tree/dataframe/inc/ROOT/RDF/GraphNode.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,19 @@ public:

////////////////////////////////////////////////////////////////////////////
/// \brief Adds the column defined up to the node
void AddDefinedColumns(const std::vector<std::string> &columns) { fDefinedColumns = columns; }
void AddDefinedColumns(const std::vector<std::string_view> &columns)
{
// TODO: Converting the string_views for backward compatibility.
// Since they are names of defined columns, they were added to the
// register of column names of the RLoopManager object by the
// RColumnRegister, so we could also change fDefinedColumns to only
// store string_views
fDefinedColumns.clear();
fDefinedColumns.reserve(columns.size());
for (const auto &col : columns) {
fDefinedColumns.push_back(std::string(col));
};
}

std::string GetColor() const { return fColor; }
unsigned int GetID() const { return fID; }
Expand Down
2 changes: 1 addition & 1 deletion tree/dataframe/inc/ROOT/RDF/RAction.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ public:

auto upmostNode = AddDefinesToGraph(thisNode, GetColRegister(), prevColumns, visitedMap);

thisNode->AddDefinedColumns(GetColRegister().GetNames());
thisNode->AddDefinedColumns(GetColRegister().GenerateColumnNames());
upmostNode->SetPrevNode(prevNode);
return thisNode;
}
Expand Down
94 changes: 36 additions & 58 deletions tree/dataframe/inc/ROOT/RDF/RColumnRegister.hxx
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
// Author: Enrico Guiraud, Danilo Piparo, Massimo Tumolo CERN 06/2018
// Author: Vincenzo Eduardo Padulano CERN 05/2024

/*************************************************************************
* Copyright (C) 1995-2021, Rene Brun and Fons Rademakers. *
* Copyright (C) 1995-2024, Rene Brun and Fons Rademakers. *
* All rights reserved. *
* *
* For the licensing terms see $ROOTSYS/LICENSE. *
Expand All @@ -17,7 +18,9 @@
#include <unordered_map>
#include <memory>
#include <string>
#include <string_view>
#include <vector>
#include <utility>

namespace ROOT {
namespace RDF {
Expand All @@ -36,38 +39,10 @@ namespace RDF {

namespace RDFDetail = ROOT::Detail::RDF;

class RDefineReader;
class RVariationBase;
class RVariationReader;

/// A helper type that keeps track of RDefine objects and their corresponding RDefineReaders.
class RDefinesWithReaders {
using RDefineBase = RDFDetail::RDefineBase;

// this is a shared_ptr only because we have to track its lifetime with a weak_ptr that we pass to jitted code
// (see BookDefineJit). it is never null.
std::shared_ptr<RDefineBase> fDefine;
// Column readers per variation (in the map) per slot (in the vector).
std::vector<std::unordered_map<std::string, std::unique_ptr<RDefineReader>>> fReadersPerVariation;

public:
RDefinesWithReaders(std::shared_ptr<RDefineBase> define, unsigned int nSlots);
RDefineBase &GetDefine() const { return *fDefine; }
RDefineReader &GetReader(unsigned int slot, const std::string &variationName);
};

class RVariationsWithReaders {
// this is a shared_ptr only because we have to track its lifetime with a weak_ptr that we pass to jitted code
// (see BookVariationJit). it is never null.
std::shared_ptr<RVariationBase> fVariation;
// Column readers for this RVariation for a given variation (map key) and a given slot (vector element).
std::vector<std::unordered_map<std::string, std::unique_ptr<RVariationReader>>> fReadersPerVariation;

public:
RVariationsWithReaders(std::shared_ptr<RVariationBase> variation, unsigned int nSlots);
RVariationBase &GetVariation() const { return *fVariation; }
RVariationReader &GetReader(unsigned int slot, const std::string &colName, const std::string &variationName);
};
class RDefinesWithReaders;
class RVariationsWithReaders;

/**
* \class ROOT::Internal::RDF::RColumnRegister
Expand All @@ -85,64 +60,67 @@ public:
* between one node and the next, then the new node contains a new instance of a RColumnRegister that shares all data
* members with the previous instance except for the one data member that needed updating, which is replaced with a new
* updated instance.
*
* The contents of the collections that keep track of other objects of the computation graph are not owned by the
* RColumnRegister object. They are registered centrally by the RLoopManager and only accessed via reference in the
* RColumnRegister.
*/
class RColumnRegister {
using ColumnNames_t = std::vector<std::string>;
using DefinesMap_t = std::unordered_map<std::string, std::shared_ptr<RDefinesWithReaders>>;
/// See fVariations for more information on this type.
using VariationsMap_t = std::unordered_multimap<std::string, std::shared_ptr<RVariationsWithReaders>>;
using VariationsMap_t = std::unordered_multimap<std::string_view, ROOT::Internal::RDF::RVariationsWithReaders *>;
using DefinesMap_t = std::vector<std::pair<std::string_view, ROOT::Internal::RDF::RDefinesWithReaders *>>;
using AliasesMap_t = std::vector<std::pair<std::string_view, std::string_view>>;

std::shared_ptr<RDFDetail::RLoopManager> fLoopManager;
/// The head node of the computation graph this register belongs to. Never null.
ROOT::Detail::RDF::RLoopManager *fLoopManager;

/// Immutable multimap of Variations, can be shared among several nodes.
/// The key is the name of an existing column, the values are all variations
/// that affect that column. Variations that affect multiple columns are
/// inserted in the map multiple times, once per column, and conversely each
/// column (i.e. each key) can have several associated variations.
std::shared_ptr<const VariationsMap_t> fVariations;
/// Immutable collection of Defines, can be shared among several nodes.
/// The pointee changes if a new Define node is added to the RColumnRegister.
std::shared_ptr<DefinesMap_t> fDefines;
/// It is a vector because we rely on insertion order to recreate the branch
/// of the computation graph where necessary.
std::shared_ptr<const DefinesMap_t> fDefines;
/// Immutable map of Aliases, can be shared among several nodes.
std::shared_ptr<const std::unordered_map<std::string, std::string>> fAliases;
/// Immutable multimap of Variations, can be shared among several nodes.
/// The key is the name of an existing column, the values are all variations that affect that column.
/// Variations that affect multiple columns are inserted in the map multiple times, once per column,
/// and conversely each column (i.e. each key) can have several associated variations.
std::shared_ptr<VariationsMap_t> fVariations;
std::shared_ptr<const ColumnNames_t> fColumnNames; ///< Names of Defines and Aliases registered so far.

void AddName(std::string_view name);
/// The pointee changes if a new Alias node is added to the RColumnRegister.
/// It is a vector because we rely on insertion order to recreate the branch
/// of the computation graph where necessary.
std::shared_ptr<const AliasesMap_t> fAliases;

RVariationsWithReaders *FindVariationAndReaders(const std::string &colName, const std::string &variationName);

public:
RColumnRegister(const RColumnRegister &) = default;
RColumnRegister(RColumnRegister &&) = default;
RColumnRegister &operator=(const RColumnRegister &) = default;

explicit RColumnRegister(std::shared_ptr<RDFDetail::RLoopManager> lm);
~RColumnRegister();
explicit RColumnRegister(ROOT::Detail::RDF::RLoopManager *lm);

////////////////////////////////////////////////////////////////////////////
/// \brief Return the list of the names of the defined columns (Defines + Aliases).
ColumnNames_t GetNames() const { return *fColumnNames; }
std::vector<std::string_view> GenerateColumnNames() const;

ColumnNames_t BuildDefineNames() const;
std::vector<std::string_view> BuildDefineNames() const;

RDFDetail::RDefineBase *GetDefine(const std::string &colName) const;
RDFDetail::RDefineBase *GetDefine(std::string_view colName) const;

bool IsDefineOrAlias(std::string_view name) const;

void AddDefine(std::shared_ptr<RDFDetail::RDefineBase> column);

void AddAlias(std::string_view alias, std::string_view colName);

bool IsAlias(const std::string &name) const;
bool IsAlias(std::string_view name) const;
bool IsDefine(std::string_view name) const;

std::string ResolveAlias(std::string_view alias) const;
std::string_view ResolveAlias(std::string_view alias) const;

void AddVariation(std::shared_ptr<RVariationBase> variation);

std::vector<std::string> GetVariationsFor(const std::string &column) const;

std::vector<std::string> GetVariationDeps(const std::string &column) const;

std::vector<std::string> GetVariationDeps(const ColumnNames_t &columns) const;
std::vector<std::string> GetVariationDeps(const std::vector<std::string> &columns) const;

ROOT::RDF::RVariationsDescription BuildVariationsDescription() const;

Expand Down
5 changes: 3 additions & 2 deletions tree/dataframe/inc/ROOT/RDF/RDefinePerSample.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,9 @@ class R__CLING_PTRCHECK(off) RDefinePerSample final : public RDefineBase {

public:
RDefinePerSample(std::string_view name, std::string_view type, F expression, RLoopManager &lm)
: RDefineBase(name, type, RDFInternal::RColumnRegister{nullptr}, lm, /*columnNames*/ {}),
fExpression(std::move(expression)), fLastResults(lm.GetNSlots() * RDFInternal::CacheLineStep<RetType_t>())
: RDefineBase(name, type, RDFInternal::RColumnRegister{&lm}, lm, /*columnNames*/ {}),
fExpression(std::move(expression)),
fLastResults(lm.GetNSlots() * RDFInternal::CacheLineStep<RetType_t>())
{
fLoopManager->Register(this);
auto callUpdate = [this](unsigned int slot, const ROOT::RDF::RSampleInfo &id) { this->Update(slot, id); };
Expand Down
29 changes: 27 additions & 2 deletions tree/dataframe/inc/ROOT/RDF/RDefineReader.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@
#include <limits>
#include <type_traits>

#include <memory>
#include <string>
#include <string_view>
#include <unordered_map>
#include <unordered_set>

namespace ROOT {
namespace Internal {
namespace RDF {
Expand Down Expand Up @@ -48,8 +54,27 @@ public:
}
};

}
}
/// A helper type that keeps track of RDefine objects and their corresponding RDefineReaders.
class RDefinesWithReaders {

// this is a shared_ptr only because we have to track its lifetime with a weak_ptr that we pass to jitted code
// (see BookDefineJit). it is never null.
std::shared_ptr<ROOT::Detail::RDF::RDefineBase> fDefine;
// Column readers per variation (in the map) per slot (in the vector).
std::vector<std::unordered_map<std::string_view, std::unique_ptr<RDefineReader>>> fReadersPerVariation;

// Strings that were already used to represent column names in this RDataFrame instance.
std::unordered_set<std::string> &fCachedColNames;

public:
RDefinesWithReaders(std::shared_ptr<ROOT::Detail::RDF::RDefineBase> define, unsigned int nSlots,
std::unordered_set<std::string> &cachedColNames);
ROOT::Detail::RDF::RDefineBase &GetDefine() const { return *fDefine; }
ROOT::Internal::RDF::RDefineReader &GetReader(unsigned int slot, std::string_view variationName);
};

} // namespace RDF
} // namespace Internal
}

#endif
2 changes: 1 addition & 1 deletion tree/dataframe/inc/ROOT/RDF/RFilter.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ public:
auto upmostNode = AddDefinesToGraph(thisNode, fColRegister, prevColumns, visitedMap);

// Keep track of the columns defined up to this point.
thisNode->AddDefinedColumns(fColRegister.GetNames());
thisNode->AddDefinedColumns(fColRegister.GenerateColumnNames());

upmostNode->SetPrevNode(prevNode);
return thisNode;
Expand Down
6 changes: 3 additions & 3 deletions tree/dataframe/inc/ROOT/RDF/RInterface.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -1207,7 +1207,7 @@ public:
std::string_view columnNameRegexp = "",
const RSnapshotOptions &options = RSnapshotOptions())
{
const auto definedColumns = fColRegister.GetNames();
const auto definedColumns = fColRegister.GenerateColumnNames();
auto *tree = fLoopManager->GetTree();
const auto treeBranchNames = tree != nullptr ? ROOT::Internal::TreeUtils::GetTopLevelBranchNames(*tree) : ColumnNames_t{};
const auto dsColumns = fDataSource ? fDataSource->GetColumnNames() : ColumnNames_t{};
Expand Down Expand Up @@ -1350,7 +1350,7 @@ public:
/// is empty, all columns are selected. See the previous overloads for more information.
RInterface<RLoopManager> Cache(std::string_view columnNameRegexp = "")
{
const auto definedColumns = fColRegister.GetNames();
const auto definedColumns = fColRegister.GenerateColumnNames();
auto *tree = fLoopManager->GetTree();
const auto treeBranchNames =
tree != nullptr ? ROOT::Internal::TreeUtils::GetTopLevelBranchNames(*tree) : ColumnNames_t{};
Expand Down Expand Up @@ -2628,7 +2628,7 @@ public:
// certainly does not contain named filters.
// The number 4 takes into account the implicit columns for entry and slot number
// and their aliases (2 + 2, i.e. {r,t}dfentry_ and {r,t}dfslot_)
if (std::is_same<Proxied, RLoopManager>::value && fColRegister.GetNames().size() > 4)
if (std::is_same<Proxied, RLoopManager>::value && fColRegister.GenerateColumnNames().size() > 4)
returnEmptyReport = true;

auto rep = std::make_shared<RCutFlowReport>();
Expand Down
4 changes: 2 additions & 2 deletions tree/dataframe/inc/ROOT/RDF/RInterfaceBase.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ namespace RDFInternal = ROOT::Internal::RDF;
class RInterfaceBase {
protected:
///< The RLoopManager at the root of this computation graph. Never null.
RDFDetail::RLoopManager *fLoopManager;
std::shared_ptr<ROOT::Detail::RDF::RLoopManager> fLoopManager;
/// Non-owning pointer to a data-source object. Null if no data-source. RLoopManager has ownership of the object.
RDataSource *fDataSource = nullptr;

Expand Down Expand Up @@ -125,7 +125,7 @@ protected:
}
}

RDFDetail::RLoopManager *GetLoopManager() const { return fLoopManager; }
RDFDetail::RLoopManager *GetLoopManager() const { return fLoopManager.get(); }

ColumnNames_t GetValidatedColumnNames(const unsigned int nColumns, const ColumnNames_t &columns)
{
Expand Down
29 changes: 29 additions & 0 deletions tree/dataframe/inc/ROOT/RDF/RLoopManager.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,11 @@
#include <limits>
#include <map>
#include <memory>
#include <set>
#include <string>
#include <string_view>
#include <unordered_map>
#include <unordered_set>
#include <vector>

// forward declarations
Expand All @@ -44,6 +47,8 @@ std::vector<std::string> GetBranchNames(TTree &t, bool allowDuplicates = true);
class GraphNode;
class RActionBase;
class RVariationBase;
class RDefinesWithReaders;
class RVariationsWithReaders;

namespace GraphDrawing {
class GraphCreatorHelper;
Expand Down Expand Up @@ -176,14 +181,26 @@ class RLoopManager : public RNodeBase {
void UpdateSampleInfo(unsigned int slot, const std::pair<ULong64_t, ULong64_t> &range);
void UpdateSampleInfo(unsigned int slot, TTreeReader &r);

std::unordered_set<std::string> fCachedColNames;
std::set<std::pair<std::string_view, std::unique_ptr<ROOT::Internal::RDF::RDefinesWithReaders>>>
fUniqueDefinesWithReaders;
std::set<std::pair<std::string_view, std::unique_ptr<ROOT::Internal::RDF::RVariationsWithReaders>>>
fUniqueVariationsWithReaders;

public:
RLoopManager(TTree *tree, const ColumnNames_t &defaultBranches);
RLoopManager(std::unique_ptr<TTree> tree, const ColumnNames_t &defaultBranches);
RLoopManager(ULong64_t nEmptyEntries);
RLoopManager(std::unique_ptr<RDataSource> ds, const ColumnNames_t &defaultBranches);
RLoopManager(ROOT::RDF::Experimental::RDatasetSpec &&spec);

// Rule of five

RLoopManager(const RLoopManager &) = delete;
RLoopManager &operator=(const RLoopManager &) = delete;
RLoopManager(RLoopManager &&) = delete;
RLoopManager &operator=(RLoopManager &&) = delete;
~RLoopManager() = default;

void JitDeclarations();
void Jit();
Expand Down Expand Up @@ -243,6 +260,18 @@ public:

void SetEmptyEntryRange(std::pair<ULong64_t, ULong64_t> &&newRange);
void ChangeSpec(ROOT::RDF::Experimental::RDatasetSpec &&spec);

std::unordered_set<std::string> &GetColumnNamesCache() { return fCachedColNames; }
std::set<std::pair<std::string_view, std::unique_ptr<ROOT::Internal::RDF::RDefinesWithReaders>>> &
GetUniqueDefinesWithReaders()
{
return fUniqueDefinesWithReaders;
}
std::set<std::pair<std::string_view, std::unique_ptr<ROOT::Internal::RDF::RVariationsWithReaders>>> &
GetUniqueVariationsWithReaders()
{
return fUniqueVariationsWithReaders;
}
};

/// \brief Create an RLoopManager that reads a TChain.
Expand Down
Loading

0 comments on commit 45fa7fd

Please sign in to comment.