Refactor io into reading and processing and create table tree structu…

…re (#607) The function to create graph et al from a csv folder has been split into two functions. The first reads the csv folder into a new TableTree structure. The second processes the TableTree structure into graph et al.
TulipaEnergy · Apr 26, 2024 · 29814b7 · 29814b7
1 parent 49a243d
commit 29814b7
Show file tree

Hide file tree

Showing 6 changed files with 146 additions and 68 deletions.
diff --git a/docs/src/how-to-use.md b/docs/src/how-to-use.md
@@ -171,10 +171,29 @@ It hides the complexity behind the energy problem, making the usage more friendl
 
 The `EnergyProblem` can also be constructed using the minimal constructor below.
 
--   `EnergyProblem(graph, representative_periods, timeframe)`: Constructs a new `EnergyProblem` object with the given graph, representative periods, and timeframe. The `constraints_partitions` field is computed from the `representative_periods`, and the other fields are initialized with default values.
+-   `EnergyProblem(table_tree)`: Constructs a new `EnergyProblem` object with the given [`table_tree`](@ref TableTree) object. The `graph`, `representative_periods`, and `timeframe` are computed using `create_internal_structures`. The `constraints_partitions` field is computed from the `representative_periods`, and the other fields are initialized with default values.
 
 See the [basic example tutorial](@ref basic-example) to see how these can be used.
 
+### TableTree
+
+To move and keep data, we use [DataFrames](https://dataframes.juliadata.org) and a tree-like structure to link to these structures.
+Each field in this structure is a NamedTuple. Below, you will find its fields:
+
+-   `static`: Stores the data that does not vary inside a year. Its fields are
+    -   `assets`: Assets data.
+    -   `flows`: Flows data.
+-   `profiles`: Stores the profile data indexed by:
+    -   `assets`: Dictionary with the reference to assets' profiles indexed by periods (`"rep-periods"` or `"timeframe"`).
+    -   `flows`: Reference to flows' profiles for representative periods.
+    -   `profiles`: Actual profile data. Dictionary of dictionary indexed by periods and then by the profile name.
+-   `partitions`: Stores the partitions data indexed by:
+    -   `assets`: Dictionary with the specification of the assets' partitions indexed by periods.
+    -   `flows`: Specification of the flows' partitions for representative periods.
+-   `periods`: Stores the periods data, indexed by:
+    -   `rep_periods`: Representative periods.
+    -   `timeframe`: Timeframe periods.
+
 ### Graph
 
 The energy problem is defined using a graph.
@@ -185,7 +204,7 @@ Using MetaGraphsNext we can define a graph with metadata, i.e., associate data w
 Furthermore, we can define the labels of each asset as keys to access the elements of the graph.
 The assets in the graph are of type [GraphAssetData](@ref), and the flows are of type [GraphFlowData](@ref).
 
-The graph can be created using the [`create_graph_and_representative_periods_from_csv_folder`](@ref) function, or it can be accessed from an [EnergyProblem](@ref).
+The graph can be created using the [`create_internal_structures`](@ref) function, or it can be accessed from an [EnergyProblem](@ref).
 
 See how to use the graph in the [graph tutorial](@ref graph-tutorial).
 

diff --git a/docs/src/tutorials.md b/docs/src/tutorials.md
@@ -83,15 +83,21 @@ energy_problem.objective_value, energy_problem.termination_status
 ### Manually creating all structures without EnergyProblem
 
 For additional control, it might be desirable to use the internal structures of `EnergyProblem` directly.
-This can be error-prone, but it is slightly more efficient.
+This can be error-prone, so use it with care.
 The full description for these structures can be found in [Structures](@ref).
 
 ```@example manual
 using TulipaEnergyModel
 
 input_dir = "../../test/inputs/Tiny" # hide
 # input_dir should be the path to Tiny
-graph, representative_periods, timeframe = create_graph_and_representative_periods_from_csv_folder(input_dir)
+table_tree = create_input_dataframes_from_csv_folder(input_dir)
+```
+
+The `table_tree` contains all tables in the folder, which are then processed into the internal structures below:
+
+```@example manual
+graph, representative_periods, timeframe = create_internal_structures(table_tree)
 ```
 
 We also need a time partition for the constraints to create the model.

diff --git a/src/io.jl b/src/io.jl
@@ -1,5 +1,6 @@
 export create_energy_problem_from_csv_folder,
-    create_graph_and_representative_periods_from_csv_folder,
+    create_input_dataframes_from_csv_folder,
+    create_internal_structures,
     save_solution_to_file,
     compute_assets_partitions!,
     compute_flows_partitions!
@@ -14,15 +15,14 @@ the `EnergyProblem` structure.
 Set `strict = true` to error if assets are missing from partition data.
 """
 function create_energy_problem_from_csv_folder(input_folder::AbstractString; strict = false)
-    graph, representative_periods, timeframe =
-        create_graph_and_representative_periods_from_csv_folder(input_folder; strict = strict)
-    return EnergyProblem(graph, representative_periods, timeframe)
+    table_tree = create_input_dataframes_from_csv_folder(input_folder; strict = strict)
+    return EnergyProblem(table_tree)
 end
 
 """
-    graph, representative_periods, timeframe = create_graph_and_representative_periods_from_csv_folder(input_folder; strict = false)
+    table_tree = create_input_dataframes_from_csv_folder(input_folder; strict = false)
 
-Returns the `graph` structure that holds all data, and the `representative_periods` array.
+Returns the `table_tree::TableTree` structure that holds all data.
 Set `strict = true` to error if assets are missing from partition data.
 
 The following files are expected to exist in the input folder:
@@ -39,48 +39,31 @@ The following files are expected to exist in the input folder:
   - `profiles-rep-periods-<type>.csv`: Following the schema `schemas.rep_periods.profiles_data`.
   - `rep-periods-data.csv`: Following the schema `schemas.rep_periods.data`.
   - `rep-periods-mapping.csv`: Following the schema `schemas.rep_periods.mapping`.
-
-The returned structures are:
-
-  - `graph`: a MetaGraph with the following information:
-
-      + `labels(graph)`: All assets.
-      + `edge_labels(graph)`: All flows, in pair format `(u, v)`, where `u` and `v` are assets.
-      + `graph[a]`: A [`TulipaEnergyModel.GraphAssetData`](@ref) structure for asset `a`.
-      + `graph[u, v]`: A [`TulipaEnergyModel.GraphFlowData`](@ref) structure for flow `(u, v)`.
-
-  - `representative_periods`: An array of
-    [`TulipaEnergyModel.RepresentativePeriod`](@ref) ordered by their IDs.
-
-  - `timeframe`: Information of
-    [`TulipaEnergyModel.Timeframe`](@ref).
 """
-function create_graph_and_representative_periods_from_csv_folder(
-    input_folder::AbstractString;
-    strict = false,
-)
+function create_input_dataframes_from_csv_folder(input_folder::AbstractString; strict = false)
     df_assets_data = read_csv_with_implicit_schema(input_folder, "assets-data.csv")
     df_flows_data  = read_csv_with_implicit_schema(input_folder, "flows-data.csv")
-    df_rep_period  = read_csv_with_implicit_schema(input_folder, "rep-periods-data.csv")
+    df_rep_periods = read_csv_with_implicit_schema(input_folder, "rep-periods-data.csv")
     df_rp_mapping  = read_csv_with_implicit_schema(input_folder, "rep-periods-mapping.csv")
 
-    df_assets_profiles = Dict(
-        profile_type =>
-            read_csv_with_implicit_schema(input_folder, "assets-$profile_type-profiles.csv") for
-        profile_type in ["timeframe", "rep-periods"]
+    period_types = ["rep-periods", "timeframe"]
+
+    dfs_assets_profiles = Dict(
+        period_type =>
+            read_csv_with_implicit_schema(input_folder, "assets-$period_type-profiles.csv") for
+        period_type in period_types
     )
     df_flows_profiles =
         read_csv_with_implicit_schema(input_folder, "flows-rep-periods-profiles.csv")
-    df_assets_partitions = Dict(
-        "timeframe" =>
-            read_csv_with_implicit_schema(input_folder, "assets-timeframe-partitions.csv"),
-        "rep-periods" =>
-            read_csv_with_implicit_schema(input_folder, "assets-rep-periods-partitions.csv"),
+    dfs_assets_partitions = Dict(
+        period_type =>
+            read_csv_with_implicit_schema(input_folder, "assets-$period_type-partitions.csv")
+        for period_type in period_types
     )
     df_flows_partitions =
         read_csv_with_implicit_schema(input_folder, "flows-rep-periods-partitions.csv")
 
-    df_profiles = Dict(
+    dfs_profiles = Dict(
         period_type => Dict(
             begin
                 regex = "profiles-$(period_type)-(.*).csv"
@@ -90,13 +73,13 @@ function create_graph_and_representative_periods_from_csv_folder(
                 key => value
             end for filename in readdir(input_folder) if
             startswith("profiles-$period_type-")(filename)
-        ) for period_type in ["rep-periods", "timeframe"]
+        ) for period_type in period_types
     )
 
     # Error if partition data is missing assets (if strict)
     if strict
         missing_assets =
-            setdiff(df_assets_data[!, :name], df_assets_partitions["rep-periods"][!, :asset])
+            setdiff(df_assets_data[!, :name], dfs_assets_partitions["rep-periods"][!, :asset])
         if length(missing_assets) > 0
             msg = "Error: Partition data missing for these assets: \n"
             for a in missing_assets
@@ -108,24 +91,53 @@ function create_graph_and_representative_periods_from_csv_folder(
         end
     end
 
-    # Sets and subsets that depend on input data
+    table_tree = TableTree(
+        (assets = df_assets_data, flows = df_flows_data),
+        (assets = dfs_assets_profiles, flows = df_flows_profiles, data = dfs_profiles),
+        (assets = dfs_assets_partitions, flows = df_flows_partitions),
+        (rep_periods = df_rep_periods, mapping = df_rp_mapping),
+    )
+
+    return table_tree
+end
 
+"""
+    graph, representative_periods, timeframe  = create_internal_structures(table_tree)
+
+Return the `graph`, `representative_periods`, and `timeframe` structures given the input dataframes structure.
+
+The details of these structures are:
+
+  - `graph`: a MetaGraph with the following information:
+
+      + `labels(graph)`: All assets.
+      + `edge_labels(graph)`: All flows, in pair format `(u, v)`, where `u` and `v` are assets.
+      + `graph[a]`: A [`TulipaEnergyModel.GraphAssetData`](@ref) structure for asset `a`.
+      + `graph[u, v]`: A [`TulipaEnergyModel.GraphFlowData`](@ref) structure for flow `(u, v)`.
+
+  - `representative_periods`: An array of
+    [`TulipaEnergyModel.RepresentativePeriod`](@ref) ordered by their IDs.
+
+  - `timeframe`: Information of
+    [`TulipaEnergyModel.Timeframe`](@ref).
+"""
+function create_internal_structures(table_tree::TableTree)
     # TODO: Depending on the outcome of issue #294, this can be done more efficiently with DataFrames, e.g.,
-    # combine(groupby(df_rp_mapping, :rep_period), :weight => sum => :weight)
+    # combine(groupby(input_df_periods.mapping, :rep_period), :weight => sum => :weight)
 
     # Create a dictionary of weights and populate it.
     weights = Dict{Int,Dict{Int,Float64}}()
-    for sub_df in DataFrames.groupby(df_rp_mapping, :rep_period)
+    for sub_df in DataFrames.groupby(table_tree.periods.mapping, :rep_period)
         rp = first(sub_df.rep_period)
         weights[rp] = Dict(Pair.(sub_df.period, sub_df.weight))
     end
 
     representative_periods = [
         RepresentativePeriod(weights[row.id], row.num_timesteps, row.resolution) for
-        row in eachrow(df_rep_period)
+        row in eachrow(table_tree.periods.rep_periods)
     ]
 
-    timeframe = Timeframe(maximum(df_rp_mapping.period), df_rp_mapping)
+    timeframe = Timeframe(maximum(table_tree.periods.mapping.period), table_tree.periods.mapping)
 
     asset_data = [
         row.name => GraphAssetData(
@@ -147,7 +159,7 @@ function create_graph_and_representative_periods_from_csv_folder(
             row.initial_storage_capacity,
             row.initial_storage_level,
             row.energy_to_power_ratio,
-        ) for row in eachrow(df_assets_data)
+        ) for row in eachrow(table_tree.static.assets)
     ]
 
     flow_data = [
@@ -164,11 +176,11 @@ function create_graph_and_representative_periods_from_csv_folder(
             row.initial_export_capacity,
             row.initial_import_capacity,
             row.efficiency,
-        ) for row in eachrow(df_flows_data)
+        ) for row in eachrow(table_tree.static.flows)
     ]
 
     num_assets = length(asset_data)
-    name_to_id = Dict(name => i for (i, name) in enumerate(df_assets_data.name))
+    name_to_id = Dict(name => i for (i, name) in enumerate(table_tree.static.assets.name))
 
     _graph = Graphs.DiGraph(num_assets)
     for flow in flow_data
@@ -181,7 +193,7 @@ function create_graph_and_representative_periods_from_csv_folder(
     for a in MetaGraphsNext.labels(graph)
         compute_assets_partitions!(
             graph[a].rep_periods_partitions,
-            df_assets_partitions["rep-periods"],
+            table_tree.partitions.assets["rep-periods"],
             a,
             representative_periods,
         )
@@ -190,19 +202,19 @@ function create_graph_and_representative_periods_from_csv_folder(
     for (u, v) in MetaGraphsNext.edge_labels(graph)
         compute_flows_partitions!(
             graph[u, v].rep_periods_partitions,
-            df_flows_partitions,
+            table_tree.partitions.flows,
             u,
             v,
             representative_periods,
         )
     end
 
     # For timeframe, only the assets where is_seasonal is true are selected
-    for row in eachrow(df_assets_data)
+    for row in eachrow(table_tree.static.assets)
         if row.is_seasonal
-            # Search for this row in the df_assets_partitions and error if it is not found
+            # Search for this row in the table_tree.partitions.assets and error if it is not found
             found = false
-            for partition_row in eachrow(df_assets_partitions["timeframe"])
+            for partition_row in eachrow(table_tree.partitions.assets["timeframe"])
                 if row.name == partition_row.asset
                     graph[row.name].timeframe_partitions = _parse_rp_partition(
                         Val(partition_row.specification),
@@ -220,11 +232,11 @@ function create_graph_and_representative_periods_from_csv_folder(
         end
     end
 
-    for asset_profile_row in eachrow(df_assets_profiles["rep-periods"]) # row = asset, profile_type, profile_name
+    for asset_profile_row in eachrow(table_tree.profiles.assets["rep-periods"]) # row = asset, profile_type, profile_name
         gp = DataFrames.groupby( # 3. group by RP
             filter(
                 row -> row.profile_name == asset_profile_row.profile_name, # 2. Filter profile_name
-                df_profiles["rep-periods"][asset_profile_row.profile_type], # 1. Get the profile of given type
+                table_tree.profiles.data["rep-periods"][asset_profile_row.profile_type], # 1. Get the profile of given type
             ),
             :rep_period,
         )
@@ -236,11 +248,11 @@ function create_graph_and_representative_periods_from_csv_folder(
         end
     end
 
-    for flow_profile_row in eachrow(df_flows_profiles)
+    for flow_profile_row in eachrow(table_tree.profiles.flows)
         gp = DataFrames.groupby(
             filter(
                 row -> row.profile_name == flow_profile_row.profile_name,
-                df_profiles["rep-periods"][flow_profile_row.profile_type],
+                table_tree.profiles.data["rep-periods"][flow_profile_row.profile_type],
             ),
             :rep_period,
         )
@@ -252,10 +264,10 @@ function create_graph_and_representative_periods_from_csv_folder(
         end
     end
 
-    for asset_profile_row in eachrow(df_assets_profiles["timeframe"]) # row = asset, profile_type, profile_name
+    for asset_profile_row in eachrow(table_tree.profiles.assets["timeframe"]) # row = asset, profile_type, profile_name
         df = filter(
             row -> row.profile_name == asset_profile_row.profile_name, # 2. Filter profile_name
-            df_profiles["timeframe"][asset_profile_row.profile_type], # 1. Get the profile of given type
+            table_tree.profiles.data["timeframe"][asset_profile_row.profile_type], # 1. Get the profile of given type
         )
         graph[asset_profile_row.asset].timeframe_profiles[asset_profile_row.profile_type] = df.value
     end

diff --git a/src/structures.jl b/src/structures.jl
@@ -4,6 +4,42 @@ export GraphAssetData,
 const TimestepsBlock = UnitRange{Int}
 const PeriodsBlock = UnitRange{Int}
 
+const PeriodType = String
+const TableNodeStatic = @NamedTuple{assets::DataFrame, flows::DataFrame}
+const TableNodeProfiles = @NamedTuple{
+    assets::Dict{PeriodType,DataFrame},
+    flows::DataFrame,
+    data::Dict{PeriodType,Dict{Symbol,DataFrame}},
+}
+const TableNodePartitions = @NamedTuple{assets::Dict{PeriodType,DataFrame}, flows::DataFrame}
+const TableNodePeriods = @NamedTuple{rep_periods::DataFrame, mapping::DataFrame}
+
+"""
+Structure to hold the tabular data.
+
+## Fields
+
+- `static`: Stores the data that does not vary inside a year. Its fields are
+  - `assets`: Assets data.
+  - `flows`: Flows data.
+- `profiles`: Stores the profile data indexed by:
+  - `assets`: Dictionary with the reference to assets' profiles indexed by periods (`"rep-periods"` or `"timeframe"`).
+  - `flows`: Reference to flows' profiles for representative periods.
+  - `profiles`: Actual profile data. Dictionary of dictionary indexed by periods and then by the profile name.
+- `partitions`: Stores the partitions data indexed by:
+  - `assets`: Dictionary with the specification of the assets' partitions indexed by periods.
+  - `flows`: Specification of the flows' partitions for representative periods.
+- `periods`: Stores the periods data, indexed by:
+  - `rep_periods`: Representative periods.
+  - `timeframe`: Timeframe periods.
+"""
+struct TableTree
+    static::TableNodeStatic
+    profiles::TableNodeProfiles
+    partitions::TableNodePartitions
+    periods::TableNodePeriods
+end
+
 """
 Structure to hold the data of the timeframe.
 """
@@ -197,6 +233,7 @@ It hides the complexity behind the energy problem, making the usage more friendl
 See the [basic example tutorial](@ref basic-example) to see how these can be used.
 """
 mutable struct EnergyProblem
+    table_tree::TableTree
     graph::MetaGraph{
         Int,
         SimpleDiGraph{Int},
@@ -221,15 +258,17 @@ mutable struct EnergyProblem
     time_solve_model::Float64
 
     """
-        EnergyProblem(graph, representative_periods, timeframe)
+        EnergyProblem(dfs_input)
 
-    Constructs a new EnergyProblem object with the given graph, representative periods, and timeframe. The `constraints_partitions` field is computed from the `representative_periods`,
-    and the other fields and nothing or set to default values.
+    Constructs a new EnergyProblem object from the input dataframes.
+    This will call [`create_internal_structures`](@ref).
     """
-    function EnergyProblem(graph, representative_periods, timeframe)
+    function EnergyProblem(dfs_input)
+        graph, representative_periods, timeframe = create_internal_structures(dfs_input)
         constraints_partitions = compute_constraints_partitions(graph, representative_periods)
 
         return new(
+            dfs_input,
             graph,
             representative_periods,
             constraints_partitions,