Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Widelong #1

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

Widelong #1

wants to merge 30 commits into from

Conversation

martinwellman
Copy link
Collaborator

This is the first attempt at the ODM wide-to-long-to-wide mapping. Can you go over it?

@martinwellman
Copy link
Collaborator Author

Rewrote the mapping spec document to focus on the mapping operations, rather than the database formats.

specs/mapping-operations.qmd is the main mapping operations spec
R/docs/doc_mdtables.R contains R functions to create Markdown tables from dataframes, tibbles, lists, and csv files.
specs/examples/ contains example CSV files for ODM wide and ODM long

@yulric
Copy link

yulric commented Oct 13, 2023

@martinwellman Can you let me know when this PR is good to review again?

@martinwellman
Copy link
Collaborator Author

This PR is ready to review again. @yulric

…rst column name

Problem: On Windows machines if you do not specify the file encoding to
be UTF-8-BOM when reading in a CSV file a i.. is appended to the first
column header name. For example, if the column name is measures it will
become i..measures.

Solution: Add the fileEncoding argument when using the read.csv
function.
Copy link

@yulric yulric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document itself looks pretty good! Just commend on the code and organization.

  • I recommend using a library to handle pretty printing CSV files. Its a pretty common piece of functionality and there are established libraries that have been doing it for some time. For example, DT
  • Recommend using renv to track all your dependencies in your project. Its like the requirements.txt file in Python.
  • We try to put all asset files (like CSV files, images etc.) in a folder called assets.

group_by: [ date, compartment ]
```

## Usage: Grouping
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worthwhile to provide a quick grouping example here, one that's used in the other operations.

@@ -0,0 +1,59 @@
#' This file contains functions for converting various R types to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An annoying thing about R is that you're technically not allowed to have sub-directories within the R folder. I would move this file to the root of the R folder.

# pipe characters, 2) add optional word breaks to strings (<wbr>),
# 3) concatenate strings in row with pipe separator.
row <- row %>%
sapply(function(x) str_replace_all(x, "\\|", "\\\\|")) %>%
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using a function from a library can you call it using the library name? So this would be stringr::str_replace_all().
With R its hard to know whether a function is from base R or from a library, so we always use the above syntax to call a function from a library.

# 3) concatenate strings in row with pipe separator.
row <- row %>%
sapply(function(x) str_replace_all(x, "\\|", "\\\\|")) %>%
sapply(function(x) str_replace_all(x, "_", "_<wbr>")) %>%
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you replacing _ with _<wbr> here?

@martinwellman
Copy link
Collaborator Author

I updated this PR with all the requested changes. Let me know if everything looks good.

I replace _ with _ for the headers in a markdown table to allow optional wrapping of the headers at each underscore. Since the ODM wide format can have very long names this helps the tables take up less space horizontally and a bit easier to read. I've commented the code to explain this.

Copy link
Collaborator

@mathew-thomson mathew-thomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great ! A fantastic break down, and really lays things out clearly. I think we may need to talk about ID generation when we move from wide to long, but otherwise I think this is in fantastic shape.


## Grouping

No grouping is required for wide-to-long, as we process each row in isolation (ie. row-by-row).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may (forgive my ignorance here) need to do some grouping, however, in the sense that each wide row probably is a "measure set" and so would need a generated "measure set ID" to be common across all the new long rows.

|---------------|-------------|----------|-------------------|---------------|-------------------|-----------------|-------|-------|
| Sept 18, 2023 | water | sample | liquid | SARS-CoV-2-N1 | gene copies per L | arithmetic mean | 1 | 40 |
| Sept 18, 2023 | water | sample | liquid | PMMoV | gene copies per L | arithmetic mean | 1 | 45 |
| Sept 18, 2023 | water | sample | liquid | pH | unitless | arithmetic mean | 1 | 6.1 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As alluded to in my other comment, I think we may also need an operation for generating IDs for some of the new rows as well move from wide to long format. Measure sets, as I mention above, is one example. But quality IDs would be another example. I can try to generate an exhaustive list, if that would be helpful.

Copy link

@yulric yulric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment otherwise looks good.

R/doc_mdtables.R Outdated
df <- df[rows, ]

# Add optional word breaks after underscores, allowing headers to wrap
# and take up less room horizontally.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# and take up less room horizontally.
# and take up less room horizontally. Effectively, if a table gets too long this will result in the addition of a horizontal scrollbar instead of trying to squish all the columns. Squishing all the columns resulted in word-wrapping occurring that would look weird.

(water, sample, liquid, SARS-CoV-2-N1, gene copies per L, arithmetic mean, 1)
```

We can apply some string maps or custom transforms to clean these values:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest describing this as a separate operation. I'd plan for each operation to be an individual function. These operations could be grouped into a method. So, pivoting wider and longer could be a 'method' (a collection of operations), but it seems more like an operation.

Copy link

@DougManuel DougManuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good.

The pivot wider and longer works well. These are common concepts and practices with analysts, and the description works well from that analyst’s perspective.

Two late coming comments:

  1. I suggest splitting the pivot wide into more operations. Specifically, this could be a recode or similar.

We can apply some string maps, or custom transforms to clean these values:

  1. We should be clear or explicit that variable mapping involves getting or putting into specific tables. The table information is all available in the ODM, but it is worth reviewing.


# Operation: Pivot Wider

Pivoting wider converts long format rows to wide format columns. It is based on R's [pivot_wider()](https://tidyr.tidyverse.org/reference/pivot_wider.html) function in the tidyverse and is the inverse of the [Pivot Longer](#operation-pivot-longer) operation. When pivoting wider we have (in the source table) name columns and value columns. The name columns will determine the wide format column name while the value columns will determine the new column's value. Often there is just one value column, but more than one can be provided, in which case we may transform the multiple values into a single value (eg. we can concatenate the values '24' and '12' to get the single value '24.12')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the aggregation operation something the user can set? E.g., maybe a user wants the values to be concatenated into a string, while others might want to take the arithmetic or geometric mean of the values

mdtable_csv_file("../assets/examples/odm/measure-singledate-odmlong.csv", rows = 1)
```

Applying this same rule to the remaining two wide-name columns in the example, we get two more rows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how the value aggregation is handled in long-to-wide, the number of resulting rows from this wide-to-long operation could change. If the long-to-wide yields a list of values, Each item would be mapped to its own row, whereas if the values were aggregated, it'll result in one row where there originally might have been several. Maybe that should be noted somehow. For measure reports, the value if the aggregation column could be changed to reflect what happened, or a note could be appended to the notes field.

Copy link
Collaborator

@jeandavidt jeandavidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all looks good.
The only questions I have revolve around how many "long" rows for the same type of measurement are collapsed into a single "wide" value. Something to discuss next meeting :)

@martinwellman
Copy link
Collaborator Author

Updated the specs and ready for re-review. See specs/mapping-configs.qmd and specs/mapping-operations.qmd. In mapping-configs.qmd I'd like to rethink the sections "Creating New Output Rows" and "Multiple Input and Output Rows" to make it simpler, suggestions are welcome.

Copy link

@yulric yulric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main comment is about the organization of the documents. Right now the specification of the mapping config is in two locations, mapping-configs.qmd and mapping-operations.qmd. I think we should move all of that information into the mapping-configs.qmd file and focus the mapping-operations.qmd file into a high level document that goes over the different features that the implementation should support, showing real world use cases/examples for why a feature is needed. This will decouple the implementation specifications from the high level specifications so that if we decide for example to use the linkml transformer schema instead, the high level specifications do not need to change.


# Quarto documentation
specs/*_files/
specs/*.html
README.html
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend creating a Quarto project file. It has the following advantages:

  1. You get one single website that has all your documentation files. You don't have to run different render commands, one for each quarto file.
  2. You can specify an output directory where the entire website is built and stored. This way you have only entry in your .gitignore for the website rather than multiple.


This document describes the various operations required for mapping between different database formats, such as between ODM long format to ODM wide format (and vice versa) and PHA4GE to ODM. Each section describes the operation with examples, and provides a list of database conversions that require the operation.
This section lists all operations allowable in a mapping configuration file. For details on the mapping configuration file see [Mapping Configuration Files](mapping-configs.qmd). With these operations, the source or input tables only contain the rows for the current group being processed. If the `group_by` key is not specified for the `main_inputs` table, then each source/input table processed by the operations will only have one row.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by the second sentence. Do you mind expanding or giving an example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm in the same boat as Yulric here - also unclear as to what is meant by "With these operations, the source or input tables only contain the rows for the current group being processed."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an example?

```{r, echo=FALSE}
mdtable_csv_file("../assets/examples/odm/measure-multidate-odmlong.csv")
```
The `copy` operation will copy one or more columns from the source table to the target table, optionally format the value(s), and optionally cast it to a specified type. The `method` key specifies which source rows are used for copying. For a method of `use_source_rows`, if an intermediate output table is already created, it will use the [source row numbers](mapping-behavior.qmd#assigning-source-table-and-source-row-numbers) assigned to each of the output table rows. If an intermediate output table is not already created, it will copy all the rows from the current source table. For a method of `exact`, all source rows are copied in order, populating the output table from top to bottom and creating new rows if required.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case for the exact method?

operation: copy
operation_config:
target_column: mr_reportDate
target_value: "The report date is {reportDate}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a situation where they will need to specify the table within the curly braces? For example is two tables have the same column?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the source_table key what determines what table the column is coming from?


For example, using the following long table group, resulting from [grouping](#operation-grouping) by date:
If an exact match occurs (eg. the column `mr_reportDate` matching the regular expression `mr_reportDate`) then the column is placed in the position of that match. Otherwise, if no exact match occurs then the position of a wildcard match is used. If multiple columns match a wildcard then those columns are sorted alphabetically for consistency. If no match occurs then the column is placed at the end. For example, if a table has a column `mr_reportDate`, `mr_specimen`, `mr_aggregation`, and `sm_sampleMat`, and the following configuration is used:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by the position of a wildcard match is used?


We can apply some string maps or custom transforms to clean these values:
From the input rows (of the input table specified by `source_table`) `pivot_longer` finds any column that fully matches the regular expression specified in `match_column_values`. The regular expression must match the full column name, not just a part of it. It then takes the captures in the regular expression and assigns them to the columns (in order) specified in `target_columns`. In the `target_columns` field a value of `NULL` can be used if that capture should be ignored. It also takes the value in the input table found in the matched column and assigns it to the column specified in `target_value_column`. Using the input row below as an example:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we comfortable that all our needs can be met with a regular expression?

```{r, echo=FALSE}
mdtable_csv_file("../assets/examples/odm/measure-singledate-odmwide.csv")
```
The `method` field specifies how the output table is populated. If it is `new_rows` then each pivot will result in a new row. If it is `existing_rows` then no new rows are created (with one exception, see below), instead the values are pivoted into the existing rows (possibly overwriting values in those rows). With `existing_rows`, the rows used for pivoting are the [source rows](mapping-behavior.qmd#assigning-source-table-and-source-row-numbers) assigned to each existing output row. The one exception for `existing_rows` is if the output table is currently empty, in which case pivoting longer behaves as `new_rows` until a single pivot is complete.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a usecase for existing_rows?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an example for how this would work?


- ODM long format to ODM wide format
The `pivot_wider` operation pivots individual input rows into individual output columns. It is the inverse of the `pivot_longer` operation. When mapping a single input row to a column we take the values in various columns of the input row to form a new column name (specified by `target_column`). We then set the target value to the value found in the `source_column` column.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add text about the difference between text in the target_column surrounded by {} and not?


# Operation: Pivot Longer
The `method` key specifies how new or existing rows are created or populated in the output. `stack` fills the column in the output table from top to bottom, starting with the first empty row, and adding new rows if required. `stack_no_new_rows` also fills the column in the output table from top to bottom, starting with the first empty row, but if there are more values to stack than available existing output rows we end and do not create new rows. `existing_rows` will only populate existing rows, starting from top to bottom, and will also use the [source rows](mapping-behavior.qmd#assigning-source-table-and-source-row-numbers) attached to each output row as the source of data.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example for each method using the same input table?

```txt
(water, sample, liquid, SARS-CoV-2-N1, gene copies per L, arithmetic mean, 1)
```
The `set` operation assigns predefined values to columns in the output rows. If the output currently has more than one row the row index to copy the value to can be set with the `target_index` key, which is a 0-based index. `target_index` can also be an array of integers to specify multiple row indices to set. If `target_index` is not specified, or `all` is specified, then all rows are set. The default value is `all`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if more than the target_index a would be better, since people would not really know what the target index would be. Like, set the organizatioId column values to 1 for rows with sampleId equal to 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having some ability to match on filtering rules would be helpful

Copy link
Collaborator

@mathew-thomson mathew-thomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Curious to see responses to Yulric's comment. Some of the documentation is a little over my head, unfortunately, but that's maybe to be expected.

@@ -1,2 +1,6 @@
# PHES-ODM-Map
Transform data from various database formats to ODM.

Transform data from various database formats to ODM, and ODM wide format to/from ODM long format. We are currently developing the specs for this project, please view the documentation below:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May wish to specify that eventually to hope is to be able to move to and from the ODM.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these diagram files - super clean, clear and helpful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like these example images - really clean and clear. Nice work!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really detailed summary, well explained, and makes a lot of sense. Covers a lot of bases.


This document describes the various operations required for mapping between different database formats, such as between ODM long format to ODM wide format (and vice versa) and PHA4GE to ODM. Each section describes the operation with examples, and provides a list of database conversions that require the operation.
This section lists all operations allowable in a mapping configuration file. For details on the mapping configuration file see [Mapping Configuration Files](mapping-configs.qmd). With these operations, the source or input tables only contain the rows for the current group being processed. If the `group_by` key is not specified for the `main_inputs` table, then each source/input table processed by the operations will only have one row.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm in the same boat as Yulric here - also unclear as to what is meant by "With these operations, the source or input tables only contain the rows for the current group being processed."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the jargon/terms in this document flew over my head a bit, but from what I was able to follow I think it's looking quite good. I'm interested to see the responses to Yulric's questions/comments as well.

Copy link
Collaborator

@jeandavidt jeandavidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Martin! The documentation reads really well and the figures are very clear.

To avoid having to enter many separate `copy` operations, multiple `copy` operations can be specified using arrays for `target_column`, `source_table`, and `source_column`. The example below copies `samples["siteID"]` to the output's `sm_siteID` column, `samples["saMaterial"]` to the output's `sm_sampleMat` column, and `contacts["contactID"]` to the output's `co_contactID` column:

```yaml
operation: copy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could one set derived values (like in the example below) here as well?

operation: copy
operation_config:
target_column: mr_reportDate
target_value: "The report date is {reportDate}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the source_table key what determines what table the column is coming from?

```txt
(water, sample, liquid, SARS-CoV-2-N1, gene copies per L, arithmetic mean, 1)
```
The `set` operation assigns predefined values to columns in the output rows. If the output currently has more than one row the row index to copy the value to can be set with the `target_index` key, which is a 0-based index. `target_index` can also be an array of integers to specify multiple row indices to set. If `target_index` is not specified, or `all` is specified, then all rows are set. The default value is `all`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having some ability to match on filtering rules would be helpful


```yaml
metadata:
title: ODM 2.0 Long to Wide

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect we will eventually want more metadata, such as description: maybe map_fromandmap_to`.

The authoring approach for YAML (in Quarto and other similar approaches) is:

author:
  - name: "Author Name"
    affiliation: "Author Affiliation"
    orcid: "Author ORCID"
    email: "Author Email"
    url: "Author URL"
    ```


![Figure 1: Example construction of output table based on input table](assets/example_table_construction.png){fig-align="center"}

`main_inputs` is the main input table for the current output table. We will iterate over the rows of this table one at a time. For each input row, we will apply all mapping operations one at a time to generate the new row(s) for the output table. When proceeding to each subsequent mapping operation, we carry forward the resulting output rows from the previous operation. This will construct the new row(s) iteratively, growing the row(s) as we proceed. Once all operations have been applied, we save the output rows and proceed to the next input row and repeat the process to generate other new rows.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if we just want inputs, rather than main_inputs and other_inputs. I think you are right that folks need to start somewhere

```{r, echo=FALSE, eval=TRUE, file="../R/doc_mdtables.R"}
```

# Introduction

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may help to have a high-level figure:

Configuration --> Define tables --> Operations --> Post-operations --> Save output.

The process kinda reminded me of PMML steps (PMML = Predictive Modelling Mark-up Language)
pmml

Copy link

@DougManuel DougManuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document looks comprehensive. I have a feeling that there additional operations, but I can't think of any.

The folks over at linkML have been active in the last few week on the transformer. https://github.com/linkml/linkml-transformer

There are links to a meeting last spring, which has a presentation and good review of various mapping libraries. I see they are using the pydantic library. https://pydantic.dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants