-
Couldn't load subscription status.
- Fork 2
IN-930 Create data flow diagram #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
971e377 to
cc13640
Compare
Why these changes are being introduced: * Provide a visual that encapsulates the flow of data from source to destination. How this addresses that need: * Create a diagram showing data flow for a Carbon run * Add link to Confluence document in the README Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-930
cc13640 to
833d505
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks fantastic, and I actually think this slightly simplified mermaid is even desirable over a diagram that exposes the threads in play.
Perhaps another diagram could be helpful -- if the need or request presents itself -- that really dives into the pipes and threads, but this diagram feels good for a high level, project level README.
Feel free to use or disregard any comments, and consider approved as-is. I think it's clear as-is, and my suggestions are just moving the needle slightly in the direction of conceptual/simple over technical/precise.
README.md
Outdated
| subgraph in-memory[Application In-memory] | ||
| direction TB | ||
| rec-generator([Record generator]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it accurate that this "Record generator" is a python generator from the data warehouse query results? If so, I might suggest renaming to "Query Results Generator" to really drive home that it's looping over the records from the data warehouse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Update made!
README.md
Outdated
| end | ||
| mit-dwrhs -->|Fetch query results | rec-generator | ||
| rec-generator-->|Yielding records one at a time, <br> transform record into normalized XML strings <br> and write to a file| buffered-writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing:
and write to a file
to:
and passes to write buffer
Please pushback though if I'm misunderstanding how it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on saying "write to file" (as in "file-like object") or "write to stream", but I do think "pass to write buffer" is clean and easy to understand.
README.md
Outdated
| mit-dwrhs -->|Fetch query results | rec-generator | ||
| rec-generator-->|Yielding records one at a time, <br> transform record into normalized XML strings <br> and write to a file| buffered-writer | ||
| buffered-writer -->|Stream contents from file to another read end of pipe| buffered-reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To continue simplifying for people new to codebase, what about changing:
Stream contents from file to another read end of pipe
to:
Direct pipe
Acknowledging this may paper over some details, at the end of the day it feels like a reader should understand that the "writer" is "piped" to the "reader" and they don't need to think about it very deeply past that (unless digging into code of course...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this over Slack, and I made changes to reflect our conversation! Please let me know what you think.
7791e75 to
bbb2f9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, approved!
I think diagramming the read/write pipes in Carbon is difficult, depending a lot on what kind of detail you want to show, but I think this diagram gives enough information to a developer to understand where they could look further.
At a very high level, my understanding is this:
data warehouse --> query --> results generator --> normalize XML --> read/write pipe --> FTPS client --> FTPS server
What this flow suggests to me is that each record that is normalized, is immediately moving through some kind of pipe and straight onto the FTP server.
What I think is tricky, is what do you define as the "Buffered Writer" or "Buffered Readers" there?
I think your diagram communicates that flow, and to me, that's the important thing at this README level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
|
Thank you both for your review! |
What does this PR do?
Create a data flow diagram for the Carbon application.
Helpful background context
I went back-and-forth on what type of diagram to use (flowchart, sequence, or state flow), but ultimately, I decided that this flow chart showing where the data comes from and where it ends up were the key questions a data engineer might ask when first working on the application. I couldn't figure out a diagram type--with the existing
mermaidfeatures--that could easily communicate the 'threading' component of the Carbon application, so i opted to just create this diagram for now. However, I'd be happy to get suggestions for future work!A little while ago, I made this diagram on Figma that tries to show the concurrent threads (i.e., green bars are used to indicate which thread is running for a given action), but I couldn't see how to replicate this with

mermaid.How can a reviewer manually see the effects of these changes?
Please see the
README.Includes new or updated dependencies?
NO
Developer
Code Reviewer
(not just this pull request message)