-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use of .R files in the data directory #165
Comments
@connectedblue I'm not quite following the problem. The convention of ProjectTemplate is that only variable names that match the file name will be checked prior to processing the file in the In my opinion keeping the global namespace clean is the responsibility of the user. If the user wishes to create multiple global variables during the data step this is their choice. Sometimes the intended behavior is to create multiple global variables, as is the case when an excel workbook with multiple sheets is loaded. This is why I'm not sure I'm following the issue! Perhaps I'm missing the larger picture here. In your attached example (thanks for that btw!) what did you originally have? How did your original solution create an unexpected behavior within ProjectTemplate? What was the expected behavior compared to what happened? That might help me better understand the larger context of your comment :) |
Hi @KentonWhite, I agree the user has to be responsible - I just also wanted to be lazy and avoid having to wrap unwanted variables in a function :) As you say, multiple variables are also created in excel files, so that is a smaller point and perhaps a red herring. However, the larger point is the unnecessary processing that occurs when the variable name is not the same. In my example, I originally had the variable called What was happening is that So, I think the bigger picture is that users who have slow loading data in So, in a round about way, I'm saying that when users are using The reason I would make this a convention is to document it so that users don't have to go through the same thinking to arrive at the most efficient conclusion. We can add examples on the website to show how The excel case is different though. Usually the user has been given a file from another source with multiple tabs and they all have to be loaded from the same file. Here it is also possible that none of the tab names have the same name as the file, so it will be unnecessarily loaded every time, even though the underlying data has already been cached. Excel loads can also take quite a long time. What I was going to suggest for the excel case was if, after loading the file there is no global env variable with the same name as the file, then have the reader create one with a value of Sorry for the long reply. Does that make more sense? I'm really keen to make sure that users run By the way, I don't know if you listen to the Not So Standard Deviations podcast, but @hilaryparker made a throw away comment in the last episode that she only runs |
@connectedblue Ok I see. just to make sure I understand, the proposal is to make it a convention that there can only be one global variable in a |
Yes, @KentonWhite , that's what I was thinking. The Some appropriate messaging during data load would help guide the user if the convention is not followed, and the man page would outline the convention. I can knock up a prototype to test the concept further and see if it makes sense. |
I've run into an issue on one of my projects which @KentonWhite highlighted when #160 was being developed to auto cache data during load.project()
This file demonstrates the issues, and how I worked around them:
https://github.com/connectedblue/capstone/blob/master/capstone/data/corpus.R
In this project, I am trying to create a test data set from the original data during my exploratory analysis (I use a custom config variable flag to toggle between the full and testing set to save rewriting the data acquisition code).
The way
.R
files are treated in thedata
directory results in the following issues:corpus_dir
,docs
anddoc
are side effects used to create the intended variablecorpus
. I wrapped all the processing into a function so that these variables did not find their way into the global environment..R
file, the data file will always be processed in thedata
directory, This duplicates processing unnecessarily when the variables are already in memory or the cache. I got around this by making the generated variable the same name as the file, and it is skipped duringload.project()
as a result.These same issues would also occur in other data loaders such as excel and sql where the datafile can generate multiple global environment variables.
I have a proposed solution for the
.R
problems above, but I want to get the community's view before submitting a PR.Essentially I propose using convention over configuration, and enforce a rule that
.R
files in thedata
directory can only produce one global variable with the same name as the file. Any other global variables would be deleted which means that side effect variables like mine above don't have to be wrapped in a function to avoid polluting the global environment.We'd have to put some migration logic in to warn the user about any legacy code that would need to be looked at.
I think it's good practice anyway to have one variable per
.R
file in thedata
directory, and enforcing it brings more benefits to users when they runload.project()
making it faster to load by relying on memory and disk caching.The change would be quite straightforward - I'm happy to do it, but I don't want to get distracted from my project so it'll be a couple of weeks before I raise a PR. In the meantime, I'd like to get feedback on the approach to this issue.
The text was updated successfully, but these errors were encountered: