Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samples update 2018 #199

Merged
merged 16 commits into from
Apr 19, 2019
Merged

Samples update 2018 #199

merged 16 commits into from
Apr 19, 2019

Conversation

Evildoor
Copy link
Contributor

@Evildoor Evildoor commented Feb 4, 2019

The data changes overtime, which means that something can be missing in the current samples that were taken from the 2016's data. Samples from the 2018's data should be added to properly represent the data, but the system must work on both sets.

  • Select a time interval to gather the 2018 data. Should be 100-200 lines in stage 009's output.
  • Add samples for stage 009.
  • Add samples for stage 091.
  • Add samples for stage 025.
  • Add samples for stage 016.
  • Add samples for stage 093.
  • Add samples for stage 095.
  • Add samples for stage 019.
  • Rename all 2016 samples.

Note - data gathering intervals were saved in Utils/Dataflow/009_oracleConnector/README.

Copy link
Collaborator

@mgolosova mgolosova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments before I checked the samples themselves, please take a look.

Can you also try and resolve the conflicts with master? There are two ways to do:

git fetch; git rebase origin/master
<...solve conflicts here...>
git push --force

or

git fetch; git merge origin/master
<...solve conflicts here...>
git push

The first will "hide" conflicts, so that no one will be able to see and check how they were solved (well, except that recently GitHub started to keep the story of push-forces, allowing to see the difference between two versions); the second will save this information within the merge commit.

Both looks fine for me; merge is definitely more "right" way, yet rebase is "less noisy". So I don`t know which way would I prefer in same situation )

Utils/Dataflow/091_datasetsRucio/README Show resolved Hide resolved
Utils/Dataflow/025_chicagoES/README Outdated Show resolved Hide resolved
Utils/Dataflow/016_task2es/README Outdated Show resolved Hide resolved
Utils/Dataflow/019_esFormat/README Outdated Show resolved Hide resolved
Utils/Dataflow/019_esFormat/README Show resolved Hide resolved
@mgolosova mgolosova mentioned this pull request Mar 15, 2019
5 tasks
@Evildoor
Copy link
Contributor Author

Resolved the conflicts.

@Evildoor Evildoor changed the title [WIP] Samples update 2018 Samples update 2018 Mar 20, 2019
This was referenced Mar 20, 2019
@mgolosova
Copy link
Collaborator

e9424f3

This commit is a bit confusing, as it mixes up "what was" (that requires one more fix) and "what became".
My suggestions:

  • join it with the two following ones: like making things "as they should be after all the changes" (1 commit);
  • join it with each of the following two: "DF/016: ", "DF/019: ..." (2 commits);
  • first fix broken links (making them pointing exactly to the same file as before), next update links (making them pointing to the proper stage's output), and then add 2018 (separately for 016 and 019) (4 commits).

Actualy, the good rule is: when you want to write in commit log something like "do(ne) this and do(ne) that" (here: "update" and "add"), stop for a moment and think. If you can`t clearly explain, how this two changes are related to each other, it is better to separate them and make two individual commits.

An example of "related" changes is: "Data processing stage changed: now it produces records with additional field 'X' (change 'A'). So 'X' is added to the mapping (change 'B') and output samples are updated (change 'C')." If one does not perform 'B' or 'C' at all, things become inconsistent: previous version of the output data example simply becomes incorrect, and missed field in the mapping makes general behaviour unpredictable and may lead to conflicts in the futire (one who wants to know what data are available in the final storage, will check the mapping and won`t be aware of the field 'X'; one who wants to make changes to another stage, may try to add his own version of 'X' field, and it will lead to a conflict). This three chages may go separately, as three commits (within a single PR), if the author wishes, yet it is fine to commit three-in-one, as they are clearly related to each other.

@Evildoor
Copy link
Contributor Author

Evildoor commented Mar 27, 2019

You are absolutely right about e9424f3. I tried to properly update the existing commits with symbolic links' changes, but something got in the way, so I put them into a separate commit to deal with later, if necessary. Well, that "later" is now then. The commit was scrapped and its changes were incorporated into earlier commits:

  1. 8626cf0: link to 025's 2018 sample is created (it was used as input).
  2. d43ffb0: links to 016's 2018 sample are created (it was used as input).
  3. 6a2dca6: links are updated due to renaming.

This way, the symbolic links should properly represent the situation with the samples at all points of time.

While doing so I had to rewrite history, and now numerous commits merged from master are displayed separately. Is this ok, or should I redo this PR in a separate branch?

@mgolosova
Copy link
Collaborator

Doesn`t look very good, you`re right. Here`s what happened.
Before merge, you had commit history like this:

--- A --- B --- C (master)
     \--- D --- E (samples-2018)

After merge:

--- A --- B --- C ----- \ (master)
     \--- D --- E --- (merge) --- F --- G (samples-2018)

After rebase:

--- A --- B --- C (master)
     \--- D --- E --- B' --- C' --- F' --- G' (samples-2018)

In other words: to samples-2018 were added copies of commits from master, instead of keeping the information about merge.

What you can do:

  • create new branch, that will be in the same state, as samples-2018 before rebase (git branch samples-2018-copy ea8d513, where ea8d513 -- the tip of the branch before rebase);
  • take a look at the commit history (git log --oneline --graph), find the merge there;
  • perform the rebase you did, but this time with key -p (--preserve-merges) -- or, even better, start the rebasing form the commit right after the merge: git rebase -i 523dec48. First will try to keep the merge commit as a merge, while the second won`t even touch it;
  • after rebase, check the commit history (git log --oneline --graph), make sure the merge is still there and everything is fine;
  • make sure there is no difference between samples-2018 and samples-2018-copy (git diff samples-2018 samples-2018-copy);
  • reset samples-2018 to the state of the new branch (git reset --hard samples-2018-copy) -- this will replace commit history of samples-2018 to the one you have prepared in ...-copy;
  • push the new version here (git push --force).

New branch is created just to have an easy way to step back and start it over, same can be done right in the samples-2018 -- just use origin/samples-2018 in diff for comparison.

Functionally there is no difference if you close this PR and open another one from the new branch: the only difference is that you can keep the name of this PR`s source branch, replacing its content.

@mgolosova
Copy link
Collaborator

@Evildoor
Copy link
Contributor Author

Aha, I see. Thank you very much for the detailed explanation.

As ATLAS' development progresses, the data can change. Some types
of new information, such as `hs06sec` for Oracle / Chicago ES, can
be absent in 2016 sample. Therefore, new 2018 sample are added to
properly illustrate the state of the data, but the 2016 one are
preserved because the system must work correctly regardless of the
data gathering time.

Update the stage's README to describe the samples.

Conflicts:
	Utils/Dataflow/009_oracleConnector/README
Update the stage's README to describe the samples.
Update the stage's README to describe the samples.
Update the stage's README to describe the samples.
Add a symbolic link to the input used for creating the sample.
Update the stage's README to describe the samples.
Update the stage's README to describe the samples.
Update the stage's README to describe the samples.
Add symbolic links to the input files used to produce the sample.
Update all READMEs to reflect the change.
Update symbolic links broken by the change.
Current sample was obtained by processing 009's output, while the stage should
be working with 091's output. Replace the sample with correct one.
Current sample was obtained by processing 091's output, while the stage should
be working with 025's output. Replace the sample with correct one.
Current sample was obtained with stage 025 skipped. Update it properly.
@Evildoor
Copy link
Contributor Author

Evildoor commented Apr 3, 2019

@mgolosova
Unfortunately, the simplest and most logical thing

or, even better, start the rebasing form the commit right after the merge: git rebase -i 523dec4

was not an option due to the fact that at least one commit before the merge had to be changed.

I decided that I got tired of hauling this merge in this PR, so I started new branch from current master, cherry-picked commits into it and reset this branch to a new one (diff said that they were identical).

Please, take a look.

Copy link
Collaborator

@mgolosova mgolosova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Evildoor, still can not approve the PR.

Main reason is that the 2018 output sample of the 025 does not contain toths06* metadata.
Maybe there were no information in Chicago ES when you created the PR, but now it is definitely there. So I think I should ask you to update the sample; without these fields it simply does not serve its purpose...


Also, just noticed that Stage 025 input directory still linked to the 009's output (091 output/input_* files are expected). There are two options:

  • simply link 025_*/input to 091_*/output, as in 093.
    Then in the README (of both 025 and 093 stages) should be specified which input files were used;
  • create a directory with two links to 091_*/output/input_* files.
    Then the 093's input should be updated accordingly (to contain only 091_*/output/output_* files).

And, talking about links to the output directory and individual files: 016's input is a directory with two links, while it can be a single link to the 025's output. I see it is due to the fact that originally there was a link to the 091's specific output file, but now it can be changed. It`s just because links to directories simplify further samples management: if one needs to rename samples, he/she will rename only output samples, not input and output together. Won`t work for all stages, of course, but still...


There`s an artificial line in 009's 2016 output sample:

{"taskid": 13319225, "start_time" : "27-02-2018 16:56:38", "end_time" : "07-03-2018 14:24:46", "status": "done"}

It was added in #191, to have at least one record with new metadata fields. Now it can (and should) be removed, as exactly for this purpose 2018 samples were added.

Seems like this record was to be propagated from there only to the 025's output, which was updated in this PR and does not contain this line any longer, so it is the only sample to fix.


Everything else seems to be fine, thank you!

The line was added by hand for sake of having at least one record with new
metadata fields in the sample. Now such records can be obtained from 2018
samples, so the line is no longer needed.
- Redirect stage 025's input to stage 091's output as it should be.
- Specify which of the 091's samples were used in stage 025's README.
  This is important since stage 091 produces two types of files.
- Do the same in stage 093's README for consistency.
@Evildoor
Copy link
Contributor Author

Evildoor commented Apr 8, 2019

Main reason is that the 2018 output sample of the 025 does not contain toths06* metadata.
Maybe there were no information in Chicago ES when you created the PR, but now it is definitely there. So I think I should ask you to update the sample; without these fields it simply does not serve its purpose...

Done.

Also, just noticed that Stage 025 input directory still linked to the 009's output (091 output/input_* files are expected). There are two options:

  • simply link 025_*/input to 091_*/output, as in 093.
    Then in the README (of both 025 and 093 stages) should be specified which input files were used;
  • create a directory with two links to 091_*/output/input_* files.
    Then the 093's input should be updated accordingly (to contain only 091_*/output/output_* files).

Missed that, thank you. Done, first option, because...

And, talking about links to the output directory and individual files: 016's input is a directory with two links, while it can be a single link to the 025's output. I see it is due to the fact that originally there was a link to the 091's specific output file, but now it can be changed. Its just because links to directories simplify further samples management: if one needs to rename samples, he/she will rename only output samples, not input and output together. Wont work for all stages, of course, but still...

... I agree with you - links to directories are easier to work with.

There`s an artificial line in 009's 2016 output sample:

{"taskid": 13319225, "start_time" : "27-02-2018 16:56:38", "end_time" : "07-03-2018 14:24:46", "status": "done"}

It was added in #191, to have at least one record with new metadata fields. Now it can (and should) be removed, as exactly for this purpose 2018 samples were added.

Seems like this record was to be propagated from there only to the 025's output, which was updated in this PR and does not contain this line any longer, so it is the only sample to fix.

Removed it.

@mgolosova
Copy link
Collaborator

@Evildoor, thank you for the update.

You forget to update samples of stages "after" 025 according to the changes in 09fadf1; everything else is fine now.

- Re-produce the sample to add missing toths06* fields.
- Update the corresponding samples of stages 016 and 019 that follow the 025.
Directory links are easier to maintain.
@Evildoor
Copy link
Contributor Author

You forget to update samples of stages "after" 025 according to the changes in 09fadf1

My bad, fixed.

@Evildoor Evildoor mentioned this pull request Apr 18, 2019
@mgolosova mgolosova merged commit 9bc49fd into master Apr 19, 2019
@mgolosova mgolosova deleted the samples-2018 branch April 19, 2019 08:43
@mgolosova
Copy link
Collaborator

@Evildoor, thank you! Approved and merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants