Samples update 2018 #199

Evildoor · 2019-02-04T14:58:43Z

The data changes overtime, which means that something can be missing in the current samples that were taken from the 2016's data. Samples from the 2018's data should be added to properly represent the data, but the system must work on both sets.

Select a time interval to gather the 2018 data. Should be 100-200 lines in stage 009's output.
Add samples for stage 009.
Add samples for stage 091.
Add samples for stage 025.
Add samples for stage 016.
Add samples for stage 093.
Add samples for stage 095.
Add samples for stage 019.
Rename all 2016 samples.

Note - data gathering intervals were saved in Utils/Dataflow/009_oracleConnector/README.

mgolosova

Few comments before I checked the samples themselves, please take a look.

Can you also try and resolve the conflicts with master? There are two ways to do:

git fetch; git rebase origin/master
<...solve conflicts here...>
git push --force

or

git fetch; git merge origin/master
<...solve conflicts here...>
git push

The first will "hide" conflicts, so that no one will be able to see and check how they were solved (well, except that recently GitHub started to keep the story of push-forces, allowing to see the difference between two versions); the second will save this information within the merge commit.

Both looks fine for me; merge is definitely more "right" way, yet rebase is "less noisy". So I don`t know which way would I prefer in same situation )

Utils/Dataflow/091_datasetsRucio/README

Utils/Dataflow/025_chicagoES/README

Utils/Dataflow/016_task2es/README

Utils/Dataflow/019_esFormat/README

Evildoor · 2019-03-18T11:05:41Z

Resolved the conflicts.

mgolosova · 2019-03-26T12:50:23Z

`e9424f3`

This commit is a bit confusing, as it mixes up "what was" (that requires one more fix) and "what became".
My suggestions:

join it with the two following ones: like making things "as they should be after all the changes" (1 commit);
join it with each of the following two: "DF/016: ", "DF/019: ..." (2 commits);
first fix broken links (making them pointing exactly to the same file as before), next update links (making them pointing to the proper stage's output), and then add 2018 (separately for 016 and 019) (4 commits).

Actualy, the good rule is: when you want to write in commit log something like "do(ne) this and do(ne) that" (here: "update" and "add"), stop for a moment and think. If you can`t clearly explain, how this two changes are related to each other, it is better to separate them and make two individual commits.

An example of "related" changes is: "Data processing stage changed: now it produces records with additional field 'X' (change 'A'). So 'X' is added to the mapping (change 'B') and output samples are updated (change 'C')." If one does not perform 'B' or 'C' at all, things become inconsistent: previous version of the output data example simply becomes incorrect, and missed field in the mapping makes general behaviour unpredictable and may lead to conflicts in the futire (one who wants to know what data are available in the final storage, will check the mapping and won`t be aware of the field 'X'; one who wants to make changes to another stage, may try to add his own version of 'X' field, and it will lead to a conflict). This three chages may go separately, as three commits (within a single PR), if the author wishes, yet it is fine to commit three-in-one, as they are clearly related to each other.

Evildoor · 2019-03-27T11:05:35Z

You are absolutely right about e9424f3. I tried to properly update the existing commits with symbolic links' changes, but something got in the way, so I put them into a separate commit to deal with later, if necessary. Well, that "later" is now then. The commit was scrapped and its changes were incorporated into earlier commits:

8626cf0: link to 025's 2018 sample is created (it was used as input).
d43ffb0: links to 016's 2018 sample are created (it was used as input).
6a2dca6: links are updated due to renaming.

This way, the symbolic links should properly represent the situation with the samples at all points of time.

While doing so I had to rewrite history, and now numerous commits merged from master are displayed separately. Is this ok, or should I redo this PR in a separate branch?

mgolosova · 2019-03-27T12:20:13Z

Doesn`t look very good, you`re right. Here`s what happened.
Before merge, you had commit history like this:

--- A --- B --- C (master)
     \--- D --- E (samples-2018)

After merge:

--- A --- B --- C ----- \ (master)
     \--- D --- E --- (merge) --- F --- G (samples-2018)

After rebase:

--- A --- B --- C (master)
     \--- D --- E --- B' --- C' --- F' --- G' (samples-2018)

In other words: to samples-2018 were added copies of commits from master, instead of keeping the information about merge.

What you can do:

create new branch, that will be in the same state, as samples-2018 before rebase (git branch samples-2018-copy ea8d513, where ea8d513 -- the tip of the branch before rebase);
take a look at the commit history (git log --oneline --graph), find the merge there;
perform the rebase you did, but this time with key -p (--preserve-merges) -- or, even better, start the rebasing form the commit right after the merge: git rebase -i 523dec48. First will try to keep the merge commit as a merge, while the second won`t even touch it;
after rebase, check the commit history (git log --oneline --graph), make sure the merge is still there and everything is fine;
make sure there is no difference between samples-2018 and samples-2018-copy (git diff samples-2018 samples-2018-copy);
reset samples-2018 to the state of the new branch (git reset --hard samples-2018-copy) -- this will replace commit history of samples-2018 to the one you have prepared in ...-copy;
push the new version here (git push --force).

New branch is created just to have an easy way to step back and start it over, same can be done right in the samples-2018 -- just use origin/samples-2018 in diff for comparison.

Functionally there is no difference if you close this PR and open another one from the new branch: the only difference is that you can keep the name of this PR`s source branch, replacing its content.

mgolosova · 2019-03-27T12:46:22Z

Samples update: add 2018 data.

Evildoor · 2019-03-28T07:55:06Z

Aha, I see. Thank you very much for the detailed explanation.

As ATLAS' development progresses, the data can change. Some types of new information, such as `hs06sec` for Oracle / Chicago ES, can be absent in 2016 sample. Therefore, new 2018 sample are added to properly illustrate the state of the data, but the 2016 one are preserved because the system must work correctly regardless of the data gathering time. Update the stage's README to describe the samples. Conflicts: Utils/Dataflow/009_oracleConnector/README

Update the stage's README to describe the samples.

Update the stage's README to describe the samples. Add a symbolic link to the input used for creating the sample.

Update the stage's README to describe the samples.

Update the stage's README to describe the samples. Add symbolic links to the input files used to produce the sample.

Update all READMEs to reflect the change. Update symbolic links broken by the change.

Current sample was obtained by processing 009's output, while the stage should be working with 091's output. Replace the sample with correct one.

Current sample was obtained by processing 091's output, while the stage should be working with 025's output. Replace the sample with correct one.

Current sample was obtained with stage 025 skipped. Update it properly.

Evildoor · 2019-04-03T12:48:09Z

@mgolosova
Unfortunately, the simplest and most logical thing

or, even better, start the rebasing form the commit right after the merge: git rebase -i 523dec4

was not an option due to the fact that at least one commit before the merge had to be changed.

I decided that I got tired of hauling this merge in this PR, so I started new branch from current master, cherry-picked commits into it and reset this branch to a new one (diff said that they were identical).

Please, take a look.

mgolosova

@Evildoor, still can not approve the PR.

Main reason is that the 2018 output sample of the 025 does not contain toths06* metadata.
Maybe there were no information in Chicago ES when you created the PR, but now it is definitely there. So I think I should ask you to update the sample; without these fields it simply does not serve its purpose...

Also, just noticed that Stage 025 input directory still linked to the 009's output (091 output/input_* files are expected). There are two options:

simply link 025_*/input to 091_*/output, as in 093.
Then in the README (of both 025 and 093 stages) should be specified which input files were used;
create a directory with two links to 091_*/output/input_* files.
Then the 093's input should be updated accordingly (to contain only 091_*/output/output_* files).

And, talking about links to the output directory and individual files: 016's input is a directory with two links, while it can be a single link to the 025's output. I see it is due to the fact that originally there was a link to the 091's specific output file, but now it can be changed. It`s just because links to directories simplify further samples management: if one needs to rename samples, he/she will rename only output samples, not input and output together. Won`t work for all stages, of course, but still...

There`s an artificial line in 009's 2016 output sample:

{"taskid": 13319225, "start_time" : "27-02-2018 16:56:38", "end_time" : "07-03-2018 14:24:46", "status": "done"}

It was added in #191, to have at least one record with new metadata fields. Now it can (and should) be removed, as exactly for this purpose 2018 samples were added.

Seems like this record was to be propagated from there only to the 025's output, which was updated in this PR and does not contain this line any longer, so it is the only sample to fix.

Everything else seems to be fine, thank you!

The line was added by hand for sake of having at least one record with new metadata fields in the sample. Now such records can be obtained from 2018 samples, so the line is no longer needed.

- Redirect stage 025's input to stage 091's output as it should be. - Specify which of the 091's samples were used in stage 025's README. This is important since stage 091 produces two types of files. - Do the same in stage 093's README for consistency.

Evildoor · 2019-04-08T13:35:13Z

Main reason is that the 2018 output sample of the 025 does not contain toths06* metadata.
Maybe there were no information in Chicago ES when you created the PR, but now it is definitely there. So I think I should ask you to update the sample; without these fields it simply does not serve its purpose...

Done.

Also, just noticed that Stage 025 input directory still linked to the 009's output (091 output/input_* files are expected). There are two options:

simply link 025_*/input to 091_*/output, as in 093.
Then in the README (of both 025 and 093 stages) should be specified which input files were used;

create a directory with two links to 091_*/output/input_* files.
Then the 093's input should be updated accordingly (to contain only 091_*/output/output_* files).

Missed that, thank you. Done, first option, because...

And, talking about links to the output directory and individual files: 016's input is a directory with two links, while it can be a single link to the 025's output. I see it is due to the fact that originally there was a link to the 091's specific output file, but now it can be changed. Its just because links to directories simplify further samples management: if one needs to rename samples, he/she will rename only output samples, not input and output together. Wont work for all stages, of course, but still...

... I agree with you - links to directories are easier to work with.

There`s an artificial line in 009's 2016 output sample:
{"taskid": 13319225, "start_time" : "27-02-2018 16:56:38", "end_time" : "07-03-2018 14:24:46", "status": "done"}
It was added in #191, to have at least one record with new metadata fields. Now it can (and should) be removed, as exactly for this purpose 2018 samples were added.

Seems like this record was to be propagated from there only to the 025's output, which was updated in this PR and does not contain this line any longer, so it is the only sample to fix.

Removed it.

mgolosova · 2019-04-17T08:29:25Z

@Evildoor, thank you for the update.

You forget to update samples of stages "after" 025 according to the changes in 09fadf1; everything else is fine now.

- Re-produce the sample to add missing toths06* fields. - Update the corresponding samples of stages 016 and 019 that follow the 025.

Directory links are easier to maintain.

Evildoor · 2019-04-18T09:08:36Z

You forget to update samples of stages "after" 025 according to the changes in 09fadf1

My bad, fixed.

mgolosova · 2019-04-19T08:44:35Z

@Evildoor, thank you! Approved and merged now.

Evildoor self-assigned this Feb 4, 2019

Evildoor force-pushed the samples-2018 branch from 3bac0bd to 6ca050c Compare February 6, 2019 13:46

Evildoor force-pushed the samples-2018 branch from e3a7e27 to f32a8e4 Compare February 18, 2019 12:37

Evildoor requested a review from mgolosova March 14, 2019 10:02

mgolosova reviewed Mar 15, 2019

View reviewed changes

mgolosova mentioned this pull request Mar 15, 2019

Task Chain #201

Merged

5 tasks

Evildoor force-pushed the samples-2018 branch from 795131f to e9424f3 Compare March 20, 2019 08:05

Evildoor changed the title ~~[WIP] Samples update 2018~~ Samples update 2018 Mar 20, 2019

This was referenced Mar 20, 2019

Stage 091 readme #235

Merged

Stage 095 readme #237

Merged

Evildoor force-pushed the samples-2018 branch from ea8d513 to 5b08c58 Compare March 27, 2019 10:36

Evildoor added 12 commits April 3, 2019 14:21

Add 2018 sample for stage 091.

e5b70e5

Update the stage's README to describe the samples.

Add 2018 sample for stage 025.

18ea2b2

Update the stage's README to describe the samples.

Add 2018 sample for stage 016.

5d35844

Update the stage's README to describe the samples. Add a symbolic link to the input used for creating the sample.

Add 2018 sample for stage 093.

27260de

Update the stage's README to describe the samples.

Add 2018 sample for stage 095.

e71889f

Update the stage's README to describe the samples.

Add 2018 sample for stage 019.

db01599

Update the stage's README to describe the samples. Add symbolic links to the input files used to produce the sample.

Rename 2016 samples from *.ndjson to *2016.ndjson.

30c82d6

Update all READMEs to reflect the change. Update symbolic links broken by the change.

Explain the contents of the paired samples.

65b2528

Update 2016 sample for stage 025.

d76053f

Current sample was obtained by processing 009's output, while the stage should be working with 091's output. Replace the sample with correct one.

Update 2016 sample for stage 016.

fc607af

Current sample was obtained by processing 091's output, while the stage should be working with 025's output. Replace the sample with correct one.

Update 2016 task sample for stage 019.

2f3a93c

Current sample was obtained with stage 025 skipped. Update it properly.

Evildoor force-pushed the samples-2018 branch from 6684de1 to 2f3a93c Compare April 3, 2019 12:33

mgolosova requested changes Apr 8, 2019

View reviewed changes

Evildoor added 2 commits April 8, 2019 14:02

Remove artificial line from a sample.

3e41c43

The line was added by hand for sake of having at least one record with new metadata fields in the sample. Now such records can be obtained from 2018 samples, so the line is no longer needed.

Evildoor force-pushed the samples-2018 branch from 4e82fb6 to 767c7f0 Compare April 8, 2019 13:30

Evildoor added 2 commits April 18, 2019 11:05

Update stage 025's 2018 sample.

b7730af

- Re-produce the sample to add missing toths06* fields. - Update the corresponding samples of stages 016 and 019 that follow the 025.

Replace links to files with a link to directory.

6f6866f

Directory links are easier to maintain.

Evildoor force-pushed the samples-2018 branch from 767c7f0 to 6f6866f Compare April 18, 2019 09:06

Evildoor mentioned this pull request Apr 18, 2019

Oracle-ES consistency #240

Merged

mgolosova approved these changes Apr 19, 2019

View reviewed changes

mgolosova merged commit 9bc49fd into master Apr 19, 2019

mgolosova deleted the samples-2018 branch April 19, 2019 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Samples update 2018 #199

Samples update 2018 #199

Evildoor commented Feb 4, 2019 •

edited

Loading

mgolosova left a comment

Evildoor commented Mar 18, 2019

mgolosova commented Mar 26, 2019

Evildoor commented Mar 27, 2019 •

edited

Loading

mgolosova commented Mar 27, 2019

mgolosova commented Mar 27, 2019

Evildoor commented Mar 28, 2019

Evildoor commented Apr 3, 2019

mgolosova left a comment

Evildoor commented Apr 8, 2019

mgolosova commented Apr 17, 2019

Evildoor commented Apr 18, 2019

mgolosova commented Apr 19, 2019

Samples update 2018 #199

Samples update 2018 #199

Conversation

Evildoor commented Feb 4, 2019 • edited Loading

mgolosova left a comment

Choose a reason for hiding this comment

Evildoor commented Mar 18, 2019

mgolosova commented Mar 26, 2019

e9424f3

Evildoor commented Mar 27, 2019 • edited Loading

mgolosova commented Mar 27, 2019

mgolosova commented Mar 27, 2019

Evildoor commented Mar 28, 2019

Evildoor commented Apr 3, 2019

mgolosova left a comment

Choose a reason for hiding this comment

Evildoor commented Apr 8, 2019

mgolosova commented Apr 17, 2019

Evildoor commented Apr 18, 2019

mgolosova commented Apr 19, 2019

Evildoor commented Feb 4, 2019 •

edited

Loading

`e9424f3`

Evildoor commented Mar 27, 2019 •

edited

Loading