Unit tests #68

sebastianludwig · 2021-02-15T21:25:59Z

The goal of this PR is to establish an integration test baseline to build upon in the future. The foundation to confidently move forward with future development is to avoid regressions. Since the whole tool can be thought of as a function with input parameters, deterministic behaviour and a result, it makes sense to assert the correctness on this level first before caring more about the internals. As long as this works, the rest doesn't matter too much.

The rough plan is:

Have an artificial, curated, fake google takeout in a folder (input), containing examples for all cases (including weird edge cases).
Let GPTH process the folder (output).
Ensure that the generated output equals a static reference (reference).

Opening this PR as early draft to discuss the approach.

The next steps required to start reaping the benefits of this are

Resolve TODOs (actually implement comparison with the reference)
Add a variety of input data (the broader the spectrum, the better).

Future steps would be to integrate this into some kind of CI, but even without it's already very helpful as a quick test run on a local checkout is easy.

@TheLastGimbus Regarding the second point you mentioned that you already collected test data from other users. Since sending them over to you and having them permanently stored in a public repo is still something different I also added a script to remove the actual image data from the files while preserving the EXIF metadata. It would be great if you could clean and provide what you have collected, that would probably already be a very solid test data set. The files I added so far are only to demonstrate the approach.

google_photos_takeout_helper/__main__.py

TheLastGimbus · 2021-02-16T12:15:37Z

So I see you added some images, directly to a repo...

My approach was to put them to a public Google Drive (because it has faster download speed)(I hope), and download them on-the-go in GitHub Actions (since their vms have very high internet speeds) - so we don't force everyone who just wants to git clone this repo to download 50 images+videos

Alternatively, we could put them in separate repo, if GDrive didn't suite us...

That's why I put @Akianonymus script to download stuff from google drive in scripts/...

remove the actual image data from the files while preserving the EXIF metadata

Seeing as many edge cases as I seen, I would prefer to keep actual "test images" as close as what you get directly from Google Takeout...

you could clean and provide what you have collected

Here they are:

GooglePhotosTakeoutHelper/.github/workflows/commands.yml

Line 25 in 094202b

    
                     ./gdl.bash https://drive.google.com/drive/folders/1UEgWofLWzaNGACavUlvy3-h3Qx1oledE?usp=sharing

sebastianludwig · 2021-02-16T16:54:46Z

My approach was to put them to a public Google Drive (because it has faster download speed)(I hope), and download them on-the-go in GitHub Actions (since their vms have very high internet speeds) - so we don't force everyone who just wants to git clone this repo to download 50 images+videos
Alternatively, we could put them in separate repo, if GDrive didn't suite us...

Mmhh.. I see your point. I haven't thought much about videos yet 🤔

I directly added the photos to generate a strong link between test data and code. If a new, unhandled edge case case is found, the code that fixes it should be added together with the new test data in the same PR. We wouldn't want to add it to the test set before the fix is merged because that would break the CI for the current implementation. We also don't want to add it after the fix, because we want the CI to tell us that the PR indeed fixes the new test case. We could of course reference different files on Google Drive like sample-takeout.zip, sample-takeout-v2.zip, sample-takeout-v3.zip but then we have just re-invented version control (poorly). Another downside of Google Drive I see (besides the 1,5k lines of Bash code which need to be maintained - BTW, that's 3x the Python code 😅) is access rights: On the one hand you want to be in control over the test data set and on the other hand you don't want to be the bottleneck to add new data. I don't think Google's rights management is up to the task.

TL;DR: Git > GDrive IMO

I quickly looked into LFS, but that seems too costly.

I like your suggestion of a separate git repo 👍 We could add that as Git submodule and have the benefit of tight coupling without the downside that every user needs to download large amounts of data. Contributing fellow developers probably know how to handle git submodules and we could also offer convenience scripts to avoid some of the common pitfalls.

If you agree, do you mind creating a repo?

Seeing as many edge cases as I seen, I would prefer to keep actual "test images" as close as what you get directly from Google Takeout...

Can you give an example of where the result depended on the actual image data?

…the meta data

sebastianludwig · 2021-02-21T17:39:28Z

I went ahead with the described approach. If you agree we should transfer https://github.com/sebastianludwig/GooglePhotosTakeoutHelperTestData to you @TheLastGimbus or simply re-create that repo and update the reference.

TheLastGimbus · 2021-02-21T17:49:37Z

Whoa... that's some big changes...

an example of where the result depended on the actual image data?

I can't specifically pin point a one, but I'm too afraid to try to mess with them 😬

I can't promise any timeline to merge this - I will need to learn some of these myself

TheLastGimbus · 2021-02-21T17:50:50Z

.gitmodules

@@ -0,0 +1,3 @@
+[submodule "tests/integration/fixtures"]


Hey, you look like a guy who knows submodules (I dont 😆 ) - could you show me how to add GDrive scripts as a submodule? And how to use them after that?

I found this to be a decent introduction https://www.atlassian.com/git/tutorials/git-submodule

Git submodules sound awesome when you learn about them the first time, but be cautioned: they have a few rough edges and most of the time there's a better choice. For example having the GDrive scripts directly in this repo doesn't hurt, I think. Or a completely different repo which is only linked in the README if you want to decouple them from this repo.

Yeah, last time I tried, they very very problematic (because of course I didn't read any tutorials 💯)

Huh - could we actually remove a test repo from submodules too, and just tell people to git clone it? We could just add a tag there, so it would be nicely versioned when CI/tests change

Yes, we could. But I don't think we'd gain anything. The test code by itself is very small and wont' impact the repo size noticeably.

On the other hand having tests and implementation split over two repos and thus two PRs makes reviewing a lot harder. Often PRs change both (implementation changes, tests added) and not being able to discuss both in the same place is going to be cumbersome and annoying.

No no, test repo meaning repo with test images

TheLastGimbus · 2021-02-21T17:52:55Z

google_photos_takeout_helper/__main__.py

@@ -753,4 +751,4 @@ def _walk_with_tqdm(res, bar: _tqdm):


 if __name__ == '__main__':
-    main()
+    main(sys.argv[1:])


Why is that?

to be able to call main from the tests and pass "command line args" as parameters.

TheLastGimbus · 2021-02-21T17:55:07Z

tests/integration/test_integration.py

+            ['0th', piexif.ImageIFD.DateTime], 
+            ['Exif', piexif.ExifIFD.DateTimeOriginal], 
+            ['Exif', piexif.ExifIFD.DateTimeDigitized],
+            ['GPS', piexif.GPSIFD.GPSVersionID],
+            ['GPS', piexif.GPSIFD.GPSLatitude],
+            ['GPS', piexif.GPSIFD.GPSLatitudeRef],
+            ['GPS', piexif.GPSIFD.GPSLongitude],
+            ['GPS', piexif.GPSIFD.GPSLongitudeRef],
+            ['GPS', piexif.GPSIFD.GPSAltitude],
+            ['GPS', piexif.GPSIFD.GPSAltitudeRef]


I see you added a lot of piexif stuff - I'm worried about this, because I wanted to start transitioning to other library - because this one is pretty dead - and now I have more stuff to make me procrastinate this decision, and more stuff that can break :(

Do you already have one in mind? If you haven't decided yet, I don't think that should block the merge of this PR. On the contrary, it makes sense to have a testing rig in place when swapping out integral parts to be sure everything keeps working. Also the unit tests aren't coupled to the implementation: we can replace the EXIF lib in the implementation and keep the tests on piexif for a while and see how it goes. Or the other way round.

Also I contain all the piexif stuff in this one method - I can also move the import down here if you'd like that to be more explicit.

exif looks nicer and actually maintained

How about the following plan:

Get the test setup in place

Add a couple of more tests to increase the confidence

Switch the tests over to exif - in a separate PR so the change is isolated and it can easily be judged if the switch broke anything

After gaining some experience with exif switch over the implementation or try another lib in the tests

?

sebastianludwig · 2021-02-21T19:59:56Z

I can't specifically pin point a one, but I'm too afraid to try to mess with them 😬

I hear you. What would you need to be comfortable with the cleaning approach? More theoretical reasoning why I'm convinced this won't cause issues, more practical proofs like the steps described in contirbuting.md, anything else..? I'm happy to elaborate

I can't promise any timeline to merge this - I will need to learn some of these myself

No worries, feel free to ask about anything.

TheLastGimbus · 2021-02-21T20:24:22Z

What would you need to be comfortable with the cleaning approach?

I just don't think it's necessary. We just need few pictures to test if dates are set right, if .jpgs and .tiffs work the same, etc - I'm very comfortable with sharing couple of mine, and I'm sure people that had some edge cases won't have a big problem either - I already got some (also quite a few through email)

TL;DR I just really don't think it's necessary

sebastianludwig · 2021-03-01T16:17:28Z

Ah, yes, the deduplication. Completely haven't thought about that. Yes, the image contents do make a difference there.

Mmhh..I still think it makes sense to test and handle the different aspects of the tool differently to 1) give privacy conscious people a way of comfortably share their test data and 2) not waste space in the test data repo unnecessarily.

So, suggestion: use cleaned images for EXIF data tests, use full images for deduplication tests.

I mean, with the process outlined in contributing.md we will find out if the cleaning causes any problems (the second point of step 7 would fail). I don't think there's anything to lose, only to win.

I'm sure people that had some edge cases won't have a big problem either

You just met the first one ;-)

TheLastGimbus · 2021-03-10T14:16:05Z

Okay, I though a bit about this, and here is my vision/manifest for this:

This script is generally one big dump of code. It's not modular, or easy to divide into pices - that's why, when testing it, I think we should just run it like we run it from cmd - that way it won't require modifing the __main__.py code itself

The Actions workflow will look like this:

Download test dataset
Run python3 __main__.py --options
Save output in GH artifacts
Run other tests which all test individual properties of output files

For the properties itself... we can make a json file (or couple of them), containing a list of stuff to assert. It would look something like this:

dates_assert.json:

{
  "photos": [
    "IMG_2137.jpg" : {   // Name of the file that should be present in output folder
      "file_modified": 124523432,   // Unix time epoch of file modified date
      "exif_date": 124523432,    //  time inside exif properties
    },
    "IMG_42069.png" : {   // Name of the file that should be present in output folder
      "file_modified": 124523432,   // Unix time epoch of file modified date
      //  png - no exif properties here
    }
  ]
}

location_assert.json:

{
  "photos": [
    "IMG_2137.jpg" : {
      "lat": 32.534543,
      "lng": 20.543523,
    }
  ]
}

etc...

Then, just loop through it with python test and we are good - that way we can easilty add new edge-case photos and their expected values

What do you think, @sebastianludwig?

As for test data itself, yeah, I created test data repo, and let's just keep everything there - just don't upload any 2gb videos there

sebastianludwig · 2021-03-11T06:32:43Z

That's definitely a valid approach. Before I form an opinion, can you share your thoughts on why you think this is the way forward and the benefits you see compared to the approach proposed in this PR?

TheLastGimbus · 2021-03-11T10:33:04Z

What you did so far:

launch main() function from test code - since we won't be getting any deeper than main() (no testing of individual functions beacuse they are bloated and global as hell) - this is pretty much the same thing as launching the script from bash, and I just prefer to leave main() untouched - just a stylish reason, nothing serious
Some assertFileDates(), assertExifData() functions - those can stay, they will just be launched/used differently, and maybe chopped into smaller pices

TL;DR this will be pretty simillar, just less pythonic and with one abstraction layer more (that is, asserts.json files)

TheLastGimbus · 2021-03-11T11:25:52Z

Also, here is what I want to have as a result:

When someone makes a PR, tests launch, and I have couple of green/red makrs on GitHub - like this:

As I understand, to achive this, I need to run couple of different "jobs", which all run in separate VMs - I can't do it all in single vm, right?

If so, I would make separete files for all different tests - or just have some way to run them individually

Does using unittests benefit us a lot here? Or can we just use assert?

sebastianludwig · 2021-03-14T12:26:14Z

this is pretty much the same thing as launching the script from bash, and I just prefer to leave main() untouched - just a stylish reason, nothing serious

Not completely. In case of failures we wouldn't get stacktraces but only console logs. Also debugging wouldn't be possible, because the helper would be running in a different process. I'm not convinced stylish reasons are worth these tangible darwbacks.

with one abstraction layer more

Abstractions all come with a cost of complexity, understandability and maintainability. The JSON files for example need to be maintained separately, and separate parsing logic is necessary. Also they would only check the aspects explicitly encoded into the JSON files instead of all possible aspects we could check from reference images, effectively reducing the code/case coverage. If you see this explicitness as benefit, I think it'd still be better to write the asserts directly into the unit tests instead of JSON files. If we'd have to write the expected values by hand anyway, why introduce additional indirection and probably sacrifice error explicitness (for multiple images failing assertions we'd only have the error message all originating in the same assert statement).

The CI integration and the fancy green checkmarks are icing on the cake and everything is possible wich each setup. So the UI shouldn't inform your decision on the technical foundation.

Leaving the "this is how it could work as well" level, I'm still curious what you think why this different approach is better. Like what problems of the proposed solution does it solve, which downsides does it improve on?

TheLastGimbus · 2021-03-14T14:30:36Z

in case of failures we wouldn't get stacktraces but only console logs

But we got those super sweat logs from loguru! Look how sick it looks: #63 (comment)

Here is my backstory - I never wrote any tests before, in any language. I want to learn it myself from the begining, so I went with ones that are purest and simplest - just assert statements 👌 I created my separate PR: #80 - hope you don't mind, but as I said, I just wanted to learn it myself 😅 - you stil helped me on deciding how to structure it ❤️

I'm still curious what you think why this different approach is better

I really have no good reason. I'm just experimenting, and I don't know if any of those ideas are good. Maybe jsons will be a pain, but I want to try it - this repo is a nice sandbox where I can play with shovel and rakes

sebastianludwig · 2021-03-14T16:22:29Z

Yeah, no worries. Glad I was of any help. Good luck with this project 🚀

sebastianludwig mentioned this pull request Feb 15, 2021

Unit Tests #65

Closed

conradstorz reviewed Feb 16, 2021

View reviewed changes

google_photos_takeout_helper/__main__.py Outdated Show resolved Hide resolved

sebastianludwig force-pushed the unit-tests branch from 5099d4b to 7ce0786 Compare February 16, 2021 06:38

sebastianludwig marked this pull request as draft February 16, 2021 06:39

TheLastGimbus added the tests Testing this pile-of-spaggheti code label Feb 16, 2021

sebastianludwig marked this pull request as ready for review February 21, 2021 17:28

sebastianludwig added 5 commits February 21, 2021 18:29

Add helper script to remove the content from images while preserving …

52a9067

…the meta data

Implement integration test skeleton

f38cf5e

Remove test fixtures

58da05f

Add fixtures as git submodule

a0d3446

Remove clean_images script which now lives in the fixtures repo

b9d9957

sebastianludwig force-pushed the unit-tests branch from cedba84 to d09d424 Compare February 21, 2021 17:31

sebastianludwig added 7 commits February 21, 2021 18:37

Remove unnecessary import alias

967ca52

Fully implement integration test structure

1b5ba4d

Use constant for DateTime EXIF tag

635f236

Remove unused constant

c642e94

Add first integration test

a6fbbde

Add test case covering the current reference data set

883171d

Add contributing.md to explain the integration test setup

8d4fd9d

sebastianludwig force-pushed the unit-tests branch from d09d424 to 8d4fd9d Compare February 21, 2021 17:38

TheLastGimbus reviewed Feb 21, 2021

View reviewed changes

sebastianludwig closed this Mar 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit tests #68

Unit tests #68

sebastianludwig commented Feb 15, 2021 •

edited

TheLastGimbus commented Feb 16, 2021

sebastianludwig commented Feb 16, 2021

sebastianludwig commented Feb 21, 2021

TheLastGimbus commented Feb 21, 2021

TheLastGimbus Feb 21, 2021

sebastianludwig Feb 21, 2021

TheLastGimbus Feb 22, 2021

sebastianludwig Mar 1, 2021

TheLastGimbus Mar 9, 2021

TheLastGimbus Feb 21, 2021

sebastianludwig Feb 21, 2021

TheLastGimbus Feb 21, 2021

sebastianludwig Feb 21, 2021

TheLastGimbus Feb 21, 2021

sebastianludwig Mar 1, 2021

sebastianludwig commented Feb 21, 2021

TheLastGimbus commented Feb 21, 2021

sebastianludwig commented Mar 1, 2021

TheLastGimbus commented Mar 10, 2021 •

edited

sebastianludwig commented Mar 11, 2021

TheLastGimbus commented Mar 11, 2021

TheLastGimbus commented Mar 11, 2021

sebastianludwig commented Mar 14, 2021

TheLastGimbus commented Mar 14, 2021

sebastianludwig commented Mar 14, 2021

Unit tests #68

Unit tests #68

Conversation

sebastianludwig commented Feb 15, 2021 • edited

TheLastGimbus commented Feb 16, 2021

sebastianludwig commented Feb 16, 2021

sebastianludwig commented Feb 21, 2021

TheLastGimbus commented Feb 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastianludwig commented Feb 21, 2021

TheLastGimbus commented Feb 21, 2021

sebastianludwig commented Mar 1, 2021

TheLastGimbus commented Mar 10, 2021 • edited

Okay, I though a bit about this, and here is my vision/manifest for this:

sebastianludwig commented Mar 11, 2021

TheLastGimbus commented Mar 11, 2021

TheLastGimbus commented Mar 11, 2021

sebastianludwig commented Mar 14, 2021

TheLastGimbus commented Mar 14, 2021

sebastianludwig commented Mar 14, 2021

sebastianludwig commented Feb 15, 2021 •

edited

TheLastGimbus commented Mar 10, 2021 •

edited