Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tap-github allows inputting buggy data into its output #110

Closed
laurentS opened this issue Apr 21, 2022 · 1 comment · Fixed by #112
Closed

Tap-github allows inputting buggy data into its output #110

laurentS opened this issue Apr 21, 2022 · 1 comment · Fixed by #112

Comments

@laurentS
Copy link
Contributor

It is possible to "corrupt" data coming out of the tap by inputting incorrectly cased values in the config file. While it does not directly affect the tap itself, it can lead to problems downstream (we've just hit this problem).
I am not sure what the correct solution is, so I'm opening this ticket for discussion and awareness.

In config, set "repositories": ["meltanolabs/tap-github"] (lowercase M and L on purpose) and run the tap deselecting the repositories stream, and keeping at least one child stream, issues for instance.

The tap will run fine, since the github api will handle the case inconsistencies in the request and return data containing MeltanoLabs instead of meltanolabs.

But because of these lines https://github.com/MeltanoLabs/tap-github/blob/0851e80578bbb17f69faa4c6b2220d30ed1802c5/tap_github/repository_streams.py#L103-107 the output from the tap will contain org=meltanolabs but full_name=MeltanoLabs/tap-github (note the difference in upper/lowercase chars). If you use the org and repo columns to join records from different streams together, you can end up with inconsistent values which are hard to reconcile.

The obvious solution to this problem is to avoid incorrectly cased input in the config, but this is fairly fragile. Another option would be to change the way the child streams get their org/repo values by parsing the url field for instance (which is received from github and matches the "official" repo name casing), but this seems fairly costly performance-wise.

Are there any best practices around this in other taps?

@aaronsteers
Copy link
Contributor

aaronsteers commented Apr 21, 2022

Hi, @laurentS. Thanks for raising. I definitely think this is worth fixing but I'm not sure as of yet what the best fix would be.

This is happening, I think, because MeltanoLabs/tap-github is both a config value and also a data element. While the source API may not be case aware or case sensitive, we certainly should expect the output data to be.

A less than ideal solution would be to just throw a hard failure when casing doesn't match up. But perhaps more gracious would be to defer/coerce to the datasets' casing rules if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants