Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get multimedia categories from multimedia.txt #61

Closed
peterdesmet opened this issue Feb 11, 2015 · 6 comments
Closed

Get multimedia categories from multimedia.txt #61

peterdesmet opened this issue Feb 11, 2015 · 6 comments

Comments

@peterdesmet
Copy link
Member

Description

For a given dataset, I want to know how many records have media associated. I also want to know if there are any issues. This implementation is an enhancement over #43, where we also take media into consideration without a type (and there are lots of those).

Output

datasetKey
multimedia_not_provided
multimedia_url_invalid
multimedia_valid

Terms we need

issues (occurrence.txt)
gbifID (multimedia.txt)

Process

Search for related media.

FOR EACH occurrence
    IF issues CONTAINS ( MULTIMEDIA_URI_INVALID )
        THEN category = "multimedia_url_invalid" // This issue should get priority over valid multimedia
    ELSE search for related multimedia in multimedia.txt
         IF found
             THEN category = "multimedia_valid"
         ELSE no related media found
             THEN category = "multimedia_not_provided"
@peterdesmet peterdesmet added this to the Media type milestone Feb 11, 2015
@peterdesmet
Copy link
Member Author

@niconoe, some notes:

  1. I'm ignoring multiple media for the same occurrence, as I want to assess the number of occurrences with media, not the number of linked media. I'm selecting the type of the first media (a drawback). My loop assumes gbifID is ordered to ignore duplicates. In the example I have, that is the case, but it might not be so.
  2. In the above described process, we're not populating media_not_provided: for that we need to look into occurrence.txt as well. Currently, we assume that the remainder (total occurrences minus media_video, media_audio, media_image, media_unknown) will be the media_not_provided, but it's somewhat flaky.
  3. In the above described process, we're not populating media_url_invalid: for that we need to look into occurrence.txt as well. This is quite a big drawback, as all linked images might not work.

If you have a solution for looking into occurrence.txt and multimedia.txt together, we can solve 2 and 3.

@niconoe
Copy link
Member

niconoe commented Feb 13, 2015

@peterdesmet : there are definitely possible improvements over issue #43, since I have tools to look in parallel into occurrence.txt and multimedia.txt. But the way they work currently is starting from the occurrence and then reaching the attached extensions, so your algorithm is not really applicable as-is...

So to be honest I'm a bit lost there in implementation discussions...
Could you clarify the "what and why" ? I'll take care of the "how". More specifically:

  • Compared to existing: what should change in the output data ? the format ? or only richer content due to the fact that we inspect multimedia.txt ? or both ?
  • What advantage does that provide over the existing ?
  • What's the priority of this ticket compared to other ?

Generally speaking, I think we can become even better if we think more in terms of interface/implementation separation (black box analogy) ! Best

@peterdesmet
Copy link
Member Author

Ok, it's actually better that you look in parallel. I'll update the issue.

@peterdesmet
Copy link
Member Author

Issue updated. Let me know if it makes sense.

@peterdesmet
Copy link
Member Author

@niconoe, I discussed this with @bartaelterman. The precedence for one type over the other is a bit weird. We decided to do it simpler:

  • Has valid media
  • URL invalid
  • Has no media

Media types is something we should tackle for media, not occurrences. We'll do this in another issue, and it is probably beyond the scope of this POC.

@peterdesmet peterdesmet changed the title Get media type categories from multimedia.txt Get multimedia from multimedia.txt Feb 16, 2015
@peterdesmet peterdesmet changed the title Get multimedia from multimedia.txt Get multimedia categories from multimedia.txt Feb 16, 2015
peterdesmet added a commit that referenced this issue Feb 16, 2015
@bartaelterman
Copy link
Member

As discussed with @peterdesmet I documented an extraction procedure in #63 that combines the backend requirements of this issue and those of #60.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants