Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple images support #889

Merged
merged 35 commits into from
Aug 13, 2023
Merged

Conversation

AudricV
Copy link
Member

@AudricV AudricV commented Aug 9, 2022

  • I carefully read the contribution guidelines and agree to them.
  • I have tested the API against NewPipe.
    No, because it requires to do changes in it which I don't know how to do, at least in the best way.
  • I agree to create a pull request for NewPipe as soon as possible to make it compatible with the changed API.
    No, because I don't know how to do the app support, at least properly, such as implementing options to select an image which fits the resolution of the UI element on which it will be shown (or a highest one if it is already in cache), selecting maximum or minimum resolutions.

This pull request introduces support of multiple images instead of a single image URL.

This will allow control over the image quality used, and also improve it in several places, especially on YouTube.

These changes of course will benefit to all extractor clients, and not only to NewPipe itself: Piped is a good example on which low quality of images can be easily seen.

Click here to hide or show how images of Creative Commons channel on Piped looks and how they would look if best resolutions are used instead
Current look Look if higher images provided are used
Current look Look if higher images provided are used

Concept implementation and code changes details

Data of images are handled by a specific new class, Image, containing the URL to an image, its height and width, if they are known; otherwise, the relevant constants are returned (HEIGHT_UNKNOWN and WIDTH_UNKNOWN).

Contrary to a previous attempt (#268), the real size is so returned instead of a quality level. In my opinion, it is not to the extractor to decide if images are high or small, but to clients, as each client has its own needs.

Images are returned as unmodifiable lists, because you should be not able to modify data extracted by extractors in its objects, i.e. insert a new stream in a stream list, clients who want to do so should use copies of the lists or their elements (note: that's the case in some places and this should be fixed, but that's to be discussed in a separated issue).

The images methods (and attributes) of extractor classes have been updated to reflect the changes:

Click here to hide or show image methods (and attributes) changes of extractor classes
Class/Interface Method(s) removed

(getters return type: String; setters return type: void)
Method(s) added

(getters return type: List<Image>; setters return type: void)
InfoItem - getThumbnailUrl()
- setThumbnailUrl(String)
- getThumbnails()
- setThumbnails(List<Image>)
CommentsInfoItem - getUploaderAvatarUrl()
- setUploaderAvatarUrl(String)
- getUploaderAvatars()
- setUploaderAvatars(List<Image>)
StreamInfoItem - getUploaderAvatarUrl()
- setUploaderAvatarUrl(String)
- getUploaderAvatars()
- setUploaderAvatars(List<Image>)
InfoItemExtractor - getThumbnailUrl() - getThumbnails()
CommentsInfoItemExtractor - getUploaderAvatarUrl() - getUploaderAvatars()
StreamInfoItemExtractor - getUploaderAvatarUrl() - getUploaderAvatars()
ChannelExtractor - getAvatarUrl()
- getBannerUrl()
- getParentChannelAvatarUrl()
- getAvatars()
- getBanners()
- getParentChannelAvatars()
PlaylistExtractor - getUploaderAvatarUrl()
- getThumbnailUrl()
- getBannerUrl()
- getSubChannelAvatarUrl()
- getUploaderAvatars()
- getThumbnails()
- getBanners()
- getSubChannelAvatars()
StreamExtractor - getThumbnailUrl()
- getUploaderAvatarUrl()
- getSubChannelAvatarUrl()
- getThumbnails()
- getUploaderAvatars()
- getSubChannelAvatars()
ChannelInfo - getParentChannelAvatarUrl()
- setParentChannelAvatarUrl(String)
- getAvatarUrl()
- setAvatarUrl(String)
- getBannerUrl()
- setBannerUrl(String)
- getParentChannelAvatars()
- setParentChannelAvatars(List<Image)
- getAvatars()
- setAvatars(List<Image>)
- getBanners()
- setBanners(List<Image>)
PlaylistInfo - getThumbnailUrl()
- setThumbnailUrl(String)
- getBannerUrl()
- setBannerUrl(String)
- getUploaderAvatarUrl()
- setUploaderAvatarUrl(String)
- getSubChannelAvatarUrl()
- setSubChannelAvatarUrl(String)
- getThumbnails()
- setThumbnails(List<Image>)
- getBanners()
- setBanners(List<Image)
- getUploaderAvatars()
- setUploaderAvatars(List<Image>)
- getSubChannelAvatars()
- setSubChannelAvatars(List<Image>)
StreamInfo - getThumbnailUrl()
- setThumbnailUrl(String)
- getUploaderAvatarUrl()
- setUploaderAvatarUrl(String)
- getSubChannelAvatarUrl()
- setSubChannelAvatarUrl(String)
- getThumbnails()
- setThumbnails(List<Image>)
- getUploaderAvatars()
- setUploaderAvatars(List<Image>)
- getSubChannelAvatars()
- setSubChannelAvatars(List<Image>)

Services implementations

Several methods have been added, changed and/or removed, depending of the services. These changes can be found in the commits of this PR.

The implementation of multiple images support differs between each service.

  • YouTube: images are got from thumbnails arrays returned by YouTube, except for mixes when they are extracted (and so when they do not come from a related item or a search result item), for which image and their resolutions are hardcoded:

    • default.jpg: 120x90px;
    • mqdefault.jpg: 320x180px;
    • hqdefault.jpg: 480x360px.
  • SoundCloud: resolutions are always hardcoded:

    Click here to hide or show image resolutions used on SoundCloud
    • for artworks and avatars, image and resolutions returned are:
      • mini: 16x16px;
      • t20x20: 20x20px;
      • small: 32x32px;
      • badge: 47x47px;
      • t50x50: 50x50px;
      • t60x60: 60x60px;
      • t67x67: 67x67px;
      • large: 100x100px (the resolution provided by SoundCloud in its internal (and public?) API);
      • t120x120: 120x120px;
      • t200x200: 200x200px;
      • t240x240: 240x240px;
      • t250x250: 250x250px;
      • t300x300: 300x300px;
      • t500x500: 500x500px.
    • for visuals/user banners, image and resolutions returned are (the resolution provided by SoundCloud in its internal (and public?) API is original):
      • t1240x260: 1240x260px;
      • t2480x520: 2480x520px.
  • PeerTube: resolution of images are only provided, and so known, for banners and avatars. The implementation uses first avatars and banners JSON arrays, as they contain multiple images; avatar and banner objects are also used as a fallback if these arrays are not present or empty (so compatibility with old instances is kept);

  • Bandcamp: as images may be not squares, only image resolutions which preserve aspect ratios are selected (and hardcoded). As a matter of fact, only one dimension of an image is known per image ID:

    Click here to hide or show Bandcamp image IDs and their dimension known
    • 10: 1200px wide;
    • 101: 90px high;
    • 170: 422px high;
    • 171: 646px high;
    • 20: 1024px wide;
    • 200: 420px high;
    • 201: 280px high;
    • 202: 140px high;
    • 204: 360px high;
    • 205: 240px high;
    • 206: 180px high;
    • 207: 120px high;
    • 43: 100px high;
    • 44: 200px high.

    This does not apply on banners, for which only the banner image URL extracted from the website is used, and so its dimensions are not known. See why this has been chosen in the Known limitations/issues section above.

  • MediaCCC: There is only one image for contents. A switch to the higher image resolution has been also made in this PR.

To help with implementations and harcoded image resolutions, which are always suffixes to image URLs and/or paths, a class has been added to manage them: ImageSuffix. An object of this class stores an image suffix string to add and the height and width corresponding to the image itself.

Known limitations/issues

  • Multi-service:

    • original images are not returned, for performance and size purposes;
    • no changes have been also made on Framesets and on StreamSegments: changes required to implement multi-image support on these classes should be made in a separate PR.
  • YouTube:

    • videos which have a high resolution thumbnail are advertised to have a full HD size, even if it should be only HD. This is a YouTube issue;
    • some videos can have a high resolution thumbnail and it can be not returned by YouTube. This is a YouTube issue and also happens with the official API;
    • not all resolutions are returned, depending of the type of the content requested (video, short, channel, playlist, mix, search). This is a YouTube limitation, at least from the internal API: only image resolutions which would fit the best in the desktop website image elements are returned;
  • SoundCloud: if default prefixes are changed by SoundCloud (the ones returned in the image URL provided by the internal API), image extraction would be broken;

  • PeerTube: resolution of thumbnails is not known. Should an issue be opened in the PeerTube repository to request the addition of it?

  • Bandcamp: among the image IDs used for avatars and covers, some of them does not seem to match the same dimension property known on banners, according to the following test made on the banner URL returned for the artist used in BandcampChannelExtractorTest:

    Click here to hide or show results
    • 10: maximum size provided (975x180px);
    • 101: height respected (90px);
    • 170: height not respected and maximum size not provided (750x138px);
    • 171: maximum size provided (975x180px);
    • 20: maximum size provided (975x180px);
    • 200: maximum size provided (975x180px);
    • 201: height not respected and maximum size not provided (750x138px);
    • 202: height not respected and maximum size not provided (375x69px);
    • 204: height not respected and maximum size not provided (960x177px);
    • 205: height not respected and maximum size not provided (640x118px);
    • 206: height not respected and different resolution provided (480x89px);
    • 207: height not respected and different resolution provided (320x59px);
    • 43: height respected (542x100px);
    • 44: maximum size provided (975x180px).

    This should be investigated later and not in this PR.

  • MediaCCC: image sizes are not provided and so not known. Should an issue be opened in the MediaCCC repository to request the addition of it?

Changes on tests

Changes were also required on the tests structure, and the ones made are the following ones:

Click here to hide or show changes on the tests structure
Class/Interface Method(s) removed

(return type: void)
Method(s) added

(return type: void)
BaseChannelExtractorTest - testAvatarUrl()
- testBannerUrl()
- testAvatars()
- testBanners()
BasePlaylistExtractorTest - testThumbnailUrl()
- testBannerUrl()
- testUploaderAvatarUrl()
- testThumbnails()
- testBanners()
- testUploaderAvatars()
BaseStreamExtractorTest - testUploaderAvatarUrl()
- testSubChannelAvatarUrl()
- testThumbnailUrl()
- testUploaderAvatars()
- testSubChannelAvatars()
- testThumbnails()

Image test methods which used URL in their name have been renamed to reflect the multiple images support changes.

A default test on image collections has been added and ensures that:

  • image collections are not null;
  • for each image of the collections, the following conditions are met:
    • its URL is secure;
    • its height and width are greater or equal to the relevant unknown constants.

Each test is responsible to call this default method or use custom ones like it is made on YouTube (using the default test and asserting that each image URL contains the string yt) and Bandcamp (using the default test and asserting that each image URL contains the string f4.bcbits.com/img and ends with .jpg or .png).

Other changes

A few other changes have been also made in this PR:

  • licence headers have been moved to the top of the files modified, where it was applicable and it wasn't already the case;
  • YoutubeMusicSearchExtractor's InfoItems have been moved to external classes, which do not depend of the YouTube ones anymore (it was useless to do so, as the objects of YouTube are not ordered in the same way/are not the same than on YouTube Music) in order to increase readability and perform code changes easier;
  • unneeded public modifiers have been removed from edited test classes;
  • missing @Test annotations have been added on old Junit 4 tests on modified test classes and the corresponding tests have been fixed if needed;
  • single class imports are used instead of wildcard imports in test classes modified when it was not the case;
  • a little bit of code has been improved/refactored in some places of the files changed with this PR.

Closes #649, fixes #763.

Copy link
Member

@B0pol B0pol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiservice:
original images are not returned, for performance and size purposes

I don’t understand why. We provide all resolutions so that the client can choose, so why remove original? If it’s too large the client can decide not to use it.

PeerTube: resolution of thumbnails is not known. Should an issue be opened in the PeerTube repository to request the addition of it?

I think yes

Contrary to a previous attempt (#268), the real size is so returned instead of a quality level. In my opinion, it is not to the extractor to decide if images are high or small, but to clients, as each client has its own needs.

But how will the client decide if both width and height quality are unknown?
Taking example with PeerTube, previewThumbnail is high quality, thumbnailPath is low quality, we return both as unknown, how is the client supposed to know which one to choose?

While if we have LOW, MEDIUM, HIGH (maybe LOWEST, HIGHEST too), we can set previewPath to HIGH and thumbnailPath to low and the client will know even if there is both HEIGHT_UNKNOWN and WIDTH_UNKNOWN

IMO we should both return actual width/height and estimated quality by extractor.
So that client can choose with size if they want, but when it’s unknown they can use what extractor tells (or even always).
Maybe we should also sort by Lists by image quality, either from lowest to highest or the other way.

@AudricV
Copy link
Member Author

AudricV commented Aug 17, 2022

Maybe we should also sort by Lists by image quality, either from lowest to highest or the other way.

I asked about this in the IRC channel and here is the conversation I had with Stypox (Matrix link):

Me:

How do you think a list of images should be sorted?
I want your advice because I am implementing support of extracting multiple images provided by services in the extractor.
For now, I'm using images' height because it seems to be the most dimension available, even if it can be unknown in some cases (such as in some Bandcamp image formats which are based on the width).
For reference, here is the code I use in Java to do the sorting of a list:
The Image class has three properties: URL, height and width
imageList.sort(Comparator.comparingInt(Image::getHeight));

Stypox:

I don't think they should not be sorted at all
If somebody wants to sort them, they can do that on their own, we don't need to do unneeded computation
Just like video streams are not sorted

Me:

Ok, then I will only make image lists unmodifiable
But this will be needed in clients such as NewPipe

Stypox:

Mmmh, I don't think so, in NewPipe we just need a filter that chooses the best fitting thumbnail according to a size criteria (or something like that), which doesn't require any kind of sorting (on a side note, sorting would run in O(nlogn), while without sorting it is O(n))

Copy link
Member

@TobiGr TobiGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full review is going to come in the next days. The code I have read so far, looks good 👍

I agree with the comments made by @B0pol

Just a note: please annotate the new methods which are going to be accessible for library users. I did not highlight every usage, just a few.

Copy link
Member

@Stypox Stypox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code structure looks well done, thank you! I reviewed the code but I did not check whether the URLs you extract actually point somewhere, but I trust you and (in part) the tests for that. I also didn't check if there are more thumbnails that can be extracted, I also trust you here (though even if there were, it wouldn't be a big problem). I have just a few comments + B0pol&TobiGr's comments need to be solved, too.

}
if (streamInfo.getAudioStreams() == null) {
streamInfo.setAudioStreams(Collections.emptyList());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing this?

Copy link
Member Author

@AudricV AudricV Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing this because the default values are now set in the corresponding attributes.

@ghost
Copy link

ghost commented Jan 22, 2023

is the developement still active on this one ?

@AudricV
Copy link
Member Author

AudricV commented Feb 4, 2023

Yes, it is.

@Qaz-7

This comment was marked as spam.

@ghost
Copy link

ghost commented Feb 25, 2023

What's stopping this from being merged? If for any reason there is some unwanted drawback to implementing this, any kind of response can help clear the water.

…s and image URLs

These new public and static methods, added in SoundcloudParsingHelper,
getAllImagesFromArtworkOrAvatarUrl(String) and
getAllImagesFromVisualUrl(String) (which call a common private method,
getAllImagesFromImageUrlReturned(String, List<ImageSuffix>, List<Image>)),
return an unmodifiable list of JPEG images containing almost every image
resolution provided by SoundCloud except the original size and the tiny
resolution (for artworks and avatars, as the image size is 20x20 for artworks
and 18x18 for avatars, so very close to or equal to the t20x20 resolution):

- for artworks and avatars:
  - mini: 16x16;
  - t20x20: 20x20;
  - small: 32x32;
  - badge: 47x47;
  - t50x50: 50x50;
  - t60x60: 60x60;
  - t67x67: 67x67;
  - large: 100x100;
  - t120x120: 120x120;
  - t200x200: 200x200;
  - t240x240: 240x240;
  - t250x250: 250x250;
  - t300x300: 300x300;
  - t500x500: 500x500.

- for visuals/user banners:
  - t1240x260: 1240x260;
  - t2480x520: 2480x520.

Duplicated code in two methods of SoundcloudParsingHelper
(getUsersFromApi(ChannelInfoItemsCollector, String) and
getStreamsFromApi(StreamInfoItemsCollector, String, boolean)) has been merged
into one common private method, getNextPageUrlFromResponseObject(JsonObject).
…r avatars as uploader avatars in SoundcloudStreamInfoItemExtractor
… and channels

Four new static methods have been added in PeertubeParsingHelper to do so:
- two public methods to get the corresponding image type:
  getAvatarsFromOwnerAccountOrVideoChannelObject(String, JsonObject) and
  getBannersFromAccountOrVideoChannelObject(String, JsonObject);
- two private methods as helper methods: getImagesFromAvatarsOrBanners(String,
  JsonObject, String, String) and getImagesFromAvatarOrBannerArray(String,
  JsonArray).
This method, getThumbnailsFromPlaylistOrVideoItem, has been added in
PeertubeParsingHelper and returns the two image variants for playlists and
videos.
Also lower the visibility of attributes of channels and playlists InfoItems to
private.
…vatar picture

The default avatar picture was used when no profile picture was found, but it
was removed and split in multiple images.

Thumbnails' size is not known, as this data is not provided by the API.
Bandcamp images work with image IDs, which provide different resolutions.

Images on Bandcamp are not always squares, and some IDs respect aspect ratios
where some others not.

The extractor will only use the ones which preserve aspect ratio and will not
provide original images, for performance and size purposes.

Because of this aspect ratio preservation constraint, only one dimension will
be known at a time.

The image IDs with their respective dimension used are:

- 10: 1200w;
- 101: 90h;
- 170: 422h;
- 171: 646h;
- 20: 1024w;
- 200: 420h;
- 201: 280h;
- 202: 140h;
- 204: 360h;
- 205: 240h;
- 206: 180h;
- 207: 120h;
- 43: 100h;
- 44: 200h.

(Where w represents the width of the image and h the height of the image)

Note that these dimensions are theoretical because if the image size is less
than the dimensions of the image ID, it will be not upscaled but kept to its
original size.

All these resolutions are stored in a private static list of ThumbnailSuffixes
in BandcampExtractorHelper, in which the methods to get mutliple images have
been added:

- getImagesFromImageUrl(String): public method to get images from an image URL;
- getImagesFromImageId(long, boolean): public method to get images from an
  image ID;
- getImagesFromImageBaseUrl(String): private utility method to get images from
  the static list of ThumbnailSuffixes from a given image base URL, containing
  the path to the image, a "a" letter if it comes from an album, its ID and an
  underscore.

Some existing methods have been also edited:

- the documentation of getImageUrl(long, boolean) has been changed to reflect
  the Bandcamp images findings;
- getThumbnailUrlFromSearchResult has been renamed to
  getImagesFromSearchResult, and a documentation has been added to this method.

The method replaceHttpWithHttps of the Utils class has been also used in
BandcampExtractorHelper instead of doing manually what the method does.
…os and streams

These three new methods, added in MediaCCCParsingHelper,
getImageListFromImageUrl(String), getThumbnailsFromStreamItem(JsonObject) and
getThumbnailsFromLiveStreamItem(JsonObject) (the last two are based on a common
method, getThumbnailsFromObject(JsonObject, String, String)), return an empty
list if the case no image URL could be extracted.

Images returned have their height and width unknown and a resolution level
depending on the image key of the JSON API response.
Also remove usage of the conference logo as the banner of a conference, as it
is a logo and not a banner.
Also suppress unused warnings in BaseStreamExtractorTest, like it is done on
other BaseExtractorTests interfaces.
This new method, defaultTestImageList(List<Image), will check that the image
list is not null.

For each image, it will test that its URL is secure and its height and width
are more than or equal to their relevant unknown constants in the Image class
(HEIGHT_UNKNOWN and WIDTH_UNKNOWN).
… is empty and to test image collections

Two new methods have been added in ExtractorAsserts to check if a collection is
empty:

- assertNotEmpty(String, Collection<?>), checking:
  - the non nullity of the collection;
  - its non emptiness (if that's not case, an exception will be thrown using
    the provided message).

- assertNotEmpty(Collection<?>), calling assertNotEmpty(String, Collection<?>)
  with null as the value of the string argument.

A new one has been added to this assertion class to check the contrary:
assertEmpty(Collection<?>), checking emptiness of the collection only if it is
not null.

Three new methods have been added in ExtractorAsserts as utility test methods
for image collections:

- assertContainsImageUrlInImageCollection(String, Collection<Image>), checking
that:
  - the provided URL and image collection are not null;
  - the image collection contains at least one image which has the provided
    string value as its URL (which is a string) property.

- assertContainsOnlyEquivalentImages(Collection<Image>, Collection<Image>),
  checking that:
  - both collections are not null;
  - they have the same size;
  - each image of the first collection has its equivalent in the second one.
    This means that the properties of an image in the first collection must be
    equal in an image of the second one.

- assertNotOnlyContainsEquivalentImages(Collection<Image>, Collection<Image>),
  checking that:
  - both collections are not null;
  - one of the following conditions is met:
    - they have different sizes;
    - an image of the first collection has not its equivalent in the second one.
      This means that the properties of an image in the first collection must
      be not equal in an image of the second one.

These methods will be used by services extractors tests (and default ones) to
test image collections.
This method, testImages(Collection<Image>), will use first the default image
collection test in DefaultTests and then will check that each image URL
contains the string yt.

The JavaDoc of the class has been also updated to reflect the changes made in
it (it is now more general).
Also remove some public test methods modifiers, add missing Test annotations on
old Junit 4 tests (and update them if needed), and use final in some places
where it was possible.
Also remove some public test methods modifiers, add missing Test annotations on
old Junit 4 tests (and update them if needed), and improve some code.
This method, testImages(Collection<Image>), will use first the default image
collection test in DefaultTests and then will check that each image URL
contains f4.bcbits.com/img and ends with .jpg or .png.

To do so, a new non-instantiable final class has been added: BandcampTestUtils.
Also remove some public test methods modifiers, add missing Test annotations on
old Junit 4 tests (and update them if needed), and use final in some places
where it was possible.

BandcampChannelExtractorTest.testLength has been removed as the test is always
true.
Also remove some public test methods modifiers.
@AudricV AudricV marked this pull request as ready for review August 12, 2023 21:15
Copy link
Member

@Stypox Stypox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines +72 to +73
replaceHttpWithHttps(url), HEIGHT_UNKNOWN, WIDTH_UNKNOWN,
ResolutionLevel.UNKNOWN))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the banner seems to be stored in the img tag: <img src="https://f4.bcbits.com/img/0032808625_100.png" width="975" height="180">. Also, as you can see the url contains an id followed by 100. Changing the 100 to other values yields other images. So I guess in a separate PR the bandcamp image support could be improved.

@Stypox Stypox merged commit 1f08d28 into TeamNewPipe:dev Aug 13, 2023
3 of 4 checks passed
@AudricV AudricV deleted the multiple-images-support branch August 13, 2023 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[YouTube] Couldn't get high quality thumbnails [PeerTube] Implement banners
8 participants