Podcasts: Pre-caching image and more-info data can bring the server to a crawl #668

mw9 · 2021-09-26T14:13:03Z

These proposed changes seek to address issues that have arisen with pre-caching Podcast feed images and data.

At present, there is a problem with the pre-caching process, because it potentially generates a large number of simultaneous feed requests which can overload a modestly powered server, especially where the user subscribes to a large number of feeds. It's not an obvious problem on, say, an iMac, but it has created problems on a Raspberry Pi, as reported here: Slow loading of a list of podcast episodes.

I have submitted the PR against LMS 8.2, on the premise that the issue is a bug in 8.2. But it can be applied equally well to 8.3.

The proposed changes are:

Podcast: Pre-caching - Guard against fetching same feed multiple times
This change ensures that only one pre-caching attempt per day will be made for each feed. At present, a pre-caching attempt may be made on each occasion that the user browses within the podcast menu.
Podcast: Pre-caching - Randomize image & moreInfo caching period
This change implements randomization of the cache period, to reduce the likelihood that future cache refreshes will happen all at once. It also sets the cache period for a podcast's 'moreInfo' data to the same period as the podcast image. At present that data is only cached for the default period of one hour, which can become burdensome if the user has many feeds.
Podcast: Pre-caching - Refresh pre-cached data while browsing a feed
This change implements an opportunistic refresh of the cache when the user browses into a podcast feed. It will help to keep the cache up to date.
Podcast: Pre-caching - Schedule activity over an extended period
This change spreads out the pre-caching activity by launching each feed at 30 second intervals. Lower powered servers will benefit from this somewhat laid back approach. It does mean that a user with a large number of feeds will wait longer to see the podcast images properly populated, but perhaps that is not so very important.
Update: Commit now revised - refer below.

When browsing within the podcast menu, pre-caching activity will be initiated for each feed where pre-cached image or podcast info data is not found. A guard is needed to prevent launching the pre-cache process multiple times for the same feed. This will happen whenever the user is browsing within the podcast menu and a pre-existing pre-cache process has been launched but has not completed. This can have an adverse impact on modestly powered servers, and/or where there are a large number of feeds to be handled. This change handles the issue by pre-caching place-holder data each time a feed pre-cache process is started. The place-holder data will be replaced by real data when the process completes.

At present, a podcast's 'moreInfo' data is cached for the default period of one hour, in contrast to image data which is cached for 90 days. This generates regular pre-caching activity, at at least hourly intervals, whenever the user browses within the podcast menu. This can become a significant burden if the user subscribes to more than a handful of feeds. This change aligns the caching period with image data. Staleness will not be a significant issue, as this data rarely changes. Additionally, the caching period is now randomized to fall within a window of +/- 10 days about 90 days. This will reduce the chance that all cached data becomes stale at the same time, thereby spreading the impact of pre-caching activity over a number of user browse sessions.

Take the opportunity to refresh/update a podcast's pre-cached data while browsing the feed. This is, essentially, free, and extends the life of the cached data. By preventing cached data from becoming stale, we reduce the burden of podcast feed pre-caching activity that can take place whenever the user browses within the podcast menu.

philippe44 · 2021-09-26T19:20:09Z

You're right about the load on more simple servers and the need to not query the same thing multiple times, I often forget that LMS redescend XML tree multiple time over navigation, but don't you think that something basic like that would be good enough

sub handleFeed {
	my ($client, $cb, $params, $args) = @_;

	my $items = [];
	my $provider = getProviderByName();

	# populate provider's custom menu
	foreach my $item (@{$provider->getMenuItems($client)}) {
		$item->{name}  ||= cstring($client, 'PLUGIN_PODCAST_SEARCH');
		$item->{type}  ||= 'search';
		$item->{url}   ||= \&searchHandler unless $item->{enclosure};
		$item->{passthrough} ||= [ { provider => $provider, item => $item } ];

		if (!$item->{image} || ref $item->{image}) {
			$item->{image} = 'html/images/search.png';
		}

		push @$items, $item;
	}

	# then add recently played
	push @$items, {
		name  => cstring($client, 'PLUGIN_PODCAST_RECENTLY_PLAYED'),
		url   => \&recentHandler,
		type  => 'link',
		image => __PACKAGE__->_pluginDataFor('icon'),
	};

	# then existing feeds
	my @feeds = @{$prefs->get('feeds')};
	my @need;
	my $fetch;
	
	foreach ( @feeds ) {
		my $url = $_->{value};
		my $image = $cache->get('podcast-rss-' . $url);

		push @$items, {
			name => $_->{name},
			url  => $url,
			favorites_url => $url,
			favorites_type => 'link',
			parser => 'Slim::Plugin::Podcast::Parser',
			image => $image || __PACKAGE__->_pluginDataFor('icon'),
			playlist => $url,
		};

		unless ($image && $cache->get('podcast_moreInfo_' . $url)) {
			$cache->set('podcast-rss-' . $url, __PACKAGE__->_pluginDataFor('icon'), '1days');
			$cache->set('podcast_moreInfo_' . $url, {}, '1days');
			push (@need, $url);
		}	
	}

	$fetch = sub {
		my $url = pop @need;
		Slim::Formats::XML->getFeedAsync(
			sub {
				_precacheShowDetails($url, $_[0]);
				$fetch->();
			},
			sub {
				$log->warn("can't get $url RSS feed information: ", $_[0]);
				$fetch->();
			},
			{
				parser => 'Slim::Plugin::Podcast::Parser',
				url => $url,
				expires => 86400,
			}
		) if $url;
	};
	
	# get missing cache images & moreinfo if any
	$fetch->();
	
	$cb->({
		items => $items,
		actions => {
			info => {
				command   => ['podcastinfo', 'items'],
				variables => [ 'url', 'url', 'name', 'name', 'image', 'image' ],
			},
		}
	});
}

I'm not a big fan of timers in general and this alternative simply serialize the requests, so there is no flooding of 100's and when descending a tree, grabbing the episodes of a single feed will simply enter into concurrence with just one caching request. It's faster (no need to wait 30s) when the opportunity permits, while not being in the way of other user-requested browsing.

michaelherger · 2021-09-27T05:09:30Z

Thanks @mw9 for identifying this issue - good catch! I like the idea of the randomized TTL for the cache and the update whenever available.

But I'd agree with @philippe44 about the timers: they're always wrong - too short or too long, but never right 😀. I mentioned some modules in the forum thread, but @philippe44's approach seems good enough to me. The Async::Util::amap() would have had a slight advantage in that it would allow for some level of parallelism without opening too many connections at the same time. But that's optimization.

Could you please factor in @philippe44's suggestion to get rid of the timer? The rest looks good to me. Thanks again!

This change adapts the podcast image and 'more info' pre-caching activity to take place over an extended period of time. The intention is to avoid overly burdening modestly endowed servers. Some users may have a considerable number of subscribed feeds (in excess of 100 have been reported), and to fetch and parse this number of feeds all at the same time can cause considerable delays (reportedly in the 10s of seconds) when browsing the podcast UI. This will be particularly noticeable when using the podcast menu after clearing server caches, or re-installing LMS. In those circumstances all feeds need to be pre-cached, and this will be attempted on first use of the podcast menu. With this change, fetching and parsing each feed takes place sequentially, So the server will only be processing one feed at a time, instead of processing all feeds concurrently. No explicit expiry time need be set for the retrieved feed, we will be caching the data we want from it.

mw9 · 2021-09-27T17:24:23Z

I have taken on board your discomfort with my laid back timers 😀, and have taken up @philippe44's elegant suggestion in its place.

It seems to work fine on my modestly powered server.

Just the last commit is changed:

Podcast: Pre-caching - Schedule activity over an extended period
This change spreads out the pre-caching activity by fetching and parsing each feed sequentially. Lower powered servers will benefit from this approach, because the server will only be processing one feed at a time, instead of processing all feeds concurrently.

mw9 · 2021-09-27T17:42:18Z

A remark:

While playing around with this last week, I ran across a curious issue. It seems that, if there is an http failure, there are circumstances in which Slim::Formats::XML->getFeedAsync can call its error callback twice. For the same feed.

But the remote http server that was triggering the problem has now started to behave itself, and I have not been able to reproduce the circumstances. At first I thought it might be something to do with Async::Util::amap(). But it isn't. Perhaps, one day, I shall track it down.

There is no particularly adverse effect on this proposed change. It just means that we would be triggered into fetching the next feed a little earlier than anticipated. It will probably be a rare occurrence.

mherger

Thanks for the update. I was wondering whether you didn't pre-cache things twice now: once in the parser, then in the callback triggered with the parser's result?

Slim/Plugin/Podcast/Plugin.pm

sub 'preCacheFeedData' requires a URL and a feed as parameters. Check that we are given something that looks believable. Also, 'preCacheFeedData' is already called by the parser, we don't need to do it again.

Slim/Plugin/Podcast/Plugin.pm

mherger · 2021-09-28T11:49:11Z

Thanks for this PR! I merged it to 8.3 for now. Should it turn out to be good, I'll back-port to 8.2. Please give it some testing.

mw9 · 2021-10-03T14:38:50Z

Thanks for this PR! I merged it to 8.3 for now. Should it turn out to be good, I'll back-port to 8.2. Please give it some testing.

Forum user @Aphonia, who raised the issue, has provided me with his list of (125) podcasts that were causing him pain.
I have replicated his pain on my own (low powered) system, and confirmed that these changes do seem to resolve the issues.

One matter has arisen:

It is quite acceptable for a podcast not to provide an image. In this case we provide a default image, but only for 24 hours. Thereafter we just fetch the feed again, and re-try. That seems, to me, to be somewhat unnecessary, and perhaps "rude". I shall propose in a follow up PR that we adopt the same circa 90 day policy that is applied to "real" images.

I also noted some small changes that might be made to the XML/RSS feed parser, which I will offer up in a follow up PR.

mw9 added 3 commits September 26, 2021 12:50

mw9 force-pushed the feature/pcast_precaching branch from cf0e21e to fa945f7 Compare September 27, 2021 17:10

mherger reviewed Sep 27, 2021

View reviewed changes

Slim/Plugin/Podcast/Plugin.pm Outdated Show resolved Hide resolved

Slim/Plugin/Podcast/Plugin.pm Show resolved Hide resolved

Podcast: Pre-caching - Sanity check on params to preCacheFeedData

f96ff71

sub 'preCacheFeedData' requires a URL and a feed as parameters. Check that we are given something that looks believable. Also, 'preCacheFeedData' is already called by the parser, we don't need to do it again.

philippe44 reviewed Sep 27, 2021

View reviewed changes

Slim/Plugin/Podcast/Plugin.pm Show resolved Hide resolved

mherger changed the base branch from public/8.2 to public/8.3 September 28, 2021 11:47

mherger merged commit ab2f9d3 into LMS-Community:public/8.3 Sep 28, 2021

mw9 deleted the feature/pcast_precaching branch October 3, 2021 14:42

This was referenced Oct 3, 2021

Podcasts: Pre-caching image data - do not continually try again when there is no image to be had #671

Merged

Slim::Formats:XML - Correct/tidy up image extraction from podcast feeds #674

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Podcasts: Pre-caching image and more-info data can bring the server to a crawl #668

Podcasts: Pre-caching image and more-info data can bring the server to a crawl #668

mw9 commented Sep 26, 2021 •

edited

philippe44 commented Sep 26, 2021 •

edited

michaelherger commented Sep 27, 2021

mw9 commented Sep 27, 2021

mw9 commented Sep 27, 2021 •

edited

mherger left a comment

mherger commented Sep 28, 2021

mw9 commented Oct 3, 2021

Podcasts: Pre-caching image and more-info data can bring the server to a crawl #668

Podcasts: Pre-caching image and more-info data can bring the server to a crawl #668

Conversation

mw9 commented Sep 26, 2021 • edited

philippe44 commented Sep 26, 2021 • edited

michaelherger commented Sep 27, 2021

mw9 commented Sep 27, 2021

mw9 commented Sep 27, 2021 • edited

mherger left a comment

Choose a reason for hiding this comment

mherger commented Sep 28, 2021

mw9 commented Oct 3, 2021

mw9 commented Sep 26, 2021 •

edited

philippe44 commented Sep 26, 2021 •

edited

mw9 commented Sep 27, 2021 •

edited