Images being repeatedly sideloaded #18

tamw-wnet · 2022-06-14T21:01:07Z

I've noticed since late December that my site is sideloading several extra copies of the 'primary' image for stories. I've gone from sideloading about 3-5GB of images from NPR stories per month to around 15GB. @jwcounts
I need to do some further review, but I think it might have something to do with the changes around this block of code in NPRAPIWordpress.php starting at line 393:

                  foreach( $attached_images as $att_image ) {
                    // see if the filename is very similar
                    // $att_guid = explode( '.', $att_image->guid );
                    $attach_url = wp_get_attachment_url( $att_image->ID );
                    $attach_url_parse = parse_url( $attach_url );
                    $attach_url_parts = pathinfo( $attach_url_parse['path'] );

                    $imagep_attach_url = wp_get_attachment_url( $image_post->ID );
                    $imagep_url_parse = parse_url( $imagep_attach_url );
                    $imagep_url_parts = pathinfo( $imagep_url_parse['path'] );
                    // so if the already attached image name is part of the name of the file
                    // coming in, ignore the new/temp file, it's probably the same
                    if ( strtolower( $attach_url_parts['filename'] ) == strtolower( $imagep_url_parts['filename'] ) ) {
                      @unlink( $file_array['tmp_name'] );
                      wp_delete_attachment( $image_upload_id );
                      $file_OK = FALSE;
                    }
                  }

I'm guessing what's happening is that the previous filename matching logic is breaking somehow, so every time there's a revision to a story the new image gets created.

This could well be some issue on the API end as well -- I was on vacation the week we start seeing this behavior so a code update making its way through seems.... unlikely.

The text was updated successfully, but these errors were encountered:

jwcounts · 2022-06-14T21:05:12Z

Let me know what you find out, and I can start doing some inspection on my end.

tamw-wnet · 2022-06-14T21:32:19Z

here's an example -- from NPR Story ID 1087462333 on March 19 I have 3 identical versions of image 0b6a9874_custom-a4f9308ca707a02ec5dcd6c9d1a3a032525f9a8d.jpg

and I have 9 versions of this March 10 nursing-related image
20220124-nprnursing-0106-edit_custom-46609f71513d9b0bd45791c9230ee8ad9624136e.jpg

12 versions of a March 3 Wuhan related image
2022_03_03_wuhanmarket-1193097829-32a0b94cbc67e11c0b6f01e030b1a2cbb53de86f.jpg

I'd be interested if you guys see the same multiple downloads for these (or other) images.

jwcounts · 2022-06-15T16:47:00Z

I haven't been able to verify any on my end. Do you have any recorded duplicates after March 30, 2022?

I committed a change to the repo on March 30th, which changed line 391 in NPRAPIWordpress.php. Previously, that line was using the internal Wordpress ID for the imported article, instead of the ID of the imported image. wp_get_attachment_url() will return false if the ID referenced is not an attachment, so no duplicates would ever be found.

tamw-wnet · 2022-06-15T17:49:56Z

Here's two images from June 13
ap22164713613679_custom-be6cbaed21844e94defba9d0a2508a68a05207c4-3-scaled.jpg from "Here's every word of the second Jan. 6 committee hearing on its investigation" story id 1104690690

This image has 9 versions -- ap22145785567460_wide-dfbbced2e2a33aa5a0899ca7d6cd6f257e13ed37-9-scaled.jpg its from June 13 "Republican primaries show that Trump voters don't always follow his endorsements" story id 1103956855

tamw-wnet · 2022-06-15T17:51:37Z

I'm using 1.9.3.1, I updated to it on June 8

tamw-wnet · 2022-06-15T17:54:29Z

I've got a feeling the issue is that this is failing:

if ( strtolower( $attach_url_parts['filename'] ) == strtolower( $imagep_url_parts['filename'] ) ) {

I'm going to error_log those two values for a couple of days and see what I see.

jwcounts · 2022-06-15T18:07:24Z

Okay, previous theory busted. I'll run some tests with the 2 above you referenced.

Also, comparing the filenames you referenced with the output from the API, (example: ap22164713613679_custom-be6cbaed21844e94defba9d0a2508a68a05207c4-3-scaled.jpg) I get where "-3" is coming from (since it's the third copy), but where is the "-scaled" coming from? With the function being used, Wordpress should be returning the URL to the original full-sized file, and not a scaled or resized variant.

tamw-wnet · 2022-06-15T18:12:05Z

Don't worry about the '-scaled' I was using '-8-scaled' as a grep criteria against ls

tamw-wnet · 2022-06-15T18:18:03Z

I'm starting to think the issue is earlier -- you're automatically sideloading the image and creating an attachment always, but then you compare the filenames of the new attachment with the old attachments; if they match you delete the new attachment. However, because the new image has already been downloaed and the filename set to incremented '-version.jpg' the match always fails.
The comparison needs to be done against the original filename before download.

I'll work on a patch.

jwcounts · 2022-06-15T18:28:10Z

Yeah, good point. I think that logic was created before our time, though I touched it last when I swapped around what IDs were being checked. Doing the filename check earlier sounds like a good idea.

tamw-wnet · 2022-06-15T18:30:38Z

Agreed, it looks like it was 7 years ago.

jwcounts · 2022-06-15T20:04:40Z

I think I have a solution cooked up if need be. Otherwise, I will defer to yours.

tamw-wnet · 2022-06-15T20:05:50Z

you're probably further along, go for it. I'm still trying to debug where the filename matching isn't happening.

…

________________________________________ From: Jared Counts ***@***.***> Sent: Wednesday, June 15, 2022 4:04 PM To: OpenPublicMedia/nprapi-wordpress Cc: Tam, Will; Author Subject: Re: [OpenPublicMedia/nprapi-wordpress] Images being repeatedly sideloaded (Issue #18) CAUTION: This email originated from outside of WNET. Do not follow guidance, click links, or open attachments unless you recognize the sender and know the content is safe. I think I have a solution cooked up if need be. Otherwise, I will defer to yours. — Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACHTAAQL6BDIHMIB6MKMUILVPIZOHANCNFSM5YZBU43A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

jwcounts · 2022-06-15T20:12:37Z

Okay, sent the commit to the repo. Basically made it loop through attached images right after determining $image_url from the API. If any of the attached images match the filename in $image_url, that API image is skipped.

If no attached images match, then the download/sideload/etc. actually happens. Looks like it will work, but need to run some tests.

jwcounts · 2022-06-15T20:17:42Z

Nope, too aggressive. My testing site isn't importing images at all now.

jwcounts · 2022-06-15T20:23:28Z

Fixed it. Forgot that 'filename' from pathinfo() doesn't include the file extension, whereas 'basename' does

tamw-wnet · 2022-06-15T21:27:00Z

I think a related issue is introduced by the '-scaled' thing -- a built-in WordPress feature since 5.3.

Here's some debug output I got from a statement right after line 403 -- $file_array is the parsed $image_url, $attach_url_parts is from the post's existing attached images, and $imagep_url_parts is from the newly sideloaded image.

$file_array['name'] = ap22140559338813-71e6ea258db1d20a0c21b6459be57551dc1ce167.jpg
$attach_url_parts['filename'] = ap22140559338813-71e6ea258db1d20a0c21b6459be57551dc1ce167-scaled
$imagep_url_parts['filename'] = ap22140559338813-71e6ea258db1d20a0c21b6459be57551dc1ce167-1-scaled

tamw-wnet · 2022-06-15T21:38:22Z

We need an additional arg to the two calls to wp_get_attachment_url() -- in both cases we need to specify the size, 'full' aka the original downloaded size
eg
$imagep_attach_url = wp_get_attachment_url( $image_post->ID , 'full');

jwcounts · 2022-06-15T21:42:45Z

Unfortunately, that function only accepts the ID. However, they introduced a new function in 5.3: wp_get_original_image_url(). It also only accepts the ID, but otherwise, it functions just like wp_get_attachment_url(). I did a preliminary test on my end and it seems to work, so I'll be curious what kind of outcomes you see on your end. I updated NPRAPIWordpress.php and committed it to the repo.

tamw-wnet · 2022-06-16T13:54:27Z

I'd actually thought we were using wp_get_attachment_image_src() but doh. I'll check it out.

tamw-wnet · 2022-06-16T14:18:13Z

Well, I'm still seeing the issue. I added a little bit of debugging to see what's happening -- line 378 right before the decision to 'continue' I added
error_log($attach_url_parts['filename'] . " " . $imagep_url_parts['filename']);

Here's the output from the most recent run:

[16-Jun-2022 14:10:16 UTC] bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918 bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918
[16-Jun-2022 14:10:16 UTC] bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918-1 bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918
[16-Jun-2022 14:10:16 UTC] bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918-2 bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918

However, "bmf_3885-920766ab62abd8b5da6f9356ab8af60807f72918-3" ended up being added to the Media Library. I'll see if I can figure out what is going on and why it's still getting sideloaded.

tamw-wnet · 2022-06-16T14:21:00Z

actually I already see the issue, the 'continue' doesn't actually stop the download from happening in line 385.

jwcounts · 2022-06-16T14:29:27Z

You're right. The continue pops you out of the one loop, but not the greater one. What about this? Starting at 369:

$attach_match = false;
if ( !empty( $attached_images ) ) {
	foreach( $attached_images as $att_image ) {
		$attach_url = wp_get_original_image_url( $att_image->ID );
		$attach_url_parse = parse_url( $attach_url );
		$attach_url_parts = pathinfo( $attach_url_parse['path'] );

		if ( strtolower( $attach_url_parts['filename'] ) === strtolower( $imagep_url_parts['filename'] ) ) {
			$attach_match = true;
		}
	}
}
if ( $attach_match ) {
	continue;
}

tamw-wnet · 2022-06-16T14:30:51Z

easier would be to just change the orig continue to 'continue 2' I think

tamw-wnet · 2022-06-16T14:32:39Z

I'm testing the 'continue 2' approach on my server now, with debugging, I'll let you know how it goes

jwcounts · 2022-06-16T14:32:51Z

Oh wow, I didn't know you could do that. Makes sense though.

tamw-wnet · 2022-06-16T14:36:36Z

Looking good! Just got 5 story updates, none of them got downloaded, but a new story got its image.

jwcounts · 2022-06-16T14:39:20Z

Nice! I've got a couple of small bug fixes queued for a release, so this will round it out pretty well.

tamw-wnet closed this as completed Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images being repeatedly sideloaded #18

Images being repeatedly sideloaded #18

tamw-wnet commented Jun 14, 2022

jwcounts commented Jun 14, 2022

tamw-wnet commented Jun 14, 2022

jwcounts commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

jwcounts commented Jun 15, 2022 •

edited

Loading

tamw-wnet commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

jwcounts commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

jwcounts commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022 via email

jwcounts commented Jun 15, 2022

jwcounts commented Jun 15, 2022

jwcounts commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

jwcounts commented Jun 15, 2022 •

edited

Loading

tamw-wnet commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

jwcounts commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

jwcounts commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

jwcounts commented Jun 16, 2022

Images being repeatedly sideloaded #18

Images being repeatedly sideloaded #18

Comments

tamw-wnet commented Jun 14, 2022

jwcounts commented Jun 14, 2022

tamw-wnet commented Jun 14, 2022

jwcounts commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

jwcounts commented Jun 15, 2022 • edited Loading

tamw-wnet commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

jwcounts commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

jwcounts commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022 via email

jwcounts commented Jun 15, 2022

jwcounts commented Jun 15, 2022

jwcounts commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

tamw-wnet commented Jun 15, 2022

jwcounts commented Jun 15, 2022 • edited Loading

tamw-wnet commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

jwcounts commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

jwcounts commented Jun 16, 2022

tamw-wnet commented Jun 16, 2022

jwcounts commented Jun 16, 2022

jwcounts commented Jun 15, 2022 •

edited

Loading

jwcounts commented Jun 15, 2022 •

edited

Loading