It looks like this is resetting the url to the beginning. I uploaded a log here: https://gist.github.com/Locke/e6fd4d1a1966980e27ce0e453b9436c2 #8

Blickfeldkurier · 2020-07-13T16:10:13Z

It looks like this is resetting the url to the beginning. I uploaded a log here: https://gist.github.com/Locke/e6fd4d1a1966980e27ce0e453b9436c2

Small excerpt:

Get: http://athalis.soup.io/since/613560173?mode=own
Looking for next Page
Process Posts
no next found. retry 1 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 2 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 3 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 4 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 5 of 10
Get: http://athalis.soup.io
Looking for next Page
	...found script
Process Posts
		Image:
			Skip https://asset.soup.io/asset/14977/4363_42f4_600.jpeg: File exists
[...]
Get: http://athalis.soup.io/since/696310839?mode=own
Looking for next Page
Process Posts
no next found. retry 1 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 2 of 10
Get: http://athalis.soup.io
Looking for next Page
	...found script
Process Posts
		Image:
			Skip https://asset.soup.io/asset/14977/4363_42f4_600.jpeg: File exists
[...]

Originally posted by @Locke in #7 (comment)

The text was updated successfully, but these errors were encountered:

Blickfeldkurier · 2020-07-13T16:45:16Z

It might work again.
It seems the way dlurl was overwritten was not quite up to the task.

Blickfeldkurier · 2020-07-13T16:45:56Z

Waiting for the next Soup Timeout to prove it works :-)

Locke · 2020-07-13T17:18:24Z

LGTM, thank you very much! I also like the idea of printing the status code :)

Full log: https://gist.github.com/Locke/cfe735fd21b069242378fb62a1d75c67

Small excerpt:

Looking for next Page
	...found script: /since/613164780?mode=own
Get: http://athalis.soup.io/since/613164780?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/613164780?mode=own
Process Posts
[...]

Blickfeldkurier · 2020-07-13T17:21:32Z

Now I got it down to just one repetition? scratches head

Blickfeldkurier · 2020-07-13T17:21:59Z

And retry count works again ^^

Locke · 2020-07-13T17:26:16Z

It had just one repetition, as the second one worked. Now I saw some more repetitions (edit: still on the previous commit dadb614):

[...]
			Skip https://asset.soup.io/asset/13161/6161_644e_960.png: File exists
Looking for next Page
	...found script: /since/612830072?mode=own
Get: http://athalis.soup.io/since/612830072?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/612830072?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/612830072?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/612830072?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/612830072?mode=own
Process Posts
		Image:
			Skip https://asset.soup.io/asset/13161/4615_0573_928.png: File exists
[...]

Blickfeldkurier · 2020-07-13T17:50:08Z

As long as the request.get returns != 200 there schould be no repetitions.
If however the soup page simply doen't add the next page link (The anoying red box: could not load more posts) the site will get still processed and we get repetitions.

Blickfeldkurier · 2020-07-13T17:56:25Z

        dlurl = self.rooturl + cont_url
        old_url = ""
        while retrycount < maxretrycount:
            print("Get: " + dlurl)
            dl = requests.get(dlurl)
            if dl.status_code == 200:
                page = BeautifulSoup(dl.content, 'html.parser')
                if dlurl != old_url
                    print("Process Posts")
                    self.process_posts(page, dlurl)
                print("Looking for next Page")
                old_url = dlurl
                dlurl = self.rooturl + self.find_next_page(page, dlurl)
            else:
                self.dlnextfound = False
                print("Failed with Status Code: " + str(dl.status_code))
            if self.dlnextfound == False:
                retrycount=retrycount+1
                print("no next found. retry {} of {}".format(retrycount, maxretrycount))
            else:
                retrycount = 0

the if dlurl == old_url test might fix this.
I have no idea about the edge-cases though

Locke · 2020-07-13T19:24:39Z

Ok, I've let f1e0670 run for a bit and it looks fine: https://gist.github.com/Locke/abf4f1066b2aceed5ae6c6363585faca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It looks like this is resetting the url to the beginning. I uploaded a log here: https://gist.github.com/Locke/e6fd4d1a1966980e27ce0e453b9436c2 #8

It looks like this is resetting the url to the beginning. I uploaded a log here: https://gist.github.com/Locke/e6fd4d1a1966980e27ce0e453b9436c2 #8

Blickfeldkurier commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Locke commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Locke commented Jul 13, 2020 •

edited

Loading

Blickfeldkurier commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Locke commented Jul 13, 2020

It looks like this is resetting the url to the beginning. I uploaded a log here: https://gist.github.com/Locke/e6fd4d1a1966980e27ce0e453b9436c2 #8

It looks like this is resetting the url to the beginning. I uploaded a log here: https://gist.github.com/Locke/e6fd4d1a1966980e27ce0e453b9436c2 #8

Comments

Blickfeldkurier commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Locke commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Locke commented Jul 13, 2020 • edited Loading

Blickfeldkurier commented Jul 13, 2020

Blickfeldkurier commented Jul 13, 2020

Locke commented Jul 13, 2020

Locke commented Jul 13, 2020 •

edited

Loading