Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It looks like this is resetting the url to the beginning. I uploaded a log here: https://gist.github.com/Locke/e6fd4d1a1966980e27ce0e453b9436c2 #8

Open
Blickfeldkurier opened this issue Jul 13, 2020 · 9 comments

Comments

@Blickfeldkurier
Copy link
Owner

It looks like this is resetting the url to the beginning. I uploaded a log here: https://gist.github.com/Locke/e6fd4d1a1966980e27ce0e453b9436c2

Small excerpt:

Get: http://athalis.soup.io/since/613560173?mode=own
Looking for next Page
Process Posts
no next found. retry 1 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 2 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 3 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 4 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 5 of 10
Get: http://athalis.soup.io
Looking for next Page
	...found script
Process Posts
		Image:
			Skip https://asset.soup.io/asset/14977/4363_42f4_600.jpeg: File exists
[...]
Get: http://athalis.soup.io/since/696310839?mode=own
Looking for next Page
Process Posts
no next found. retry 1 of 10
Get: http://athalis.soup.io
Looking for next Page
Process Posts
no next found. retry 2 of 10
Get: http://athalis.soup.io
Looking for next Page
	...found script
Process Posts
		Image:
			Skip https://asset.soup.io/asset/14977/4363_42f4_600.jpeg: File exists
[...]

Originally posted by @Locke in #7 (comment)

@Blickfeldkurier
Copy link
Owner Author

It might work again.
It seems the way dlurl was overwritten was not quite up to the task.

@Blickfeldkurier
Copy link
Owner Author

Waiting for the next Soup Timeout to prove it works :-)

@Locke
Copy link
Contributor

Locke commented Jul 13, 2020

LGTM, thank you very much! I also like the idea of printing the status code :)

Full log: https://gist.github.com/Locke/cfe735fd21b069242378fb62a1d75c67

Small excerpt:

Looking for next Page
	...found script: /since/613164780?mode=own
Get: http://athalis.soup.io/since/613164780?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/613164780?mode=own
Process Posts
[...]

@Blickfeldkurier
Copy link
Owner Author

Now I got it down to just one repetition? scratches head

@Blickfeldkurier
Copy link
Owner Author

And retry count works again ^^

@Locke
Copy link
Contributor

Locke commented Jul 13, 2020

It had just one repetition, as the second one worked. Now I saw some more repetitions (edit: still on the previous commit dadb614):

[...]
			Skip https://asset.soup.io/asset/13161/6161_644e_960.png: File exists
Looking for next Page
	...found script: /since/612830072?mode=own
Get: http://athalis.soup.io/since/612830072?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/612830072?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/612830072?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/612830072?mode=own
Failed with Status Code: 503
Get: http://athalis.soup.io/since/612830072?mode=own
Process Posts
		Image:
			Skip https://asset.soup.io/asset/13161/4615_0573_928.png: File exists
[...]

@Blickfeldkurier
Copy link
Owner Author

As long as the request.get returns != 200 there schould be no repetitions.
If however the soup page simply doen't add the next page link (The anoying red box: could not load more posts) the site will get still processed and we get repetitions.

@Blickfeldkurier
Copy link
Owner Author

        dlurl = self.rooturl + cont_url
        old_url = ""
        while retrycount < maxretrycount:
            print("Get: " + dlurl)
            dl = requests.get(dlurl)
            if dl.status_code == 200:
                page = BeautifulSoup(dl.content, 'html.parser')
                if dlurl != old_url
                    print("Process Posts")
                    self.process_posts(page, dlurl)
                print("Looking for next Page")
                old_url = dlurl
                dlurl = self.rooturl + self.find_next_page(page, dlurl)
            else:
                self.dlnextfound = False
                print("Failed with Status Code: " + str(dl.status_code))
            if self.dlnextfound == False:
                retrycount=retrycount+1
                print("no next found. retry {} of {}".format(retrycount, maxretrycount))
            else:
                retrycount = 0

the if dlurl == old_url test might fix this.
I have no idea about the edge-cases though

@Locke
Copy link
Contributor

Locke commented Jul 13, 2020

Ok, I've let f1e0670 run for a bit and it looks fine: https://gist.github.com/Locke/abf4f1066b2aceed5ae6c6363585faca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants