-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User file/avoid scraping the same user twice if not necessary #3
Comments
Sounds good! On Tue, Mar 29, 2016 at 5:22 AM, Emil Kirkegaard notifications@github.com
|
I have implemented saving the user info to It works on my end, but some settings were incorrectly changed I see. |
So you want to remove the changes in .gitignore and settings.py in my part? On Tue, Mar 29, 2016 at 6:27 AM, Emil Kirkegaard notifications@github.com
|
Just use |
I pushed new version. Looking forward to your response. On Tue, Mar 29, 2016 at 6:36 AM, Emil Kirkegaard notifications@github.com
|
I pull your original project. |
I think this issue is already solved. |
We need to test it. I tried testing it with the --u option. However, it still scrapes the user twice. |
Twice? It scrapes the user only once. |
If I run the scraper twice with the same user in --u, the profile is scraped twice. It should be skipped the second time if the number of questions answered in the profile is the same as the number in |
Hi, Emil. I pushed new version. |
I tested this with user mama_crossasaur. This user has not answered any hidden questions. The code correctly skips her. I also tried the
There is a problem with users that have answered some questions privately. The scraper does not scrape them, so they do not get counted. However, the scraper uses the number shown in the profile to make the comparison, hence it still scrapes them. The solution is to change the number used for the comparison to use the one shown in the profile as well. I will make this change myself. |
Can you add the number of questions answered to |
I sent you new version. |
Good. I am currently doing a large test of the scraper (scraping 1000 users). I am looking to see if there are more bugs that we just haven't seen yet. I will try your new version after I'm done with that. |
Fixed in 4b9a67a |
Currently, the scraper picks users at semi-random and scrapes them. Then picks some more, etc. Altho there are hundreds of thousands of users, doing it this way makes it possible that the same user will get scraped twice. This potentially wastes the scraper's time.
To avoid this problem, one can create a user file that keeps track of which users were scraped and number of questions answered and when they were scraped.
Creating the
users.csv
will be easy enough. Just a little change tosave_as_csv
. Then a change must be made to the scraper function (get_target
I think?) so that it skips users that have not changed their data.The text was updated successfully, but these errors were encountered: