Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User file/avoid scraping the same user twice if not necessary #3

Closed
Deleetdk opened this issue Mar 29, 2016 · 16 comments
Closed

User file/avoid scraping the same user twice if not necessary #3

Deleetdk opened this issue Mar 29, 2016 · 16 comments

Comments

@Deleetdk
Copy link
Owner

Currently, the scraper picks users at semi-random and scrapes them. Then picks some more, etc. Altho there are hundreds of thousands of users, doing it this way makes it possible that the same user will get scraped twice. This potentially wastes the scraper's time.

To avoid this problem, one can create a user file that keeps track of which users were scraped and number of questions answered and when they were scraped.

Creating the users.csv will be easy enough. Just a little change to save_as_csv. Then a change must be made to the scraper function (get_target I think?) so that it skips users that have not changed their data.

@tomwalter2287
Copy link
Contributor

Sounds good!

On Tue, Mar 29, 2016 at 5:22 AM, Emil Kirkegaard notifications@github.com
wrote:

Currently, the scraper picks users at semi-random and scrapes them. Then
picks some more, etc. Altho there are hundreds of thousands of users, doing
it this way makes it possible that the same user will get scraped twice.
This potentially wastes the scraper's time.

To avoid this problem, one can create a user file that keeps track of
which users were scraped and number of questions answered and when they
were scraped. This should be simple enough, so I will try.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#3

@Deleetdk
Copy link
Owner Author

I have implemented saving the user info to users.csv in 0450111

It works on my end, but some settings were incorrectly changed I see.

@tomwalter2287
Copy link
Contributor

So you want to remove the changes in .gitignore and settings.py in my part?

On Tue, Mar 29, 2016 at 6:27 AM, Emil Kirkegaard notifications@github.com
wrote:

I have implemented this feature in 0450111
0450111

It works on my end, but some settings were incorrectly changed I see.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

@Deleetdk
Copy link
Owner Author

Just use git pull before you use git commit. This updates your local version with the server's, so that there are no conflicts. Make sure that your .gitignore file is correct because otherwise you are uploading temporary files (those ending with ~) and data files (those in data/) to the repository. If you look at https://github.com/Deleetdk/OKCubot2/blob/master/.gitignore#L64 you will see that I have excluded these.

@tomwalter2287
Copy link
Contributor

I pushed new version.

Looking forward to your response.

On Tue, Mar 29, 2016 at 6:36 AM, Emil Kirkegaard notifications@github.com
wrote:

Just use git pull before you use git commit. This updates your local
version with the server's, so that there are no conflicts. Make sure that
youre .gitignore file is correct because otherwise you are uploading
temporary files (those ending with ~) and data files (those in data/) to
the repository. If you look at
https://github.com/Deleetdk/OKCubot2/blob/master/.gitignore#L64 you will
see that I have excluded these.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

@tomwalter2287
Copy link
Contributor

I pull your original project.
So this issue must be solved.

@tomwalter2287
Copy link
Contributor

I think this issue is already solved.

@Deleetdk
Copy link
Owner Author

We need to test it. I tried testing it with the --u option. However, it still scrapes the user twice.

@tomwalter2287
Copy link
Contributor

Twice?

It scrapes the user only once.

@Deleetdk
Copy link
Owner Author

If I run the scraper twice with the same user in --u, the profile is scraped twice. It should be skipped the second time if the number of questions answered in the profile is the same as the number in users.csv. Make sure that this skipping feature is disableable with a command line argument. E.g. --noskip.

@tomwalter2287
Copy link
Contributor

Hi, Emil.

I pushed new version.
This issue is solved in this version.

@Deleetdk
Copy link
Owner Author

I tested this with user mama_crossasaur. This user has not answered any hidden questions. The code correctly skips her. I also tried the --noskip argument. Also worked.

python start.py PlainSeagull PlainSeagullPlainSeagull --u mama_crossasaur
python start.py PlainSeagull PlainSeagullPlainSeagull --u mama_crossasaur --noskip

There is a problem with users that have answered some questions privately. The scraper does not scrape them, so they do not get counted. However, the scraper uses the number shown in the profile to make the comparison, hence it still scrapes them.

The solution is to change the number used for the comparison to use the one shown in the profile as well. I will make this change myself.

@Deleetdk
Copy link
Owner Author

Can you add the number of questions answered to target_info? Call it m_numberanswered. In that case, I can get the information I need in the save_as_csv function.

@tomwalter2287
Copy link
Contributor

I sent you new version.

@Deleetdk
Copy link
Owner Author

Good. I am currently doing a large test of the scraper (scraping 1000 users). I am looking to see if there are more bugs that we just haven't seen yet.

I will try your new version after I'm done with that.

@Deleetdk
Copy link
Owner Author

Deleetdk commented Apr 7, 2016

Fixed in 4b9a67a

@Deleetdk Deleetdk closed this as completed Apr 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants