User file/avoid scraping the same user twice if not necessary #3

Deleetdk · 2016-03-29T12:22:55Z

Currently, the scraper picks users at semi-random and scrapes them. Then picks some more, etc. Altho there are hundreds of thousands of users, doing it this way makes it possible that the same user will get scraped twice. This potentially wastes the scraper's time.

To avoid this problem, one can create a user file that keeps track of which users were scraped and number of questions answered and when they were scraped.

Creating the users.csv will be easy enough. Just a little change to save_as_csv. Then a change must be made to the scraper function (get_target I think?) so that it skips users that have not changed their data.

The text was updated successfully, but these errors were encountered:

tomwalter2287 · 2016-03-29T12:24:45Z

Sounds good!

On Tue, Mar 29, 2016 at 5:22 AM, Emil Kirkegaard notifications@github.com
wrote:

Currently, the scraper picks users at semi-random and scrapes them. Then
picks some more, etc. Altho there are hundreds of thousands of users, doing
it this way makes it possible that the same user will get scraped twice.
This potentially wastes the scraper's time.

To avoid this problem, one can create a user file that keeps track of
which users were scraped and number of questions answered and when they
were scraped. This should be simple enough, so I will try.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#3

Deleetdk · 2016-03-29T13:27:52Z

I have implemented saving the user info to users.csv in 0450111

It works on my end, but some settings were incorrectly changed I see.

tomwalter2287 · 2016-03-29T13:32:32Z

So you want to remove the changes in .gitignore and settings.py in my part?

On Tue, Mar 29, 2016 at 6:27 AM, Emil Kirkegaard notifications@github.com
wrote:

I have implemented this feature in 0450111
0450111

It works on my end, but some settings were incorrectly changed I see.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

Deleetdk · 2016-03-29T13:36:17Z

Just use git pull before you use git commit. This updates your local version with the server's, so that there are no conflicts. Make sure that your .gitignore file is correct because otherwise you are uploading temporary files (those ending with ~) and data files (those in data/) to the repository. If you look at https://github.com/Deleetdk/OKCubot2/blob/master/.gitignore#L64 you will see that I have excluded these.

tomwalter2287 · 2016-03-29T14:12:34Z

I pushed new version.

Looking forward to your response.

On Tue, Mar 29, 2016 at 6:36 AM, Emil Kirkegaard notifications@github.com
wrote:

Just use git pull before you use git commit. This updates your local
version with the server's, so that there are no conflicts. Make sure that
youre .gitignore file is correct because otherwise you are uploading
temporary files (those ending with ~) and data files (those in data/) to
the repository. If you look at
https://github.com/Deleetdk/OKCubot2/blob/master/.gitignore#L64 you will
see that I have excluded these.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

tomwalter2287 · 2016-03-29T18:45:03Z

I pull your original project.
So this issue must be solved.

tomwalter2287 · 2016-03-30T17:53:53Z

I think this issue is already solved.

Deleetdk · 2016-03-30T18:08:55Z

We need to test it. I tried testing it with the --u option. However, it still scrapes the user twice.

tomwalter2287 · 2016-03-30T18:27:23Z

Twice?

It scrapes the user only once.

Deleetdk · 2016-03-30T19:27:07Z

If I run the scraper twice with the same user in --u, the profile is scraped twice. It should be skipped the second time if the number of questions answered in the profile is the same as the number in users.csv. Make sure that this skipping feature is disableable with a command line argument. E.g. --noskip.

tomwalter2287 · 2016-03-30T21:56:18Z

Hi, Emil.

I pushed new version.
This issue is solved in this version.

Deleetdk · 2016-03-30T23:38:16Z

I tested this with user mama_crossasaur. This user has not answered any hidden questions. The code correctly skips her. I also tried the --noskip argument. Also worked.

python start.py PlainSeagull PlainSeagullPlainSeagull --u mama_crossasaur
python start.py PlainSeagull PlainSeagullPlainSeagull --u mama_crossasaur --noskip

There is a problem with users that have answered some questions privately. The scraper does not scrape them, so they do not get counted. However, the scraper uses the number shown in the profile to make the comparison, hence it still scrapes them.

The solution is to change the number used for the comparison to use the one shown in the profile as well. I will make this change myself.

Deleetdk · 2016-03-30T23:48:29Z

Can you add the number of questions answered to target_info? Call it m_numberanswered. In that case, I can get the information I need in the save_as_csv function.

tomwalter2287 · 2016-03-31T00:05:27Z

I sent you new version.

Deleetdk · 2016-03-31T00:09:04Z

Good. I am currently doing a large test of the scraper (scraping 1000 users). I am looking to see if there are more bugs that we just haven't seen yet.

I will try your new version after I'm done with that.

Deleetdk · 2016-04-07T20:30:53Z

Fixed in 4b9a67a

Deleetdk added the enhancement label Mar 29, 2016

Deleetdk closed this as completed Apr 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User file/avoid scraping the same user twice if not necessary #3

User file/avoid scraping the same user twice if not necessary #3

Deleetdk commented Mar 29, 2016

tomwalter2287 commented Mar 29, 2016

Deleetdk commented Mar 29, 2016

tomwalter2287 commented Mar 29, 2016

Deleetdk commented Mar 29, 2016

tomwalter2287 commented Mar 29, 2016

tomwalter2287 commented Mar 29, 2016

tomwalter2287 commented Mar 30, 2016

Deleetdk commented Mar 30, 2016

tomwalter2287 commented Mar 30, 2016

Deleetdk commented Mar 30, 2016

tomwalter2287 commented Mar 30, 2016

Deleetdk commented Mar 30, 2016

Deleetdk commented Mar 30, 2016

tomwalter2287 commented Mar 31, 2016

Deleetdk commented Mar 31, 2016

Deleetdk commented Apr 7, 2016

User file/avoid scraping the same user twice if not necessary #3

User file/avoid scraping the same user twice if not necessary #3

Comments

Deleetdk commented Mar 29, 2016

tomwalter2287 commented Mar 29, 2016

Deleetdk commented Mar 29, 2016

tomwalter2287 commented Mar 29, 2016

Deleetdk commented Mar 29, 2016

tomwalter2287 commented Mar 29, 2016

tomwalter2287 commented Mar 29, 2016

tomwalter2287 commented Mar 30, 2016

Deleetdk commented Mar 30, 2016

tomwalter2287 commented Mar 30, 2016

Deleetdk commented Mar 30, 2016

tomwalter2287 commented Mar 30, 2016

Deleetdk commented Mar 30, 2016

Deleetdk commented Mar 30, 2016

tomwalter2287 commented Mar 31, 2016

Deleetdk commented Mar 31, 2016

Deleetdk commented Apr 7, 2016