Complete medicalboards and clinical officers scrapers #2

RyanSept · 2017-03-27T07:36:52Z

What does this PR do?

Add scrapers that retrieve data from the local doctors, foreign doctors and clinical officers registry and archive it.

Description of Task to be completed?

Create scrapers to retrieve data that is available on the medical board and clinical officers website. These scrapers should then upload resulting data into AWS Cloud Search and create an archive of the same on AWS S3 if there is a change from the last scrape.

How should this be manually tested?

Open https://morph.io/RyanSept/healthtools_ke.

…rs scraper

DavidLemayian

Excellent work. Love the tests. I've requested a couple of changes.

DavidLemayian · 2017-03-27T08:06:51Z

healthtools/config.py

+    "region_name": os.getenv("MORPH_AWS_REGION")
+}
+
+TEST_DIR = os.getcwd() + "/healthtools/tests"


Just to avoid replication, let's have the AWS configs be in this format:

AWS = { 'config': { 'access_id': os.getenv('MORPH_AWS_ACCESS_ID'), 'secret_key': os.getenv('MORPH_AWS_SECRET_KEY'), 'region': os.getenv('MORPH_AWS_REGION') }, 'cloudsearch': { 'doctors': '...' 'cos': '...' } }

DavidLemayian · 2017-03-27T08:11:10Z

healthtools/scrapers/base_scraper.py

+            except Exception as err:
+                skipped_pages += 1
+                print "ERROR: scrape_site() - source: {} - page: {} - {}".format(url, page_num, err)
+                continue


If a page scrape doesn't work, we should retry 5 times and if not possible, quit the scrape for the specific site.

In this case if any page in doctors or foreign docs fails more than 5 times, the scrape should move to trying to scrape COs.

Counter of failure also resets to zero when there is successful retry.

@RyanSept one more change here.

DavidLemayian · 2017-03-27T08:17:15Z

healthtools/scrapers/base_scraper.py

+                self.document_id += 1
+            return entries, delete_batch
+        except Exception as err:
+            print "ERROR: Failed to scrape data from page {}  -- {}".format(page_url, str(err))


Should also return an error variable.

DavidLemayian · 2017-03-27T08:20:24Z

healthtools/scrapers/clinical_officers.py

+        '''
+        Generate an id for an entry
+        '''
+        _id = "31291493112" + str(self.document_id)


Not sure I understand the id generated here?

At a basic level, should be an auto incrementing number from 1 to whatever.

This is a prefix to each document's id due to the local and foreign doctors scrapers sharing a cloudsearch endpoint. Whichever scraper is second ends up overwriting the previous scraper's documents due to their ids being the same eg. local doctor document 0 is uploaded then foreign doctor document 0 is uploaded deleting the initially uploaded one.

Ok, for clinical officers we wouldn't face this problem?

And for foreign doctors dilemma, let's fetch the last id after scraping and parse it to the foreign doctors scrape to increment from. Probably from the scraper.py file.

Yes. The clinical officers scraper wouldn't have this issue.
Sure thing will implement that.

DavidLemayian · 2017-03-27T08:22:15Z

.gitignore

-# Ignore output of scraper
-data.sqlite
+*.pyc
+delete_*.json


Any reason why we are ignoring both these files?

The .pyc files are compiled python files and are created any time a .py file is ran. The delete_*.json files shouldn't be there as we're now storing the delete docs on s3. Will be changing that.

Develop

RyanSept added 19 commits March 16, 2017 09:05

[Chore] Add scraper code, morph env variables

7384a80

[Fix] Update requirements file

1dc415a

[Fix] Change doctors endpoint in scraper.py

5b00431

[Chore] Add config file, setup tests and project skeleton

bfd75dc

[Feature] Add get total number of pages for site

6e3fbcb

[Feature] Add scrape page

16976a8

[Feature] Add upload to cloudsearch, scrape whole site, foreign docto…

c2669c7

…rs scraper

[Feature] Add archive to S3

2153562

[Feature] Add scraper for clinical officers, change folder structure

f927490

[Fix] Fix cloudsearch document id issue

9736308

[Fix] Change folder structure, removed unrequired files

644adab

[Fix] Move scraper.py to root of project

b272ef4

[Fix] Make document ids more unique, edit README

d66a589

[Fix] Remove line from README

649e3af

[Fix] Clean up requirements.txt

cdfe9d2

[Feature] Delete cloudsearch data upon scraping all data, Change s3 keys

8ee6dbc

[Feature] Add archive historical data

7f783ed

[Fix] Add cloudsearch delete docs

9f47fb4

[Fix] Remove delete json files

5fc23db

RyanSept changed the title ~~Complete Local and foreign doctors scrapers as well as the clinical officers scraper~~ Complete medicalboards and clinical officers scrapers Mar 27, 2017

DavidLemayian suggested changes Mar 27, 2017

View reviewed changes

RyanSept added 3 commits March 27, 2017 12:20

[Fix] Change config structure

496bd41

[Feature] Add retry scrape 5 times on fail scrape page

552e0ec

[Fix] Change id generation system

dfc9c46

DavidLemayian merged commit 46305e2 into CodeForAfrica-SCRAPERS:master Mar 28, 2017

DavidLemayian pushed a commit that referenced this pull request Jun 23, 2017

Merge pull request #2 from Gathondu/develop

b963494

Develop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete medicalboards and clinical officers scrapers #2

Complete medicalboards and clinical officers scrapers #2

RyanSept commented Mar 27, 2017

DavidLemayian left a comment

DavidLemayian Mar 27, 2017

DavidLemayian Mar 27, 2017

DavidLemayian Mar 27, 2017

DavidLemayian Mar 27, 2017

DavidLemayian Mar 27, 2017

RyanSept Mar 27, 2017

DavidLemayian Mar 27, 2017

RyanSept Mar 27, 2017

DavidLemayian Mar 27, 2017

RyanSept Mar 27, 2017 •

edited

Complete medicalboards and clinical officers scrapers #2

Complete medicalboards and clinical officers scrapers #2

Conversation

RyanSept commented Mar 27, 2017

What does this PR do?

Description of Task to be completed?

How should this be manually tested?

DavidLemayian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanSept Mar 27, 2017 • edited

Choose a reason for hiding this comment

RyanSept Mar 27, 2017 •

edited