Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete medicalboards and clinical officers scrapers #2

Merged
merged 22 commits into from
Mar 28, 2017

Conversation

RyanSept
Copy link
Contributor

What does this PR do?

Add scrapers that retrieve data from the local doctors, foreign doctors and clinical officers registry and archive it.

Description of Task to be completed?

Create scrapers to retrieve data that is available on the medical board and clinical officers website. These scrapers should then upload resulting data into AWS Cloud Search and create an archive of the same on AWS S3 if there is a change from the last scrape.

How should this be manually tested?

Open https://morph.io/RyanSept/healthtools_ke.

@RyanSept RyanSept changed the title Complete Local and foreign doctors scrapers as well as the clinical officers scraper Complete medicalboards and clinical officers scrapers Mar 27, 2017
Copy link
Contributor

@DavidLemayian DavidLemayian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work. Love the tests. I've requested a couple of changes.

"region_name": os.getenv("MORPH_AWS_REGION")
}

TEST_DIR = os.getcwd() + "/healthtools/tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid replication, let's have the AWS configs be in this format:

AWS = {
    'config': {
        'access_id': os.getenv('MORPH_AWS_ACCESS_ID'),
        'secret_key': os.getenv('MORPH_AWS_SECRET_KEY'),
        'region': os.getenv('MORPH_AWS_REGION')
    },
    'cloudsearch': {
        'doctors': '...'
        'cos': '...'
    }
}

except Exception as err:
skipped_pages += 1
print "ERROR: scrape_site() - source: {} - page: {} - {}".format(url, page_num, err)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a page scrape doesn't work, we should retry 5 times and if not possible, quit the scrape for the specific site.

In this case if any page in doctors or foreign docs fails more than 5 times, the scrape should move to trying to scrape COs.

Counter of failure also resets to zero when there is successful retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RyanSept one more change here.

self.document_id += 1
return entries, delete_batch
except Exception as err:
print "ERROR: Failed to scrape data from page {} -- {}".format(page_url, str(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also return an error variable.

'''
Generate an id for an entry
'''
_id = "31291493112" + str(self.document_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the id generated here?

At a basic level, should be an auto incrementing number from 1 to whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a prefix to each document's id due to the local and foreign doctors scrapers sharing a cloudsearch endpoint. Whichever scraper is second ends up overwriting the previous scraper's documents due to their ids being the same eg. local doctor document 0 is uploaded then foreign doctor document 0 is uploaded deleting the initially uploaded one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, for clinical officers we wouldn't face this problem?

And for foreign doctors dilemma, let's fetch the last id after scraping and parse it to the foreign doctors scrape to increment from. Probably from the scraper.py file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The clinical officers scraper wouldn't have this issue.
Sure thing will implement that.

.gitignore Outdated
# Ignore output of scraper
data.sqlite
*.pyc
delete_*.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we are ignoring both these files?

Copy link
Contributor Author

@RyanSept RyanSept Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .pyc files are compiled python files and are created any time a .py file is ran. The delete_*.json files shouldn't be there as we're now storing the delete docs on s3. Will be changing that.

@DavidLemayian DavidLemayian merged commit 46305e2 into CodeForAfrica-SCRAPERS:master Mar 28, 2017
DavidLemayian pushed a commit that referenced this pull request Jun 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants