-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete medicalboards and clinical officers scrapers #2
Complete medicalboards and clinical officers scrapers #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work. Love the tests. I've requested a couple of changes.
healthtools/config.py
Outdated
"region_name": os.getenv("MORPH_AWS_REGION") | ||
} | ||
|
||
TEST_DIR = os.getcwd() + "/healthtools/tests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to avoid replication, let's have the AWS configs be in this format:
AWS = {
'config': {
'access_id': os.getenv('MORPH_AWS_ACCESS_ID'),
'secret_key': os.getenv('MORPH_AWS_SECRET_KEY'),
'region': os.getenv('MORPH_AWS_REGION')
},
'cloudsearch': {
'doctors': '...'
'cos': '...'
}
}
except Exception as err: | ||
skipped_pages += 1 | ||
print "ERROR: scrape_site() - source: {} - page: {} - {}".format(url, page_num, err) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a page scrape doesn't work, we should retry 5 times and if not possible, quit the scrape for the specific site.
In this case if any page in doctors or foreign docs fails more than 5 times, the scrape should move to trying to scrape COs.
Counter of failure also resets to zero when there is successful retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RyanSept one more change here.
healthtools/scrapers/base_scraper.py
Outdated
self.document_id += 1 | ||
return entries, delete_batch | ||
except Exception as err: | ||
print "ERROR: Failed to scrape data from page {} -- {}".format(page_url, str(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also return an error variable.
''' | ||
Generate an id for an entry | ||
''' | ||
_id = "31291493112" + str(self.document_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the id
generated here?
At a basic level, should be an auto incrementing number from 1 to whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a prefix to each document's id due to the local and foreign doctors scrapers sharing a cloudsearch endpoint. Whichever scraper is second ends up overwriting the previous scraper's documents due to their ids being the same eg. local doctor document 0 is uploaded then foreign doctor document 0 is uploaded deleting the initially uploaded one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, for clinical officers we wouldn't face this problem?
And for foreign doctors dilemma, let's fetch the last id
after scraping and parse it to the foreign doctors scrape to increment from. Probably from the scraper.py
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The clinical officers scraper wouldn't have this issue.
Sure thing will implement that.
.gitignore
Outdated
# Ignore output of scraper | ||
data.sqlite | ||
*.pyc | ||
delete_*.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we are ignoring both these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .pyc files are compiled python files and are created any time a .py file is ran. The delete_*.json files shouldn't be there as we're now storing the delete docs on s3. Will be changing that.
What does this PR do?
Add scrapers that retrieve data from the local doctors, foreign doctors and clinical officers registry and archive it.
Description of Task to be completed?
Create scrapers to retrieve data that is available on the medical board and clinical officers website. These scrapers should then upload resulting data into AWS Cloud Search and create an archive of the same on AWS S3 if there is a change from the last scrape.
How should this be manually tested?
Open https://morph.io/RyanSept/healthtools_ke.