Skip to content

Commit

Permalink
merge master
Browse files Browse the repository at this point in the history
  • Loading branch information
Gathondu committed Jun 29, 2017
2 parents 8a772a1 + da31e1a commit fed8126
Show file tree
Hide file tree
Showing 17 changed files with 277 additions and 535 deletions.
1 change: 0 additions & 1 deletion .gitignore
@@ -1,2 +1 @@
*.pyc
/data/
37 changes: 6 additions & 31 deletions README.md
Expand Up @@ -23,40 +23,15 @@ Change directory into package `$ cd healthtools_ke`
Install the dependencies by running `$ pip install requirements.txt`

You can set the required environment variables like so
```
$ export MORPH_AWS_REGION=<aws_region>
$ export MORPH_AWS_ACCESS_KEY_ID= <aws_access_key_id>
$ export MORPH_AWS_SECRET_KEY= <aws_secret_key>
```

$ export MORPH_AWS_REGION=<aws_region>
$ export MORPH_AWS_ACCESS_KEY_ID=<aws_access_key_id>
$ export MORPH_AWS_SECRET_KEY=<aws_secret_key>
$ export MORPH_S3_BUCKET=<s3_bucket_name> # If not set, data will be archived locally in the project's folder in a folder called data
$ export MORPH_ES_HOST=<elastic_search_host_endpoint> # Do not set this if you would like to use elastic search locally on your machine
$ export MORPH_ES_PORT=<elastic_search_host_port> # Do not set this if you would like to use elastic search default port 9200
$ export MORPH_WEBHOOK_URL=<slack_webhook_url> # Do not set this if you don't want to post error messages on Slack

**If you want to use elasticsearch locally on your machine use the following instructions to set it up**

For linux and windows users, follow instructions from this [link](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html)

For mac users run `brew install elasticsearch` on your terminal

**If you want to post messages on slack**

Set up `Incoming Webhooks` [here](https://slack.com/signin?redir=%2Fservices%2Fnew%2Fincoming-webhook) and set the global environment for the `MORPH_WEBHOOK_URL`

If you set up elasticsearch locally run it `$ elasticsearch`

You can now run the scrapers `$ python scraper.py` (It might take a while)

**FOR DEVELOPMENT PURPOSES**

Set the **BATCH** and **HF_BATCH** (for health facilities) in the config file that will ensure the scraper doesn't scrape entire sites but just the number
of pages that you would like it to scrape defined by this variable.

use `$ python scraper.py small_batch` to run the scrapers

You can now run the scrapers `$ python scraper.py` (It might take a while and you might need to change the endpoints in config.py if you haven't authorization for them)

## Running the tests
_**make sure if you use elasticsearch locally, it's running**_

Use nosetests to run tests (with stdout) like this:
```$ nosetests --nocapture```

17 changes: 0 additions & 17 deletions circle.yml

This file was deleted.

33 changes: 12 additions & 21 deletions healthtools/config.py
Expand Up @@ -2,32 +2,23 @@

# sites to be scraped
SITES = {
"DOCTORS": "http://medicalboard.co.ke/online-services/retention/?currpage={}",
"FOREIGN_DOCTORS": "http://medicalboard.co.ke/online-services/foreign-doctors-license-register/?currpage={}",
"CLINICAL_OFFICERS": "http://clinicalofficerscouncil.org/online-services/retention/?currpage={}",
"TOKEN_URL": "http://api.kmhfl.health.go.ke/o/token/"
}
'DOCTORS': 'http://medicalboard.co.ke/online-services/retention/?currpage={}',
'FOREIGN_DOCTORS': 'http://medicalboard.co.ke/online-services/foreign-doctors-license-register/?currpage={}',
'CLINICAL_OFFICERS': 'http://clinicalofficerscouncil.org/online-services/retention/?currpage={}',
'TOKEN_URL': 'http://api.kmhfl.health.go.ke/o/token/'
}

AWS = {
"aws_access_key_id": os.getenv("MORPH_AWS_ACCESS_KEY"),
"aws_secret_access_key": os.getenv("MORPH_AWS_SECRET_KEY"),
"region_name": os.getenv("MORPH_AWS_REGION", "eu-west-1"),
"s3_bucket": os.getenv("MORPH_S3_BUCKET", None)
}

ES = {
"host": os.getenv("MORPH_ES_HOST", "127.0.0.1"),
"port": os.getenv("MORPH_ES_PORT", "9200"),
"index": "healthtools"
}
"region_name": 'eu-west-1',
# Doctors document endpoint
"cloudsearch_doctors_endpoint": "http://doc-cfa-healthtools-ke-doctors-m34xee6byjmzcgzmovevkjpffy.eu-west-1.cloudsearch.amazonaws.com/",
# Clinical document endpoint
"cloudsearch_cos_endpoint": "http://doc-cfa-healthtools-ke-cos-nhxtw3w5goufkzram4er7sciz4.eu-west-1.cloudsearch.amazonaws.com/",
# Health facilities endpoint
"cloudsearch_health_faciities_endpoint":"https://doc-health-facilities-ke-65ftd7ksxazyatw5fiv5uyaiqi.eu-west-1.cloudsearch.amazonaws.com",

SLACK = {
"url": os.getenv("MORPH_WEBHOOK_URL")
}

TEST_DIR = os.getcwd() + "/healthtools/tests"

SMALL_BATCH = 5 # No of pages from clinical officers, doctors and foreign doctors sites, scrapped in development mode
SMALL_BATCH_HF = 100 # No of records scraped from health-facilities sites in development mode

DATA_DIR = os.getcwd() + "/data/"

0 comments on commit fed8126

Please sign in to comment.