Skip to content

Commit

Permalink
Revert "Merge pull request CodeForAfrica-SCRAPERS#16 from CodeForAfri…
Browse files Browse the repository at this point in the history
…ca-SCRAPERS/develop"

This reverts commit b0cdc90, reversing
changes made to a7d25a4.
  • Loading branch information
DavidLemayian committed Jun 29, 2017
1 parent b0cdc90 commit da31e1a
Show file tree
Hide file tree
Showing 17 changed files with 278 additions and 537 deletions.
1 change: 0 additions & 1 deletion .gitignore
@@ -1,2 +1 @@
*.pyc
/data/
37 changes: 6 additions & 31 deletions README.md
Expand Up @@ -23,40 +23,15 @@ Change directory into package `$ cd healthtools_ke`
Install the dependencies by running `$ pip install requirements.txt`

You can set the required environment variables like so
```
$ export MORPH_AWS_REGION=<aws_region>
$ export MORPH_AWS_ACCESS_KEY_ID= <aws_access_key_id>
$ export MORPH_AWS_SECRET_KEY= <aws_secret_key>
```

$ export MORPH_AWS_REGION=<aws_region>
$ export MORPH_AWS_ACCESS_KEY_ID=<aws_access_key_id>
$ export MORPH_AWS_SECRET_KEY=<aws_secret_key>
$ export MORPH_S3_BUCKET=<s3_bucket_name> # If not set, data will be archived locally in the project's folder in a folder called data
$ export MORPH_ES_HOST=<elastic_search_host_endpoint> # Do not set this if you would like to use elastic search locally on your machine
$ export MORPH_ES_PORT=<elastic_search_host_port> # Do not set this if you would like to use elastic search default port 9200
$ export MORPH_WEBHOOK_URL=<slack_webhook_url> # Do not set this if you don't want to post error messages on Slack

**If you want to use elasticsearch locally on your machine use the following instructions to set it up**

For linux and windows users, follow instructions from this [link](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html)

For mac users run `brew install elasticsearch` on your terminal

**If you want to post messages on slack**

Set up `Incoming Webhooks` [here](https://slack.com/signin?redir=%2Fservices%2Fnew%2Fincoming-webhook) and set the global environment for the `MORPH_WEBHOOK_URL`

If you set up elasticsearch locally run it `$ elasticsearch`

You can now run the scrapers `$ python scraper.py` (It might take a while)

**FOR DEVELOPMENT PURPOSES**

Set the **BATCH** and **HF_BATCH** (for health facilities) in the config file that will ensure the scraper doesn't scrape entire sites but just the number
of pages that you would like it to scrape defined by this variable.

use `$ python scraper.py small_batch` to run the scrapers

You can now run the scrapers `$ python scraper.py` (It might take a while and you might need to change the endpoints in config.py if you haven't authorization for them)

## Running the tests
_**make sure if you use elasticsearch locally, it's running**_

Use nosetests to run tests (with stdout) like this:
```$ nosetests --nocapture```

17 changes: 0 additions & 17 deletions circle.yml

This file was deleted.

33 changes: 12 additions & 21 deletions healthtools/config.py
Expand Up @@ -2,32 +2,23 @@

# sites to be scraped
SITES = {
"DOCTORS": "http://medicalboard.co.ke/online-services/retention/?currpage={}",
"FOREIGN_DOCTORS": "http://medicalboard.co.ke/online-services/foreign-doctors-license-register/?currpage={}",
"CLINICAL_OFFICERS": "http://clinicalofficerscouncil.org/online-services/retention/?currpage={}",
"TOKEN_URL": "http://api.kmhfl.health.go.ke/o/token/"
}
'DOCTORS': 'http://medicalboard.co.ke/online-services/retention/?currpage={}',
'FOREIGN_DOCTORS': 'http://medicalboard.co.ke/online-services/foreign-doctors-license-register/?currpage={}',
'CLINICAL_OFFICERS': 'http://clinicalofficerscouncil.org/online-services/retention/?currpage={}',
'TOKEN_URL': 'http://api.kmhfl.health.go.ke/o/token/'
}

AWS = {
"aws_access_key_id": os.getenv("MORPH_AWS_ACCESS_KEY"),
"aws_secret_access_key": os.getenv("MORPH_AWS_SECRET_KEY"),
"region_name": os.getenv("MORPH_AWS_REGION", "eu-west-1"),
"s3_bucket": os.getenv("MORPH_S3_BUCKET", None)
}

ES = {
"host": os.getenv("MORPH_ES_HOST", "127.0.0.1"),
"port": os.getenv("MORPH_ES_PORT", "9200"),
"index": "healthtools"
}
"region_name": 'eu-west-1',
# Doctors document endpoint
"cloudsearch_doctors_endpoint": "http://doc-cfa-healthtools-ke-doctors-m34xee6byjmzcgzmovevkjpffy.eu-west-1.cloudsearch.amazonaws.com/",
# Clinical document endpoint
"cloudsearch_cos_endpoint": "http://doc-cfa-healthtools-ke-cos-nhxtw3w5goufkzram4er7sciz4.eu-west-1.cloudsearch.amazonaws.com/",
# Health facilities endpoint
"cloudsearch_health_faciities_endpoint":"https://doc-health-facilities-ke-65ftd7ksxazyatw5fiv5uyaiqi.eu-west-1.cloudsearch.amazonaws.com",

SLACK = {
"url": os.getenv("MORPH_WEBHOOK_URL")
}

TEST_DIR = os.getcwd() + "/healthtools/tests"

SMALL_BATCH = 5 # No of pages from clinical officers, doctors and foreign doctors sites, scrapped in development mode
SMALL_BATCH_HF = 100 # No of records scraped from health-facilities sites in development mode

DATA_DIR = os.getcwd() + "/data/"

0 comments on commit da31e1a

Please sign in to comment.