Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving to Elastic #16

Merged
merged 74 commits into from
Jun 29, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
7c059c0
Add support for elastic search cloud service
Gathondu Jun 14, 2017
82d7e49
Add support for health facilities to elastic search
Gathondu Jun 15, 2017
e59d6ba
Convert to save to local elastic search and not cloud service
Gathondu Jun 15, 2017
05bbd87
Add CircleCi integration
Gathondu Jun 15, 2017
8329b4f
Use elastic cloud search host
Gathondu Jun 16, 2017
4e703be
Index to elasticsearch AWS Service
Gathondu Jun 19, 2017
4c4e2f0
Shift data indexing to Elasticsearch from cloudsearch (#1)
Gathondu Jun 20, 2017
be4db2d
Add slack integration
Gathondu Jun 20, 2017
ddfd6f8
Refractor slack integration to use webhook's url and not test tokens
Gathondu Jun 20, 2017
c762f18
Add slack error notification to health facilities scrapper
Gathondu Jun 20, 2017
49b591a
Merge pull request #9 from Gathondu/feature-elastic-search
DavidLemayian Jun 20, 2017
c0b8361
Merge branch 'develop' of https://github.com/CodeForAfrica-SCRAPERS/h…
Gathondu Jun 21, 2017
15f9499
Merge branch 'develop' into failing-scraper-slack-notification
Gathondu Jun 21, 2017
792af44
Refactor code for sending slack messages to a method
Gathondu Jun 21, 2017
9ff0a90
Merge branch 'failing-scraper-slack-notification' of github.com:Gatho…
Gathondu Jun 21, 2017
698975c
Refactor error printing method
Gathondu Jun 22, 2017
2212a2e
Merge pull request #10 from Gathondu/failing-scraper-slack-notification
DavidLemayian Jun 22, 2017
9a26c33
Merge branch 'master' into develop
DavidLemayian Jun 22, 2017
7193987
Change order of logs.
DavidLemayian Jun 22, 2017
ade3c0b
Add timestamp whilst printing error messages on terminal
Gathondu Jun 22, 2017
d784d39
Refactor code to allow to use elastic search locally
Gathondu Jun 22, 2017
f9a84ae
Update circle ci file
Gathondu Jun 22, 2017
510b0cd
Merge pull request #11 from Gathondu/failing-scraper-slack-notification
DavidLemayian Jun 22, 2017
7f9bc3e
Small changes.
DavidLemayian Jun 22, 2017
a1e5878
Merge remote-tracking branch 'origin/develop' into develop
DavidLemayian Jun 22, 2017
7fec6da
Merge branch 'develop' of https://github.com/CodeForAfrica-SCRAPERS/h…
Gathondu Jun 23, 2017
fb16d0b
Merge branch 'master' into develop
Gathondu Jun 23, 2017
b963494
Merge pull request #2 from Gathondu/develop
Gathondu Jun 23, 2017
563edb5
Add support for elastic search cloud service
Gathondu Jun 14, 2017
1a85699
Add support for health facilities to elastic search
Gathondu Jun 15, 2017
d4f0af0
Add CircleCi integration
Gathondu Jun 15, 2017
5fcf4a8
Shift data indexing to Elasticsearch from cloudsearch (#1)
Gathondu Jun 20, 2017
90eb34c
Add slack integration
Gathondu Jun 20, 2017
828ee8a
Refractor slack integration to use webhook's url and not test tokens
Gathondu Jun 20, 2017
7deb780
Change order of logs.
DavidLemayian Jun 22, 2017
37a4447
Merge pull request #3 from Gathondu/develop
Gathondu Jun 23, 2017
fbdf663
Merge pull request #12 from Gathondu/master
DavidLemayian Jun 23, 2017
9bc1602
Make space for the work and the plans.
DavidLemayian Jun 23, 2017
8184e9c
Code style.
DavidLemayian Jun 23, 2017
52cffdd
A few changes once more.
DavidLemayian Jun 23, 2017
0814c58
Refactor code for determining elastic search host
Gathondu Jun 23, 2017
c3ce7b7
Add functionality to allow limiting of pages to be scraped
Gathondu Jun 24, 2017
31898c9
Merge with develop branch from Code of Africa
Gathondu Jun 27, 2017
4ab9b74
Allow to save locally if s3 bucket not set
Gathondu Jun 27, 2017
cc9e066
pep8 fixes
Gathondu Jun 27, 2017
f8bc619
Merge pull request #13 from Gathondu/develop
DavidLemayian Jun 27, 2017
d9a2e61
updates to allow to work with morph.io
Gathondu Jun 28, 2017
b762433
refactor posting messages on slack
Gathondu Jun 28, 2017
43767e9
updates to allow to work with morph.io
Gathondu Jun 28, 2017
7b43ea9
refactor posting messages on slack
Gathondu Jun 28, 2017
7ad9879
Merge branch 'develop' of github.com:Gathondu/healthtools_ke into dev…
Gathondu Jun 28, 2017
eb9fd54
Merge pull request #4 from Gathondu/develop
Gathondu Jun 28, 2017
6d48981
Refactor slack posting code
Gathondu Jun 28, 2017
c2c9333
Merge pull request #5 from Gathondu/develop
Gathondu Jun 28, 2017
567405f
add machine name while outputin error
Gathondu Jun 28, 2017
02f770d
Refactor slack posting code
Gathondu Jun 28, 2017
1b757c0
add machine name while outputing error
Gathondu Jun 28, 2017
a00ddca
Merge branch 'master' of github.com:Gathondu/healthtools_ke
Gathondu Jun 28, 2017
a401d6a
add machine name while outputing error
Gathondu Jun 28, 2017
9d3908c
Merge branch 'master' of github.com:Gathondu/healthtools_ke
Gathondu Jun 28, 2017
57241bc
Merge branch 'master' of github.com:Gathondu/healthtools_ke
Gathondu Jun 28, 2017
112ab6b
Merge branch 'master' of github.com:Gathondu/healthtools_ke
Gathondu Jun 28, 2017
c53579a
Merge pull request #14 from Gathondu/master
DavidLemayian Jun 29, 2017
ea139f1
updates to allow to work with morph.io
Gathondu Jun 28, 2017
da080ea
refactor posting messages on slack
Gathondu Jun 28, 2017
2a6efbc
rebase develop
Gathondu Jun 29, 2017
b6109eb
add machine name while outputin error
Gathondu Jun 28, 2017
f4f8ca7
rebase
Gathondu Jun 29, 2017
b3b8fbb
add machine name while outputing error
Gathondu Jun 28, 2017
1e90914
merge
Gathondu Jun 29, 2017
c8840b6
add machine name while outputing error
Gathondu Jun 28, 2017
4963f78
rebase from upstream
Gathondu Jun 29, 2017
7f01733
update README
Gathondu Jun 29, 2017
d37c03b
Merge pull request #15 from Gathondu/develop
DavidLemayian Jun 29, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
*.pyc
/data/
37 changes: 31 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,40 @@ Change directory into package `$ cd healthtools_ke`
Install the dependencies by running `$ pip install requirements.txt`

You can set the required environment variables like so
```
$ export MORPH_AWS_REGION=<aws_region>
$ export MORPH_AWS_ACCESS_KEY_ID= <aws_access_key_id>
$ export MORPH_AWS_SECRET_KEY= <aws_secret_key>
```

You can now run the scrapers `$ python scraper.py` (It might take a while and you might need to change the endpoints in config.py if you haven't authorization for them)
$ export MORPH_AWS_REGION=<aws_region>
$ export MORPH_AWS_ACCESS_KEY_ID=<aws_access_key_id>
$ export MORPH_AWS_SECRET_KEY=<aws_secret_key>
$ export MORPH_S3_BUCKET=<s3_bucket_name> # If not set, data will be archived locally in the project's folder in a folder called data
$ export MORPH_ES_HOST=<elastic_search_host_endpoint> # Do not set this if you would like to use elastic search locally on your machine
$ export MORPH_ES_PORT=<elastic_search_host_port> # Do not set this if you would like to use elastic search default port 9200
$ export MORPH_WEBHOOK_URL=<slack_webhook_url> # Do not set this if you don't want to post error messages on Slack

**If you want to use elasticsearch locally on your machine use the following instructions to set it up**

For linux and windows users, follow instructions from this [link](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html)

For mac users run `brew install elasticsearch` on your terminal

**If you want to post messages on slack**

Set up `Incoming Webhooks` [here](https://slack.com/signin?redir=%2Fservices%2Fnew%2Fincoming-webhook) and set the global environment for the `MORPH_WEBHOOK_URL`

If you set up elasticsearch locally run it `$ elasticsearch`

You can now run the scrapers `$ python scraper.py` (It might take a while)

**FOR DEVELOPMENT PURPOSES**

Set the **BATCH** and **HF_BATCH** (for health facilities) in the config file that will ensure the scraper doesn't scrape entire sites but just the number
of pages that you would like it to scrape defined by this variable.

use `$ python scraper.py small_batch` to run the scrapers


## Running the tests
_**make sure if you use elasticsearch locally, it's running**_

Use nosetests to run tests (with stdout) like this:
```$ nosetests --nocapture```

17 changes: 17 additions & 0 deletions circle.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
machine:
python:
version: 2.7.5
java:
version: openjdk8
dependencies:
pre:
- pip install -r requirements.txt
post:
- wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.4.2.tar.gz
- tar -xzf elasticsearch-5.4.2.tar.gz
- elasticsearch-5.4.2/bin/elasticsearch: {background: true}
- sleep 10 && wget --waitretry=5 --retry-connrefused -v http://127.0.0.1:9200/

test:
override:
- nosetests --nocapture
33 changes: 21 additions & 12 deletions healthtools/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,32 @@

# sites to be scraped
SITES = {
'DOCTORS': 'http://medicalboard.co.ke/online-services/retention/?currpage={}',
'FOREIGN_DOCTORS': 'http://medicalboard.co.ke/online-services/foreign-doctors-license-register/?currpage={}',
'CLINICAL_OFFICERS': 'http://clinicalofficerscouncil.org/online-services/retention/?currpage={}',
'TOKEN_URL': 'http://api.kmhfl.health.go.ke/o/token/'
}
"DOCTORS": "http://medicalboard.co.ke/online-services/retention/?currpage={}",
"FOREIGN_DOCTORS": "http://medicalboard.co.ke/online-services/foreign-doctors-license-register/?currpage={}",
"CLINICAL_OFFICERS": "http://clinicalofficerscouncil.org/online-services/retention/?currpage={}",
"TOKEN_URL": "http://api.kmhfl.health.go.ke/o/token/"
}

AWS = {
"aws_access_key_id": os.getenv("MORPH_AWS_ACCESS_KEY"),
"aws_secret_access_key": os.getenv("MORPH_AWS_SECRET_KEY"),
"region_name": 'eu-west-1',
# Doctors document endpoint
"cloudsearch_doctors_endpoint": "http://doc-cfa-healthtools-ke-doctors-m34xee6byjmzcgzmovevkjpffy.eu-west-1.cloudsearch.amazonaws.com/",
# Clinical document endpoint
"cloudsearch_cos_endpoint": "http://doc-cfa-healthtools-ke-cos-nhxtw3w5goufkzram4er7sciz4.eu-west-1.cloudsearch.amazonaws.com/",
# Health facilities endpoint
"cloudsearch_health_faciities_endpoint":"https://doc-health-facilities-ke-65ftd7ksxazyatw5fiv5uyaiqi.eu-west-1.cloudsearch.amazonaws.com",
"region_name": os.getenv("MORPH_AWS_REGION", "eu-west-1"),
"s3_bucket": os.getenv("MORPH_S3_BUCKET", None)
}

ES = {
"host": os.getenv("MORPH_ES_HOST", "127.0.0.1"),
"port": os.getenv("MORPH_ES_PORT", "9200"),
"index": "healthtools"
}

SLACK = {
"url": os.getenv("MORPH_WEBHOOK_URL")
}

TEST_DIR = os.getcwd() + "/healthtools/tests"

SMALL_BATCH = 5 # No of pages from clinical officers, doctors and foreign doctors sites, scrapped in development mode
SMALL_BATCH_HF = 100 # No of records scraped from health-facilities sites in development mode

DATA_DIR = os.getcwd() + "/data/"