This script pulls data from politikus.sinarproject.org and cache it in networkx to enable offline processing. The cache can then be saved into a Neo4j database for further processing and visualization.
The project depends on the following tools / python package in order to build and install properly.
- Python 3.6 and up
- While the development work targets Neo4j 4.1, earlier version should work.
- Poetry - follow the installation instruction found here.
- Python wheel - you can install via pip
pip3 install wheel
- In order to generate graph, python would need to be compiled to work with
tk-dev
package on Ubuntu.
- Clone this project
git clone https://github.com/Sinar/popit_relationship cd popit_relationship
- Install and build the project
poetry build
Install the built project with pip (filename of the .whl
file may vary). Please ensure your PATH
is configured properly.
pip3 install ./dist/popit_relationship-0.1.0-py3-none-any.whl
If you are reinstalling after pulling the latest changes, add a --force-reinstall
flag
pip3 install --force-reinstall ./dist/popit_relationship-0.1.0-py3-none-any.whl
Most of the configuration is saved within .env
file, please refer to the .env.example
for example. Besides NEO4J_AUTH
and NEO4J_URI
, the script should work with the default settings.
NEO4J_AUTH
stores the username and passsword pair separated by a backslash character/
, e.g.neo4j/s0meCompl!catedPassword
NEO4J_URI
stores the URI to the neo4j database, e.g.bolt:hostname:7687
ENDPOINT_API
stores the ENDPOINT API URI, currently defaulted tohttps://politikus.sinarproject.org/@search
, the script should work with other similar APIsCRAWL_INTERVAL
stores the time to wait between every API call (defaulted to1
second)CACHE_PATH
stores the path to the cache file (defaulted to./primport-cache.gpickle
)
The configuration environment variables can be overwritten while executing the script (please refer to the usage examples below).
After following the installation guide, if the python environment is properly configured, a script named primport
should be made available. Sub-commands can then be issued for different tasks.
Configuration options can be overriden as environment variables, e.g. when running primport
in Bash
NEO4J_AUTH=neo4j/someOtherPassword primport reset db
primport reset cache
resets the cache fileprimport reset db
clears the Neo4j database
primport sync person
fetches thePerson
APIprimport sync org
fetches theOrganization
APIprimport sync post
fetches thePost
APIprimport sync membership
fetches theMembership
APIprimport sync relationship
fetches theRelationship
APIprimport sync ownership
fetches theOwnership Control Statement
APIprimport sync all
fetches all of the aboveprimport visualize $node1 [$node2 $node3 ...]
generates a graph from cache including$node1
($node2
,$node3
etc are optional).- Each
$node
is a URI to an entity, for instancehttps://politikus.sinarproject.org/organizations/government-linked-companies/1mdb-real-estate-sdn-bhd
- The maximum depth can be overwritten by passing
--depth
flag, eg.--depth=1
(value is defaulted to3
).
- Each
primport save
saves the cached data to the Neo4j database to allow further work.
- The script can be executed normally as follows
(Just replace
git clone https://github.com/Sinar/popit_relationship cd popit_relationship poetry install poetry run python src/popit_relationship/primport.py reset db
primport
withpoetry run python src/popit_relationship/primport.py
)
Test is done through PyTest
poetry run pytest