No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
mysql-export/scripts
pgloader
.gitignore
README.rst
filius_forma_extract.py
name_string_indices.py
name_strings.py
name_strings_words.py
process-open-tree-data.ipynb
settings.py
vernacular_string_indices.py
vernacular_strings.py

README.rst

GnHarvester

Workflow to import data from MySQL to PostgreSQL:

  1. export data from MySQL to TSV with mysql-export/scripts
  2. run Spark job to convert TSV to PostgreSQL format with:
SPARK_HOME=/path/to/spark-dists/spark-src_1.6.2/ \
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip \
PYSPARK_PYTHON=/path/to/anaconda2/envs/snakes/bin/python \
    /path/to/spark-dists/spark-src_1.6.2/bin/spark-submit \
    --jars "/path/to/scala-2.10/gnparser-spark-python-assembly-0.3.3-SNAPSHOT.jar" \
    --driver-class-path="/path/to/scala-2.10/gnparser-spark-python-assembly-0.3.3-SNAPSHOT.jar" \
    --packages com.databricks:spark-csv_2.10:1.4.0 \
    --executor-memory 20G --driver-memory 20G \
    python_script.py
  1. import data with script:
time cat csvs/<<TABLE>>/part-* | psql development -h localhost -U postgres -c "COPY <<TABLE>> FROM STDIN"