Skip to content

Commit

Permalink
Merge branch 're-master' into 'master'
Browse files Browse the repository at this point in the history
it works

See merge request lbsn/lbsntransform!8
  • Loading branch information
Sieboldianus committed Mar 8, 2021
2 parents 3ff4150 + 2f9afae commit 5f9ec06
Show file tree
Hide file tree
Showing 9 changed files with 206 additions and 48 deletions.
4 changes: 2 additions & 2 deletions .gitlab-ci.yml
Expand Up @@ -106,7 +106,7 @@ docker-image-branch:
before_script:
- docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" $CI_REGISTRY
script:
- docker build --pull --tag "$CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG" .
- docker build --pull --tag "$CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG" -f docker/Dockerfile .
- docker push "$CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG"
rules:
- if: $CI_COMMIT_BRANCH != "master"
Expand All @@ -119,7 +119,7 @@ docker-image:
before_script:
- docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" $CI_REGISTRY
script:
- docker build --pull --tag "$CI_REGISTRY_IMAGE" .
- docker build --pull --tag "$CI_REGISTRY_IMAGE" -f docker/Dockerfile .
- docker push "$CI_REGISTRY_IMAGE"
rules:
- if: $CI_COMMIT_BRANCH == "master"
34 changes: 0 additions & 34 deletions Dockerfile

This file was deleted.

25 changes: 25 additions & 0 deletions docker/Dockerfile
@@ -0,0 +1,25 @@
FROM python:slim

COPY lbsntransform/ ./lbsntransform/
COPY resources/ ./resources/
COPY setup.py README.md ./

RUN set -ex; \
\
apt-get update; \
apt-get install -y --no-install-recommends \
libpq-dev \
build-essential \
; \
pip install --upgrade pip; \
pip install psycopg2-binary; \
pip install --ignore-installed --editable . \
; \
apt-get purge -y \
build-essential \
; \
apt-get autoremove -y \
; \
rm -rf /var/lib/apt/lists/*;

ENTRYPOINT ["lbsntransform"]
48 changes: 38 additions & 10 deletions docs/argparse/examples.md
@@ -1,4 +1,6 @@
lbsntransform has a Command Line Interface (CLI) that can be used to convert many input formats to common lbsnstructure, including to its privacy-aware hll implementation.
lbsntransform has a Command Line Interface (CLI) that can be used to convert many
input formats to common lbsnstructure, including to its privacy-aware hll implementation.


!!! Note
Substitute bash linebreak character `\` in examples below with `^` if you're on Windows command line
Expand Down Expand Up @@ -32,13 +34,22 @@ If your files are spread across subdirectories in (e.g.) `.01_Input/`, add `--re

A specific mapping is provided for the [YFCC100m Dataset](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/).

The YFCC100m Dataset consists of multiple files, with the core dataset of 100 Million Flickr photo metadata records (yfcc100m_dataset.csv) and several "expansion sets".

The YFCC100m Dataset consists of multiple files, with the core dataset of 100 Million
Flickr photo metadata records (yfcc100m_dataset.csv) and several "expansion sets".


The only expansion-set that is available for mapping is places-expansion (yfcc100m_places.csv).


Both photo metadata and places metadata can be processed parrallel, by using `--zip_records`.

Before executing the following, make sure you've started the [lbsn-raw database docker](https://gitlab.vgiscience.de/lbsn/databases/rawdb). This includes the postgres implementation of the common lbsn structure format. You can run the docker db container on any host, but we suggest testing your setup locally - in this case, `127.0.0.1` refers to _localhost_ and port `15432` (the default for lbsn-raw).

Before executing the following, make sure you've started the [lbsn-raw database docker](https://gitlab.vgiscience.de/lbsn/databases/rawdb).
This includes the postgres implementation of the common lbsn structure format. You
can run the docker db container on any host, but we suggest testing your setup locally
- in this case, `127.0.0.1` refers to _localhost_ and port `15432` (the default for
lbsn-raw).


```bash
Expand Down Expand Up @@ -80,25 +91,42 @@ If you have stored the Flickr-dataset locally, simply replace the urls with:

# Privacy-aware output (HyperLogLog)

We've developed a privacy-aware implementation of lbsn-raw format, based based on the probabilistic datastructure HyperLogLog and the postgres implementation from [Citus](https://github.com/citusdata/postgresql-hll).
We've developed a privacy-aware implementation of lbsn-raw format, based based on
the probabilistic datastructure HyperLogLog and the postgres implementation from
[Citus](https://github.com/citusdata/postgresql-hll).

Two preparations steps are necessary:

* Prepare a postgres database with the HLL version of lbsnstructure. You can use the [lbsn-hll database docker](https://gitlab.vgiscience.de/lbsn/databases/hlldb)
* Prepare a read-only (empty) database with Citus HyperLogLog extension installed. You can use the [hll importer docker](https://gitlab.vgiscience.de/lbsn/tools/importer)
* Prepare a postgres database with the HLL version of lbsnstructure. You can use
the [lbsn-hll database docker](https://gitlab.vgiscience.de/lbsn/databases/hlldb)

* Prepare a read-only (empty) database with Citus HyperLogLog extension installed.
You can use the [hll importer docker](https://gitlab.vgiscience.de/lbsn/tools/importer)


We've designed this rather complex setup to separate concerns:
- the importer db (called `hllworkerdb` in the command below) will be used by lbsntransform to calculate hll `shards` from raw data - it will not store any data, nor will it get any additional (privacy-relevant) information. Shards are calculated in-memory and returned. The importer is prepared with global hll-settings that must not change during the whole lifetime of the final output.

For example, as a means of additional security, before creating shards, distinct values can be one-way-hashed. This hashing can be improved using a `salt` that is only known to **importer**.
- the importer db (called `hllworkerdb` in the command below) will be used by lbsntransform
to calculate hll `shards` from raw data
- it will not store any data, nor will it get any additional (privacy-relevant) information.
- Shards are calculated in-memory and returned. The importer is prepared with global
hll-settings that must not change during the whole lifetime of the final output.

Finally, as a result, output hll db will not retrieve any privacy-relevant data because this is removed before transmission.
For example, as a means of additional security, before creating shards, distinct
values can be one-way-hashed. This hashing can be improved using a `salt` that is
only known to **importer**.

Finally, as a result, output hll db will not retrieve any privacy-relevant data because
this is removed before transmission.

!!! Note
Depending on chosen `bases` and the type of input data, data may still contain privacy sensitive references. Have a look at the [lbsn-docs](https://lbsn.vgiscience.org) for further information.
Depending on chosen `bases` and the type of input data, data may still contain
privacy sensitive references. Have a look at the [lbsn-docs](https://lbsn.vgiscience.org)
for further information.

To convert YFCC100m photo metadata and places and transfer to a local hll-db, use:


```bash
lbsntransform --origin 21 \
--file_input \
Expand Down
48 changes: 48 additions & 0 deletions docs/install-alternatives.md
@@ -0,0 +1,48 @@
# Windows

There are many ways to install python tools, in Windows this can become particularly frustrating.

1. For most Windows users, the recommended way is to install lbsntransform with [conda package manager](/install-conda/)
2. You can also use the [Docker](/install-docker/) approach in Windows, e.g. in combination with Windows Subsystem for Linux (WSL)
3. If you _need_ to install with pip in Windows, a possible approach is to install all dependencies first (use [Gohlke wheels]
if necessary) and then install lbsntransform with

```bash
pip install lbsntransform --no-deps
```

# Linux

!!! note
Use of pip can be problematic, even on Linux. Some sub-dependencies outside python [cannot
be managed by pip][1], such as `libpq-dev`, which is required by [psycopg2].
Use [conda](/install-conda/) if you're new to python package managers. You've been warned.

For most Linux users, it is recommended to first create some type of virtual environment,
and then install lbsntransform in the virtual env, e.g.:

```bash
apt-get install -y libpq-dev # required for psycopg2
apt-get install python3-venv # required for virtual env
python3 -m venv lbsntransform_env
source ./lbsntransform_env/bin/activate
pip install lbsntransform
```

You can also directly install the latest release of lbsntransform with pip:

```bash
pip install lbsntransform
```

..or, clone the repository and install lbsntransform directly:

```bash
git clone https://github.com/Sieboldianus/lbsntransform.git
cd lbsntransform
python setup.py install
```

[1]: https://stackoverflow.com/q/27734053/4556479#comment43880476_27734053
[psycopg2]: https://www.psycopg.org/install/
[Gohlke wheels]: https://www.lfd.uci.edu/~gohlke/pythonlibs/
48 changes: 48 additions & 0 deletions docs/install-conda.md
@@ -0,0 +1,48 @@
# Installation with conda

This is the recommended way for all systems.

This approach is independent of the OS used.

If you have conda package manager, you can install lbsntransform dependencies
with the `environment.yml` that is available in the lbsntransform repository:

```yaml
{!../environment.yml!}
```

1. Create a conda env using `environment.yml`

```bash
git clone https://github.com/Sieboldianus/lbsntransform.git
cd lbsntransform
# not necessary, but recommended:
conda config --env --set channel_priority strict
conda env create -f environment.yml
```

2. Install lbsntransform without dependencies

```bash
conda activate lbsntransform
```

Either, use the release version from pypi. This will create a static installation that needs
to be manually upgraded when new package versions appear.

```bash
pip install lbsntransform --no-deps --upgrade
```

Or use pip `--editable`, linking the lbsntransform folder:

```bash
pip install --no-deps --editable .
```

This is the recommended way if you want to edit files, or use the latest commits from the repository.

The `lbsntransform` package will be directly linked to the folder.

!!! note "Why isn't the package available on conda-forge?"
This is planned to happen in one of the next versions..
40 changes: 40 additions & 0 deletions docs/install-docker.md
@@ -0,0 +1,40 @@
# Docker

This is the recommended approach:

- If you have Docker
- and if you do not want to develop or make changes to lbsntransform.

You can use the latest Dockerimage to directly run lbsntransform in a container:

First, pull the image.
```bash
docker pull registry.gitlab.vgiscience.org/lbsn/lbsntransform:latest
docker tag registry.gitlab.vgiscience.org/lbsn/lbsntransform:latest lbsntransform
```

The run it.
```bash
docker run \
--rm \
lbsntransform \
--version
```

!!! note
Replace `--version` with the CLI commands for your use case.

Or, use the Dockerfile in `docker/Dockerfile` to build the image yourself.

```dockerfile
{!../docker/Dockerfile!}
```

Example:
```bash
docker build -t lbsntransform -f docker/Dockerfile .
docker run \
--rm \
lbsntransform \
--version
```
2 changes: 1 addition & 1 deletion docs/quick-guide.md
Expand Up @@ -86,7 +86,7 @@ pip install lbsntransform
```bash
git clone https://github.com/Sieboldianus/lbsntransform.git
cd lbsntransform
pip install .
python setup.py install
```

[1]: https://stackoverflow.com/q/27734053/4556479#comment43880476_27734053
Expand Down
5 changes: 4 additions & 1 deletion mkdocs.yml
Expand Up @@ -27,7 +27,10 @@ markdown_extensions:
nav:
- Introduction: index.md
- User Guide:
- Quick Installation: quick-guide.md
- Quick Installation:
- Conda: install-conda.md
- Docker: install-docker.md
- "Alternatives: Windows and Linux": install-alternatives.md
- Use Cases: use-cases.md
- Input Types: input-types.md
- Mappings:
Expand Down

0 comments on commit 5f9ec06

Please sign in to comment.