Skip to content

Commit

Permalink
docs: fix links and rehrase sections
Browse files Browse the repository at this point in the history
  • Loading branch information
Sieboldianus committed Dec 14, 2020
1 parent baa6976 commit 76875bf
Show file tree
Hide file tree
Showing 6 changed files with 156 additions and 97 deletions.
59 changes: 46 additions & 13 deletions docs/input-mappings.md
Expand Up @@ -6,37 +6,45 @@ is available from the Python version of the Proto Buf Spec.
Mappings are loaded dynamically. You can provide a path to a folder
containing mappings with the flag `--mappings_path ./subfolder`.

To use the provided example mappings (Twitter or YFCC100M), clone the
repository and use:
```bash
lbsntransform --mappings_path ./resources/mappings/
```

If no path is provided, `lbsn raw` is assumed as input, for which
the file mapping is available in [lbsntransform/input/field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html),
including lbsn db query syntax defined in [lbsntransform/input/db_query.py](/api/input/mappings/db_query.html).
the file mapping is available in [lbsntransform/input/field_mapping_lbsn.py](lbsntransform/docs/api/input/mappings/field_mapping_lbsn.html),
including lbsn db query syntax defined in [lbsntransform/input/db_query.py](lbsntransform/docs/api/input/mappings/db_query.html).

Predefined mappings exist for the [Flickr YFCC100M dataset](https://lbsn.vgiscience.org/yfcc-introduction/) (CSV) and Twitter (JSON).

Have a look at the two mappings in the [resources folder](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources).

Predefined mappings exist for Flickr (CSV/JSON) and Twitter (JSON)
in the [resources folder](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources).
If the git repository is cloned to a local folder, use
`--mappings_path ./resources/mappings/` to load Flickr or Twitter mappings.

Input mappings must have some specific attributes to be recognized.

Primarily, a class constant "MAPPING_ID" is used to load mappings,
e.g. the [field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html)
e.g. the [field_mapping_lbsn.py](lbsntransform/docs/api/input/mappings/field_mapping_lbsn.html)
has the following module level constant:
```py
MAPPING_ID = 0
```

**Examples:**

To load data with the default mapping, use `lbsntransform --origin 0`.
To load data with the default mapping, with the MAPPING_ID "0", use `lbsntransform --origin 0`.

To load data from Twitter json, use use
To load data from Twitter json, use
```bash
lbsntransform --origin 3 \
--mappings_path ./resources/mappings/ \
--file_input \
--file_type "json"
```

To load data from Flickr YFCC100M, use use
To load data from Flickr YFCC100M, use

```bash
lbsntransform --origin 21 \
Expand All @@ -48,7 +56,7 @@ lbsntransform --origin 21 \

# Custom Input Mappings

Start with any of the predefined mappings, either from [field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html),
Start with any of the predefined mappings, either from [field_mapping_lbsn.py](lbsntransform/docs/api/input/mappings/field_mapping_lbsn.html),
or [field_mapping_twitter.py](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources/field_mapping_twitter.py) (JSON) and
[field_mapping_yfcc100m.py](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources/field_mapping_yfcc100m.py) (CSV).

Expand Down Expand Up @@ -95,16 +103,41 @@ class importer():
record A single row from CSV, stored as list type.
"""
# extract/convert all lbsn records
lbsn_records = self.extract_post(record)
lbsn_records = []
lbsn_record = self.extract_post(record)
lbsn_records.append(lbsn_record)
return lbsn_records

# def parse_json_record(self, record, record_type: Optional[str] = None):
# """Entry point for processing JSON data:
# Attributes:
# record A single record, stored as dictionary type.
# """
# # extract lbsn objects
# return lbsn_records

def extract_post(self, record):
post_record = HF.new_lbsn_record_with_id(
lbsn.Post(), post_guid, self.origin)
return post_record
```


* **Json or CSV?** Database records and JSON objects are read as nested dictionaries.
CSV records are read using dict_reader, and provided as flat dictionaries.
* Each mapping must have either `parse_csv_record()` or `parse_json_record()` defined.
* Both json and CSV mapping can be mapped in one file, but it is recommended to separate
mappings for different input file formats in two mappings.
* The class attributes provided above are currently required to be defined. It is
not necessary to actually make use of these.
* Both `parse_csv_record()` and `parse_json_record()` must return a list of lbsn Objects.


!!! Note
For one lbsn origin, many mappings may exist. For example,
for the above example origin with id "99", you may have
mappings with ids 991, 992, 993 etc. This can be used to
create separate mappings for json, csv etc.
for the above example origin with id `99`, you may have
mappings with ids `991`, `992`, `993` etc. This can be used to
create separate mappings for json, csv etc.

The actual `origin_id` that is stored in the database is
given in the `importer.origin` attributes.
93 changes: 50 additions & 43 deletions docs/input-types.md
@@ -1,47 +1,54 @@
# Input type: File, URL, or Database?

lbsntransform can read data from different common types of data sources:
lbsntransform can read data from different common types of data sources.

The following cli arguments are available:
The two main input types to distinguish are input from files and databases.

* file input `--file_input`
* json files `--file_type json`
* stacked `--is_stacked_json`
The typical form for json is `[{json1},{json2}]`. If `--is_stacked_json` is set,
jsons in the form of `{json1}{json2}` (no comma) can be imported.
* line separated `--is_line_separated_json`
If this flag is used, lbsntransform expects one json per line (separated with a line break).
* csv files `--file_type csv`
* Set CSV delimiter with `--csv_delimiter`, common types are e.g.:
* Comma: `','` (default)
* Semi-colon: `';'`
* Tab: `$'\t'`
* Additional flags for file input:
* `--input_path_url` the folder, path or url to read from, e.g.:
* `--input_path_url 01_Input` Read from the relative subfolder "01_Input" (default).
* `--input_path_url ~/data/` Read from the user's home folder "data".
* `--input_path_url /c/tmp/data` Read from a WSL mounted subdir from Windows.
* "/d/03_EvaVGI/01_Daten/02_FlickrCommons/Flickr_Commons_100Million_YFCC100M_dataset/" \
* `--recursive_load` to recursively process local sub directories (default depth: 2).
* `--skip_until_file x` to process all files until a file name with name `x` is found
* `--zip_records` Allows to zip records from multiple sources using semi-colon (`;`), e.g.:
* `--input_path_url "https://mypage.org/dataset_col1.csv;https://mypage.org/dataset_col2.csv"`
Will process records from both csv files parallel, by zipping files.
* data base input (Postgres)
* `--dbuser_input "postgres"` the name of the dbuser
* `--dbserveraddress_input "127.0.0.1:5432"` the name and (optional) the port to use. The default postgres port is `5432`.
* `--dbname_input "rawdb"` the name of the database.
* `--dbpassword_input "mypw` the password to use when connecting.
* `--dbformat_input "lbsn"` the format of the database. Currently, only "lbsn" and "json" are supported.
* Additional flags for db input:
- `--records_tofetch 1000` If retrieving from a db, limit the
number of records to fetch per batch. Defaults to 10k.
- `--startwith_db_rownumber xyz` To resume processing from an arbitrary ID.
If input db type is "LBSN", provide the primary key to start from (e.g. post_guid, place_guid etc.).
This flag will only work if processing a single lbsnObject (e.g. lbsnPost).
- `--endwith_db_rownumber xyz` To stop processing at a particular row-id.
- `--include_lbsn_objects` If processing from lbsn rawdb, provide a comma separated list of
[lbsn objects](https://lbsn.vgiscience.org/structure/) to include. May contain:
origin,country,city,place,user_groups,user,post,post_reaction,event
Excluded objects will not be queried, but empty objects may be created due to referenced
foreign key relationships. Defaults to origin,post.
The following cli arguments are available for the two types.

## File input

* activated by `--file_input`
* json files `--file_type json`
* stacked `--is_stacked_json`
The typical form for json is `[{json1},{json2}]`. If `--is_stacked_json` is set,
jsons in the form of `{json1}{json2}` (no comma) can be imported.
* line separated `--is_line_separated_json`
If this flag is used, lbsntransform expects one json per line (separated with a line break).
* csv files `--file_type csv`
* Set CSV delimiter with `--csv_delimiter`, common types are e.g.:
* Comma: `','` (default)
* Semi-colon: `';'`
* Tab: `$'\t'`
* Additional flags for file input:
* `--input_path_url` the folder, path or url to read from, e.g.:
* `--input_path_url 01_Input` Read from the relative subfolder "01_Input" (default).
* `--input_path_url ~/data/` Read from the user's home folder "data".
* `--input_path_url /c/tmp/data` Read from a WSL mounted subdir from Windows.
* "/d/03_EvaVGI/01_Daten/02_FlickrCommons/Flickr_Commons_100Million_YFCC100M_dataset/" \
* `--recursive_load` to recursively process local sub directories (default depth: 2).
* `--skip_until_file x` to process all files until a file name with name `x` is found
* `--zip_records` Allows to zip records from multiple sources using semi-colon (`;`), e.g.:
* `--input_path_url "https://mypage.org/dataset_col1.csv;https://mypage.org/dataset_col2.csv"`
Will process records from both csv files parallel, by zipping files.

## Database input (Postgres)

* activated by default
* `--dbuser_input "postgres"` the name of the dbuser
* `--dbserveraddress_input "127.0.0.1:5432"` the name and (optional) the port to use. The default postgres port is `5432`.
* `--dbname_input "rawdb"` the name of the database.
* `--dbpassword_input "mypw` the password to use when connecting.
* `--dbformat_input "lbsn"` the format of the database. Currently, only "lbsn" and "json" are supported.
* Additional flags for db input:
- `--records_tofetch 1000` If retrieving from a db, limit the
number of records to fetch per batch. Defaults to 10k.
- `--startwith_db_rownumber xyz` To resume processing from an arbitrary ID.
If input db type is "LBSN", provide the primary key to start from (e.g. post_guid, place_guid etc.).
This flag will only work if processing a single lbsnObject (e.g. lbsnPost).
- `--endwith_db_rownumber xyz` To stop processing at a particular row-id.
- `--include_lbsn_objects` If processing from lbsn rawdb, provide a comma separated list of
[lbsn objects](https://lbsn.vgiscience.org/structure/) to include. May contain:
`origin,country,city,place,user_groups,user,post,post_reaction,event`
Note: Excluded objects will not be queried, but empty objects may be created due to referenced
foreign key relationships. Defaults to `origin,post`.
33 changes: 23 additions & 10 deletions docs/output-mappings.md
@@ -1,6 +1,6 @@
**lbsntransform** can output data to a database with the [common lbsn structure](),
called [rawdb](https://gitlab.vgiscience.de/lbsn/databases/rawdb)
or the privacy-aware version, called [hlldb](https://gitlab.vgiscience.de/lbsn/databases/hllb).
or the privacy-aware version, called [hlldb](https://gitlab.vgiscience.de/lbsn/databases/hlldb).

**Examples:**

Expand Down Expand Up @@ -51,15 +51,20 @@ It further allows to separate processing into individual components.

If no hll worker is available, hlldb may be used.

??? Why do I need a database connection?
There's a [python package](https://github.com/AdRoll/python-hll) available that
allows making hll calculations in python. However, it is not as performant
as the native Postgres implementation.

Use `--include_lbsn_objects` to specify which input data you want to convert to
the privacy aware version. For example, `--include_lbsn_objects "origin,post"`
would process [lbsn objects](https://lbsn.vgiscience.org/structure/)
of type origin and post (default).

Use `--include_lbsn_bases` to specify which output data you want to convert to.

We call this "bases", and they are defined in output mappings in
[lbsntransform/input/field_mapping_lbsn.py](/api/output/hll/hll_bases.html),
We refer to the different output structures as "bases", and they are defined
in output mappings in [lbsntransform/input/field_mapping_lbsn.py](lbsntransform/docs/api/output/hll/hll_bases.html),

Bases can be separated by comma and may include:

Expand Down Expand Up @@ -100,17 +105,25 @@ For example:
lbsntransform --include_lbsn_bases hashtag,place,date,community
```

..would fill/update entries of the hlldb structures:
..would convert and transfer any input data to the hlldb structures:

- topical.hashtag
- spatial.place
- temporal.date
- social.community
- `topical.hashtag`
- `spatial.place`
- `temporal.date`
- `social.community`

This name refers to `schema.table`.
The name refers to `schema.table` in the Postgres implementation.

!!! Upsert (Insert or Update)
Because it is entirely unknown to lbsntransform whether output
records (primary keys) already exist, any data is transferred using the
[Upsert](https://wiki.postgresql.org/wiki/UPSERT) syntax, which means
`INSERT ... ON CONFLICT UPDATE`. This means that records are either
inserted if primary keys do not exist yet, or updated, using `hll_union()`.

It is possible to define own output hll db mappings. The best place
to start is [lbsntransform/input/field_mapping_lbsn.py](/api/output/hll/hll_bases.html).
to start is [lbsntransform/input/field_mapping_lbsn.py](lbsntransform/docs/api/output/hll/hll_bases.html).

Have a look at the pre-defined bases and add any additional needed. It is recommended
to use inheritance. After adding your own mappings, the hlldb must be prepared with
respective table structures. Have a look at the
Expand Down
61 changes: 33 additions & 28 deletions docs/quick-guide.md
@@ -1,12 +1,43 @@
# Installation with conda

This is the recommended way for all systems.

This approach is independent of the OS used.

If you have conda package manager, you can install lbsntransform dependencies
with the `environment.yml` that is available in the lbsntransform repository:

```yaml
{!../environment.yml!}
```

1. Create a conda env using `environment.yml`

```bash
git clone https://github.com/Sieboldianus/lbsntransform.git
cd lbsntransform
# not necessary, but recommended:
conda config --env --set channel_priority strict
conda env create -f environment.yml
```

2. Install lbsntransform without dependencies

```bash
conda activate lbsntransform
python setup.py install --no-deps
```

# Windows

There are many ways to install python tools, in Windows this can become particularly frustrating.

1. For most Windows users, the recommended way is to install lbsntransform with [conda package manager](#installation-with-conda)
2. If you _need_ to install with pip in Windows, a possible approach is to install all dependencies first (use [Gohlke wheels] if necessary) and then install lbsntransform with

pip install lbsntransform --no-deps

```python
pip install lbsntransform --no-deps
```
# Linux

!!! note
Expand Down Expand Up @@ -39,32 +70,6 @@ cd lbsntransform
python setup.py install
```

# Installation with conda

This approach is independent of the OS used.

If you have conda package manager, you can install lbsntransform dependencies
with the `environment.yml` that is available in the lbsntransform repository:

```yaml
{!../environment.yml!}
```

1. Create a conda env using `environment.yml`

```bash
git clone https://github.com/Sieboldianus/lbsntransform.git
cd lbsntransform
conda env create -f environment.yml
```

2. Install lbsntransform without dependencies

```bash
conda activate lbsntransform
python setup.py install --no-deps
```

[1]: https://stackoverflow.com/q/27734053/4556479#comment43880476_27734053
[psycopg2]: https://www.psycopg.org/install/
[Gohlke wheels]: https://www.lfd.uci.edu/~gohlke/pythonlibs/
4 changes: 2 additions & 2 deletions docs/use-cases.md
Expand Up @@ -17,5 +17,5 @@ The following two primary use cases exist:

For any conversion,

- the input type must be provided, see [input-types](/input-types)
- a mapping must exist, see [input-mappings](/input-mappings)
- the input type must be provided, see [input-types](lbsntransform/docs/input-types)
- a mapping must exist, see [input-mappings](lbsntransform/docs/input-mappings)
3 changes: 2 additions & 1 deletion mkdocs.yml
Expand Up @@ -2,9 +2,10 @@ site_name: lbsntransform Documentation
site_url: https://lbsn.vgiscience.org/lbsntransform/docs/
site_author: Alexander Dunkel
copyright: CC BY 4.0, Alexander Dunkel, vgiscience.org and contributors
site_dir: site

repo_url: https://github.com/Sieboldianus/lbsntransform
site_dir: site
docs_dir: docs

theme:
name: 'readthedocs'
Expand Down

0 comments on commit 76875bf

Please sign in to comment.