docs: fix links and rehrase sections

Sieboldianus · Dec 14, 2020 · 76875bf · 76875bf
1 parent baa6976
commit 76875bf
Show file tree

Hide file tree

Showing 6 changed files with 156 additions and 97 deletions.
diff --git a/docs/input-mappings.md b/docs/input-mappings.md
@@ -6,37 +6,45 @@ is available from the Python version of the Proto Buf Spec.
 Mappings are loaded dynamically. You can provide a path to a folder 
 containing mappings with the flag `--mappings_path ./subfolder`.
 
+To use the provided example mappings (Twitter or YFCC100M), clone the
+repository and use:
+```bash
+lbsntransform --mappings_path ./resources/mappings/
+```
+
 If no path is provided, `lbsn raw` is assumed as input, for which
-the file mapping is available in [lbsntransform/input/field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html),
-including lbsn db query syntax defined in [lbsntransform/input/db_query.py](/api/input/mappings/db_query.html).
+the file mapping is available in [lbsntransform/input/field_mapping_lbsn.py](lbsntransform/docs/api/input/mappings/field_mapping_lbsn.html),
+including lbsn db query syntax defined in [lbsntransform/input/db_query.py](lbsntransform/docs/api/input/mappings/db_query.html).
+
+Predefined mappings exist for the [Flickr YFCC100M dataset](https://lbsn.vgiscience.org/yfcc-introduction/) (CSV) and Twitter (JSON).
+
+Have a look at the two mappings in the [resources folder](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources).
 
-Predefined mappings exist for Flickr (CSV/JSON) and Twitter (JSON)
-in the [resources folder](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources).
 If the git repository is cloned to a local folder, use
 `--mappings_path ./resources/mappings/` to load Flickr or Twitter mappings.
 
 Input mappings must have some specific attributes to be recognized.
 
 Primarily, a class constant "MAPPING_ID" is used to load mappings, 
-e.g. the [field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html)
+e.g. the [field_mapping_lbsn.py](lbsntransform/docs/api/input/mappings/field_mapping_lbsn.html)
 has the following module level constant:
 ```py
 MAPPING_ID = 0
 ```
 
 **Examples:**
 
-To load data with the default mapping, use `lbsntransform --origin 0`.
+To load data with the default mapping, with the MAPPING_ID "0", use `lbsntransform --origin 0`.
 
-To load data from Twitter json, use use 
+To load data from Twitter json, use 
 ```bash
 lbsntransform --origin 3 \
               --mappings_path ./resources/mappings/ \
               --file_input \
               --file_type "json"
 ```
 
-To load data from Flickr YFCC100M, use use 
+To load data from Flickr YFCC100M, use 
 
 ```bash
 lbsntransform --origin 21 \
@@ -48,7 +56,7 @@ lbsntransform --origin 21 \
 
 # Custom Input Mappings
 
-Start with any of the predefined mappings, either from [field_mapping_lbsn.py](/api/input/mappings/field_mapping_lbsn.html),
+Start with any of the predefined mappings, either from [field_mapping_lbsn.py](lbsntransform/docs/api/input/mappings/field_mapping_lbsn.html),
 or [field_mapping_twitter.py](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources/field_mapping_twitter.py) (JSON) and
 [field_mapping_yfcc100m.py](https://gitlab.vgiscience.de/lbsn/lbsntransform/resources/field_mapping_yfcc100m.py) (CSV).
 
@@ -95,16 +103,41 @@ class importer():
         record    A single row from CSV, stored as list type.
         """
         # extract/convert all lbsn records
-        lbsn_records = self.extract_post(record)
+        lbsn_records = []
+        lbsn_record = self.extract_post(record)
+        lbsn_records.append(lbsn_record)
         return lbsn_records
 
+    # def parse_json_record(self, record, record_type: Optional[str] = None):
+    #   """Entry point for processing JSON data:
+    #   Attributes:
+    #   record    A single record, stored as dictionary type.
+    #   """
+    #   # extract lbsn objects
+    #   return lbsn_records
+
     def extract_post(self, record):
         post_record = HF.new_lbsn_record_with_id(
             lbsn.Post(), post_guid, self.origin)
+        return post_record
 ```
 
+
+* **Json or CSV?** Database records and JSON objects are read as nested dictionaries.
+  CSV records are read using dict_reader, and provided as flat dictionaries.  
+* Each mapping must have either `parse_csv_record()` or `parse_json_record()` defined.  
+* Both json and CSV mapping can be mapped in one file, but it is recommended to separate
+  mappings for different input file formats in two mappings.  
+* The class attributes provided above are currently required to be defined. It is
+  not necessary to actually make use of these.  
+* Both `parse_csv_record()` and `parse_json_record()` must return a list of lbsn Objects.
+
+
 !!! Note
     For one lbsn origin, many mappings may exist. For example, 
-    for the above example origin with id "99", you may have 
-    mappings with ids 991, 992, 993 etc. This can be used to 
-    create separate mappings for json, csv etc.
+    for the above example origin with id `99`, you may have 
+    mappings with ids `991`, `992`, `993` etc. This can be used to 
+    create separate mappings for json, csv etc.
+
+    The actual `origin_id` that is stored in the database is 
+    given in the `importer.origin` attributes.
diff --git a/docs/input-types.md b/docs/input-types.md
@@ -1,47 +1,54 @@
 # Input type: File, URL, or Database?
 
-lbsntransform can read data from different common types of data sources:
+lbsntransform can read data from different common types of data sources.
 
-The following cli arguments are available:  
+The two main input types to distinguish are input from files and databases.
 
-* file input `--file_input`
-    * json files `--file_type json`
-        * stacked `--is_stacked_json`
-          The typical form for json is `[{json1},{json2}]`. If `--is_stacked_json` is set,
-          jsons in the form of `{json1}{json2}` (no comma) can be imported.
-        * line separated `--is_line_separated_json`
-          If this flag is used, lbsntransform expects one json per line (separated with a line break).
-    * csv files `--file_type csv`
-        * Set CSV delimiter with `--csv_delimiter`, common types are e.g.:
-            * Comma: `','` (default)
-            * Semi-colon: `';'`
-            * Tab: `$'\t'`
-    * Additional flags for file input:
-        * `--input_path_url` the folder, path or url to read from, e.g.:
-            * `--input_path_url 01_Input` Read from the relative subfolder "01_Input" (default).
-            * `--input_path_url ~/data/` Read from the user's home folder "data".
-            * `--input_path_url /c/tmp/data` Read from a WSL mounted subdir from Windows.
-            *  "/d/03_EvaVGI/01_Daten/02_FlickrCommons/Flickr_Commons_100Million_YFCC100M_dataset/" \
-        * `--recursive_load` to recursively process local sub directories (default depth: 2).
-        * `--skip_until_file x` to process all files until a file name with name `x` is found
-        * `--zip_records` Allows to zip records from multiple sources using semi-colon (`;`), e.g.:
-            * `--input_path_url "https://mypage.org/dataset_col1.csv;https://mypage.org/dataset_col2.csv"`
-              Will process records from both csv files parallel, by zipping files.
-* data base input (Postgres)
-    * `--dbuser_input "postgres"` the name of the dbuser
-    * `--dbserveraddress_input "127.0.0.1:5432"` the name and (optional) the port to use. The default postgres port is `5432`.
-    * `--dbname_input "rawdb"` the name of the database.
-    * `--dbpassword_input "mypw` the password to use when connecting.
-    * `--dbformat_input "lbsn"` the format of the database. Currently, only "lbsn" and "json" are supported.
-    * Additional flags for db input:
-        - `--records_tofetch 1000` If retrieving from a db, limit the 
-          number of records to fetch per batch. Defaults to 10k.
-        - `--startwith_db_rownumber xyz` To resume processing from an arbitrary ID.
-          If input db type is "LBSN", provide the primary key to start from (e.g. post_guid, place_guid etc.). 
-          This flag will only work if processing a single lbsnObject (e.g. lbsnPost).
-        - `--endwith_db_rownumber xyz` To stop processing at a particular row-id.
-        - `--include_lbsn_objects` If processing from lbsn rawdb, provide a comma separated list of 
-          [lbsn objects](https://lbsn.vgiscience.org/structure/) to include. May contain: 
-          origin,country,city,place,user_groups,user,post,post_reaction,event
-          Excluded objects will not be queried, but empty objects may be created due to referenced 
-          foreign key relationships. Defaults to origin,post.
+The following cli arguments are available for the two types.
+
+## File input 
+
+* activated by `--file_input`
+* json files `--file_type json`
+    * stacked `--is_stacked_json`
+      The typical form for json is `[{json1},{json2}]`. If `--is_stacked_json` is set,
+      jsons in the form of `{json1}{json2}` (no comma) can be imported.
+    * line separated `--is_line_separated_json`
+      If this flag is used, lbsntransform expects one json per line (separated with a line break).
+* csv files `--file_type csv`
+    * Set CSV delimiter with `--csv_delimiter`, common types are e.g.:
+        * Comma: `','` (default)
+        * Semi-colon: `';'`
+        * Tab: `$'\t'`
+* Additional flags for file input:
+    * `--input_path_url` the folder, path or url to read from, e.g.:
+        * `--input_path_url 01_Input` Read from the relative subfolder "01_Input" (default).
+        * `--input_path_url ~/data/` Read from the user's home folder "data".
+        * `--input_path_url /c/tmp/data` Read from a WSL mounted subdir from Windows.
+        *  "/d/03_EvaVGI/01_Daten/02_FlickrCommons/Flickr_Commons_100Million_YFCC100M_dataset/" \
+    * `--recursive_load` to recursively process local sub directories (default depth: 2).
+    * `--skip_until_file x` to process all files until a file name with name `x` is found
+    * `--zip_records` Allows to zip records from multiple sources using semi-colon (`;`), e.g.:
+        * `--input_path_url "https://mypage.org/dataset_col1.csv;https://mypage.org/dataset_col2.csv"`
+          Will process records from both csv files parallel, by zipping files.
+
+## Database input (Postgres)
+
+* activated by default
+* `--dbuser_input "postgres"` the name of the dbuser
+* `--dbserveraddress_input "127.0.0.1:5432"` the name and (optional) the port to use. The default postgres port is `5432`.
+* `--dbname_input "rawdb"` the name of the database.
+* `--dbpassword_input "mypw` the password to use when connecting.
+* `--dbformat_input "lbsn"` the format of the database. Currently, only "lbsn" and "json" are supported.
+* Additional flags for db input:
+    - `--records_tofetch 1000` If retrieving from a db, limit the 
+      number of records to fetch per batch. Defaults to 10k.
+    - `--startwith_db_rownumber xyz` To resume processing from an arbitrary ID.
+      If input db type is "LBSN", provide the primary key to start from (e.g. post_guid, place_guid etc.). 
+      This flag will only work if processing a single lbsnObject (e.g. lbsnPost).
+    - `--endwith_db_rownumber xyz` To stop processing at a particular row-id.
+    - `--include_lbsn_objects` If processing from lbsn rawdb, provide a comma separated list of 
+      [lbsn objects](https://lbsn.vgiscience.org/structure/) to include. May contain: 
+      `origin,country,city,place,user_groups,user,post,post_reaction,event`
+      Note: Excluded objects will not be queried, but empty objects may be created due to referenced 
+      foreign key relationships. Defaults to `origin,post`.
diff --git a/docs/output-mappings.md b/docs/output-mappings.md
@@ -1,6 +1,6 @@
 **lbsntransform** can output data to a database with the [common lbsn structure](), 
 called [rawdb](https://gitlab.vgiscience.de/lbsn/databases/rawdb)
-or the privacy-aware version, called [hlldb](https://gitlab.vgiscience.de/lbsn/databases/hllb).
+or the privacy-aware version, called [hlldb](https://gitlab.vgiscience.de/lbsn/databases/hlldb).
 
 **Examples:**
 
@@ -51,15 +51,20 @@ It further allows to separate processing into individual components.
 
 If no hll worker is available, hlldb may be used.
 
+??? Why do I need a database connection?
+    There's a [python package](https://github.com/AdRoll/python-hll) available that
+    allows making hll calculations in python. However, it is not as performant
+    as the native Postgres implementation.
+
 Use `--include_lbsn_objects` to specify which input data you want to convert to 
 the privacy aware version. For example, `--include_lbsn_objects "origin,post"`
 would process [lbsn objects](https://lbsn.vgiscience.org/structure/) 
 of type origin and post (default).
 
 Use `--include_lbsn_bases` to specify which output data you want to convert to.
 
-We call this "bases", and they are defined in output mappings in
-[lbsntransform/input/field_mapping_lbsn.py](/api/output/hll/hll_bases.html),
+We refer to the different output structures as "bases", and they are defined 
+in output mappings in [lbsntransform/input/field_mapping_lbsn.py](lbsntransform/docs/api/output/hll/hll_bases.html),
 
 Bases can be separated by comma and may include:
 
@@ -100,17 +105,25 @@ For example:
 lbsntransform --include_lbsn_bases hashtag,place,date,community
 ```
 
-..would fill/update entries of the hlldb structures:  
+..would convert and transfer any input data to the hlldb structures:  
 
-- topical.hashtag  
-- spatial.place  
-- temporal.date  
-- social.community  
+- `topical.hashtag`  
+- `spatial.place`  
+- `temporal.date`  
+- `social.community`  
 
-This name refers to `schema.table`.
+The name refers to `schema.table` in the Postgres implementation.
+
+!!! Upsert (Insert or Update)
+    Because it is entirely unknown to lbsntransform whether output
+    records (primary keys) already exist, any data is transferred using the
+    [Upsert](https://wiki.postgresql.org/wiki/UPSERT) syntax, which means
+    `INSERT ... ON CONFLICT UPDATE`. This means that records are either 
+    inserted if primary keys do not exist yet, or updated, using `hll_union()`.
 
 It is possible to define own output hll db mappings. The best place
-to start is [lbsntransform/input/field_mapping_lbsn.py](/api/output/hll/hll_bases.html).
+to start is [lbsntransform/input/field_mapping_lbsn.py](lbsntransform/docs/api/output/hll/hll_bases.html).
+
 Have a look at the pre-defined bases and add any additional needed. It is recommended
 to use inheritance. After adding your own mappings, the hlldb must be prepared with
 respective table structures. Have a look at the 

diff --git a/docs/quick-guide.md b/docs/quick-guide.md
@@ -1,12 +1,43 @@
+# Installation with conda
+
+This is the recommended way for all systems.
+
+This approach is independent of the OS used.
+
+If you have conda package manager, you can install lbsntransform dependencies 
+with the `environment.yml` that is available in the lbsntransform repository:
+
+```yaml
+{!../environment.yml!}
+```
+
+1. Create a conda env using `environment.yml`
+
+```bash
+git clone https://github.com/Sieboldianus/lbsntransform.git
+cd lbsntransform
+# not necessary, but recommended:
+conda config --env --set channel_priority strict
+conda env create -f environment.yml
+```
+
+2. Install lbsntransform without dependencies
+
+```bash
+conda activate lbsntransform
+python setup.py install --no-deps
+```
+
 # Windows
 
 There are many ways to install python tools, in Windows this can become particularly frustrating.
 
 1. For most Windows users, the recommended way is to install lbsntransform with [conda package manager](#installation-with-conda)
 2. If you _need_ to install with pip in Windows, a possible approach is to install all dependencies first (use [Gohlke wheels] if necessary) and then install lbsntransform with 
 
-        pip install lbsntransform --no-deps
-
+```python
+pip install lbsntransform --no-deps
+```
 # Linux
 
 !!! note
@@ -39,32 +70,6 @@ cd lbsntransform
 python setup.py install
 ```
 
-# Installation with conda
-
-This approach is independent of the OS used.
-
-If you have conda package manager, you can install lbsntransform dependencies 
-with the `environment.yml` that is available in the lbsntransform repository:
-
-```yaml
-{!../environment.yml!}
-```
-
-1. Create a conda env using `environment.yml`
-
-```bash
-git clone https://github.com/Sieboldianus/lbsntransform.git
-cd lbsntransform
-conda env create -f environment.yml
-```
-
-2. Install lbsntransform without dependencies
-
-```bash
-conda activate lbsntransform
-python setup.py install --no-deps
-```
-
 [1]: https://stackoverflow.com/q/27734053/4556479#comment43880476_27734053
 [psycopg2]: https://www.psycopg.org/install/
 [Gohlke wheels]: https://www.lfd.uci.edu/~gohlke/pythonlibs/
diff --git a/docs/use-cases.md b/docs/use-cases.md
@@ -17,5 +17,5 @@ The following two primary use cases exist:
 
 For any conversion,  
 
-- the input type must be provided, see [input-types](/input-types)  
-- a mapping must exist, see [input-mappings](/input-mappings)  
+- the input type must be provided, see [input-types](lbsntransform/docs/input-types)  
+- a mapping must exist, see [input-mappings](lbsntransform/docs/input-mappings)  
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -2,9 +2,10 @@ site_name: lbsntransform Documentation
 site_url: https://lbsn.vgiscience.org/lbsntransform/docs/
 site_author: Alexander Dunkel
 copyright: CC BY 4.0, Alexander Dunkel, vgiscience.org and contributors
-site_dir: site
 
 repo_url: https://github.com/Sieboldianus/lbsntransform
+site_dir: site
+docs_dir: docs
 
 theme:
   name: 'readthedocs'