# Python package builder example
The package `ckanapi_harvesters.builder` implements functions to manage a CKAN dataset (previously known as a package) with the help of an Excel workbook. This Excel file specifies the package metadata and information to upload/download the resources of the package. An illustration of these tasks is given in this notebook.

## Initialisation
The package can be installed with the following command:
```sh
> pip install ckanapi_harvesters[extras]
```

The following cell refers to the code present in the Git directory with the option `use_git_package`.

In [None]:
# optionally, use the ckanapi_harvesters package present in the Git directory
use_git_package = True
if use_git_package:
    import os
    cwd = os.getcwd()
    if not os.path.isdir(os.path.join(cwd, "ckanapi_harvesters")):
        # we assume we are in the examples directory
        cwd = os.path.join(cwd, r"../../src")  # aim for src directory
        assert(os.path.isdir(os.path.join(cwd, "ckanapi_harvesters")))
        os.chdir(cwd)
        print("CWD changed to: " + os.path.abspath(""))

In [None]:
import os
from ckanapi_harvesters import CkanApi, BuilderPackage
from ckanapi_harvesters.builder.example import example_package_xls

### Loading package metadata from Excel file
The Excel workbook given in the example refers to an external code module for DataFrame upload/download functions. To activate this feature, a call to `unlock_external_code_execution` must be done.

__Warning:__ Only enable this feature for code which comes from trusted sources as this executes the module referred in the Excel workbook (`Auxiliary functions file` field).

In [None]:
BuilderPackage.unlock_external_code_execution()
mdl = BuilderPackage.from_excel(example_package_xls)

In [None]:
ckan = CkanApi(None)
# you can specify the CKAN URL, owner organization, API key here or in the Excel workbook
ckan = mdl.init_ckan(ckan)
ckan.input_missing_info(input_args_if_necessary=True, input_owner_org=True, error_not_found=False)  # request user input to configure CKAN
ckan.set_limits(10000)  # reduce if server hangs up
ckan.set_submit_timeout(5)
ckan.set_verbosity(True)  # this displays all the steps performed by the script

### Displaying the package model

In [None]:
df_dict = mdl.get_all_df()
for tab, df in df_dict.items():
    display(f"Tab {tab}:")
    display(df)

#### Auxiliary function to display a progress bar

In [None]:
from ipywidgets import IntProgress
from IPython.display import display
f = IntProgress(min=0,max=100)

def progress_callback(index:int, total:int, **kwargs):
    f.value = int(index/total*100)

## Initiating the package
This call creates the package if no other package with the same name exists. If the package already exists, it is updated with information from the Excel workbook. The resources are initialized. Optionally, the resources data can be fully reuploaded (even if the resources already exist) to ensure the server side of the package represents the information specified in the Excel workbook. However, if there are large datasets, this resets them. 

In [None]:
reupload = True  # True: reuploads all documents and resets large datasets to the first document (not recommended if there is a large dataset)
mdl.patch_request_full(ckan, reupload=reupload)

### Uploading large datasets
Large datasets are defined locally by a directory containing multiple CSV files. The first file found (in alphabetic order) is used to initialize the dataset. This function automates the concatenation of other files using the API `datastore_upsert` in a multi-threaded implementation. It can be executed multiple times without affecting the final result, as long as all the data has been transferred.
The number of threads should be adjusted if there are too many request errors.

In [None]:
threads = 3  # > 1: multi-threading mode - reduce if HTTP 502 errors
display(f)
mdl.upload_large_datasets(ckan, threads=threads, progress_callback=progress_callback, only_missing=True)

## Downloading the package
This function downloads all the resources of a dataset. The multi-threaded implementation is reserved to download large datasets.

In [None]:
# define the destination directory
example_package_download_dir = os.path.abspath("package_download")
print("Package will be downloaded in: " + example_package_download_dir)

In [None]:
threads = 3  # > 1: number of threads to download large datasets
display(f)
mdl.download_request_full(ckan, example_package_download_dir, full_download=True, threads=threads,
                              skip_existing=False, progress_callback=progress_callback)