# Datalabframework

The datalabframework is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

In [1]:
import datalabframework as dlf

## Load a project

One of the main things here is to have configuration and code separated in different files. Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files. When the datalabframework project is loaded, it starts by searching for a `__main__.py` file, according to python module file naming conventions. When such a file is found, the corresponding directory is set as the root path for the project. All modules and alias paths are all relative to the project root path.

Loading the profile can be done with the `datalabframework.project.load` function call. It will look for files ending with `metadata.yml`. The function can optionally set the current working directory and import the key=values of .env file into the python os environment. if no parameters are specified, the default profile is loaded.

In [2]:
help(dlf.project.load)

Help on function load in module datalabframework.project:

load(profile='default', rootpath=None)



### Project Configuration

In [3]:
# Loading default profile
import datalabframework as dlf
project = dlf.project.load()

created SparkEngine
Init engine "spark"
Configuring packages:
  -  com.microsoft.sqlserver:mssql-jdbc:6.4.0.jre8
  -  mysql:mysql-connector-java:8.0.12
  -  org.apache.hadoop:hadoop-aws:3.1.1
  -  org.postgresql:postgresql:42.2.5
Configuring conf:
  -  spark.hadoop.fs.s3a.access.key : ****** (redacted)
  -  spark.hadoop.fs.s3a.endpoint : http://minio:9000
  -  spark.hadoop.fs.s3a.impl : org.apache.hadoop.fs.s3a.S3AFileSystem
  -  spark.hadoop.fs.s3a.path.style.access : true
  -  spark.hadoop.fs.s3a.secret.key : ****** (redacted)
Connecting to spark master: local[*]
Engine context spark:2.4.1 successfully started


## Inspect current project configuration
The following will display the configuration of the project metadata profile and configuration data loaded. The configuration is available as a dictionary object.

In [4]:
dlf.project.info()

version: 0.8.2
username: jovyan
session_name: dlf-tutorial
session_id: '0x2d17a182931911e9'
profile:
rootdir: /home/jovyan/work/tutorial
script_path: project.ipynb
dotenv_path: .env
notebooks_files:
  - main.ipynb
  - install.ipynb
  - resources.ipynb
  - engine.ipynb
  - load_compare.ipynb
  - metadata.ipynb
  - project.ipynb
  - loadsave.ipynb
  - scaffolding.ipynb
  - join.ipynb
  - logging.ipynb
  - events.ipynb
python_files:
  - __main__.py
metadata_files:
  - metadata.yml
repository:
    type: git
    committer: natbusa
    hash: a77df08
    commit: a77df08eeae3eac6fc411df77cec4022a1047dfc
    branch: master
    url: https://github.com/natbusa/dlf-tutorial
    name: dlf-tutorial
    date: '2019-06-19T07:25:05+07:00'
    clean: false

### Loading a specific profile

Loading explicitely a different profile.  
In this case the profile `prod` will connect to a cluster in client mode.

In [5]:
# Loading default profile
project = dlf.project.load('prod')

Init engine "spark"
Configuring packages:
  -  com.microsoft.sqlserver:mssql-jdbc:6.4.0.jre8
  -  mysql:mysql-connector-java:8.0.12
  -  org.apache.hadoop:hadoop-aws:3.1.1
  -  org.postgresql:postgresql:42.2.5
Configuring conf:
  -  spark.hadoop.fs.s3a.access.key : ****** (redacted)
  -  spark.hadoop.fs.s3a.endpoint : http://minio:9000
  -  spark.hadoop.fs.s3a.impl : org.apache.hadoop.fs.s3a.S3AFileSystem
  -  spark.hadoop.fs.s3a.path.style.access : true
  -  spark.hadoop.fs.s3a.secret.key : ****** (redacted)
Connecting to spark master: spark://spark-master:7077
Engine context spark:2.4.1 successfully started
