# Datalabframework

The datalabframework is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

In [1]:
import datalabframework as dlf

### Load a project

One of the main things here is to have configuration and code separated in different files. Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files. When the datalabframework project is loaded, it starts by searching for a `__main__.py` file, according to python module file naming conventions. When such a file is found, the corresponding directory is set as the root path for the project. All modules and alias paths are all relative to the project root path.

Loading the profile can be done with the `datalabframework.project.load` function call. It will look for files ending with `metadata.yml`. The function can optionally set the current working directory and import the key=values of .env file into the python os environment. if no parameters are specified, the default profile is loaded.

In [2]:
help(dlf.project.load)

Help on function load in module datalabframework.project:

load(profile='default', rootpath=None)
    Performs the following steps:
        - set rootdir for the given project
        - import variables from  <rootdir>/.env (if present),
        - load the `profile` from the metadata files
        - setup and start the data engine
    
    :param profile: load the given metadata profile (default: 'default')
    
    :param rootpath: root directory for loaded project 
           default behaviour: search parent dirs to detect rootdir by 
           looking for a '__main__.py' or 'main.ipynb' file. 
           When such a file is found, the corresponding directory is the 
           root path for the project. If nothing is found, the current 
           working directory, will be the rootpath
    
    :return: None
    
    Notes abount metadata configuration:
    
    1)  Metadata files are merged up, so you can split the information in 
        multiple files as long as they end with `

### Project Configuration

In [3]:
help(dlf.project.config)

Help on function config in module datalabframework.project:

config()



In [4]:
# Loading default profile
project = dlf.project.load()

Init engine "spark"
Loading detected packages:
  -  org.apache.hadoop:hadoop-aws:3.1.1
  -  com.microsoft.sqlserver:mssql-jdbc:6.4.0.jre8
  -  mysql:mysql-connector-java:8.0.12
  -  org.postgresql:postgresql:42.2.5
Connecting to spark master: local[*]
Engine context spark:2.4.1 successfully started


The following will display the configuration of the project metadata profile and configuration data loaded. The configuration is available as a dictionary object.

In [5]:
dlf.project.config()

version: 0.7.1
username: jovyan
session_id: '0x51de6ac0603b11e9'
profile: default
rootdir: /home/jovyan/work/tutorial
script_path: load.ipynb
dotenv_path:
notebooks_files:
  - main.ipynb
  - install.ipynb
  - versions.ipynb
  - metadata.ipynb
  - project.ipynb
  - scaffolding.ipynb
  - logging.ipynb
  - hello.ipynb
  - load.ipynb
python_files:
  - __main__.py
metadata_files:
  - metadata.yml
repository:
    type: git
    committer: natbusa
    hash: 6b01799
    commit: 6b017991f154d237345349ef7d2d4fa69fa9f8e4
    branch: master
    url: https://github.com/natbusa/dlf-tutorial
    name: dlf-tutorial
    date: '2019-04-15T10:14:10+00:00'
    clean: false

### Loading a specific profile

Loading explicitely a different profile.  
In this case the profile `prod` will connect to a cluster in client mode.

In [6]:
# Loading default profile
project = dlf.project.load('prod')

Init engine "spark"
Loading detected packages:
  -  org.apache.hadoop:hadoop-aws:3.1.1
  -  com.microsoft.sqlserver:mssql-jdbc:6.4.0.jre8
  -  mysql:mysql-connector-java:8.0.12
  -  org.postgresql:postgresql:42.2.5
Connecting to spark master: spark://spark-master:7077
Engine context spark:2.4.1 successfully started
