# Architecture


## Overview
DGC consists of following <b>components</b>:
- Services
    - DGC 
    - Repository 
    - Jobserver
    - Monitoring
- Console (used to manage the environments or instances)
- Clients (Web interface, On-the-go, Connect)

![title](ArchitectureEnvCollibra551.png "ShowMyImage")

- Node is a server
- Console can (should) be installed on different nodes (as shown), but can also run on the same node as repo and dgc service
- Version of environments managed by one console has to be same

### Internal Communication (between components)
![title](InternalCommunication551.png "ShowMyImage")

- Communication for Cloud solutions is done in https in default (on-premise can use it too, but not by default)
- Every node of an environment is running an agent to communicate between console and component of DGC (DGC, repo or job service)
- Assumption: Agent installation will be done automatically by every installation of a component on a node
- DGC service, job service , repo service and agent are separated processes on the node

<table style="width:100%">
  <tr>
    <th>comm path</th>
    <th>via</th> 
  </tr>
  <tr>
    <td> 
        <ul>
          <li>Agent to Repository service </li>
          <li>DGC service to Repository service</li>
        </ul> 
    </td>
    <td>Send SQL statements over JDBC, via port 4403.</td> 
  </tr>
  <tr>
    <td> 
        <ul>
          <li>Agent to DGC service </li>
        </ul> 
    </td>
    <td>Send management-specific commands with a private REST interface
(JMX REST) over HTTP, via port 4400.</td> 
  </tr>
  <tr>
    <td> 
        <ul>
          <li>Collibra clients to Collibra DGC </li>
        </ul> 
    </td>
    <td>Access to Collibra DGC with the public REST interface (REST) over https</td> 
  </tr>
   <tr>
    <td> 
        <ul>
          <li>Agent to Jobserver service </li>
          <li>DGC service to Jobserver service </li>
        </ul> 
    </td>
    <td>Send job commands using a REST interface over HTTPS, via port 4404.</td> 
  </tr>
    <tr>
    <td> 
        <ul>
          <li>Jobserver service to Jobserver database</li>
        </ul> 
    </td>
    <td>Send SQL statements over JDBC via port 4414.</td> 
  </tr>
    <tr>
    <td> 
        <ul>
          <li>Agent to Monitoring service</li>
          <li>Monitoring service to DGC service</li>
        </ul> 
    </td>
    <td>Send job commands using REST interface over HTTPS via port 4407. Monitoring service connects to DGC via port 4400</td> 
  </tr>
</table> 

### DGC service architecture
![title](CollibraServiceArchitecture.png "ShowMyImage")

- prepackaged DGC installer contains:
    - application server
    - repository database (and service)
    - APIs
    
- Connect and On-the-go are separate products

## DGC service
### APIs
- DGC service contains the business logic as an web application built in Java
- Public REST API enables custom build products (collibra clients using the same API), all Java API methods are available by the REST API
    - https://yourdgcinstance.yourcompany.com/docs/index.html
- Import/Export, Views, Queries API for manipulating application data 
    - Supported formats are JSON, XML, CSV and Excel
    - Methods are integrated in the public REST API 
    - Can be integrated into ETL or ESB applications
- Search API
    - able to search for data in collibra (same API as used by Collibra-on-the-go
- BPMN 2.0 workflow engine
    - The workflow engine, Activiti, supports the execution of BPMN 2.0 (Business Process Model and Notation) processes. The prepackaged workflows are completely configurable 
    - add, modify, and deploy the workflows 
    - Worfklow service tasks can use the available Java API, which enables you to automate various application tasks, like email notification, creating comments, adding assets, and so on.

### Data
- All data is stored in the repository
- Product metadata (temporary files, log files, license file) is stored in the Collibra data directory 
    - /opt/collibra_data (Linux) or C:\-collibra_data (Windows).
    - contains subdirectories
        - <b>dgc</b>: Collibra DGC service
            - Configuration files, necessary to run the product:
                - The <b>config</b> directory contains the configuration files used by Collibra DGC.
                - The <b>collibra.license</b> is the license file that has been uploaded in Collibra Console.
            - logs directory contains the log files produced by the DGC service
            - dirs like <b>cache</b>, <b>uploads</b>, <b>indexes</b> contain tmp files which can be removed during server restart 
                - uploads contains temporary the attached files uploaded by the user (will be permanently stored in the repo)
            - customization directories
                - <b>email-templates</b> Used to override the built-in email templates to customize the emails that are sent to the users.
                - <b>translations</b> Used to override the built-in user interface labels and to add new languages. you can also do this in the Settings in the web user interface.
                - <b>page-definitions</b> Used to override page definitions. 
                - <b>modules</b> Custom UI modules to extend or override the existing UI.
                - <b>styling</b> Used to override the CSS styling of the web interface.
                - <b>groovy-lib</b> Contains additional Groovy library functions to be used in validation rules.
                - <b>images</b> Contains images that can be referenced directly as a URL, for example to set another logo.
                - <b>security</b> Used for SSL and SAML support.
        - <b>repo</b>: Collibra Repository service
        - <b>console</b>: Collibra Console
        - <b>agent</b>: Collibra DGC Agent

## Repository service
- Embedded PostgreSQL, managed by Collibra Console and Agent
- Consolte is maintaining the database periodically
### Data
- repository is located in /opt/collibra_data/repo (Linux) or C:\collibra_data\repo (Windows) 
- subdirectories
    - <b>logs</b> contains log files
    - <b>data</b> besides the data it contains the postgresql.conf file for configuration of the PostgreSQL database (manual changes are not supported!)

Another way to maintain and manipulation of data: 
https://docs.microsoft.com/de-de/sql/integration-services/import-export-data/connect-to-a-postgresql-data-source-sql-server-import-and-export-wizard?view=sql-server-2017

## Console
- Java web app
- Does not need other components to run
- Uses local file-based database
### Data
- Data from Console is stored in <b>/opt/collibra_data/console</b> (Linux) or <b>C:\collibra_data\console</b> (Windows)
- Identical file struct like DGC service except for these two additional dirs
    - <b>backups</b>: for backups taken by Console
    - <b>data.mv.db</b>: contains information about the embedded local database (not the database itself)

## Jobserver service
- app which relies on Apache Spark for CPU memory intensive computations
- Jobserver acts as the interface between DGC service and Spark
    - Jobserver sends Spark job executions via REST API
    - Jobserver controlls the single Spark jobs and data used by Spark
- Jobserver is used for <b>data profiling</b> for the <b>Catalog application</b>
- Background: A profiling operation starts a Java VM by the Jobserver which contains the Spark context. The profiling execution runs within the JVM and returns the results via the Jobserver to the DGC service. 
- <b>Only one profiling op</b> can be run at a time. 
    - Jobserver is managed by the Console trough an agent

### Data
- <b>/opt/collibra_data/jobserver</b> (Linux) or <b>C:\collibra_data\jobserver</b> (Windows)
- sub dir
    - <b>logs</b>: contains logfiles
    - <b>data</b>: contains data used during runtime, but nothing essential for the app to maintain
    - <b>config</b>: Jobserver configs
    - <b>security</b>: contains public and private keys needed to use SSL for Jobserver communication via REST API (of the Jobserver)

### Memory and CPU usage

# Installation/Deinstallation
## Installation

### Preparation
Change permissions

In [None]:
!chmod a+x dgc-linux-5.4.3-FINAL.sh

### Installation of collibra components
Test

### First use
- Console is accessible via the url: http://urlinstance:4402/contextpath
- Console admin is "Admin" and initial password is "admin"
- Change initial passwort
- Rename Environment
- Start environment 

### Troubleshooting
- If console wont start, try a reboot first
- ContextPath wont work for DGC right now

## Migration


## Maintain
- check the services on server

In [None]:
!service collibra-console status
!service collibra-agent status

- more commands are for these two services are
    - stop
    - start
    - restart

## Deinstallation

In [1]:
#change user to root
!sudo -s
#change back
!sudo -s -u <user>

[sudo] password for gleuschel: 
/bin/sh: 1: Syntax error: end of file unexpected


1. Go to the installation directory
    1.1 Default location on Linux with root permissions: /opt/collibra
    1.2 Default location on Linux without root permissions: ~/collibra
2. Start the uninstall script:
    2.1 Uninstall with root permissions: sudo ./uninstall.sh
    2.2 Uninstall without root permissions: ./uninstall.sh

# Create Docker Image
1. Create dockerfile 


- run build command

In [1]:
!docker build -t collibra:5.4.3 ./

Der Befehl "docker" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
