In [1]:
# Make parent folders available to import from
import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 

# Introduction to Pipelines and DataTransformers Usage

This short notebook demonstrates the implementation of the dataTransformer class and the classes derived hereof, i.e. pipelines, predictive models, etc.

## Data transformers

The basic structure of the abstract class/interface is:

```python
class BaseDataTransformer(ABC):

    def __init__(self,inputFormat,outputFormat):
        self.parent = None
        self.inputFormat=inputFormat
        self.outputFormat=outputFormat

    @abstractmethod
    def transform(self, data):
        pass
        
    [...]
```
In order to define a concrete dataTransformer class, one must define the transform method and set the properties inputFormat and outputFormat. A simple example could be:

```python
class fromHTMLSourceToJSONPrograms(BaseDataTransformer):
    """
    Transforms raw html files into JSON of the sentences and timestamps.
    """

    def __init__(self):
        inputFormat="HTMLSource"
        outputFormat="JSONPrograms"
        super().__init__(inputFormat,outputFormat)

    def transform(self, HTMLSource):
        files, program_id = self.getRawFilePaths(HTMLSource.data)
        JSONPrograms = self.getCleanedPrograms(files, program_id)
        return JSONPrograms
```

We use the `super().__init__(inputFormat,outputFormat)` already defined in the base class to set the properties.

**Data:** the input data to transform should be well-defined for a specific transformer, but it can defined in any way. The same is true for the output.


### Example Usage

An example of an actual implementation is given below. We use two data transformers to illustrate the use.

In [2]:
# Import Data Transformers
from util.dataClass import formatHTMLDownload, opts
from util.dataTransformers import fromHTMLSourceToJSONPrograms, fromJSONProgramsToHTMLInput

In [3]:
# Define root path to data
rootPath='/media/data/Dropbox/'

# Define input data
HTMLSource= formatHTMLDownload(data=rootPath+'DeepFactData/subtitles crawl/')
print(HTMLSource.data)
print(HTMLSource.opts)

/media/data/Dropbox/DeepFactData/subtitles crawl/
[]


In [4]:
# Construct data transformers
HTMLSource2JSONPrograms=fromHTMLSourceToJSONPrograms()
JSONPrograms2HTMLInput=fromJSONProgramsToHTMLInput()

In [5]:
# Transform data
JSONPrograms=HTMLSource2JSONPrograms.transform(HTMLSource)
HTMLInput=JSONPrograms2HTMLInput.transform(JSONPrograms)

# showing first part of the HTMLInput data object
next (iter (HTMLInput.values()))[:50] 

'<span id="program 1012335">\n\t<p id="1012335"> Pr.n'

## Pipelines

Pipelines combine other objects to make it simple to use multiple data transformers including (trained) predictive models and other pipelines.

The implementation defines the constructor and transform method as:

```python
class BasePipeline(BaseDataTransformer):
    """
    Pipeline of several data transformations.

    A pipeline may consist of both data transformers, predictive models, and
    pipelines or any combination of these
    """

    def __init__(self, dataTransformers):
        self.dataTransformers = []
        for dataTransformer in dataTransformers:
            self.addChild(dataTransformer)

    def transform(self,data):
        for dataTransformer in self.dataTransformers:
            data = dataTransformer.transform(data)
        return data

[...]
```

The pipeline constructor is given a list of data transformers, predictive models, and other pipelines in the order that the data should be transformed. The **```addChild()```** method checks if the outputFormat and inputFormat is the same for each link in the pipeline. These properties are automatically set for the pipeline itself based on the given list of dataTransformers. 

### Example Usage

In [6]:
# Import base pipeline class
from util.baseClasses import BasePipeline

First we construct a pipeline from the two previous data transformers

In [7]:
# Construct pipeline
HTMLSource2HTMLInput = BasePipeline([HTMLSource2JSONPrograms,JSONPrograms2HTMLInput])

# Use pipeline to transform data
HTMLInput=HTMLSource2HTMLInput.transform(HTMLSource)

# showing first part of the HTMLInput data object
next (iter (HTMLInput.values()))[:50] 

'<span id="program 1012335">\n\t<p id="1012335"> Pr.n'

We may also combine pipelines (and any combination of pipelines and data transformers) as long as input and output fits together:

In [8]:
# Construct pipeline of pipelines
HTMLSource2HTMLInput_part1 = BasePipeline([HTMLSource2JSONPrograms])
HTMLSource2HTMLInput_part2 = BasePipeline([JSONPrograms2HTMLInput])

HTMLSource2HTMLInput = BasePipeline([HTMLSource2HTMLInput_part1,HTMLSource2HTMLInput_part2])

# Use pipeline to transform data
HTMLInput=HTMLSource2HTMLInput.transform(HTMLSource)

# showing first part of the HTMLInput data object
next (iter (HTMLInput.values()))[:50] 

'<span id="program 1012335">\n\t<p id="1012335"> Pr.n'

Pipelines that will be used many times can be implemented so that they do not have be defined repeatedly

In [9]:
# Import implemented pipeline
from util.dataPipelines import HTMLSource2HTMLInput

# Use pipeline to transform data
HTMLInput = HTMLSource2HTMLInput().transform(HTMLSource)

# showing first part of the HTMLInput data object
next (iter (HTMLInput.values()))[:50] 

'<span id="program 1012335">\n\t<p id="1012335"> Pr.n'

## Predictive Models (pretrained)

*To do*