# Example for CML builder

<p>This jupyter notebook is intended to demonstrate how to build a CatalystML (CML) JSON that can then be used in a CML implementation.</p>

<p>To do this we create a random dataframe in pandas and do some common transformations to it.  In parallel we construct a CML JSON witht the python builder API.</p>

## Imports
<p>Pandas, datetime, and random are external libraries used to build the python data transformer example.</p>
<p>  cmlmaker is the library used to build CML structure</p>

In [1]:
import  cmlmaker as cml
import pandas as pd
import random
import datetime

## Creating fake structured data

In [2]:
l=20
size=75;msize=25;tsize=msize+size
size2=123.45
df=pd.DataFrame(
    data={
        "one":[random.choice(["red","blue","green","yellow"]) for r in range(l)],
        "two":[25+random.random()*size for r in range(l)],
        "three":[random.choice(['married','not married']) for r in range(l)],
        "four":[random.randint(0,1) for r in range(l)],
        "five":[random.random()*size2 for r in range(l)]
    }
)
df

Unnamed: 0,one,two,three,four,five
0,blue,36.928111,married,0,94.909203
1,green,84.636071,married,1,117.453248
2,red,86.885224,married,1,117.869603
3,yellow,61.518343,not married,0,18.85896
4,blue,85.23112,not married,1,26.283391
5,blue,43.133245,not married,1,91.73731
6,blue,30.303187,married,1,119.388642
7,red,85.57105,not married,0,51.805499
8,blue,39.936072,married,0,96.775865
9,red,66.830682,not married,1,119.273075


### Initializing a CML structure

<p>The first step in building a CML structure is to initialize the python object with the CML Structure name and description.  You can also include additional metadata with additional optional inputs (use help(structure) to explore).</p>

In [3]:
#Lets call our object cs for cml structure
cs=cml.structure("structuredClean","Cleaning some basic structured data")

#Update version and date
help(cs.updateVersion)
cs.updateVersion('1.0.0')

help(cs.updateCreatedDate)
cs.updateCreatedDate(datetime.datetime(2001,1,1).strftime("%Y%m%d"))

print(cs.version,cs.createdDate)

Help on method updateVersion in module cmlmaker:

updateVersion(version) method of cmlmaker.structure instance
    Version must be a string of the for 0.0.0

Help on method updateCreatedDate in module cmlmaker:

updateCreatedDate(date) method of cmlmaker.structure instance
    Created date should be a string with a sugested format of %Y%m%d

0.0.0 20010101


<p>Next step is to define the data that is coming in (the fake data created above) within the CML structure object.  Since it is a dataframe we are inputting the type will be "map" and we can label the data how we like, lets call it "df".</p>

<p>To add an input to your initialized structure use the method addInput.  If we use the python command `help` on cml.addInput we can see that the input is an inobj object.  Then using `help` on the inobj object show that the type and label are required to initialize the inobj.</p>

In [4]:
help(cs.addInput)
help(cml.inobj)

Help on method addInput in module cmlmaker:

addInput(inobj) method of cmlmaker.structure instance

Help on class inobj in module cmlmaker:

class inobj(builtins.object)
 |  the structure of an input object has type,label, dim, and shape
 |  type (1st) and label(2nd) are required with dim and shape optional
 |  
 |  Methods defined here:
 |  
 |  __init__(self, typ, label, dim=None, shape=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  make_map(self)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)



<p>Therefore we can add the input with the following commands.</p>

In [5]:
typ="map";label="df"
cs=cs.addInput(cml.inobj(typ,label))

### Operations
<p>Now we need to add the operations that will transform the data.</p>

#### Discovering operations

<p>How do we discover the operation we need?  Since oneHotEncodding is discussed in the next section lets use this operation as an example.  There are a few ways to discover the operations (assuming you have a general understanding of what you need):</p>
<ul>
    <li>explore operations through cml.ops.CATAGORY.OPERATION i.e. help(cml.ops); help(cml.ops.CATAGORY); help(cml.ops.CATAGORY.OPERATIONS </li>
    <li>use python's tab complete to exand the start of an operatoin name: on:tab::tab: -> oneHotEncoding</li>
    <li>list all operations with cml.ops.listAllOps()</li>
</ul>
<p>Once an operation's name is known (possibly with category too) the operation can be called either directly under the cml package (cml.OPERATION) or through the ops class (cml.ops.CATEGORY.OPERATION).</p>

In [6]:
help(cml.ops)

Help on class ops in module cmlmaker:

class ops(builtins.object)
 |  Methods defined here:
 |  
 |  listAllOps()
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  cleaning = <class 'cmlmaker.ops.cleaning'>
 |      Operations that fall best under data cleaning
 |  
 |  image_processing = <class 'cmlmaker.ops.image_processing'>
 |      Image processing related operations
 |  
 |  math = <class 'cmlmaker.ops.math'>
 |      Math based operations
 |  
 |  nlp = <class 'cmlmaker.ops.nlp'>
 |      Natural Language Processing(NLP) related operations
 |  
 |  restructuring = <class 'cmlmaker.ops.restructuring'>
 |      Operations that restructure da

In [7]:
cml.ops.listAllOps()

Category:          Op name:              Description:
restructuring      map2table              convert a map to a matrix
restructuring      groupBy                group by a given column in an axis and aggregate the value of ano
restructuring      reshape                change the dimensionality of a matrix without changing the underl
restructuring      transpose              transpose a matrix
restructuring      dropCol                Remove cols from matrix or map
restructuring      cast                   Convert the base datatype of a data structure or datatype from on
restructuring      pivot                  group by a given column in an axis and aggregate the value of ano
restructuring      addCol2Table           Add a column to a matrix
restructuring      flatten                reduce multidimensional lists to single dimension
restructuring      table2map              convert a matrix to a map by adding a name to each column
restructuring      join                   group by a 

#### oneHotEncoding

<p>One Hot Encoding is a common task in DS that converts a catagorical column to serveral 1/0 columns for each catagory.  Here we are encoding on five categories even though only 4 are seen in the short input dataset.</p>

In [8]:
#Perfroming a One Hot Encoding for 5 classes (only 4 in the input data) on the data
colors=["red","blue","green","yellow","purple"]
for color in colors:
    df.loc[:,color]=0
    df.loc[df['one']==color,color]=1
    print(color)
df=df.drop('one',axis=1)
df

red
blue
green
yellow
purple


Unnamed: 0,two,three,four,five,red,blue,green,yellow,purple
0,36.928111,married,0,94.909203,0,1,0,0,0
1,84.636071,married,1,117.453248,0,0,1,0,0
2,86.885224,married,1,117.869603,1,0,0,0,0
3,61.518343,not married,0,18.85896,0,0,0,1,0
4,85.23112,not married,1,26.283391,0,1,0,0,0
5,43.133245,not married,1,91.73731,0,1,0,0,0
6,30.303187,married,1,119.388642,0,1,0,0,0
7,85.57105,not married,0,51.805499,1,0,0,0,0
8,39.936072,married,0,96.775865,0,1,0,0,0
9,66.830682,not married,1,119.273075,1,0,0,0,0


In [9]:
#Show the values needed for oneHotEncoding's inputs and params objects
help(cml.oneHotEncoding)

Help on class oneHotEncoding in module cmlmaker:

class oneHotEncoding(operation)
 |  convert categorical vector into a set of vectors for each category with a 0/1
 |  
 |  Method resolution order:
 |      oneHotEncoding
 |      operation
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, inputs=None, params=None, output=None)
 |      Initialize oneHotEncoding operation and define inputs, parameters, and outputs
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  inputs = <class 'cmlmaker.oneHotEncoding.inputs'>
 |      inputs's possible keys:
 |              data (required)- 2D table to be converted to map
 |  
 |  params = <class 'cmlmaker.oneHotEncoding.params'>
 |      params's possible keys:
 |              inputColumns (required)- the columns to which one Hot Encodding should be applied
 |              outputColumns (optional)- list of keys for map that correspond to 0 to 

In [10]:
#Adding the operation to the CML

#Note:  You might notice that the input is '&#36;df' even though the data is labeled as 'df'.  
#       The '$' is used to disambugate between labeled data and a string.  
#       There for all references to data variable need to start with a '$'.

cs=cs.addOp(
    cml.ops.cleaning.oneHotEncoding(                                                        #operation we are adding
        cml.oneHotEncoding.inputs("$df"),                                                   #OHE input object being included in op
        cml.oneHotEncoding.params(inputColumns="three",outputColumns=colors,keepOrig=True), #OHE params object being included in op
        output="dfout"                                                                         #output label for resulting data
    )
)

#### Replacing values
<p>Column 'one' has two catagories married and not married.  Since this is effectively a boolean catagoization we can substitute 1 for married, and 0 for not married.  To do this in python/pandas we can do this:</p>

In [11]:
df['three']=df['three'].apply(lambda x: 1 if x=="married" else 0)

<p>Using cml.ops.listAllOps() I discover that replaceValue is the operation I need.  So next I use help(replaceValue) to explore the needed data to fillout the operation:</p>

In [12]:
help(cml.replaceValue)

Help on class replaceValue in module cmlmaker:

class replaceValue(operation)
 |  Given a map replaces data (key) with map value
 |  
 |  Method resolution order:
 |      replaceValue
 |      operation
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, inputs=None, params=None, output=None)
 |      Initialize replaceValue operation and define inputs, parameters, and outputs
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  inputs = <class 'cmlmaker.replaceValue.inputs'>
 |      inputs's possible keys:
 |              data (required)- data to be operated on
 |              replaceMap (optional)- map gives key to replace with value
 |              replaceKey (optional)- what is to be replaced
 |              replaceValue (optional)- what is to be replaced with
 |  
 |  params = <class 'cmlmaker.replaceValue.params'>
 |      params's possible keys:
 |              Axis (optional)

<p>So we see that we need a data and replacemap in the inputs and Axis and Col in the params.  We can need to label the output, which in this case we will overwrite the previous label of "df".</p>

In [13]:
cs=cs.addOp(cml.replaceValue(cml.replaceValue.inputs("$dfout",{"married":1,"not married":0}),cml.replaceValue.params(Col='one'),output="dfout"))

#### Normalization

<p>Both columns two (from msize to tsize) and five (from 0 to size2) require normalization.  Therefore we discover the normalize operation as we have before.  Here we see that data and value are required values, but minval is not.  But we can include minval for colue two because its minval starts at msize.</p>

In [14]:
help(cml.normalize)

Help on class normalize in module cmlmaker:

class normalize(operation)
 |  divide all values of array by value (i.e. x/value), if minvalue is given applies (x-minval)/(value-minvalue) where x is the data
 |  
 |  Method resolution order:
 |      normalize
 |      operation
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, inputs=None, params=None, output=None)
 |      Initialize normalize operation and define inputs, parameters, and outputs
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  inputs = <class 'cmlmaker.normalize.inputs'>
 |      inputs's possible keys:
 |              data (required)- data to get mean of
 |              value (required)- value to normalize with (if starting at 0)
 |              minval (optional)- min value to start normalize with (if not starting at 0)
 |  
 |  ----------------------------------------------------------------------
 |  Methods i

In [15]:
df.loc[:,'two']=(df.loc[:,'two']-msize)/(tsize-msize)

In [16]:
cs=cs.addOp(cml.normalize(cml.normalize.inputs("$dfout['two']",tsize,msize),output="dfout['two']"))

In [17]:
df.loc[:,'five']=df.loc[:,'five']/size2
cs=cs.addOp(cml.normalize(cml.normalize.inputs("$dfout['five']",size2),output="dfout['five']"))

#### Cleaned python dataframe
<p>The cleaned pandas dataframe (since colum'four' doesn't require cleaning) is shown here: </p> 

In [18]:
df

Unnamed: 0,two,three,four,five,red,blue,green,yellow,purple
0,0.159041,1,0,0.768807,0,1,0,0,0
1,0.795148,1,1,0.951424,0,0,1,0,0
2,0.825136,1,1,0.954796,1,0,0,0,0
3,0.486911,0,0,0.152766,0,0,0,1,0
4,0.803082,0,1,0.212907,0,1,0,0,0
5,0.241777,0,1,0.743113,0,1,0,0,0
6,0.070709,1,1,0.967101,0,1,0,0,0
7,0.807614,0,0,0.419648,1,0,0,0,0
8,0.199148,1,0,0.783928,0,1,0,0,0
9,0.557742,0,1,0.966165,1,0,0,0,0


### Output of structure
<p>Next we have to output the final data out of the structure.  To do that we add an outobj to the cml structure.</p>

In [19]:
help(cml.outobj)

Help on class outobj in module cmlmaker:

class outobj(builtins.object)
 |  the structure of an input object has type,data
 |  type (1st) and data (2nd) are required
 |  
 |  Methods defined here:
 |  
 |  __init__(self, typ, data)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  make_map(self)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)



In [20]:
cs=cs.addOutput(cml.outobj("map","$dfout"))

### Viewing and saving JSON
<p>Viewing the CML JSON is as simple as priting the CML structure object:</p>

In [21]:
print(cs)

{
    "name": "structuredClean",
    "description": "Cleaning some basic structured data",
    "version": "0.0.0",
    "createdDate": "20010101",
    "input": [
        {
            "type": "map",
            "label": "df"
        }
    ],
    "structure": [
        {
            "operation": "oneHotEncoding",
            "input": {
                "data": "$df"
            },
            "params": {
                "inputColumns": "three",
                "outputColumns": [
                    "red",
                    "blue",
                    "green",
                    "yellow",
                    "purple"
                ],
                "keepOrig": true
            },
            "output": "dfout"
        },
        {
            "operation": "replaceValue",
            "input": {
                "data": "$dfout",
                "replaceMap": {
                    "married": 1,
                    "not married": 0
                }
            },
            "params": {


<p>It is slightly more complicated to write the JSON to file.  To write to file call the writeToFile method in the cml structure object (commented so that I don't write to file everytime I test this).  If you want duplicate files to be instead be written with the version number that is updated for the new version include updateLabelVersion=True </p>

In [22]:
cs.writeToFile('test.json',updateLabelVersion=True)
print(cs)

{
    "name": "structuredClean",
    "description": "Cleaning some basic structured data",
    "version": "0.0.1",
    "createdDate": "20010101",
    "input": [
        {
            "type": "map",
            "label": "df"
        }
    ],
    "structure": [
        {
            "operation": "oneHotEncoding",
            "input": {
                "data": "$df"
            },
            "params": {
                "inputColumns": "three",
                "outputColumns": [
                    "red",
                    "blue",
                    "green",
                    "yellow",
                    "purple"
                ],
                "keepOrig": true
            },
            "output": "dfout"
        },
        {
            "operation": "replaceValue",
            "input": {
                "data": "$dfout",
                "replaceMap": {
                    "married": 1,
                    "not married": 0
                }
            },
            "params": {
