## This notebook demostrate how build a ML model from bunch of **"Mutually Exclusive and Collectively Exhaustive"** rules. 

### Steps are as Follows:

1. Expect the rules to be provided in a .json format. [Consult DemoRuleBasedSegmentation.xlsx and writePickleFile.ipynb/.html] to understand how the data is presented in staticClustering.pickle. This program will read staticClustering.pickle and do the further work as needed. 

2. Every variable/feature/attribute required to provide a range to dictate the various valus it can take. For categorical variables it must a list of all the values the underlying variable can take. For numerical variables the job can be done by providing Min and Max. 

3. 1st step is validating all the cluster conditions. It checks for if all the variables mentioned in all the cluster rules of all the clusters are of expected data type and within the range or set of expected values.

4. Based on the cluster conditions we curate a data-set for the decision tree to get trained on. Consult, curatedData.csv to get more idea. 

5. Validate the exclusivity of the cluster conditions by accuracy of the "DecisionTreeClassifier". If your cluster condtions are mutually exclusive and colllectively exhaustive, the accuracy would be 1. or 100%. 

6. At this moment we construct a list of instaces of [Attribute Class] attributes from the tree returned by "DecisionTreeClassifier". 

7. Once we constructed our list of instaces of [Attribute Class] , we traverse the tree returned by "DecisionTreeClassifier" to build our simple "ClusterTree" and then assign default conditions and expected data type for each attribute. At the end we store the "ClusterTree" for future use in ".pickle" format.


Git Hub Link of DecisionTreeClassifier: https://github.com/scikit-learn/scikit-learn/blob/1495f6924/sklearn/tree/tree.py 

We are just re-purposing the original "DecisionTreeClassifier" algorithm. 
Reusing 6 attributes of the original tree returned by "DecisionTreeClassifier" after it got trained on our curated data. 

The attributes we are using are, threshold, feature, value and in addition one known flag value , -2. 

1. -2 is flag to denote "leaf" or terminal node at original decision tree. 
2. "threshold" is a 1D array. Each index is a proxy Node Id. At each index the content of this array denotes
    what is the "deciding" value at current node unless the value is -2. In that case it is the leaf node. There is
    no further path on this route. 
3. "value" is a 2D array. The first index is a proxy Node Id. Here the tree store frequency/count of each "calss"es
    at current node. As our conditions has to be "Mutually Exclusive and Collectively Exhaustive" all the leaf nodes 
    contains data from just one class/cluster. Target classes are sorted alphabetically.
4. Like "threshold", "feature" is also a 1-D array. Each index is a proxy Node Id. Unless it is a leaf node(denoted by     -2), at each index the content of this array holds index of the feature on which the "deciding value mentioned at     "threshold" applies.
5. "children_left" and "children_right" also are two 1-D arrays. Each index is a proxy Node Id at both. At each index     the content of these arrays denotes the index of its left or right child's index unless current node is a leaf         node, in that case it is -1 to indicate left/right child of this node do not exist.

The final "ClusterTree" has been built by traversing the **tree** returned by "DecisionTreeClassifier". 

In [1]:
import logging
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import re
import datetime
import math
from ClusterTree import *
from Attribute import *
from BasicDataPrep_V1 import calcWeekEndMidEarlyWeek
from ReadWritePickleFile import *
from ClusterTree_Utility import *
from itertools import product

In [2]:
logging.basicConfig(format="%(asctime)s - %(thread)s - %(levelname)s - %(message)s")
logger = logging.getLogger()
logger.setLevel(logging.INFO)

In [3]:
def validationOfClusterConditions(StaticClustering):
    types = StaticClustering['Attributes']['Type']
    ranges = StaticClustering['Attributes']['Ranges']

    for key in StaticClustering['Clusters'].keys():
        cluster = StaticClustering['Clusters'][key]
        for criteria in cluster:
            for k in ranges.keys():
                if types[k] == 'Numerical' and k in criteria.keys():
                    assert (isinstance(criteria[k], list) or isinstance(criteria[k], tuple) or isinstance(criteria[k], set)) and\
                    criteria[k][0] >= ranges[k][0] and criteria[k][1] <= ranges[k][1],\
                    logger.error("Check the range of Attribute {}, of Cluster {}".format(k, key))

                elif types[k] == 'Categorical' and k in criteria.keys():
                    assert (isinstance(criteria[k], list) or isinstance(criteria[k], tuple) or isinstance(criteria[k], set)) and\
                    len(set(criteria[k]) - set(ranges[k]))==0,\
                    logger.error("Check the values of Attribute {}, of Cluster {}".format(k, key))

                else:
                    print ("Variable added")
                    criteria[k] = ranges[k]
                    
    return StaticClustering

def DataCuration(StaticClustering):
    types = StaticClustering['Attributes']['Type']
    ranges = StaticClustering['Attributes']['Ranges']
    rKeys = list(ranges.keys())
    curatedData = pd.DataFrame(columns=rKeys+['cluster'])
    dataset = {}
    for key in StaticClustering['Clusters'].keys():
        cluster = StaticClustering['Clusters'][key]
        for criteria in cluster:
            data =[]
            for k in rKeys:
                if types[k] == 'Numerical' and k in criteria.keys():
                    d = [criteria[k][0], criteria[k][1]]
                    #d = [criteria[k][0] for i in range(10)] + d + [criteria[k][1] for j in range(10)]

                elif types[k] == 'Categorical' and k in criteria.keys():
                    d = criteria[k] if isinstance(criteria[k], list) else list(criteria[k])

                elif types[k] == 'Numerical':
                    d = [ranges[k][0], ranges[k][1]]
                elif types[k] == 'Categorical':
                    d = ranges[k] if isinstance(ranges[k], list) else list(ranges[k])
                else:
                    d = ranges[k]

                data.append(d)
            data = [i for i in product(*data)]
            tmp = pd.DataFrame(data, columns = rKeys)
            tmp['cluster'] = key
            curatedData = pd.concat([curatedData, tmp], ignore_index=True)
            
    curatedData.to_csv('./curatedData.csv')
    
    return curatedData

In [4]:
def validateNumericDataType(df):
    columnsSet = list(df.columns)
    notNumericCols  = []
    sample = dict (df.loc[0])
    for key in sample.keys():
        if isinstance(sample[key], str) and not sample[key].isdecimal() and key!='cluster':
            notNumericCols.append(key)
            
    numericCols = [i for i in columnsSet if i not in notNumericCols + ['cluster']]
    df[numericCols] = df[numericCols].apply(pd.to_numeric)
    return df, notNumericCols

def getXandY(df, notNumericCols):
    df = pd.get_dummies(df, columns=notNumericCols)
    xCols = [col for col in df.columns if col!='cluster']
    X = df[xCols].values
    Y = df['cluster'].values
    return df, X, Y, xCols

def createCART(X, Y):
    tree = DecisionTreeClassifier()
    tree.fit(X=X, y=Y)
    assert tree.score(X=X,y=Y) == 1.0, logger.error("Your conditions are NOT mutually exclusive!!!")
    return tree

def getTargetNames(df, tree):
    targets ={}
    for i in df['cluster']:
        if i not in targets.keys():
            targets[i] = 0

        targets[i]+=1

    x = tree.tree_.value[0][0]
    z = sorted(targets.items(), key=lambda x: x[0])
    y = np.array([float(val[1]) for val in z])
    assert np.array_equal(x,y),\
            logger.error("Something went wrong!, Try training with one hot encoded target columns!")
    target_names = [val[0] for val in z]
    
    return target_names

In [5]:
def getSimpleAttributesCART(data, featureNames):
    # Assert node is an instance of CLNode
    
    assert isinstance(data, pd.DataFrame) and (isinstance(featureNames, list) or isinstance(featureNames, tuple) or isinstance(featureNames, set))
    
    minMax = data.describe().loc[['min', 'max']][featureNames]
    uniqueVals = data.nunique()
    attributes = {}
    for att in featureNames:
        attribute = Attribute(att, uniqueVals[att], minMax.loc['max'][att], minMax.loc['min'][att])
        attributes[att] = attribute
        
    
    for key in attributes.keys():
        if attributes[key].type == 'Calculated' and re.findall("_weekend$", attributes[key].name):
            attributes[key].setOriginalAttributeVal(['friday', 'saturday', 'fri', 'sat'])
            
        elif attributes[key].type == 'Calculated' and re.findall("_midweek$", attributes[key].name):
            attributes[key].setOriginalAttributeVal(['tuesday','wednesday', 'thursday','tue', 'wed', 'thu'])
            
        elif attributes[key].type == 'Calculated' and re.findall("_earlyweek$", attributes[key].name):
            attributes[key].setOriginalAttributeVal(['sunday', 'monday', 'sun', 'mon'])
    

    return attributes

def buildSimpleClusterTreefromCART(classificationTree, node, depth, simpleTreeNode, featureNames, target_names, attributes, targetOneHotEncoded=False):
    curNode = ClusterTreeNode()
    if classificationTree.feature[node] == -2:
        curNode.setParent(simpleTreeNode)
        if targetOneHotEncoded:
            targets = [classificationTree.value[node][i][1] for i in range (len(classificationTree.value[node]))]
            result = np.where(targets == max(targets)) 
            result = result[0][0]
            clusterID = target_names[result]
        else:
            for i in range(len(target_names)):
                if classificationTree.value[node][0][i]>0:
                    clusterID = target_names[i]
                    break
        clusterID = clusterID.strip().lower()
        curNode.setClusterId(clusterID)
    else:
        i = classificationTree.feature[node]
        attribute = featureNames[i]
        cutValue = classificationTree.threshold[node]
        
        if attributes[attribute].type == 'Categorical' or attributes[attribute].type == 'Calculated':
            curNode.setAttribute(attribute=attributes[attribute].originalAttribute, attributeType=attributes[attribute].type)
            curNode.setValue(attributes[attribute].originalAttributeVal)

        else:
            curNode.setAttribute(attribute=attributes[attribute].originalAttribute, attributeType=attributes[attribute].type)
            curNode.setValue(cutValue)
            
        curNode.setParent(simpleTreeNode)
        
        left = buildSimpleClusterTreefromCART(classificationTree, classificationTree.children_left[node], depth+1, curNode, featureNames, target_names, attributes, targetOneHotEncoded)
        curNode.setLeft(left)
        right = buildSimpleClusterTreefromCART(classificationTree, classificationTree.children_right[node], depth+1, curNode, featureNames, target_names, attributes, targetOneHotEncoded)
        curNode.setRight(right)
        
    return curNode

In [6]:
def getClusterTreeFromRules(staticClusterFilePath, prevVer):
    data = readPicklefile(staticClusterFilePath)
    defaultValues = data['Attributes']['Default']
    defaultValues = dict((k.strip().lower(), v.lower().strip() if isinstance(v, str) else v) for 
                        k,v in defaultValues.items())
    data = validationOfClusterConditions(data)
    df = DataCuration(data)
    newColumnNames =dict()
    for col in df.columns:
        newColumnNames[col] = col.strip().lower()    
    df.rename(columns=newColumnNames, inplace=True)
    df,_,_,_,_,_ = calcWeekEndMidEarlyWeek(df)
    df, notNumericCols = validateNumericDataType(df)
    df, X, Y, xCols = getXandY(df=df, notNumericCols=notNumericCols)
    tree = createCART(X, Y)
    attributes = getSimpleAttributesCART(df, xCols)
    target_names = getTargetNames(df, tree)
    node = buildSimpleClusterTreefromCART(tree.tree_ , 0, 1, None, xCols, target_names, attributes, False)
    utc_now = datetime.datetime.now()
    deltaVer = math.ceil(float((utc_now - datetime.datetime(1970, 1, 1)).total_seconds()))
    deltaVer = float("." + str(deltaVer))
    if math.floor(prevVer) == 1:
        curVer = 1.0 + deltaVer
    else:
        curVer = 1.0
    clusterTree = ClusterTree(node, defaultValues, curVer)
    return clusterTree, tree

In [7]:
clusterTree, tree = getClusterTreeFromRules(staticClusterFilePath='./staticClustering.pickle', prevVer=-1)
attributes, clusters = getAttributesAndClusters(clusterTree)

In [8]:
info = {'Adults': 2,'Children': 1,'LeadDays': 3,'LengthOfStay':1 , 'ArrivalDate_WeekDay':'Monday'}
clusterTree.getClusterID(info)



('midweekfamilygetaway', 1.0)

In [9]:
info = {'Adults': 2,'Children': 1,'LeadDays': 3,'LengthOfStay':1 , 'ArrivalDate_WeekDay':'friday'}
clusterTree.getClusterID(info)

('suddenleisurefamilygetaway', 1.0)

In [10]:
info = {}
clusterTree.getClusterID(info)

('personaltimeoff', 1.0)

In [11]:
clusterTree.defaultVals

{'adults': 1,
 'children': 0,
 'leaddays': 10,
 'lengthofstay': 2,
 'arrivaldate_weekday': 'friday'}

In [18]:
tree.tree_.threshold

array([ 7.500e+00,  2.500e+00,  5.000e-01,  1.500e+00,  1.450e+01,
        1.500e+00,  5.000e-01, -2.000e+00,  5.000e-01,  5.000e-01,
        5.000e-01, -2.000e+00, -2.000e+00, -2.000e+00, -2.000e+00,
        5.000e-01,  5.000e-01,  5.000e-01, -2.000e+00, -2.000e+00,
       -2.000e+00, -2.000e+00,  5.000e-01,  5.000e-01,  5.000e-01,
       -2.000e+00, -2.000e+00, -2.000e+00, -2.000e+00,  1.001e+03,
        5.000e-01,  5.000e-01, -2.000e+00,  1.500e+00, -2.000e+00,
       -2.000e+00,  1.500e+00, -2.000e+00, -2.000e+00, -2.000e+00,
        5.000e-01,  5.000e-01,  5.000e-01,  1.500e+00, -2.000e+00,
       -2.000e+00, -2.000e+00,  6.500e+00,  1.500e+00,  5.000e-01,
       -2.000e+00, -2.000e+00, -2.000e+00, -2.000e+00,  6.500e+00,
        1.500e+00,  5.000e-01, -2.000e+00, -2.000e+00, -2.000e+00,
       -2.000e+00,  5.000e-01,  1.500e+00,  5.500e+00,  2.150e+01,
        5.000e-01,  5.000e-01, -2.000e+00, -2.000e+00, -2.000e+00,
        5.000e-01,  5.000e-01, -2.000e+00, -2.000e+00, -2.000e

In [19]:
tree.tree_.value

array([[[ 80., 128., 112., ...,  64.,  32., 112.]],

       [[ 80., 128.,   0., ...,  64.,  32., 112.]],

       [[  0., 128.,   0., ...,  64.,   0., 112.]],

       ...,

       [[  0.,   0.,   0., ...,   0.,  16.,   0.]],

       [[  0.,   0.,   0., ...,   0.,   0.,   0.]],

       [[  0.,   0., 112., ...,   0.,   0.,   0.]]])

In [26]:
# 7 is a leaf node
# -2 flag is used to denote leaf

In [28]:
tree.tree_.value[7][0]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0., 28.])

In [24]:
tree.tree_.feature

array([ 1,  1,  0,  2,  3,  1,  3, -2, 10,  7,  9, -2, -2, -2, -2, 10,  9,
        7, -2, -2, -2, -2,  9, 10,  7, -2, -2, -2, -2,  2,  9,  7, -2,  3,
       -2, -2,  3, -2, -2, -2,  9,  7,  3,  1, -2, -2, -2,  3,  1,  3, -2,
       -2, -2, -2,  3,  1,  3, -2, -2, -2, -2,  0,  2,  1,  3,  7,  9, -2,
       -2, -2,  7,  9, -2, -2, -2, -2,  9,  7, -2,  3, -2, -2,  3, -2, -2,
        9,  7, -2,  3, -2, -2,  3, -2, -2, -2])

In [29]:
tree.tree_.children_left

array([ 1,  2,  3,  4,  5,  6,  7, -1,  9, 10, 11, -1, -1, -1, -1, 16, 17,
       18, -1, -1, -1, -1, 23, 24, 25, -1, -1, -1, -1, 30, 31, 32, -1, 34,
       -1, -1, 37, -1, -1, -1, 41, 42, 43, 44, -1, -1, -1, 48, 49, 50, -1,
       -1, -1, -1, 55, 56, 57, -1, -1, -1, -1, 62, 63, 64, 65, 66, 67, -1,
       -1, -1, 71, 72, -1, -1, -1, -1, 77, 78, -1, 80, -1, -1, 83, -1, -1,
       86, 87, -1, 89, -1, -1, 92, -1, -1, -1])

In [30]:
tree.tree_.children_left

array([ 1,  2,  3,  4,  5,  6,  7, -1,  9, 10, 11, -1, -1, -1, -1, 16, 17,
       18, -1, -1, -1, -1, 23, 24, 25, -1, -1, -1, -1, 30, 31, 32, -1, 34,
       -1, -1, 37, -1, -1, -1, 41, 42, 43, 44, -1, -1, -1, 48, 49, 50, -1,
       -1, -1, -1, 55, 56, 57, -1, -1, -1, -1, 62, 63, 64, 65, 66, 67, -1,
       -1, -1, 71, 72, -1, -1, -1, -1, 77, 78, -1, 80, -1, -1, 83, -1, -1,
       86, 87, -1, 89, -1, -1, 92, -1, -1, -1])