# Logistic Regression Model

_Kevin Siswandi_  
**Fundamentals of Machine Learning**  
June 2020

Assume a binary classification problem with $y \in \{1, 0\}$. The output of a logistic regression classifier is

$$ f(x; w) = \sigma(w^T x) $$

where $\sigma(...)$ is the sigmoid function. This is handy because it can be interpreted as probability of having y = 1.

## Sigmoid function
 
The sigmoid or logistic function is given by

$$ g(z) = \frac{1}{1+e^{-z}} $$

In [17]:
class DensityTree(Tree):
    def __init__(self):
        super(DensityTree, self).__init__()
        
    def train(self, data, prior):
        '''
        data: the feature matrix for the digit under consideration
        prior: the prior probability of this digit
        '''
        self.prior = prior
        N, D = data.shape
        D_try = int(np.sqrt(D)) # number of features to consider for each split decision

        # filter features and initialize bounding box
        # (If m[j] == M[j] for some j, the bounding box has zero volume, 
        #  causing divide-by-zero later on. We must ignore these features
        #  and adjust the bounding box accordingly.)
        m, M = np.min(data, axis=0), np.max(data, axis=0)
        valid_features   = np.where(m != M)[0]
        invalid_features = np.where(m == M)[0]
        M[invalid_features] = m[invalid_features] + 1

        # initialize the root node
        self.root.data = data
        self.root.box = m.copy(), M.copy()
        stack = [self.root]

        n_min = 20 # termination criterion: don't split if node contains fewer instances
        while len(stack):
            node = stack.pop()
            n = node.data.shape[0] # number of instances in present node
            if n >= n_min:
                # Call 'make_density_split_node()' with 'D_try' randomly selected 
                # indices from 'valid_features'. This turns 'node' into a split node
                # and returns the two children, which must be placed on the 'stack'.
                ... # your code here
            else:
                # Call 'make_density_leaf_node()' to turn 'node' into a leaf node.
                ... # your code here

    def predict(self, x):
        leaf = self.find_leaf(x)
        # compute p(x | y) * p(y)
        return ... # your code here

In [10]:
def make_density_split_node(node, N, feature_indices):
    '''
    node: the node to be split
    N:    the total number of training instances for the current class
    feature_indices: a numpy array of length 'D_try', containing the feature 
                     indices to be considered in the present split
    '''
    n, D = node.data.shape
    m, M = node.box

    # find best feature j (among 'feature_indices') and best threshold t for the split
    # Hint: For each feature considered, first remove duplicate feature values using 
    # 'np.unique()'. Describe here why this is necessary.
    ... # your code here

    # create children
    left = Node()
    right = Node()
    
    # initialize 'left' and 'right' with the data subsets and bounding boxes
    # according to the optimal split found above
    ... # your code here

    # turn the current 'node' into a split node
    # (store children and split condition)
    ... # your code here

    # return the children (to be placed on the stack)
    return left, right

In [18]:
def make_density_leaf_node(node, N):
    '''
    node: the node to become a leaf
    N:    the total number of training instances for the current class
    '''
    # compute and store leaf response
    ... # your code here

# Decision Tree

- [Medium post](https://medium.com/@hiromi_suenaga/machine-learning-1-lesson-5-df45f0c99618) on implementing RF and decision tree:
- [Fast.ai lecture](https://course.fast.ai/lessonsml1/lesson5.html)

In [45]:
class DecisionTree(Tree):
    def __init__(self):
        super(DecisionTree, self).__init__()
        
    def train(self, data, labels):
        '''
        data: the feature matrix for all digits
        labels: the corresponding ground-truth responses
        '''
        N, D = data.shape
        D_try = int(np.sqrt(D)) # how many features to consider for each split decision

        # initialize the root node
        self.root.data = data
        self.root.labels = labels
        stack = [self.root]

        n_min = 20 # termination criterion: don't split if node contains fewer instances
        while len(stack):
            node = stack.pop()
            n = node.data.shape[0] # number of instances in present node
            if n >= n_min and not node_is_pure(node):
                # Call 'make_decision_split_node()' with 'D_try' randomly selected 
                # feature indices. This turns 'node' into a split node
                # and returns the two children, which must be placed on the 'stack'.
                lchild, rchild = make_decision_split_node(node, np.random.permutation(D_try)) # your code here
                stack.append(lchild)
                stack.append(rchild)
            else:
                # Call 'make_decision_leaf_node()' to turn 'node' into a leaf node.
                node = make_decision_leaf_node() # your code here
                
    def predict(self, x):
        leaf = self.find_leaf(x)
        # compute p(y | x)
        preds = leaf.response
        return preds # your code here

Note that the predict method will output an array of probabilities, each corresponding to the number class from 0 to 9.

How to find the best split?
- iterate through columns and all rows
- Find the one split that minimizes Gini impurity of the two children (simple weighted addition)

In [22]:
def make_decision_split_node(node, feature_indices):
    '''
    node: the node to be split
    feature_indices: a numpy array of length 'D_try', containing the feature 
                     indices to be considered in the present split
    '''
    n, D = node.data.shape

    # find best feature j (among 'feature_indices') and best threshold t for the split
    j = -1
    t = -1
    
    def gini(lab): #takes in an array of labels to compute gini value
        proportion = 0
        for label in np.unique(lab):
            proportion += (np.sum(lab == label)/len(lab))^2
        return (1 - proportion) * len(lab)

    score = float('inf') #start with infinite score for convenience
    for c in range(D):
        x, y = node.data[:, c], node.labels
        for i in range(n): #q to ownself: what happens if we split at i?
            lhs_data, lhs_labels = x[x<x[i]], y[x<x[i]]
            rhs_data, rhs_labels = x[x>=x[i]], y[x>=x[i]]
            current_score = gini(lhs_labels) + gini(rhs_labels)
            if(current_score < score):
                score = current_score
                j = c
                t = i
        
    ... # your code here

    # create children
    left = Node()
    right = Node()
    
    # initialize 'left' and 'right' with the data subsets and labels
    # according to the optimal split found above
    left.data =  node.data[node.data[:, j] < t, :]# your code here
    left.labels = node.labels[node.data[:, j] < t, :]
    right.data = node.data[node.data[:, j] >= t, :]
    right.labels = node.labels[node.data[:, j] >= t, :]
    
    # turn the current 'node' into a split node
    # (store children and split condition)
    node.left = left
    node.right = right
    node.feature = j
    node.threshold = t# your code here

    # return the children (to be placed on the stack)
    return left, right    

For the leaf node, it has no attribute feature.

In [32]:
def make_decision_leaf_node(node):
    '''
    node: the node to become a leaf
    '''
    # compute and store leaf response
    preds = list()
    for label in np.unique(node.labels): #labels are automatically sorted by np.unique
        preds.append(np.sum(node.labels == label)/len(node.labels))
    node.response = preds
    return node # your code here

Note that make_decision_split_node, make_decision_leaf_node and node_is_pure are not part of the class.

In [48]:
def node_is_pure(node):
    '''
    check if 'node' ontains only instances of the same digit
    '''
    status = len(np.unique(node.labels)) < 2
    return status # your code here

# Evaluation of Density and Decision Tree

The DIGITS dataset is a classification problem and the data are numeric. Notes to self:
* each node can have zero, one, or two child nodes
* each node represents a split point of a feature on a variable
* each terminal node contains the prediction/output.


In [12]:
# read and prepare the digits data
from sklearn.datasets import load_digits 
def data_preparation(digits):
    """
    This function splits the digits data into data and labels.
    """
    data = digits["data"]
    target = digits["target"]
    
    return data, target

# Load data
digits = load_digits()

# Filtering data 
x, y = data_preparation(digits)

print(x.shape, y.shape) # your code here

(1797, 64) (1797,)


array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  0.,  0.,  0.],
       [ 0.,  0., 10., ...,  0.,  0.,  0.],
       [ 0.,  0.,  6., ..., 13., 11.,  1.]])

In [46]:
# train trees, plot training error confusion matrices, and comment on your results
myTree = DecisionTree()
preds = list()
for row in x:
    preds.append(mytree.predict(row))
preds # your code here

Ellipsis

# Density and Decision Forest

In [7]:
class DensityForest():
    def __init__(self, n_trees):
        # create ensemble
        self.trees = [DensityTree() for i in range(n_trees)]
    
    def train(self, data, prior):
        for tree in self.trees:
            # train each tree, using a bootstrap sample of the data
            ... # your code here

    def predict(self, x):
        # compute the ensemble prediction
        return ... # your code here

Decision Forest will create an ensemble of decision trees (i.e. random forest). The prediction will be the average of all the tree predictions (we use np.mean with axis = 0 to do it across the arrays).

In [40]:
class DecisionForest():
    def __init__(self, n_trees):
        # create ensemble
        self.trees = [DecisionTree() for i in range(n_trees)]
    
    def train(self, data, labels):
        for tree in self.trees:
            # train each tree, using a bootstrap sample of the data
            idxs = np.random.choice(len(labels), len(labels))
            tree.train(data[idxs], labels[idxs]) # your code here

    def predict(self, x):
        # compute the ensemble prediction
        return np.mean([t.predict(x) for t in self.trees], axis=0) # your code here

# Evaluation of Density and Decision Forest

In [47]:
# train forests (with 20 trees per forest), plot training error confusion matrices, and comment on your results
m = DecisionForest(100)
... # your code here

Ellipsis

# Experiment Zone

In [1]:
import numpy as np

In [2]:
np.random.permutation(10)

array([3, 0, 1, 4, 9, 6, 2, 5, 7, 8])

In [13]:
np.random.permutation(10) == 1

array([False, False, False, False, False, False, False, False, False,
        True])

In [36]:
a = np.random.rand(2, 2)
a[1]

array([0.54163211, 0.67874932])

In [37]:
np.mean(a, axis=0)

array([0.56189843, 0.63521162])

In [38]:
(0.54163211 + 0.67874932)/2

0.610190715

In [39]:
np.random.choice(10, 10)

array([3, 1, 3, 2, 9, 2, 2, 6, 5, 9])