Navarez, AM P
CMSC 173 Week 6 Assignment

One way of improving the performance of the multiclass classification is to utilize divide-and-conquer. This is done by creating a binary tree of classifiers, which operates using the following principles:

    1. The generation of nodes involves the splitting of the classes into two groups.
    2. The inner nodes represent the binary classifiers, while the leaf nodes represent the classes.
    3. The dataset is labelled based on whether or not the label/class belongs to the first, which may be done through assigning them binary values.
        ◦ For groups with an odd number of classes, the one-versus-all labelling method may be used.
    4. A binary classifier is trained on this new dataset, which also serves as a new inner node in the tree.
        ◦ The inner node produces two nodes, which may either be another classifier (in which we do another recursive call using the dataset with the binary value corresponding to its corresponding group), or a class value.
        ◦ The reason why one-versus-all may be used for an odd-numbered group is because it produces a class value node and an inner node rather than two inner nodes, which ensures an even number of classes in the next splitting.
    5. Since the created classifier is binary, the prediction will also be binary. One can look at the actual output of the prediction rather than the scores.
    6. The prediction results determines the manner in which a feature value follows until it reaches a leaf node, which determines its class.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split as tts
from sklearn.base import clone
from sklearn.preprocessing import StandardScaler
import numpy as np
import warnings
warnings.filterwarnings('ignore') 

In [2]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv'

names = ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Class']
df = pd.read_csv(url, names=names)

index_size = df.shape[0]
print("Classes: " + str(df['Class'].unique()))
df.head()

Classes: [1 2 3 5 6 7]


Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Class
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [3]:
X = df.drop('Class', axis=1)
y = df['Class']

scaler = StandardScaler()
X = scaler.fit_transform(X)
scaled_df = pd.DataFrame(X, columns=names[:-1])
scaled_df['Class'] = y
scaled_df.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Class
0,0.872868,0.284953,1.254639,-0.692442,-1.127082,-0.671705,-0.145766,-0.352877,-0.586451,1
1,-0.249333,0.591817,0.636168,-0.17046,0.102319,-0.026213,-0.793734,-0.352877,-0.586451,1
2,-0.721318,0.149933,0.601422,0.190912,0.438787,-0.164533,-0.828949,-0.352877,-0.586451,1
3,-0.232831,-0.242853,0.69871,-0.310994,-0.052974,0.112107,-0.519052,-0.352877,-0.586451,1
4,-0.312045,-0.169205,0.650066,-0.411375,0.555256,0.081369,-0.624699,-0.352877,-0.586451,1


In [4]:
"""
from https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test
Splitting the dataset into training and testing. Train-test-split is not 
used on the training dataset as the feature-label splitting will be done 
during the training process.
"""
train, test = np.split(scaled_df.sample(frac=1, random_state=100),
                      [int(.8 * len(df))])

# split test dataset into features and labels
X_test = test.drop('Class', axis=1)
y_test = test['Class']

In [13]:
# tree data structure
class TreeNode:
    def __init__(self, model, left=None, right=None, name=""):
        self.model = model
        self.left = left
        self.right = right
        self.name = name

    def __str__(self):
        return self.name if isinstance(self.name, str) else str(self.name)
        
    def isLeaf(self):
        return self.right == None and self.left == None

    def height(self):
        if self.isLeaf():
            return 1
        return 1 + max(self.left.height(), self.right.height())
    
    def predict_inner(self, feature_vector, verbose=False):
        """
        Predicts a single feature vector, returns a prediction value
        """
        
        prediction_value = self.model.predict(feature_vector)

        if verbose:
            print(str(self) + " classifier predicted that the value belongs to the" + 
                    (" left group" if prediction_value == 1 else " right group" if prediction_value == 0 else ""), 
                  end = ' ')
        
        if not self.left.isLeaf() and prediction_value == 1:
            # classified as part of the left group
            if verbose: print(".\n")
            return self.left.predict_inner(feature_vector, verbose)
        elif not self.right.isLeaf() and prediction_value == 0:
            if verbose: print(".\n")
            # classified as part of the right group
            return self.right.predict_inner(feature_vector, verbose)
        else:
            # classifying on values, still follows left vs right
            if prediction_value == 1:
                if verbose: print("with value " + str(self.left))
                return self.left.name
            else:
                if verbose: print("with value " + str(self.right))
                return self.right.name
            
            
    def predict(self, feature_set, verbose=False):
        """
        Predicts an entire feature set, returns a prediction array
        """
        predictions = np.array([])
        if isinstance(feature_set, pd.Series):
            feature_vec = feature_set.values.reshape(1, 9)
            predictions = np.append(predictions, Multiclass_classifier.predict_inner(feature_vec, verbose))
        else:
            for i in range(len(feature_set)):
                feature_vec = feature_set.iloc[i].to_numpy()
                feature_vec = feature_vec.reshape(1, 9)
                predictions = np.append(predictions, Multiclass_classifier.predict_inner(feature_vec, verbose))

        return predictions
    
    def score(self, test_features, test_labels, verbose=False):
        """
        Scores the predictions made. Formula is:
        
        Score = Count of correct labels / Count of all labels
        """
        
        predicted_labels = self.predict(test_features, verbose)
        return np.sum(predicted_labels == test_labels) / len(test_labels)
        

In [14]:
# from https://stackoverflow.com/questions/34012886/print-binary-tree-level-by-level-in-python
def print_tree(root, val="val", left="left", right="right"):
    def display(root, val=val, left=left, right=right):
        """Returns list of strings, width, height, and horizontal coordinate of the root."""
        # No child.
        if getattr(root, right) is None and getattr(root, left) is None:
            line = str(root)
            width = len(line)
            height = 1
            middle = width // 2
            return [line], width, height, middle

        # Only left child.
        if getattr(root, right) is None:
            lines, n, p, x = display(getattr(root, left))
            s = str(root)
            u = len(s)
            first_line = (x + 1) * ' ' + (n - x - 1) * '_' + s
            second_line = x * ' ' + '/' + (n - x - 1 + u) * ' '
            shifted_lines = [line + u * ' ' for line in lines]
            return [first_line, second_line] + shifted_lines, n + u, p + 2, n + u // 2

        # Only right child.
        if getattr(root, left) is None:
            lines, n, p, x = display(getattr(root, right))
            s = str(root)
            u = len(s)
            first_line = s + x * '_' + (n - x) * ' '
            second_line = (u + x) * ' ' + '\\' + (n - x - 1) * ' '
            shifted_lines = [u * ' ' + line for line in lines]
            return [first_line, second_line] + shifted_lines, n + u, p + 2, u // 2

        # Two children.
        left, n, p, x = display(getattr(root, left))
        right, m, q, y = display(getattr(root, right))
        s = str(root)
        u = len(s)
        first_line = (x + 1) * ' ' + (n - x - 1) * '_' + s + y * '_' + (m - y) * ' '
        second_line = x * ' ' + '/' + (n - x - 1 + u + y) * ' ' + '\\' + (m - y - 1) * ' '
        if p < q:
            left += [n * ' '] * (q - p)
        elif q < p:
            right += [m * ' '] * (p - q)
        zipped_lines = zip(left, right)
        lines = [first_line, second_line] + [a + u * ' ' + b for a, b in zipped_lines]
        return lines, n + m + u, max(p, q) + 2, n + u // 2

    lines, *_ = display(root, val, left, right)
    for line in lines:
        print(line)


In [15]:
def train_multiclass(Dataset, model):
    new_dataset = Dataset.copy() # deep copying dataset, may consume much space for larger datasets
    all_classes = new_dataset['Class'].unique() # get all label values
    all_classes.sort()
    
    if len(all_classes)%2 == 0:
        # group dataset evenly
        left_group = all_classes[:len(all_classes)//2]
        right_group = all_classes[len(all_classes)//2:]
    else:
        # one-vs-all grouping
        left_group = [all_classes[0].astype(np.int32)]
        right_group = list(all_classes[1:])
    
    # binary labelling as to a class belongs to the left group or not
    is_inleft = lambda val: int(val in left_group)
    new_dataset['new_Class'] = new_dataset['Class'].apply(is_inleft)
    
    # creating the features and labels
    X = new_dataset.drop(['new_Class','Class'], axis=1)
    y = new_dataset['new_Class']
    X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, random_state=173)
    
    # creating a new binary classifier 
    new_model = clone(model)
    new_model.fit(X_train,y_train)
    
    left = None
    right = None
    name = str(left_group) + " vs. " + str(right_group)
    print(name + " Score: " + str(new_model.score(X_test, y_test)))
    
    if len(left_group) > 1:
        # inner node
        left = train_multiclass(Dataset[Dataset['Class'].isin(left_group)], model)
    else:
        # leaf node, no models
        left = TreeNode(None, name=left_group[0])
    
    if len(right_group) > 1:
        # inner node
        right = train_multiclass(Dataset[Dataset['Class'].isin(right_group)], model)
    else:
        # leaf node, no models
        right = TreeNode(None, name=right_group[0])
        
    return TreeNode(new_model, left, right, name)

In [16]:
"""
Testing multiclass classification using models discussed in class.
No hyperparameter tuning was done since this is only intended as a
proof-of-concept.
"""

# perceptron = Perceptron(max_iter=3)
# Multiclass_classifier = train_multiclass(train, perceptron)

decision_tree = DecisionTreeClassifier()
Multiclass_classifier = train_multiclass(train, decision_tree)

# knn = KNeighborsClassifier(n_neighbors=4)
# Multiclass_classifier = train_multiclass(train, knn)

[1 2 3] vs. [5 6 7] Score: 0.9142857142857143
[1] vs. [2, 3] Score: 0.7692307692307693
[2] vs. [3] Score: 0.7142857142857143
[5] vs. [6, 7] Score: 0.8888888888888888
[6] vs. [7] Score: 0.8571428571428571


In [17]:
print_tree(Multiclass_classifier)

         ___________________[1 2 3] vs. [5 6 7]________                    
        /                                              \                   
 [1] vs. [2, 3]______                           [5] vs. [6, 7]______       
/                    \                         /                    \      
1               [2] vs. [3]                    5               [6] vs. [7] 
               /           \                                  /           \
               2           3                                  6           7


In [18]:
Multiclass_classifier.score(X_test, y_test)

0.6744186046511628

In [19]:
# testing a verbose prediction
Multiclass_classifier.predict(X_test.iloc[0], verbose=True)

[1 2 3] vs. [5 6 7] classifier predicted that the value belongs to the left group .

[1] vs. [2, 3] classifier predicted that the value belongs to the right group .

[2] vs. [3] classifier predicted that the value belongs to the left group with value 2


array([2.])