# Embarrassingly simple linear classification model

#### This is a scratch developmental notebook for a simple linear classification model. Its purpose is to demonstrate using Jupyter notebooks within a Docker container.

***

**First we create and test the model components**

In [1]:
import numpy as np
import warnings
import sys

In [2]:
# Function to adjust weights
def adjust_weights(weights, att, tar, hyp, eta):
    """
    Returns adjusted weight(s) based on the learning rate, corresponding
    attribute value(s), and the difference between the known and hypothesized
    class.
    
    Inputs:
        weights: weight(s) to be adjusted
        att: corresponding attribute(s)
        tar: known class (from data; i.e., the 'target')
        hyp: hypothesized class (from classifier)
        eta: learning rate
    Outputs:
        aw: adjusted weight(s)
    """
    return weights + (eta * (tar - hyp) * att)

In [3]:
def run_model(weights, examples):
    """
    Run linear classifier that returns 1 if the weighted sum is greater than
    zero or 0 otherwise.
    
    Inputs:
        weights: array of weight(s) making up the linear classifier
        examples: 2D array of examples to classify, one example per row
    """
    ws = np.sum(weights * examples, axis=1)
    h = ws > 0
    return h.astype(int)

In [4]:
def accuracy(weights, examples):
    """
    Return the fraction of examples whose class was correctly identified.
    
    Inputs:
        weights: array of weight(s) making up the linear classifier
        examples: 2D array of examples to classify, one example per row
    """
    h = run_model(weights=weights, examples=examples[:,0:-1])
    num_correct = np.sum(examples[:,-1] == h)
    return num_correct / len(examples)

In [5]:
# Load data
data = np.loadtxt(fname='banknote.csv', delimiter=',')

# Normalize
MAX = data.max(axis=0)
MIN = data.min(axis=0)
norm = (data - MIN) / (MAX - MIN)

# Append artificial "zeroth" attribute x0=1
# (i.e., "bias", used for weight-training)
norm = np.append(arr=np.ones([len(norm),1]), values=norm, axis=1)

# Shuffle examples
np.random.shuffle(norm)

In [6]:
# Training fraction 
train = 0.75

# Training/testing subsets
trainInd = round(train * len(norm))
trainSS = norm[:trainInd]
testSS = norm[trainInd:]

In [7]:
# Learning Rate
eta = 0.3

# Training Threshold (stop when obtained)
threshold = 0.9

In [8]:
# Initialize weights
weights = np.random.random(5)
weights_archive = weights[:]

# Epoch counter
num_epoch = 0

# Record percent error as each training example is presented
pct_error = []

In [14]:
# Initial model
trainAcc = accuracy(weights=weights, examples=trainSS) * 100
pct_error.append(100 - np.round(trainAcc, 4))

# Train
while trainAcc < threshold*100:
    print(f'Epoch {num_epoch}: Percent Error {np.round(100-trainAcc, 4)}%')

    # Update epoch counter
    num_epoch += 1
    
    # Loop through each training example
    for ex in trainSS:
        # Attributes and class label
        attributes = ex[0:-1].reshape(1,-1)
        classLabel = ex[-1]

        # Model hypothesis
        hypothesis = run_model(weights=weights,
                               examples=attributes)

        # Adjust and record weights
        weights = adjust_weights(weights=weights,
                                 att=attributes,
                                 tar=classLabel,
                                 hyp=hypothesis,
                                 eta=eta)
        weights_archive = np.row_stack((weights_archive, weights))

    # Accuracy
    trainAcc = accuracy(weights=weights, examples=trainSS) * 100
    pct_error.append(100-np.round(trainAcc, 4))

print(f'Epoch {num_epoch}: Percent Error {np.round(100-trainAcc, 4)}%\n')
    
# Final status
testAcc = accuracy(weights=weights, examples=testSS)*100
print(f'Final accuracy on train data: {np.round(trainAcc, 2)}%')
print(f'Accuracy on test data: {np.round(testAcc, 2)}%')

Epoch 1: Percent Error 5.0534%

Final accuracy on train data: 94.95%
Accuracy on test data: 96.5%


In [19]:
np.mean(pct_error[-3:])

5.053399999999996

***

**Now that this works, let's convert it into a Python class.**

This requires some modifications from the code blocks above. We've also taken this opportunity to add enhancements.

In [1]:
import numpy as np
import warnings
import sys

In [2]:
class Linear:
    """
    Embarrassingly simple linear classifier using perceptron learning.
    
    Initialization inputs:
        data: str, fileneame of CSV data to be trained on, organized with row-wise 
            examples and column-wise attributes. Known classes are expected to be in 
            the last column.
        train: float, 0 < train < 1, specifies the fraction of data to be used for
            training. Defaults to 0.75.
        threshold: float, 0 < threshold < 1, the accuracy threshold above which the 
            model is considered adequate and training stops. Defaults to 0.9.
        lr: float, learning rate used to adjust weights during training. Usually <1. 
            Defaults to 0.1
        seed: int, used to set random seed prior to weight initialization for 
            reproducibility. If not supplied, weights will be initialized 
            differently for every instantiation (default).
        max_epochs: int, maximum number of training epochs before training stops
            for non-convergence. Defaults to 100.
        verbose: Boolean controlling whether model performance status should be 
            printed during training.
    """
    def __init__(self, data, train=0.75, threshold=0.9, lr=0.1, seed=None,
                 max_epochs=100, verbose=True):
        self.dfile = data
        self.training = train
        self.threshold = threshold * 100
        self.eta = lr
        self.seed = seed
        self.max_epochs = max_epochs
        self.verbose = verbose
    
        # Load data
        self.data = np.loadtxt(fname=self.dfile, delimiter=',')

        # Normalize
        MAX = self.data.max(axis=0)
        MIN = self.data.min(axis=0)
        self.norm = (self.data - MIN) / (MAX - MIN)
        
        # Append artificial "zeroth" bias attribute x0=1
        # (used for weight-training)
        self.norm = np.append(arr=np.ones([len(self.norm),1]),
                              values=self.norm, axis=1)
        
        # Initialize
        self.initialize(shuffle=True)
        
    def __str__(self):
        return 'Embarrassingly simple linear classifier using perceptron learning.'
    
    def __repr__(self):
        return f'Embarrassingly simple linear classifier trained on {self.dfile}'

    def initialize(self, shuffle=True):
        """
        Randomly initialize weights.
        
        Input:
            shuffle: Boolean whether normalized data should be shuffled first. If 
                True (default), the newly shuffled data are also subset according to 
                'train' argument passed at instance initialization or as set by
                set_train_subset method.
        """
        # Shuffle examples
        if shuffle:
            print('Shuffling examples')
            self.shuffled = self.norm[:]
            np.random.shuffle(self.shuffled)

            # Training/testing subsets
            print('Splitting shuffled data into training and testing subsets')
            trainInd = round(self.training * len(self.shuffled))
            self.trainSS = self.shuffled[:trainInd]
            self.testSS = self.shuffled[trainInd:]        
    
        # Initialize weights
        wgts_rng = np.random.default_rng(seed=self.seed)
        num_weights = self.norm.shape[1] - 1
        self.weights = wgts_rng.random(num_weights)
        
    def reset(self, shuffle=True, seed=None):
        """
        Re-initialize model to random weights for retraining.
        
        Input:
            shuffle: Boolean whether normalized data should be (re)shuffled first.
                If True (default), the newly shuffled data are also subset according
                to 'train' argument passed at instance initialization or as set by
                set_train_subset method.
            seed: int, optionally set random seed prior to weight initialization. 
                Set this for reproducibility. If not supplied, seed will be retained
                from initializtion. Use str 'None' to force set to NoneType if a
                seed value was previously set, either upon initiation or with this
                'reset' method. No random seed will cause weights to be initialized
                differently every time.
        """
        if seed:
            self.set_seed(seed)
        self.initialize(shuffle=shuffle)
    
    def adjust_weights(self, att, tar, hyp):
        """
        Adjust weight(s) based on the learning rate, attribute values, and the
        difference between the known and hypothesized classes.

        Inputs:
            att: example attribute(s)
            tar: known class (from data; i.e., the 'target')
            hyp: hypothesized class (from classifier)
        """
        self.weights = self.weights + (self.eta * (tar - hyp) * att)
    
    def set_train_subset(self, fraction):
        """
        Set the fraction of data to be used for training.
        
        Input:
            fraction: float, 0 < fraction < 1
        """
        self.training = fraction
    
    def set_threshold(self, threshold):
        """
        Set the accuracy threshold above which the model is considered trained.
        
        Input:
            threshold: float, 0 < threshold < 1
        """
        self.threshold = threshold * 100
    
    def set_lr(self, lr):
        """
        Set learning rate used for training.
        
        Input:
            lr: float, usually <1
        """
        self.eta = lr
    
    def set_seed(self, seed):
        """
        Set the seed used for random weight initiation.

        Input:
            seed: int or str 'None' to force set to NoneType
        """
        if isinstance(seed, int):
            self.seed = seed
        elif isinstance(seed, str) and seed.lower() == 'none':
            self.seed = None
        else:
            raise ValueError("'seed' must be either an int or string "\
                             "string 'None' to set to NoneValue")
        warnings.warn('Random seed has been set and may be different than what was used to initiate this Linear instance.')

    def set_verbose(self, verbose):
        """
        Set whether model performance status should be printed during training.
        
        Input:
            verbose: Boolean
        """
        self.verbose = verbose
    
    def run_model(self, weights, examples):
        """
        Run linear classifier that returns 1 if the weighted sum is greater than
        zero or 0 otherwise.

        Inputs:
            weights: array of weight(s) making up the linear classifier
            examples: 2D array of examples to classify, one example per row
        """
        ws = np.sum(weights * examples, axis=1)
        h = ws > 0
        return h.astype(int)

    def get_weights(self):
        """Return existing model weights (parameters)."""
        return self.weights
    
    def accuracy(self, weights, examples):
        """
        Return the fraction of examples whose class was correctly identified.

        Inputs:
            weights: array of weight(s) making up the linear classifier
            examples: 2D array of examples to classify, one example per row
        """
        h = self.run_model(weights=weights, examples=examples[:,0:-1])
        num_correct = np.sum(examples[:,-1] == h)
        return np.round(((num_correct / len(examples)) * 100), 3)

    def error(self, weights, examples):
        """
        Return the fraction of examples whose class was incorrectly identified.

        Inputs:
            weights: array of weight(s) making up the linear classifier
            examples: 2D array of examples to classify, one example per row
        """
        h = self.run_model(weights=weights, examples=examples[:,0:-1])
        num_incorrect = np.sum(examples[:,-1] != h)
        return np.round(((num_incorrect / len(examples)) * 100), 3)

    def test(self, traindata=False):
        """
        Test the current model. Returns tuple (accuracy, error) as percentages of
        examples classified correctly and incorrectly, respectively.
        
        Input:
            traindata: Boolean indicating whether accuracy and error should be
                calculated on training subset. Defaults to False (testing subset)
        """
        ds = self.trainSS if traindata else self.testSS
        acc = self.accuracy(weights=self.weights, examples=ds)
        err = self.error(weights=self.weights, examples=ds)
        return (acc, err)
    
    def train(self):
        """Train the model"""
        if self.verbose:
            print('Training...')

        self.epoch_num = 0
        self.train_accs = []
        
        # Test initial model
        self.trainAcc, self.trainErr = self.test(traindata=True)
        self.train_accs.append(self.trainAcc)
        
        # Train
        while np.mean(self.train_accs[-3:]) < self.threshold:
            if self.verbose:
                print(f'Epoch {self.epoch_num} of {self.max_epochs} allowed: '\
                      f'Percent Error {self.trainErr}%')

            # Update epoch counter
            self.epoch_num += 1
            if self.epoch_num > self.max_epochs:
                sys.exit(f'Stopping for non-convergence after {self.max_epochs} '\
                          'max_epochs. Try increasing "max_epochs", decreasing '\
                          '"threshold", or adjusting learning rate "lr".')
                
            # Loop through each training example
            for ex in self.trainSS:
                attributes = ex[0:-1].reshape(1,-1)
                classLabel = ex[-1]
                hypothesis = self.run_model(weights=self.weights, 
                                            examples=attributes)
                self.adjust_weights(att=attributes, tar=classLabel, hyp=hypothesis)
                
            # Test current model
            self.trainAcc, self.trainErr = self.test(traindata=True)
            self.train_accs.append(self.trainAcc)

        if self.verbose:
            print(f'Epoch {self.epoch_num} of {self.max_epochs} allowed: '\
                  f'Percent Error {self.trainErr}%')
            print('Done!\n')

        # Final status
        self.testAcc, self.testErr = self.test(traindata=False)
        print(f'Final accuracy on training data: {self.trainAcc}%')
        print(f'Accuracy on testing data: {self.testAcc}%')

**Now let's test it out to confirm it works as expected.**

Start by creating an instance of the model and printing the initial weights. These should be small random numbers.

In [3]:
model = Linear(data='banknote.csv', train=0.75, threshold=0.9, lr=0.01,
               max_epochs=10, verbose=True, seed=1)
model.get_weights()

Shuffling examples
Splitting shuffled data into training and testing subsets


array([0.51182162, 0.9504637 , 0.14415961, 0.94864945, 0.31183145])

Train this model and print the final weights. We should see percent error printed for each epoch (because we set verbose=True above), followed by the final model accuracy on both the training data (less informative) and testing data (more informative).

Also, the new weights should differ from what we saw above.

In [4]:
model.train()
model.get_weights()

Training...
Epoch 0 of 10 allowed: Percent Error 55.102%
Epoch 1 of 10 allowed: Percent Error 44.218%
Epoch 2 of 10 allowed: Percent Error 25.267%
Epoch 3 of 10 allowed: Percent Error 2.915%
Epoch 4 of 10 allowed: Percent Error 1.846%
Epoch 5 of 10 allowed: Percent Error 1.555%
Done!

Final accuracy on training data: 98.445%
Accuracy on testing data: 97.959%


array([[ 0.13182162, -0.10823919, -0.09659915, -0.10860191,  0.00930294]])

Next, test the reset method. This should re-initialize the model weights. First, we do so without re-shuffling the data and without passing a new value to the 'seed' argument. Since we set seed=1 above are are not changing it here, we should get the same weights as the first time, both at initiation and after training.

In [5]:
model.reset(shuffle=False)
model.get_weights()

array([0.51182162, 0.9504637 , 0.14415961, 0.94864945, 0.31183145])

In [6]:
model.train()
model.get_weights()

Training...
Epoch 0 of 10 allowed: Percent Error 55.102%
Epoch 1 of 10 allowed: Percent Error 44.218%
Epoch 2 of 10 allowed: Percent Error 25.267%
Epoch 3 of 10 allowed: Percent Error 2.915%
Epoch 4 of 10 allowed: Percent Error 1.846%
Epoch 5 of 10 allowed: Percent Error 1.555%
Done!

Final accuracy on training data: 98.445%
Accuracy on testing data: 97.959%


array([[ 0.13182162, -0.10823919, -0.09659915, -0.10860191,  0.00930294]])

Good! Now let's reset again and but re-shuffle the examples. This time our initial weights should still be the same but we expect different final weights and training progress, since we're training a different set of examples.

In [7]:
model.reset(shuffle=True)
model.get_weights()

Shuffling examples
Splitting shuffled data into training and testing subsets


array([0.51182162, 0.9504637 , 0.14415961, 0.94864945, 0.31183145])

In [8]:
model.train()
model.get_weights()

Training...
Epoch 0 of 10 allowed: Percent Error 54.616%
Epoch 1 of 10 allowed: Percent Error 44.315%
Epoch 2 of 10 allowed: Percent Error 19.631%
Epoch 3 of 10 allowed: Percent Error 3.11%
Epoch 4 of 10 allowed: Percent Error 1.846%
Done!

Final accuracy on training data: 98.154%
Accuracy on testing data: 98.834%


array([[ 0.14182162, -0.11657762, -0.10245605, -0.11048185,  0.00739985]])

So far, so good. Finally, reset again and change the seed. Use 'None' (string) to forcefully remove the integer seed from original initiation. Now we should get different initial weights and, of course, different training results.

In [9]:
model.reset(shuffle=False, seed='None')
model.get_weights()



array([0.07047586, 0.64287171, 0.39740859, 0.63053744, 0.19173137])

In [10]:
model.train()

Training...
Epoch 0 of 10 allowed: Percent Error 54.616%
Epoch 1 of 10 allowed: Percent Error 40.233%
Epoch 2 of 10 allowed: Percent Error 23.712%
Epoch 3 of 10 allowed: Percent Error 3.596%
Epoch 4 of 10 allowed: Percent Error 4.276%
Epoch 5 of 10 allowed: Percent Error 2.818%
Done!

Final accuracy on training data: 97.182%
Accuracy on testing data: 98.834%


Make sure the seed took (sanity check):

In [11]:
print(model.seed)

None


Finally, reset one last time without shuffling but also without setting a seed. Everything should be different.

In [12]:
model.reset(shuffle=False)
model.get_weights()

array([0.09445656, 0.91924503, 0.97550096, 0.98492448, 0.4620869 ])

In [13]:
model.train()
model.get_weights()

Training...
Epoch 0 of 10 allowed: Percent Error 54.616%
Epoch 1 of 10 allowed: Percent Error 54.227%
Epoch 2 of 10 allowed: Percent Error 33.722%
Epoch 3 of 10 allowed: Percent Error 24.393%
Epoch 4 of 10 allowed: Percent Error 13.508%
Epoch 5 of 10 allowed: Percent Error 4.762%
Epoch 6 of 10 allowed: Percent Error 2.43%
Done!

Final accuracy on training data: 97.57%
Accuracy on testing data: 97.376%


array([[ 0.12445656, -0.14127662, -0.07662629, -0.11544128,  0.02996477]])

**Everything seems to be working as expected!**

***