### Sample code for COSC522 Project 1 - How to modularize your code
This code shows another extreme to Proj1-Case1, where a new class "mpp" is defined that includes a training method (fit) and a testing method (predict). It also defines one function to load data (load_data), and a function to evaluate the data (accuracy_score). The latter is not complete so you need to try to fill in the blanks. Both the class and the functions can be reused by other projects or when you use another dataset.

If you go through the main function first, the flow should be very clear.
Step 1: read in the datasets
Step 2: train the model using the training set
Step 3: test the model using the test set
Step 4: evaluate the performance of the model

In this implementation, I also tried to introduce a new structure, "dictionary", instead of using "array" for the covariance matrices and means of different categories. I'll leave it to you to figure out what is a "dictionary" and what is the benefit of using it.

I'd like to draw your attention of the difference between the native Python array and a numpy array. In this implementation, I made an effort of always using a numpy array for its rich features although this means a little deficiency in efficiency.

Finally, if this is difficult for you to digest, you can always use Proj1-Case1 but I'd strongly urge you to start from a good programming structure.

In [1]:
import numpy as np
import sys

In [2]:
def load_data(f):
    """ Assume data format:
    feature1 feature 2 ... label 
    """

    # process training data
    data = np.genfromtxt(f)
    # return all feature columns except last
    X = data[:, :-1]
    y = data[:, -1].astype(int)

    return X, y

In [3]:
def accuracy_score(y, y_model):
    """ Return accuracy score.
    You are supposed to return both overall accuracy and classwise accuracy.
    The following code only returns overall accuracy
    """
    assert len(y) == len(y_model)

    classn = len(np.unique(y))    # number of different classes
    correct_all = y == y_model    # all correctly classified samples

    acc_overall = np.sum(correct_all) / len(y)
    acc_i = []        # this list7 stores classwise accuracy
    
    """calculate classwise accuracy
    you need to fill in this part
    """

    return acc_i, acc_overall

In [4]:
class mpp:
    """Maximum Posterior Probability
    Supervised parametric learning assuming Gaussian pdf
    with 3 cases of discriminant functions
    """

    def __init__(self, case, prior):
        self.case_ = case
        self.prior_ = prior        
        
    def fit(self, Tr, y):
        # derive the model 
        self.covs_, self.means_, self.pw_ = {}, {}, {}     # dictionaries
        self.covsum_ = None

        self.classes_ = np.unique(y)     # get unique labels as dictionary items
        self.classn_ = len(self.classes_) # the number of classes in the dataset

        assert self.classn_ == len(self.prior_)  
        k = 0       # to convert the prior probability from array to dictionary
        for c in self.classes_:
            arr = Tr[y == c]
            self.covs_[c] = np.cov(np.transpose(arr))
            self.means_[c] = np.mean(arr, axis=0)  # mean along rows
            self.pw_[c] = self.prior_[k]
            k = k + 1
            if self.covsum_ is None:
                self.covsum_ = self.covs_[c]
            else:
                self.covsum_ += self.covs_[c]

        # used by case II
        self.covavg_ = self.covsum_ / self.classn_

        # used by case I
        self.varavg_ = np.sum(np.diagonal(self.covavg_)) / self.classn_

    def predict(self, Te):
        # predict labels of all test data 
        y = []      # list to hold the predicted label
        disc = np.zeros(self.classn_)
        nr, _ = Te.shape

        for i in range(nr):         # going through each sample (or each row of the test set)
            for c in self.classes_:  # going through each class or category
                if self.case_ == 1:
                    edist2 = np.dot(Te[i]-self.means_[c], Te[i]-self.means_[c])
                    disc[c] = -edist2 / (2 * self.varavg_) + np.log(self.pw_[c])
                elif self.case_ == 2: 
                    "implement minimum Mahalanobis classifier"
                elif self.case_ == 3:
                    "implement quadratic machine"
                else:
                    print("Can only handle case numbers 1, 2, 3.")
                    sys.exit(1)
            y.append(disc.argmax())
            
        return y

In [6]:
def main():
    # read in the datasets
    Xtrain, ytrain = load_data('synth.tr')
    Xtest, ytest = load_data('synth.te')
    print(f"The dimension of the training data is: {Xtrain.shape}")
    print(f"The dimension of the testing data is: {Xtest.shape}")

    # specify the prior probability, and the cases
    prior = np.array([0.5,0.5])
    case = 1
    
    # create an object of the model initialized by the case # and prior probability
    model = mpp(case, prior)
    # train the model using the training set
    model.fit(Xtrain, ytrain)
    # test the model using the test set
    y_model = model.predict(Xtest)
    # evaluate the performance of the model
    acc_classwise, acc_overall = accuracy_score(ytest, y_model)
    print(f"The overall classification accuracy is {acc_overall}")
    
if __name__ == "__main__":
    main()

The dimension of the training data is: (250, 2)
The dimension of the testing data is: (1000, 2)
The overall classification accuracy is 0.713
