# 1. Naive Bayes

We are going to classify sentences with Naive Bayes algorithm. We have some abusive and not abusive sentences in the form of word lists. Our task is to predict whether a new sentence is abusive or not.


## 1.1 Prepare: transform sentences into vectors

In [1]:
def loadDataSet():
    ### Each sentence appears in the form of word list.
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],       
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    ### Labels of sentences. 1 is abusive, 0 not
    classVec = [0,1,0,1,0,1]
    return postingList,classVec

# create a list of the unique words in all sentences.
def createVocabList(dataSet):
    vocabSet = set([])    #Create an empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #Create the union of two sets
    return list(vocabSet)

In [2]:
postingList, classVec = loadDataSet()
myVocabList = createVocabList(postingList)
print('The vocabulary list is:\n', myVocabList)

The vocabulary list is:
 ['steak', 'is', 'quit', 'to', 'mr', 'problems', 'love', 'ate', 'help', 'cute', 'I', 'licks', 'posting', 'garbage', 'take', 'not', 'park', 'so', 'food', 'worthless', 'stop', 'dog', 'maybe', 'how', 'please', 'flea', 'has', 'my', 'dalmation', 'him', 'stupid', 'buying']


In [3]:
def setOfWords2Vec(vocabList, inputSet):
    '''
    According to vocabulary list (vocabList), we convert a word vector (inputSet) to a vector of 1s and 0s of the 
    same length as the vocabulary list. 
    The $i$-th element of output vector represents whether the $i$-th word in our vocabulary list is present or not in 
    the word vector.
    
    Args:
        vocabList - a vocabulary list
        inputSet - a word list
    Returns:
        returnVec - a vector of 1s and 0s of the same length as the vocabulary list
    '''
    returnVec = [0] * len(vocabList)                               #Create a vector of all 0s
    for word in inputSet:                                          
        if word in vocabList:                                      #If the word is in the vocabulary list，then we set its value to 1 in the output vector.
            returnVec[vocabList.index(word)] += 1
        else: print("the word: %s is not in my Vocabulary!" % word)
    return returnVec                                               

In [4]:
trainMat = []
for postinDoc in postingList:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
print('The 0-1 vector of the first sentence is:', trainMat[0])
print(len(trainMat[0]))

The 0-1 vector of the first sentence is: [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
32


## 1.2 Implement the class of Naive Bayes

In [3]:
import numpy as np
class NaiveBayes():
    '''
    This is a class for Naive Bayes classification.
    
    The class contains the arrays of conditional probabilities for abusive class and not abusive class.
    
    It also contains the functions for initializing the class, fitting the Naive Bayes classifier model and use 
    the fitted model to predict test samples.
    
    Attributes:
        p0Vect (vector, num_sentences)   - array of conditional probabilities for abusive class
        p1Vect (vector, num_sentences) - array of conditional probabilities for not abusive class
        pAbusive (number in [0,1])  - the probability that the document belongs to the abusive class
        
    '''
    def __init__(self):
        self.p1Vect = 0
        self.p0Vect = 0
        self.pAbusive = 0
        
    def fit(self, trainMatrix, classVec):
        '''
        fit the naive Bayes classifier to the training data. To be specific, we calculate the class-conditional 
        probability $p(x_j|c)$ and $p(c)$.

        Args:
            trainMatrix (matrix, num_sentences * num_vocablist) : sentence matrix, returned by the function setOfWords2Vec()
            classVec (vector, num_sentences)                    : label vector，returned by the function loadDataSet()
        Returns:
            p0Vect (vector, num_sentences)   - array of conditional probabilities for abusive class
            p1Vect (vector, num_sentences) - array of conditional probabilities for not abusive class
            pAbusive (number in [0,1])  - the probability that the document belongs to the abusive class
        '''
        numTrainDocs = len(trainMatrix)                       
        numWords = len(trainMatrix[0])                        
        self.pAbusive = sum(classVec)/numTrainDocs     
        ### Create numpy.ones array, the number of appearance of all words is initialized to 1 due to Laplacian smoothing
        p0Num = np.ones(numWords); p1Num = np.ones(numWords)  
        ### The denominator is initialized to 2 due to Laplacian smoothing.
        p0Denom = 2.0; p1Denom = 2.0                          
        ### Calculate the probablities of appearance of all vocabulary words for the abusive and non-abusive class.
        for i in range(numTrainDocs):
            ### Update p1Num, p1Denom, p0Num, p0Denom
            if classVec[i] == 1:   
                p1Num += trainMatrix[i]
                p1Denom += sum(trainMatrix[i])
            else:                      
                p0Num += trainMatrix[i]
                p0Denom += sum(trainMatrix[i])
        self.p1Vect = p1Num/p1Denom
        self.p0Vect = p0Num/p0Denom
        return self.p0Vect, self.p1Vect, self.pAbusive
    

    def predict(self, vec2Classify):
        '''
        Args:
            vec2Classify - the word list (or sentence) to be classfied
        Returns:
            0/1 - classified as not abusive/abusive
        '''
        logp1Vect = np.log(self.p1Vect)                     
        logp0Vect = np.log(self.p0Vect)
        p1 = np.sum(vec2Classify * logp1Vect) + np.log(self.pAbusive)       
        p0 = np.sum(vec2Classify * logp0Vect) + np.log(1.0 - self.pAbusive)
        if p1 > p0:
            return 1
        else:
            return 0

## 1.3 Fit model

In [27]:
NBmodel = NaiveBayes()
p0V, p1V, pAb = NBmodel.fit(trainMat, classVec)
print('p0V:\n', p0V)
print('p1V:\n', p1V)
print('pAbusive:\n', pAb)

p0V:
 [0.07692308 0.07692308 0.07692308 0.03846154 0.07692308 0.07692308
 0.07692308 0.07692308 0.03846154 0.03846154 0.07692308 0.15384615
 0.07692308 0.03846154 0.07692308 0.07692308 0.07692308 0.07692308
 0.07692308 0.03846154 0.11538462 0.03846154 0.07692308 0.07692308
 0.03846154 0.07692308 0.07692308 0.03846154 0.03846154 0.03846154
 0.07692308 0.03846154]
p1V:
 [0.14285714 0.04761905 0.04761905 0.0952381  0.0952381  0.04761905
 0.04761905 0.04761905 0.0952381  0.0952381  0.04761905 0.04761905
 0.04761905 0.0952381  0.04761905 0.04761905 0.0952381  0.04761905
 0.04761905 0.0952381  0.0952381  0.0952381  0.04761905 0.04761905
 0.19047619 0.04761905 0.04761905 0.0952381  0.0952381  0.14285714
 0.04761905 0.0952381 ]
pAbusive:
 0.5


## 1.4 Predict the new sentence 

In [28]:
testEntry = ['love', 'my', 'dalmation']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print('{} is classified as: {}'.format(testEntry, NBmodel.predict(thisDoc)))

testEntry = ['stupid', 'garbage']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print('{} is classified as: {}'.format(testEntry, NBmodel.predict(thisDoc)))

['love', 'my', 'dalmation'] is classified as: 0
['stupid', 'garbage'] is classified as: 1


# 2. Linear Discriminant Model 

In [5]:
def loadDataSet(dataset_path, file_type="txt"):
    if file_type == "txt":
        X = []                                                       ### create feature matrix
        y = []                                                       ### create label matrix
        fr = open(dataset_path)                                      ### open file
        for line in fr.readlines():                                  ### read datum
            lineArr = line.strip().split()                           ### remove the `\n` and obtain the data from string
            X.append([float(x) for x in lineArr[:-1]])               ### add to the feature matrix
            y.append(float(lineArr[-1]))                             ### add to the label matrix
        fr.close()                                                   ### close file
        return X, y 

# read the data
import numpy as np
X_train, y_train = loadDataSet("horseColicTraining.txt")
X_test, y_test = loadDataSet("horseColicTest.txt")

# transform the data from list to np.array
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

# normalize
X = np.vstack([X_train, X_test])
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [16]:
class LDA(object):
    '''
    This class is for linear discriminant analysis classification.
    
    The class contains the parameters of LDA, including the number of classes and the prior probability p(i) of 
    each class $i$, where $i=1,2,\ldots,num_classes$. Moreover, the class contains the the mean vectors $\mu_i$ 
    and covariance matrix $\Sigma$ of probability distributions $p(x|i)$ for the class $i$.
    
    It also contains the functions for initializing the class, fitting the LDA classifier model, use 
    the fitted model to calculate the linear discriminant functions $\delta_i(x)$ and decision function $h^*(x)$.
    
    Attributes:
        mu (matrix, num_classes*num_features)    : mean vectors of distributions $p(x|i)$. The $i$-th row represents $\mu_i$.
        Sigma (matrix, num_features*num_features): covariance matrix
        num_classes (positive integer)           : the number of classes
        priorProbs (vector, num_classes)         : the prior probability vector and its $i$-th element is $p(i)$
        
    '''
    def __init__(self):
        '''
        Initialize the class by just assigning zero to all atrributes. 
        '''
        self.mu = 0 
        self.Sigma = 0
        self.num_classes = 0
        self.priorProbs = 0
        
    def fit(self, X, y):
        '''
        estimate the mean vector and covariance matrix of each class in the LDA model
        
        Args: 
            X (matrix, num_train*num_features): features of training samples
            y (matrix, num_train): label of training samples
            
        Returns:
            mu (matrix, num_classes*num_features)    : mean vectors of distributions $p(x|i)$. The $i$-th row represents $\mu_i$.
            Sigma (matrix, num_features*num_features): covariance matrix
        ''' 
        num_samples, num_features = X.shape
        values, counts = np.unique(y, return_counts = True)
        num_classes = len(values)
        ### calculate the prior probability $p(i)$
        self.priorProbs = counts / num_samples
        ### calculate the mean vector of each class $\mu_i$
        self.mu = np.zeros((num_classes, num_features))
        for k in range(num_samples):
            self.mu[int(y[k]),:] += X[k,:]
        self.mu = self.mu / np.expand_dims(counts, 1) 
        ### calculate the covariance matrix $\Sigma$
        Sigma_i = [np.cov(X[y == i].T)*(X[y == i].shape[0]-1) for i in range(num_classes)] 
        self.Sigma = sum(Sigma_i) / (X.shape[0]-num_classes)
        return self.mu, self.Sigma
    
    def linear_discriminant_func(self, X):
        '''
        calculate the linear discriminant functions $\delta_i(X)$
        
        Args: 
            X (matrix, num_samples*num_features): features of samples
            
        Returns:
            value (matrix, num_samples*num_classes): the linear discriminant function values. 
            The $(j,i)$-th entry of value represents $\delta_i(X[j,:])$, which is the linear discriminant function value for the class $i$ of the sample at row $j$.
        '''
        ### calculate the inverse matrix of the covariance matrix $\Sigma$
        U, S, V = np.linalg.svd(self.Sigma)
        Sn = np.linalg.inv(np.diag(S))
        Sigma_inv = np.dot(np.dot(V.T, Sn), U.T)
        ### calculate the linear discriminant function values of X
        value = np.dot(np.dot(X, Sigma_inv), self.mu.T) - \
                0.5 * np.multiply(np.dot(self.mu, Sigma_inv).T, self.mu.T).sum(axis = 0).reshape(1, -1) + \
                np.log(np.expand_dims(self.priorProbs, axis = 0))
        return value
    
    def predict(self, X):
        '''
        calculate the linear discriminant functions
        
        Args: 
            X (matrix, num_samples*num_features): features of samples
            
        Returns:
            pred_label (vector, num_samples): the predicted labels of samples. The $j$-th entry represents the predicted label of the sample at row $j$.
        '''
        pred_value = self.linear_discriminant_func(X)
        pred_label = np.argmax(pred_value, axis = 1)
        return pred_label

In [17]:
### initiate the LDA model
model = LDA()
### fit the model with training data and get the estimation of mu and Sigma
mu, Sigma = model.fit(X_train, y_train)
### predict the label of test data
y_pred = model.predict(X_test)
### calculate the accuracy of the fitted LDA model on test data
accuracy = np.sum(y_pred == y_test)/len(y_test)
print("Accuracy of LDA on the test dataset is {}.".format(accuracy))

Accuracy of LDA on the test dataset is 0.7313432835820896.


# 3. EM algorithm 

Generate samples from the mixture Gaussian Distribution 

In [6]:
import math
import copy
import numpy as np
 
def generate_data(num_samples, alpha, mu_list, sigma_list):    
    '''
    Generate tbe synthetic dataset from the mixture-of-Gaussian distribution
    Args:
        num_samples (positive integer)                  : number of samples
        alpha  (vector, num_classes)                    : prior probability vetor
        mu_list (list of length num_classes)            : the $i$-th element is the mean vector of $i$-th class
        sigma_list (list of length num_classes)         : the $i$-th element is the covariance matrix of $i$-th class
    Returns:
        X (matrix, num_samples * num_features)        : generated data
      
    '''
    
    num_components = len(mu_list)
    num_features = len(mu_list[0])
    X = np.zeros((num_samples, num_features))       
    # Generate random numbers in [0,1]
    random_numbers = np.random.random(num_samples)
    for i in range(num_samples):
        for j in range(num_components):
            if random_numbers[i] < sum(alpha[:j+1]):  
                X[i,:]  = np.random.multivariate_normal(mu_list[j], sigma_list[j], 1) 
                break
    return X

In [7]:
class EM_for_MG():
    '''
    This class is for using EM algorithm to estimate the parameters of mixture-of-Gaussian distribution.
    
    The class contains the parameters of EM iteration, including the number of classes $N$, the prior probabilities 
    $\alpha_i$, the mean vectors $\mu_i$ and covariance matrix $\Sigma_i$ for each class $i$.
    
    It also contains the functions for initializing the class, updating parameters in E-step and M-step, iterate over 
    the two steps until convergence.
    
    Attributes:
        num_classes (positive integer)           : the number of Gaussian components
        hat_alpha (list of length num_classes)   : the prior probability of each component
        hat_mu (list of length num_classes)      : the mean vector of each Gaussian component
        hat_sigma (list of length num_classes)   : the covariance matrix of each Gaussian component
        posterior_prob (matrix, num_samples * num_classes) : the posterior probability matrix and the $(j,i)$-th entry
            represents the posterior probability that the sample X[j,:] is from the $i$-th Gaussian component.
                                                   
        
    '''
    def __init__(self, num_classes=2, num_iteration=1000):
        '''
        Initialize the class for using EM algorithm to estimate the parameters in the Mixture-of-Gaussian model. 
        '''
        self.num_classes = num_classes
        self.num_iteration = num_iteration
        self.hat_alpha = []
        self.hat_mu = []
        self.hat_sigma = []
        self.posterior_prob = 0
        
    def fit(self, X):
        
        self.num_samples, self.num_features = X.shape
        ### Initialize parameters
        self.hat_alpha = [1/self.num_classes] * self.num_classes
        self.hat_mu = [np.min(X,axis=0) + (ell+1) / (self.num_classes+1) * (np.max(X, axis=0) - np.min(X, axis=0)) for ell in range(self.num_classes)]
        self.hat_sigma = [np.eye(self.num_features)*np.std(X,axis=0)] * self.num_classes
        ### Iteration begins
        previous_alpha = self.hat_alpha
        previous_mu = self.hat_mu
        for t in range(self.num_iteration):   
            ### E-step: Update posterior probability $\gamma_{ji}$
            self.E_step(X)    
            ### M-step: Update parameters $alpha, mu, sigma$
            self.M_step(X)    
            ### Judge whether the parameter estimations converge or not
            err_mu = np.mean(np.abs(np.array(previous_mu)-np.array(self.hat_mu)))     
            err_alpha = np.mean(np.abs(previous_alpha)-np.abs(self.hat_alpha))
            if (err_mu <= 0.001) and (err_alpha < 0.001):     
                print('Converged after {} iterations'.format(t+1))
                break
            else:
                previous_mu = self.hat_mu
                previous_alpha = self.hat_alpha
            ### print the result every 20 iterations
            if (t % 20 == 0):
                print('The number of iterations is:', t+1)
                print("The estimated mean vectors are:",self.hat_mu)
                print("The estimated prior probablilities are:",self.hat_alpha)
        return self.hat_alpha, self.hat_mu, self.hat_sigma
        
        
    def E_step(self, X):
        '''
        Calculate the posterior probablilty $\gamma_{ji}$ for each class $i$.
        '''
        self.posterior_prob = np.zeros((self.num_samples, self.num_classes))
        for j in range(self.num_samples):
            denom = 0
            for i in range(self.num_classes):
                denom += self.hat_alpha[i] * np.exp(-(X[j,:]-self.hat_mu[i]).reshape(1,-1)@np.linalg.inv(self.hat_sigma[i])@(X[j,:]-self.hat_mu[i]).reshape(-1,1)/2)[0,0]/np.sqrt(np.linalg.det(self.hat_sigma[i]))
            for i in range(self.num_classes):
                numer = np.exp(-(X[j,:]-self.hat_mu[i]).reshape(1,-1)@np.linalg.inv(self.hat_sigma[i])@(X[j,:]-self.hat_mu[i]).reshape(-1,1)/2)[0,0]/np.sqrt(np.linalg.det(self.hat_sigma[i]))   
                self.posterior_prob[j,i] = self.hat_alpha[i]*numer/denom      

    
    def M_step(self, X):
        '''
        Update the parameters $\alpha_i$, $\mu_i$ and $\Sigma_i$
        '''
        num_features = np.shape(X)[1]
        self.hat_mu, self.hat_alpha, self.hat_sigma = [], [], []
        for i in range(self.num_classes):
            denom=0   
            numer=0   
            for j in range(self.num_samples):
                numer += self.posterior_prob[j,i]*X[j,:]
                denom += self.posterior_prob[j,i]
            self.hat_mu.append(numer/denom)    
            self.hat_alpha.append(denom/self.num_samples)     
        for i in range(self.num_classes):
            cov_matrix = np.zeros((self.num_features,self.num_features))
            for j in range(self.num_samples):
                cov_matrix += self.posterior_prob[j,i] * np.dot((X[j,:] - self.hat_mu[i]).reshape(-1,1),(X[j,:] - self.hat_mu[i]).reshape(1,-1))
            self.hat_sigma.append(cov_matrix/np.sum(self.posterior_prob[:,i]))


In [8]:
num_samples = 1000         
num_components = 4            
alpha = [0.1,0.2,0.3,0.4]  
mu1 = [5,5]
mu2 = [10,15]
mu3 = [25,20]
mu4 = [45,30]
mu_list = [mu1, mu2, mu3, mu4]
sigma_list = [np.array([[10, 0], [0, 10]])]*4
dataset = generate_data(num_samples, alpha, mu_list, sigma_list) 
num_iteration = 1000
model = EM_for_MG(num_components, num_iteration)
hat_alpha, hat_mu, hat_sigma = model.fit(dataset)
print("The mean vectors converge to:", hat_mu)
print("The prior probablilities converge to:", hat_alpha)

The number of iterations is: 1
The estimated mean vectors are: [array([5.54740373, 6.39989141]), array([14.97425697, 16.1360408 ]), array([26.04041944, 21.09775304]), array([45.14208248, 30.00284033])]
The estimated prior probablilities are: [0.14781162761653793, 0.23755184520641426, 0.2139882530246035, 0.4006482741524441]
The number of iterations is: 21
The estimated mean vectors are: [array([4.86511192, 4.48628506]), array([10.04122609, 14.78472996]), array([24.72561463, 19.92091405]), array([45.08480489, 29.91543767])]
The estimated prior probablilities are: [0.10966891235083012, 0.18915836036094022, 0.29517846619104515, 0.4059942610971841]
Converged after 36 iterations
The mean vectors converge to: [array([4.71381662, 4.2074898 ]), array([ 9.96878205, 14.62347069]), array([24.72846409, 19.9208832 ]), array([45.08480369, 29.91543702])]
The prior probablilities converge to: [0.10374668320673734, 0.19519354885398635, 0.2950654422147265, 0.4059943257245499]
