# Library Used in Coding Assignment
Below code is for import all Python Packages
Below Packges have imported & its purpose

(1) numpy - numerical python scripts, to store array, list and other data

(2) scipy.stats - multivariate_normal to calculate the probability density function (pdf)

(2) matplotlib.pyplot - matplotlib's Pyplot packages to plot the data with the predicted mean from the EM's algo


In [86]:
import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt 
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Function Name = "log_sub_exp"
Below Function - "generate_center" is to generate the 20 Centers

The function takes the following inpurt and return the 20 2 dimensional array

(1) n = Number of Centers to generate. It will be 20

(2) mu1 = The mean against which 10 centers will be generated. It will be (0,1)

(3) mu2 = The mean against which 10 centers will be generated. It will be (1,0)

(4) sigma = The standard deviation against which the centers will be generated


# Function Name = "loglikelihood"


The function generates the 20 centers, 10 centers from mean = (0,1) and other 10 center would be from the mean = (1,0) with standard devitation of 0.5. Both the center (10 each) needs to be appended and return back to the caller

The numpy function np.random.normal is used to generate the centers.

The function would be called from the caller with seed, so it would be consistently same accross multiple runs

Following formula is used to generate the 20 centers



In [6]:
def log_sum_exp(Z):
    #Compute log(\sum_i exp(Z_i)) for Z (array)
    return np.max(Z) + np.log(np.sum(np.exp(Z - np.max(Z))))

def loglikelihood(data, weights, means, covs):
    #Compute the loglikelihood of the data for a Gaussian mixture model
    num_clusters = len(means)
    num_dim = len(data[0])
    
    ll = 0
    for d in data:
        
        Z = np.zeros(num_clusters)
        for k in range(num_clusters):
            
            # Compute (x-mu)^T * Sigma^{-1} * (x-mu)
            delta = np.array(d) - means[k]
            exponent_term = np.dot(delta.T, np.dot(np.linalg.inv(covs[k]), delta))
            
            # Compute loglikelihood contribution for this data point and this cluster
            Z[k] += np.log(weights[k])
            Z[k] -= 1/2. * (num_dim * np.log(2*np.pi) + np.log(np.linalg.det(covs[k])) + exponent_term)
            
        # Increment loglikelihood contribution of this data point across all clusters
        ll += log_sum_exp(Z)
        
    return ll

# Function Name = "EStep"

Below Function - "generate_data" is to generate data for training & test sets
The function takes the following inpurt and return the training & test data of 2 dimensions

(1) n = Number of Centers to generate. It will be 100 for training and 5000 for tests. As for each class we need to generate the data of this size

(2) center = Pass the entire 20 centers which is generated from the above function

(3) sigma = the standard deviation against which the training & test will be generated


The function takes a random number from the centers (1st 10 for the Class 0 and next 10 for the Class 1) and generates the trainning and test sets. For the training sets, 100 data will be generated with class 0 (from the random centers choose from the 1st 10 centers) and next 100 data will be generated with class 1 (from the random centers choose from the last 10 centers). Both the data vertically stacked and sends to the caller. The same will be done for the test sets, now instead of 100 each, it will be 5000 for the class 0 and next 5000 for the class 1.

Following is the formula to generate the mean however the mean would be used from the centers rather (0,1) or (1,0) 



In [49]:
# E-step
def EStep(data, init_means, init_covariances, init_weights):
    
    # initialize the variable
    means = init_means[:]
    covariances = init_covariances[:]
    weights = init_weights[:]
    
    num_data = len(data)
    num_dim = len(data[0])    
    num_clusters = len(means)

    # Initialize resp
    resp = np.zeros((num_data, num_clusters))
    
    #Loop
    for j in range(num_data):
        for k in range(num_clusters):
            resp[j, k] = weights[k]*multivariate_normal.pdf(data[j],means[k],covariances[k])
        row_sums = resp.sum(axis=1)[:, np.newaxis]
        resp = resp / row_sums # normalize the responsibility
    return resp

# Function Name = "MStep"

Below Function - "generate_data" is to generate data for training & test sets
The function takes the following inpurt and return the training & test data of 2 dimensions

(1) n = Number of Centers to generate. It will be 100 for training and 5000 for tests. As for each class we need to generate the data of this size


In [88]:
# M-step        
def MStep(data, init_means, init_covariances, init_weights, resp):
    
    # Make copies of initial parameters, which we will update during each iteration
    means = init_means[:]
    covariances = init_covariances[:]
    weights = init_weights[:]

    num_data = len(data)
    num_dim = len(data[0])    
    num_clusters = len(means)
    
    # Initialize some useful variables
    ll = loglikelihood(data, weights, means, covariances)
    ll_trace = [ll]
    
    counts = np.sum(resp, axis=0)
    
    for k in range(num_clusters):
        weights[k] = counts[k]/num_data
        weighted_sum = 0
        for j in range(num_data):
            weighted_sum += (resp[j,k]*data[j])
        means[k] = weighted_sum/counts[k]

        weighted_sum = np.zeros((num_dim, num_dim))
        for j in range(num_data):
            weighted_sum += (resp[j,k]*np.outer(data[j]-means[k],data[j]-means[k]))
        covariances[k] = weighted_sum/counts[k]

    # Compute the loglikelihood at this iteration
    ll_latest = loglikelihood(data, weights, means, covariances)
    ll_trace.append(ll_latest)

    ll = ll_latest
    
    out = {'weights': weights, 'means': means, 'covs': covariances, 'loglik': ll_trace, 'resp': resp}
    return out

In [91]:
def myEM(data, init_means, init_covariances, init_weights, maxiter=20):
    
    for i in range(maxiter):
        response = EStep(data, init_means, init_covariances, init_weights)
        out = MStep(data, init_means, init_covariances, init_weights, response)
        print("Iteration : {} - weights:{} means:{} sigma:{}".format(i,out['weights'],out['means'],out['covs']))
        
        #plt.figure(figsize=(12,8))
        #plt.scatter(X[:,0],X[:,1])
        #plt.scatter(means[0][0], means[0][1], color = "red")
        #plt.scatter(means[1][0], means[1][1],color="orange")
    
    return out

# Testing the Function

### Load the Data (Faithful.txt)


In [42]:
#Load the Data
X = np.loadtxt('../data/Faithful.txt')
print ("Data Loaded Successfully...")
print ("First 10 rows from the Faithful Dataset")
print (X[1:11,:])
print ("------------------------------------------")
print ("Size of the Dataset is: {}".format(X.shape))

Data Loaded Successfully...
First 10 rows from the Faithful Dataset
[[ 1.8   54.   ]
 [ 3.333 74.   ]
 [ 2.283 62.   ]
 [ 4.533 85.   ]
 [ 2.883 55.   ]
 [ 4.7   88.   ]
 [ 3.6   85.   ]
 [ 1.95  51.   ]
 [ 4.35  85.   ]
 [ 1.833 54.   ]]
------------------------------------------
Size of the Dataset is: (272, 2)


# Testing the Function myFunc (Two Cluster)

(1) weights / prob = 
          [0.50062804,0.49937196]

(2) means = 
          [3.4459639,69.8433735]
          [3.6217053,72.1578947]
          [3.3893617,70.5531915]

(3) covariances / sigma = 
          [1.2877935,13.842302]
          [13.8423020,183.208932]
          

In [98]:
#Two Cluster Initialization
init_weights = [0.50062804,0.49937196]
init_means = [np.array([3.467750,70.132353]),np.array([3.5078162,71.6617647])]
init_covs = [np.array([[1.2975376,13.9110994],[13.911099,183.559040]])]*2

itmax = 20
out = myEM(data=X, init_means = init_means, init_covariances = init_covs, init_weights = init_weights, maxiter=itmax)
print ("------------------------------------------------------------------------------------------------------------")
print ("weight: {}".format(out['weights']))
print ("means: {}".format(out['means']))
print ("sigma: {}".format(out['covs']))
print ("------------------------------------------------------------------------------------------------------------")

Iteration : 0 - weights:[0.5006526542469939, 0.49934734575300627] means:[array([ 3.46776969, 70.13260977]), array([ 3.5078488 , 71.66350617])] sigma:[array([[  1.32238528,  14.18954649],
       [ 14.18954649, 185.91012994]]), array([[  1.27262438,  13.63188485],
       [ 13.63188485, 181.19953113]])]
Iteration : 1 - weights:[0.5006526542469939, 0.49934734575300627] means:[array([ 3.46776969, 70.13260977]), array([ 3.5078488 , 71.66350617])] sigma:[array([[  1.32238528,  14.18954649],
       [ 14.18954649, 185.91012994]]), array([[  1.27262438,  13.63188485],
       [ 13.63188485, 181.19953113]])]
Iteration : 2 - weights:[0.5006526542469939, 0.49934734575300627] means:[array([ 3.46776969, 70.13260977]), array([ 3.5078488 , 71.66350617])] sigma:[array([[  1.32238528,  14.18954649],
       [ 14.18954649, 185.91012994]]), array([[  1.27262438,  13.63188485],
       [ 13.63188485, 181.19953113]])]
Iteration : 3 - weights:[0.5006526542469939, 0.49934734575300627] means:[array([ 3.46776969, 7

# Testing the Function myFunc (Three Cluster)

##### Below are the Initialization of the Parameters
(1) weights / prob = 
          [0.30514706,0.34926471,0.34558824]

(2) means = 
          [3.4459639,69.8433735]
          [3.6217053,72.1578947]
          [3.3893617,70.5531915]

(3) covariances / sigma = 
          [1.2877935,13.842302]
          [13.8423020,183.208932]

In [93]:
#Three Cluster Initialization
initial_weights = [0.30514706,0.34926471,0.34558824]
initial_means = [np.array([3.4459639,69.8433735]),np.array([3.6217053,72.1578947]),np.array([3.3893617,70.5531915])]
initial_covs = [np.array([[1.2877935,13.842302],[13.8423020,183.208932]])]*3

itmax = 20
out = myEM(data=X, init_means = init_means, init_covariances = init_covs, init_weights = init_weights, maxiter=itmax)
print ("------------------------------------------------------------------------------------------------------------")
print ("weight: {}".format(out['weights']))
print ("means: {}".format(out['means']))
print ("sigma: {}".format(out['covs']))
print ("------------------------------------------------------------------------------------------------------------")

Iteration : 0 - weights:[0.5006526542469939, 0.49934734575300627] means:[array([ 3.46776969, 70.13260977]), array([ 3.5078488 , 71.66350617])] sigma:[array([[  1.32238528,  14.18954649],
       [ 14.18954649, 185.91012994]]), array([[  1.27262438,  13.63188485],
       [ 13.63188485, 181.19953113]])]
Iteration : 1 - weights:[0.5006526542469939, 0.49934734575300627] means:[array([ 3.46776969, 70.13260977]), array([ 3.5078488 , 71.66350617])] sigma:[array([[  1.32238528,  14.18954649],
       [ 14.18954649, 185.91012994]]), array([[  1.27262438,  13.63188485],
       [ 13.63188485, 181.19953113]])]
Iteration : 2 - weights:[0.5006526542469939, 0.49934734575300627] means:[array([ 3.46776969, 70.13260977]), array([ 3.5078488 , 71.66350617])] sigma:[array([[  1.32238528,  14.18954649],
       [ 14.18954649, 185.91012994]]), array([[  1.27262438,  13.63188485],
       [ 13.63188485, 181.19953113]])]
Iteration : 3 - weights:[0.5006526542469939, 0.49934734575300627] means:[array([ 3.46776969, 7