# Gaussian Discrminative Analysis Hyperparameter Tuning

In this subtask, Gaussian Discriminative Analysis model's hyperparameters are tuned on 'ds1_train.csv' and then tested on 'ds1_test.csv'.

The required libraries for this task are numpy and pandas.

Numpy is used to create array and its functions are used such as exponential function, dot product, transpose of a matrix etc.

Pandas is used to read data from the csv files and convert it into list of arrays.

In [2]:
# Importing required libraries

import numpy as np
import pandas as pd

The data from the csv files are extracted by pandas.read_csv function where it reads the data from the csv file and converts it into the dataframe.

In [3]:
# Initialising the traning and test examples
# df_train represents the traning DataFrame

df_train = pd.read_csv('ds2_train.csv')
df_test = pd.read_csv('ds2_test.csv')

In [4]:
# Printing the datframe df_train to see the type of data
# pandas.head(n) function is used to print the first n data entries of the respective dataframe. By default n=5

df_train.head(10)

Unnamed: 0,x_1,x_2,y
0,3.759481,7.50794,0.0
1,3.422057,4.991203,0.0
2,2.778818,4.112071,0.0
3,4.018066,5.653732,0.0
4,1.806062,4.685966,0.0
5,2.882302,5.123573,0.0
6,3.189999,5.424746,0.0
7,2.104426,2.480323,0.0
8,1.771032,3.059402,0.0
9,3.397404,5.148616,0.0


In [5]:
# It's important to check for the null values in both the training and the test dataset

print(df_train.isnull().sum())
print(df_test.isnull().sum())

x_1    0
x_2    0
y      0
dtype: int64
x_1    0
x_2    0
y      0
dtype: int64


In [6]:
# Printing the data entries of column y to see the types of labels
# pandas.value_counts() function is used to print the number of distinct entries in a particular coulmn

df_train['y'].value_counts()

0.0    400
1.0    400
Name: y, dtype: int64

In [7]:
# Seperating the X and y of training data
# pandas.values function converts the data of DataFrame into array

X_train = df_train[['x_1', 'x_2']].values
y_train = df_train['y'].values

X_test = df_test[['x_1', 'x_2']].values
y_test = df_test['y'].values

X_train

array([[3.7594805 , 7.5079397 ],
       [3.42205706, 4.99120267],
       [2.77881751, 4.11207082],
       ...,
       [3.54410545, 2.64987938],
       [2.57546055, 2.51725473],
       [3.5608151 , 3.99184993]])

It is clear that this model has two labels 0 and 1. So Gaussian Discriminative Analysis would work fine for it.

### Declaring the modal parameters

In GDA model, the prediction is based on the basis of probabilities of the input data

The probablity of the input data is calculated in the basis of three parameters; mu, sigma and phi

mu refers to the mean of the individual features of a particular class. So it will be a matrix of order d x 1

sigma refers to the covariance matrix of each class between its variuos features. So it will be a matrix of order d x d

class prior refers to the probability of occurance each class in a dataset

In [8]:
# Calculating the number of data
# numpy.shape gives the number of rows and columns in a form of tuple
# Here y_train.shpe[0] gives the shape of the y_train array and the zero index specifies the number of rows in it.

n = y_train.shape[0]

n_0 = df_train[df_train['y'] == 0]['y'].count() # n_0=400
n_1 = df_train[df_train['y'] == 1]['y'].count() # n_1=400

Mean of the data is average of data inputs. It is ratio of the sum of the total correct entries and total number of entries.

To calculate the mean of the features of the respective classes, firstly the classes should be divided into the respective groups on the bais of labels and then thier respective means should be calculated.

In [9]:
# Calculating the mean matrix
# mu_0 represents the mean of the features belonging to class 0
# Firstly the df_train is sorted according to the class labels
# df_train['y'] == 0 will give true and false thereby distinguishing the dataset on the class labels
# Then the respective features are selected to be meaned.
# np.mean() calculates the mean of the input entries

mu_0 = df_train[df_train['y'] == 0][['x_1', 'x_2']].mean()

# mu_1 represents the mean of the features belonging to class 1
# df_train['y'] == 1 will give true and false thereby distinguishing the dataset on the class labels

mu_1 = df_train[df_train['y'] == 1][['x_1', 'x_2']].mean()

For Covariance, firstly individual covariances are calculated of individual classes and then the pooled covariance is calculated.

In [10]:
# Calculating Covariance matrix sigma and then pooled sigma
# df_train['y'] == 0 will give true and false thereby distinguishing the dataset on the class labels
# np.cov will calculate the covariance matrix  of the given data

sigma_0 = df_train[df_train['y'] == 0][['x_1', 'x_2']].cov().values
sigma_1 = df_train[df_train['y'] == 1][['x_1', 'x_2']].cov().values

# Now the pooled sigma is calculated using the below formula

pooled_sigma = (sigma_0 * (n_0 - 1) + sigma_1 * (n_1 - 1)) / (n_0 + n_1 - 2)

In [11]:
# Printing sigma_0, sigma_1 and pooled sigma

print("Sigma_0\n", sigma_0)
print("Sigma_1\n", sigma_1)
print('Pooled_sigma\n', pooled_sigma)

Sigma_0
 [[0.93996737 0.65326934]
 [0.65326934 0.9372294 ]]
Sigma_1
 [[1.01781642 0.74399449]
 [0.74399449 0.95598585]]
Pooled_sigma
 [[0.9788919  0.69863192]
 [0.69863192 0.94660762]]


Notice that in all the above sigmas the non-diagonal elements are the same.

It is because it is calculated as follows:

The first element of a sigma matrix is the covariance between the first and the first element.

The second element of a sigma matrix is the covariance between the first and the second element.

The third element of a sigma matrix is the covariance between the second and the first element.

The fourth element of a sigma matrix is the covariance between the second and the second element.

### Defining the Gaussian Analysis

In this function, it calculates the likelihood of an input feature for both the classes

Based on that information that input feature will be predicted to be the oart of the one with more likelihood value.

In [12]:
# Defining the Gaussian_Analysis Function
# When it takes the X as input, it calculates it probability of belonging to both the classes
# It returns 1 if the probability of it belonging to class 1 is more than the class 0
# It returns 0 if the probability of it belonging to class 0 is more than the class 1

def gaussian_analysis(x, prior_0, prior_1, sigma):
    
    # np.dot function is used to calculate the product of matrices
    # np.linalg.inv() is uded to calculate the inverse of the sigma matrix
    # Here pooled sigma is used beacuse it is assumed that both the gaussian distributions are based on the same covariance matrix
    
    # Calculating log likelihood for Class 0
    # mu_0 represents the mean of class 0 of features x_1 and x_2
    # phi[0] represents the probability of occurance of Class 0 in whole dataset
    
    prob_0 = -0.5 * np.dot(np.dot((x - mu_0).T, np.linalg.inv(sigma[0])), x - mu_0) + np.log(prior_0)
    
    # Calculating log likelihood for Class 1
    # mu_1 represents the mean of class 1 of features x_1 and x_2
    # phi[1] represents the probability of occurance of Class 1 in whole dataset
    
    prob_1 = -0.5 * np.dot(np.dot((x - mu_1).T, np.linalg.inv(sigma[1])), x - mu_1) + np.log(prior_1)
    
    # The probability of both the classes are compared and thus returns the class label to which it should belong
    
    if prob_0 < prob_1:
        return 1
    else:
        return 0


### GDA has a hyperparamter class prior

Class Prior is factor that is equal to the probability of occuring of a class in the whole dataset.

It might be possible that the value acquired from the dataset is not the ideal data for maximum accuracy score.

### GridSearch

GridSearch is a technique for hyperparameter tuning and model selection in machine learning. It is a systematic way of searching through a predefined set of hyperparameter values to find the combination that yields the best model performance. GridSearch is commonly used to optimize the hyperparameters of a machine learning model and improve its generalization on unseen data.

In [14]:
# Defining parameters for hypertuning
# np.linspace creates an array of the data range provided in it with equal spacing
# Defining the prior_0 class as np.linspace

prior_0 = np.linspace(0.01, 0.99, 200)

# Defining the prior_1 as the difference of prior_0 with 1 as it follows Bernoulli Distribution

prior_1 = 1 - prior_0

# Covariance Matrix may also be hypertuned
# Either individual covariances matrices are used or the pooled covariance matrix is used

sigma = [[pooled_sigma, pooled_sigma], [sigma_0, sigma_1]]

In [16]:
# Defining GridSearch function
# It takes many parameters such as X_train, y_train, X_test, y_test, prior, sigma
# X_train is the traning set and contains features to be trained
# y_train is the training set column and conatins labels to be compared with
# prior is the class prior value of class 0 
# sigma is the covariance matrix 

def GridSearch(prior, sigma, X_train, y_train):
    
    # Defining the shapes to have the separate row and column values
    # The row values will be used to get the number of times the for loop is going to run
    # Defining for the training set
    
    m_train, n_train = X_train.shape
    
    # Initialising each calculating term to be zero
    
    max_accur = 0
    pri_0 = 0
    pri_1 = 0
    sigm = [0,0]
    
    # Implementing GridSearch
    # Defining the for loop to run 
    # First loop is the loop running through prior parameter list
    # Second loop is the loop running through covariance matrix list
    
    for pr in prior:
        for sig in sigma:
            
            # Declaring training set accuracy score
            
            train_score = 0
            
            # Defining the class prior for class 0
            
            prior_0 = pr
            
            # Defining the class prior for class 1
            
            prior_1 = 1 - pr
            
            # Defining the for loop which calculates the accuracy score of the training set
            
            for i in range(m_train):
                
                # It checks if the predicted value is equal to label
                
                if gaussian_analysis(X_train[i], prior_0, prior_1, sig) == y_train[i]:
                    train_score = train_score + 1
            
            # Calculates the accuracy score
            
            train_accur = train_score/m_train
            print(train_accur,'  ', prior_0)
            
            # Updates the maximum accurcay
            # Correspondingly stores the other requires parameters
            
            if train_accur > max_accur:
                max_accur = train_accur
                pri_0 = pr
                sigm = sig
    
    # Returning the parameters that will be used in calculating the test accuracy
    
    return max_accur, pri_0, sigm

In [17]:
# Calling the GridSearch function
# max_accur stores the maximum accuracy of the train set
# prior_0 stores the class prior of class 0 corresponding to maximum accuracy
# sigma stores the covariance matrix corresponding to the maximum accuracy

max_accur, prior_0, sigma = GridSearch(prior_0, sigma, X_train, y_train)

0.69875    0.01
0.7225    0.01
0.725    0.014924623115577889
0.7575    0.014924623115577889
0.75625    0.01984924623115578
0.77375    0.01984924623115578
0.77    0.024773869346733667
0.78625    0.024773869346733667
0.77625    0.029698492462311557
0.8    0.029698492462311557
0.78875    0.03462311557788945
0.80625    0.03462311557788945
0.80125    0.03954773869346734
0.8125    0.03954773869346734
0.8125    0.04447236180904523
0.82    0.04447236180904523
0.81125    0.04939698492462312
0.82875    0.04939698492462312
0.81875    0.05432160804020101
0.8375    0.05432160804020101
0.82875    0.0592462311557789
0.84375    0.0592462311557789
0.83375    0.06417085427135678
0.845    0.06417085427135678
0.84    0.06909547738693467
0.8475    0.06909547738693467
0.845    0.07402010050251256
0.8525    0.07402010050251256
0.845    0.07894472361809045
0.85375    0.07894472361809045
0.84625    0.08386934673366835
0.85625    0.08386934673366835
0.85125    0.08879396984924623
0.855    0.08879396984924623
0.

0.9075    0.6945226130653267
0.9075    0.6994472361809045
0.9075    0.6994472361809045
0.9075    0.7043718592964825
0.9075    0.7043718592964825
0.905    0.7092964824120603
0.90875    0.7092964824120603
0.90875    0.7142211055276382
0.9075    0.7142211055276382
0.90875    0.7191457286432161
0.90625    0.7191457286432161
0.9075    0.724070351758794
0.90375    0.724070351758794
0.9075    0.7289949748743719
0.905    0.7289949748743719
0.9075    0.7339195979899498
0.905    0.7339195979899498
0.9075    0.7388442211055277
0.9075    0.7388442211055277
0.90875    0.7437688442211056
0.90625    0.7437688442211056
0.9075    0.7486934673366834
0.90375    0.7486934673366834
0.905    0.7536180904522614
0.90375    0.7536180904522614
0.90625    0.7585427135678392
0.90375    0.7585427135678392
0.905    0.7634673366834172
0.9025    0.7634673366834172
0.905    0.768391959798995
0.9025    0.768391959798995
0.905    0.773316582914573
0.90125    0.773316582914573
0.905    0.7782412060301508
0.89625    0.778

In [20]:
# Printing the best parameters

print('Best Parameters')
print('prior_0 ', prior_0)
print('sigma ', sigma)
print('maximum accuracy of the training set is: ', max_accur)

Best Parameters
prior_0  0.4926130653266332
sigma  [array([[0.93996737, 0.65326934],
       [0.65326934, 0.9372294 ]]), array([[1.01781642, 0.74399449],
       [0.74399449, 0.95598585]])]
maximum accuracy of the training set is:  0.9175


In [25]:
# Defining the shapes to have the separate row and column values
# The row values will be used to get the number of times the for loop is going to run
# np.shape gives the no. of rows and columns 

m_test, n_test = X_test.shape

# Defining the for loop which calculates the accuracy score of the training set
# Defining the variable for calculating test score

test_score = 0
for i in range(m_test):
    
    # Calculates the accuracy score of the test dataset
    
    if gaussian_analysis(X_test[i], prior_0, 1-prior_0, sigma) == y_test[i]:
        test_score = test_score + 1

test_accur = test_score/m_test
print('test_accur: ', test_accur)


test_accur:  0.91


# Thank You