# Gaussian Discrminative Analysis

In this subtask, Gaussian Discriminative Analysis model is used to train on 'ds1_train.csv' and then tested on 'ds1_test.csv'.

The required libraries for this task are numpy and pandas.

Numpy is used to create array and its functions are used such as exponential function, dot product, transpose of a matrix etc.

Pandas is used to read data from the csv files and convert it into list of arrays.

In [24]:
# Importing required libraries

import numpy as np
import pandas as pd

The data from the csv files are extracted by pandas.read_csv function where it reads the data from the csv file and converts it into the dataframe.

In [25]:
# Initialising the traning and test examples
# df_train represents the traning DataFrame

df_train = pd.read_csv('ds2_train.csv')
df_test = pd.read_csv('ds2_test.csv')

In [26]:
# Printing the datframe df_train to see the type of data
# pandas.head(n) function is used to print the first n data entries of the respective dataframe. By default n=5

df_train.head(10)

Unnamed: 0,x_1,x_2,y
0,3.759481,7.50794,0.0
1,3.422057,4.991203,0.0
2,2.778818,4.112071,0.0
3,4.018066,5.653732,0.0
4,1.806062,4.685966,0.0
5,2.882302,5.123573,0.0
6,3.189999,5.424746,0.0
7,2.104426,2.480323,0.0
8,1.771032,3.059402,0.0
9,3.397404,5.148616,0.0


In [27]:
# It's important to check for the null values in both the training and the test dataset

print(df_train.isnull().sum())
print(df_test.isnull().sum())

x_1    0
x_2    0
y      0
dtype: int64
x_1    0
x_2    0
y      0
dtype: int64


In [28]:
# Printing the data entries of column y to see the types of labels
# pandas.value_counts() function is used to print the number of distinct entries in a particular coulmn

df_train['y'].value_counts()

0.0    400
1.0    400
Name: y, dtype: int64

In [29]:
# Seperating the X and y of training data
# pandas.values function converts the data of DataFrame into array

X_train = df_train[['x_1', 'x_2']].values
y_train = df_train['y'].values

X_test = df_test[['x_1', 'x_2']].values
y_test = df_test['y'].values

X_train

array([[3.7594805 , 7.5079397 ],
       [3.42205706, 4.99120267],
       [2.77881751, 4.11207082],
       ...,
       [3.54410545, 2.64987938],
       [2.57546055, 2.51725473],
       [3.5608151 , 3.99184993]])

It is clear that this model has two labels 0 and 1. So Gaussian Discriminative Analysis would work fine for it.

### Declaring the modal parameters

In GDA model, the prediction is based on the basis of probabilities of the input data

The probablity of the input data is calculated in the basis of three parameters; mu, sigma and phi

mu refers to the mean of the individual features of a particular class. So it will be a matrix of order d x 1

sigma refers to the covariance matrix of each class between its variuos features. So it will be a matrix of order d x d

phi refers to the probability of occurance each class in a dataset

In [30]:
# Calculating the number of data
# numpy.shape gives the number of rows and columns in a form of tuple
# Here y_train.shpe[0] gives the shape of the y_train array and the zero index specifies the number of rows in it.

n = y_train.shape[0]

Mean of the data is average of data inputs. It is ratio of the sum of the total correct entries and total number of entries.

To calculate the mean of the features of the respective classes, firstly the classes should be divided into the respective groups on the bais of labels and then thier respective means should be calculated.

In [31]:
# Calculating the mean matrix
# mu_0 represents the mean of the features belonging to class 0
# Firstly the df_train is sorted according to the class labels
# df_train['y'] == 0 will give true and false thereby distinguishing the dataset on the class labels
# Then the respective features are selected to be meaned.
# np.mean() calculates the mean of the input entries

mu_0 = df_train[df_train['y'] == 0][['x_1', 'x_2']].mean()

# mu_1 represents the mean of the features belonging to class 1
# df_train['y'] == 1 will give true and false thereby distinguishing the dataset on the class labels

mu_1 = df_train[df_train['y'] == 1][['x_1', 'x_2']].mean()

Phi is the probability of the occurance of the event. It is calculated by taking out the ratio of total no. of justified cases and the total no. of sample cases.

To calculate the occurace probability of the event, the count of each class should be found out first.

In [32]:
# Calculation of phi
# initialising phi as an array of zeros
# numpy.shape gives the number of rows and columns in a form of tuple
# Here y_train.shpe[0] gives the shape of the y_train array and the zero index specifies the number of rows in it.

phi = np.zeros(y_train.shape[0])

# df_train['y'] == 0 will give true and false thereby distinguishing the dataset on the class labels
# .count() tells the total number of justified data as per the conditions if applied any

n_0 = df_train[df_train['y'] == 0]['y'].count() # n_0=400
n_1 = df_train[df_train['y'] == 1]['y'].count() # n_1=400
phi[0] = n_0 / n                                # n_0 = 400; n = 800; phi[0] = 0.5
phi[1] = n_1 / n                                # n_1 = 400; n = 800; phi[1] = 0.5

# As we have a very standard data so the probability of occurance of both the classes are same
# Therefore phi will not play any affect in determining the likely probability of the test features.

For Covariance, firstly individual covariances are calculated of individual classes and then the pooled covariance is calculated.

In [33]:
# Calculating Covariance matrix sigma and then pooled sigma
# df_train['y'] == 0 will give true and false thereby distinguishing the dataset on the class labels
# np.cov will calculate the covariance matrix  of the given data

sigma_0 = df_train[df_train['y'] == 0][['x_1', 'x_2']].cov().values
sigma_1 = df_train[df_train['y'] == 1][['x_1', 'x_2']].cov().values

# Now the pooled sigma is calculated using the below formula

pooled_sigma = (sigma_0 * (n_0 - 1) + sigma_1 * (n_1 - 1)) / (n_0 + n_1 - 2)

In [34]:
# Printing sigma_0, sigma_1 and pooled sigma

print("Sigma_0\n", sigma_0)
print("Sigma_1\n", sigma_1)
print('Pooled_sigma\n', pooled_sigma)

Sigma_0
 [[0.93996737 0.65326934]
 [0.65326934 0.9372294 ]]
Sigma_1
 [[1.01781642 0.74399449]
 [0.74399449 0.95598585]]
Pooled_sigma
 [[0.9788919  0.69863192]
 [0.69863192 0.94660762]]


Notice that in all the above sigmas the non-diagonal elements are the same.

It is because it is calculated as follows:

The first element of a sigma matrix is the covariance between the first and the first element.

The second element of a sigma matrix is the covariance between the first and the second element.

The third element of a sigma matrix is the covariance between the second and the first element.

The fourth element of a sigma matrix is the covariance between the second and the second element.

### Defining the Gaussian Analysis

In this function, it calculates the likelihood of an input feature for both the classes

Based on that information that input feature will be predicted to be the oart of the one with more likelihood value.

In [35]:
# Defining the Gaussian_Analysis Function
# When it takes the X as input, it calculates it probability of belonging to both the classes
# It returns 1 if the probability of it belonging to class 1 is more than the class 0
# It returns 0 if the probability of it belonging to class 0 is more than the class 1

def gaussian_analysis(x):
    
    # np.dot function is used to calculate the product of matrices
    # np.linalg.inv() is uded to calculate the inverse of the sigma matrix
    # Here pooled sigma is used beacuse it is assumed that both the gaussian distributions are based on the same covariance matrix
    
    # Calculating log likelihood for Class 0
    # mu_0 represents the mean of class 0 of features x_1 and x_2
    # phi[0] represents the probability of occurance of Class 0 in whole dataset
    
    prob_0 = -0.5 * np.dot(np.dot((x - mu_0).T, np.linalg.inv(pooled_sigma)), x - mu_0) + np.log(phi[0])
    
    # Calculating log likelihood for Class 1
    # mu_1 represents the mean of class 1 of features x_1 and x_2
    # phi[1] represents the probability of occurance of Class 1 in whole dataset
    
    prob_1 = -0.5 * np.dot(np.dot((x - mu_1).T, np.linalg.inv(pooled_sigma)), x - mu_1) + np.log(phi[1])
    
    # The probability of both the classes are compared and thus returns the class label to which it should belong
    
    if prob_0 < prob_1:
        return 1
    else:
        return 0


### Predicting output

After defining the function its time to predict the output

In [40]:
# Initialisng the variable that will keep the count of correct predicted data
# len() function calculates the length of the list 

correct_train = 0
for i in range(len(y_train)):
    if gaussian_analysis(X_train[i]) == y_train[i]:
        correct_train = correct_train + 1
print('Accuracy of training data: ',correct_train / len(y_train))

correct_test = 0

for i in range(len(y_test)):
    if gaussian_analysis(X_test[i]) == y_test[i]:
        correct_test = correct_test + 1
print('Accuracy of test data: ',correct_test / len(y_test))

Accuracy of training data:  0.91375
Accuracy of test data:  0.91


The accuracy of training dataset by the above model is 0.91375

The accuracy of test dataset by the above model is 0.91