# General Instructions to students:

1. There are 4 types of cells in this notebook. The cell type will be indicated within the cell.
    1. Markdown cells with problem written in it. (DO NOT TOUCH THESE CELLS) (**Cell type: TextRead**)
    2. Python cells with setup code for further evaluations. (DO NOT TOUCH THESE CELLS) (**Cell type: CodeRead**)
    3. Python code cells with some template code or empty cell. (FILL CODE IN THESE CELLS BASED ON INSTRUCTIONS IN CURRENT AND PREVIOUS CELLS) (**Cell type: CodeWrite**)
    4. Markdown cells where a written reasoning or conclusion is expected. (WRITE SENTENCES IN THESE CELLS) (**Cell type: TextWrite**)
    
2. You are not allowed to insert new cells in the submitted notebook.

3. You are not allowed to import any extra packages, unless needed.

4. The code is to be written in Python 3.x syntax. Latest versions of other packages maybe assumed.

5. In CodeWrite Cells, the only outputs to be given are plots asked in the question. Nothing else to be output/printed.

6. If TextWrite cells ask you to give accuracy/error/other numbers, you can print them on the code cells, but remove the print statements before submitting.

7. Any runtime failures on the submitted notebook will get zero marks.

8. All code must be written by you. Copying from other students/material on the web is strictly prohibited. Any violations will result in zero marks.

10. All plots must be labelled properly, the labels/legends should be readable, all tables must have rows and columns named properly.

11. Change the name of file with your roll no. For example cs15d203.ipynb (for notebook) and cs15d203.py (for plain python script)



In [None]:
# Cell type : CodeRead

import numpy as np
import matplotlib.pyplot as plt


**Cell type : TextRead**

You are supposed to build Bayesian classifiers that model each class using multivariate Gaussian density functions for the datasets assigned to you (under assumptions below and employing MLE approach to estimate class prior/conditional densities). This assignment is focused on handling
and analyzing data using interpretable classification models, rather than aiming solely for the best classification accuracy.

Build Bayesian models for the given case numbers (you may refer to the Chapter 2 of the book “Pattern Classification" by David G. Stork, Peter E. Hart, and Richard O. Duda):

**Case 1:** Bayes classifier with the same Covariance matrix for all classes.

**Case 2:** Bayes classifier with different Covariance matrix across classes.

**Case 3:** Naive Bayes classifier with the Covariance matrix S = $σ^2I$ same for all classes.

**Case 4:** Naive Bayes classifier with S of the above form, but being different across classes.

Refer to the provided dataset for each group, which can be found [here](https://drive.google.com/drive/folders/1NmqA9lkxXayVaCzEfRgSxSxCYSa0LEZu?usp=sharing). Each dataset includes 2D feature vectors and their corresponding class labels. There are two different datasets available:
1) Linearly separable data.
2) Non-linearly separable data.

There are 41 folders in each dataset, but you need to look at only one folder -- **the folder number assigned to you** being *RollNo\%41 + 1*.

Sample plots: [link](https://drive.google.com/drive/folders/1jhauePXVWVnmUEkmZeutuhlzosTRz1sU)




In [None]:
# Cell type : CodeWrite

def estimateMean(data):
    """ Find the ML estimate of the mean of n-dimensional data points belonging to a class.

    Arguments:
    data: 2d array containing features

    Returns:
    meanData: mean of the n-dimensional data points

    """
    return np.mean(data,axis = 0)

def estimateCovariance(data):
    """ Find the ML estimate of the covariance matrix of n-dimensional data points.

    Arguments:
    data: 2d array containing features

    Returns:
    covData: covariance of the n-dimensional data points

    """
    return np.cov(data,rowvar =False)

def computeLikelihood(dataPoint, meanData, covData):
    """ Computes the likelihood score of a data point with respect to a given class
    given the class' mean and covariance matrix

    Arguments:
    dataPoint: an n-dimensional feature vector
    meanData: mean of the class
    covData: covariance matrix of the class

    Returns:
    likelihood: likelihood score of the data point wrt the given class

    """




In [None]:
# Cell type : CodeWrite
# write your code here as instructed.
# (Use the functions written previously)


# Read the train data



# Compute the mean and the covariance matrices as per the 4 cases mentioned above




In [None]:
# Cell type : CodeWrite
# write your code here as instructed.
# (Use the functions written previously)

# Read the test data (dev.txt)




In [None]:
# Cell type : CodeWrite
# write your code here as instructed.
# (Use the functions written previously)

# The plot of Gaussian pdfs for all classes estimated using the train data (train.txt).
# Refer to sample plots 1 and 3
# (4 Cases x 2 Datasets = 8 plots)




In [None]:
# Cell type : CodeWrite
# write your code here as instructed.
# (Use the functions written previously)

# The classifiers, specifically their decision boundary/surface as a 2D plot
# along with training points marked in the plot
# Linearly separable data — sample plot 4
# Non-linearly separable data — sample plot 2
#(4 Cases x 2 Datasets = 8 plots)





In [None]:
# Cell type : CodeWrite
# write your code here as instructed.
# (Use the functions written previously)

# Report the error rates for the above classifiers
# (four classifiers on the two datasets as a 4 × 2 table
# with appropriately named rows and columns).





**Cell type : TextRead**

#### In the next Textwrite cell, answer briefly on whether we can use the most general “Case 2” for all datasets? If not, answer when a simpler model like “Case 1” is preferable over “Case 2”?



**Cell type : TextWrite**
(Write your answer here)


**Cell type : TextRead**

#### In the next Textwrite cell, summarise your observations

**Cell type : TextWrite**
(Write your observations here)
