# Short Assignment 2

This is an individual assignment.

**Due: Tuesday, October 4 @ 11:59pm**

## Crab Dataset Description

The Crab Data Set has 200 samples and 7 features (Frontal Lip, Rear Width, Length, Width, Depth, Male and Female), describing 5 morphological measurements on 50 crabs each of two color forms and both sexes, of the species *Leptograpsus* variegatus collected at Fremantle, W. Australia.

* Dataset Source: Campbell, N.A. and Mahon, R.J. (1974) A multivariate study of variation in two species of rock crab of genus *Leptograpsus*. *Australian Journal of Zoology* 22, 417–425.

The data set is saved in the file "crab.txt": the firt column corresponds to the class label (crab species) and the other 7 columns correspond to the features.

**Use the first 140 samples as your training set and the last 60 samples as your test set.**

In [6]:
import pandas as pd
import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from numpy.linalg import linalg

data = pd.read_csv("crab.txt", delimiter="\t").to_numpy()
#print(data.shape) #200,8
train_data=data[0:140,:] #train data of shape 140,8
test_data=data[-60:,:] #test data of shape 60,8


#print(x_train)


## Problem Set

Answer the following questions:

1. Implement the Naive Bayes classifier, under the assumption that your data likelihood model $p(x|C_j)$ is a multivariate Gaussian and the prior probabilities $p(C_j)$ are dictated by the number of samples $n_j\in\mathbb{R}$ that you have for each class. Build your own code to implement the classifier.

2. Did you encounter any problems when implementing the probabilistic generative model? What is your solution for the problem? Explain why your solution works. (Note: There is more than one solution.)

3. Report your classification results in terms of a confusion matrix in both training and test set. (You can use the function [```confusion_matrix```](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) from the module ```sklearn.metrics```.)

In [2]:
#separating data in 2 classes
class0_old=train_data[np.where(train_data[:,0] == 0)]
class1_old=train_data[np.where(train_data[:,0] == 1)]

class0=np.delete(class0_old, [0,-1], 1) #deleting the redundant dependent column to avoid singular matrix error
class1=np.delete(class1_old, [0,-1], 1) #Answer 2 discribed in more detail below

# Mean
mu0 = np.mean(class0, axis=0)
mu1 = np.mean(class1, axis=0)

# Variances
cov0 = np.cov(class0.T)
cov1 = np.cov(class1.T)
#print('Cov of Class 0: ', cov0)

N1= len(class0)
N2=len(class1)
#print('Singular error cov=', mu1.shape)

# Estimating Prior Probabilities - relative frequency
N = N1+N2
p1 = N1/N
print('Probability of Train Class 1: ',p1)
p2 = N2/N
print('Probability of Train Class 2: ',p2)



Probability of Train Class 1:  0.5142857142857142
Probability of Train Class 2:  0.4857142857142857


Initially running the code for Answer 1 generated a "singular matrix" error. Answerering question 2, I concluded the reason for this error is because the given data has linearly dependent data. Male and Female features convey the same information and are dependent on each other. Mathematically,when you have linearly dependent columns in a matrix, few matrix operations can make one of the columns to have only zero values. Such a matrix would have a determinant equal to 0, which in turn defines a singular matrix. 

The solution to this is elimating one of gender columns and thus cancelling that redundancy, like shown above. 

In [3]:
x_train=np.delete(train_data, [0,-1], 1)

# Probabaility density function for each class
y0 = multivariate_normal.pdf(x_train, mean=mu0, cov=cov0) #P(x|C0) (72,0)
y1 = multivariate_normal.pdf(x_train, mean=mu1, cov=cov1) #P(x|C1) (68,0)

# Posterior distributions: they represent our classification decision
pos1 = y0*p1 / (y0*p1 + y1*p2) # P(C1|x)
pos2 = y1*p2 / (y0*p1 + y1*p2) # P(C2|x)
#print('pos1, pos2:',pos1.shape)

y_pred_train=[]
for value in range(len(pos1)):
    if pos1[value] > pos2[value]:
        y_pred_train.append(0)
    else:
        y_pred_train.append(1)
    
print('y_pred for train data:',y_pred_train)


y_pred for train data: [0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0]


In [4]:
x_test=np.delete(test_data, [0,-1], 1) #deleting species and Female column

# Data Likelihoods
y1_newPoint = multivariate_normal.pdf(x_test, mean=mu0, cov=cov0) #P(x|C1)
y2_newPoint = multivariate_normal.pdf(x_test, mean=mu1, cov=cov1) #P(x|C2)

print('Data likelihoods:')
print('P(x|C1) = ', y1_newPoint)
print('\nP(x|C2) = ', y2_newPoint,'\n')

# Posterior Probabilities
y1_pos = y1_newPoint*p1 / (y1_newPoint*p1 + y2_newPoint*p2) #P(C1|x)
y2_pos =  y2_newPoint*p2 / (y1_newPoint*p1 + y2_newPoint*p2) #P(C2|x)
#print(y1_pos.shape, y2_pos.shape)

print('\n\n\nPosterior probabilities:')
print('P(C1|x) = ', y1_pos)
print('\nP(C2|x) = ', y2_pos,'\n')
y_pred_test=[]
for value in range(len(y1_pos)):
    if y1_pos[value] > y2_pos[value]:
        y_pred_test.append(0)
    else:
        y_pred_test.append(1)
    
print('\n\ny_pred for test data:',y_pred_test)

Data likelihoods:
P(x|C1) =  [5.52908646e-03 1.19567639e-02 7.55839312e-14 3.14603377e-13
 1.18435549e-09 8.04200156e-05 6.11910190e-09 1.60327224e-04
 6.34032230e-03 1.36326532e-02 2.17024147e-03 5.35161630e-11
 2.28570347e-04 8.03305011e-11 1.51564563e-04 2.97397912e-04
 7.24011537e-09 1.08433497e-02 8.30968939e-04 1.22027557e-10
 1.06708280e-12 5.04631854e-05 9.40221137e-05 4.40489200e-07
 7.75106489e-07 2.43661916e-09 2.97900474e-04 4.74383061e-05
 9.33591599e-07 1.27856683e-07 4.28482756e-03 3.89366331e-09
 3.68444754e-10 8.14768974e-06 2.38560013e-08 5.80681716e-04
 1.60562908e-06 6.64579218e-03 7.18451019e-03 3.48782583e-11
 5.77443568e-09 1.71501663e-06 3.19907619e-09 5.71188423e-09
 2.05711195e-09 2.07675696e-04 4.75589781e-06 4.61205650e-10
 4.44504585e-03 1.22814772e-08 1.03580878e-02 3.00521041e-04
 5.93096851e-04 1.99332843e-03 1.21554081e-04 1.21824094e-09
 3.35193372e-06 1.49330991e-05 2.49512471e-12 1.85473603e-03]

P(x|C2) =  [1.17171578e-12 1.29075541e-19 2.70739528e-

In [5]:
t_test=test_data[:,0]
t_train=train_data[:,0]
#print(t_train.shape)
#print(t_train.shape,y_pred.shape, y1.shape)
print('Confusion matrix for test set:\n',confusion_matrix(t_test, y_pred_test))
print('\nConfusion matrix for train set:\n',confusion_matrix(t_train, y_pred_train))

print('\nWe see there is no False Positives and False Negatives in both train and test confusion matrix. We also see the train set performs better than the test set, due to the parameters given in above')

Confusion matrix for test set:
 [[28  0]
 [ 0 32]]

Confusion matrix for train set:
 [[72  0]
 [ 0 68]]

We see there is no False Positives and False Negatives in both train and test confusion matrix. We also see the train set performs better than the test set, due to the parameters given in above


---

# Submit Your Solution

Confirm that you've successfully completed the assignment.

Along with the Notebook, include a PDF of the notebook with your solutions.

```add``` and ```commit``` the final version of your work, and ```push``` your code to your GitHub repository.

Submit the URL of your GitHub Repository as your assignment submission on Canvas.

---