# SMAI Assignment - 2

## Question - `4` : Gaussian Naïve Bayes

| | |
|- | -|
| Course | Statistical Methods in AI |
| Release Date | `16.02.2023` |
| Due Date | `24.02.2023` |

This question will have you working and experimenting with the Gaussian Naïve Bayes classifier. Initially, you will calculate the priors and the parameters for the Gaussians. Then, you will use the likelihoods to classify the test data. Please note that use of `sklearn` implementations is only for the final question in the Experiments section.

The dataset is simple and interesting, the [Wireless Indoor Localization Data Set](https://archive.ics.uci.edu/ml/datasets/Wireless+Indoor+Localization). An office has seven Wi-Fi routers and its signal strengths received from these routers categorize the location of the receiver (in one of four rooms). There are 7 attributes and a class label column that can take 4 values. The data is present in `wifiLocalization.txt`. It contains 2000 samples.

### Imports

In [26]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# additional imports if necessary

### Load Data

The data has been loaded onto a Pandas DataFrame. Try to get an initial feel for the data by using functions like `describe()`, `info()`, or maybe try to plot the data to check for any patterns.

Note: To obtain the data from the UCI website, `wget` can be used followed by shuffling the samples using `shuf` and adding a header for easier reading via `pandas`. It is not necessary to view the data in a DataFrame and can be directly loaded onto NumPy as convenient.

In [27]:
data = pd.read_csv('wifiLocalization.txt', sep='\t')

In [28]:
# your code here
data.head(5)

Unnamed: 0,ws1,ws2,ws3,ws4,ws5,ws6,ws7,r
0,-47,-53,-54,-49,-63,-88,-85,3
1,-50,-57,-60,-43,-66,-77,-82,3
2,-44,-50,-57,-45,-61,-72,-67,2
3,-48,-59,-53,-45,-74,-81,-81,3
4,-60,-54,-59,-65,-66,-83,-84,1


### Splitting the Data

It is a good practice to split the data into training and test sets. This is to ensure that the model is not overfitting to the training data. The test set is used to evaluate the performance of the model on unseen data. The test set is not used to train the model in any way. The test set is only used to evaluate the performance of the model. You may use the `train_test_split` function from `sklearn.model_selection` to split the data into training and test sets.

It is a good idea to move your data to NumPy arrays now as it will make computing easier.

In [29]:
# your code here
from sklearn.model_selection import train_test_split
X = data.iloc[:, 0:7]
y = data.iloc[:,7:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print(f'X_train: {X_train.shape}')
print(f'y_train: {y_train.shape}')
print()

print(f'X_test: {X_test.shape}')
print(f'y_test: {y_test.shape}')

X_train: (1600, 7)
y_train: (1600, 1)

X_test: (400, 7)
y_test: (400, 1)


In [30]:
y_train = y_train.reset_index(drop=True)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)



In [31]:
res =  pd.concat([y_train,X_train], axis=1)
res.head(2)

Unnamed: 0,r,ws1,ws2,ws3,ws4,ws5,ws6,ws7
0,3,-47,-53,-54,-49,-63,-88,-85
1,4,-59,-54,-49,-58,-52,-83,-93


In [32]:
c1_train = res[res['r'] == 1].reset_index(drop=True).iloc[:, 1:]
c2_train = res[res['r'] == 2].reset_index(drop=True).iloc[:, 1:]
c3_train = res[res['r'] == 3].reset_index(drop=True).iloc[:, 1:]
c4_train = res[res['r'] == 4].reset_index(drop=True).iloc[:, 1:]
c4_train.head()



Unnamed: 0,ws1,ws2,ws3,ws4,ws5,ws6,ws7
0,-59,-54,-49,-58,-52,-83,-93
1,-59,-53,-49,-56,-47,-87,-82
2,-55,-48,-49,-62,-55,-86,-91
3,-58,-51,-56,-58,-50,-88,-88
4,-57,-55,-45,-59,-50,-85,-85


In [33]:

mean_row1 = pd.DataFrame()
mean_row2 = pd.DataFrame()
mean_row3 = pd.DataFrame()
mean_row4 = pd.DataFrame()

var_row1 = pd.DataFrame()
var_row2 = pd.DataFrame()
var_row3 = pd.DataFrame()
var_row4 = pd.DataFrame()

mean_row1 = c1_train.mean()
var_row1 = c1_train.var()

mean_row2 = c2_train.mean()
var_row2 = c2_train.var()

mean_row3 = c3_train.mean()
var_row3 = c3_train.var()

mean_row4 = c4_train.mean()
var_row4 = c4_train.var()

In [34]:
c1_m_v = pd.concat([mean_row1,var_row1],axis=1)
c1_m_v.columns = ['mean', 'variance']

c2_m_v = pd.concat([mean_row2,var_row2],axis=1)
c2_m_v.columns = ['mean', 'variance']

c3_m_v = pd.concat([mean_row3,var_row3],axis=1)
c3_m_v.columns = ['mean', 'variance']

c4_m_v = pd.concat([mean_row4,var_row4],axis=1)
c4_m_v.columns = ['mean', 'variance']

In [35]:
c1_m_v.head(7)

Unnamed: 0,mean,variance
ws1,-62.566085,11.436247
ws2,-56.299252,9.865224
ws3,-60.523691,13.245062
ws4,-64.187032,12.847431
ws5,-70.38404,21.557145
ws6,-82.845387,14.401035
ws7,-83.962594,15.306097


### Calculate priors

Write a function to calculate the priors for each class.

In [36]:
p1 = len(c1_train) / len( X_train)
p2 = len(c2_train) / len( X_train)
p3 = len(c3_train) / len( X_train)
p4 = len(c4_train) / len( X_train)

In [37]:
m = c1_m_v.iloc[0, 0]

### Likelihood + Classification

Given a test sample, write a function to get the likelihoods for each class in the sample. Use the Gaussian parameters and priors calculated above. Then compute the likelihood that the sample belongs to each class and return the class with the highest likelihood.

What is a common problem with the likelihoods? How can you fix it? Redo the classification with the fixed likelihoods. (You can either write another function or modify the existing one after mentioning the reason for the change)

In [38]:
import math

def gaussian_probability(x, mu, var):
    exponent = math.exp(-((x - mu) ** 2) / (2 * var))
    coefficient = 1 / math.sqrt(2 * math.pi * var)
    return coefficient * exponent


def find_max_index(lst) -> int:
    max_value = max(lst)
    max_index = lst.index(max_value)
    return max_index

def Predict_class(data_test):
    p1 = len(c1_train) / len( X_train)
    p2 = len(c2_train) / len( X_train)
    p3 = len(c3_train) / len( X_train)
    p4 = len(c4_train) / len( X_train)
    p =[]
    for i  in range(len(data_test)):
        x = data_test[i]
        m = c1_m_v.iloc[i , 0]
        var = c1_m_v.iloc[i , 1]
        p1 *= gaussian_probability(x,m,var)
        
        m = c2_m_v.iloc[i , 0]
        var = c2_m_v.iloc[i , 1]
        p2 *= gaussian_probability(x,m,var)
        
        m = c3_m_v.iloc[i , 0]
        var = c3_m_v.iloc[i , 1]
        p3 *= gaussian_probability(x,m,var)
        
        m = c4_m_v.iloc[i , 0]
        var = c4_m_v.iloc[i , 1]
        p4 *= gaussian_probability(x,m,var)
    p = [p1,p2,p3,p4]
    index =find_max_index(p)
    return index + 1    

    
    # return 0


In [39]:
res_predict = []

for index, row in X_test.iterrows():
    res_predict.append(Predict_class(row))

In [40]:
res_predict = pd.DataFrame(res_predict)

In [41]:
res_predict.columns = ['predicted']

In [42]:
res_predict.head()

Unnamed: 0,predicted
0,3
1,3
2,4
3,2
4,3


In [43]:
test_data = pd.concat([X_test,y_test,res_predict],axis=1)

In [44]:
correct = 0
total = test_data.shape[0]

for row in test_data.iterrows():
   row = row[1]
   if row['r'] == row['predicted']:
      correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 396
Incorrect: 4
Accuracy: 0.99


### Experiments

1. Estimate your model on the training data.
2. Plot the Gaussian probability density functions for each class after estimation.
3. Classify the test data using your model.
4. Pick a few samples from the test set that were misclassified and plot them along with the Gaussian probability density functions for each class. What do you observe?
5. Find if there are any features that are redundant. If so, remove them and repeat the experiments. How does the performance change?
6. Conversely, are there certain features that overpower the likelihood scores independently? Test this hypothesis empirically by only using hat/those feature(s) and repeating the experiments. How does the performance change?
7. Compare your results with the `scikit-learn` implementation. You can use the `GaussianNB` class from `sklearn.naive_bayes`. You can use the `score` function to get the accuracy of the model on the test set.
8. (Optional) Try other Naïve Bayes classifiers from [`sklearn.naive_bayes`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes) and compare the results.

## 7

In [45]:
y_train = y_train['r'].to_numpy()

In [46]:
# your code here
from sklearn.naive_bayes import GaussianNB  
classifier = GaussianNB()  
classifier.fit(X_train, y_train) 

In [47]:
y_pred = classifier.predict(X_test)  

In [48]:

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred) 

0.99