# Bayes Network

I am going to use the library pomegranate which is "a Python package that implements fast and flexible probabilistic models ranging from individual probability distributions to compositional models such as Bayesian networks and hidden Markov models."

To answer the assignment questions on
https://www.ida.liu.se/ext/caisor/TDDC65/dectree-exercise/page-100930.html

## The Network

![img1](img/img1.png)

In [1]:
from pomegranate import *

In [2]:
tableA = DiscreteDistribution({"T": 0.3, "F": 0.7})

tableB = ConditionalProbabilityTable([
    ["T", "T", 0.8], # Given T, probability of T
    ["T", "F", 0.2], # Given T, probability of F
    ["F", "T", 0.4], # Given F, probability of T
    ["F", "F", 0.6], # Given F, probability of F
], [tableA])

tableC = ConditionalProbabilityTable([
    ["F", "F", "T", 0.1],
    ["F", "F", "F", 0.9],
    ["F", "T", "T", 0.7],
    ["F", "T", "F", 0.3],
    ["T", "F", "T", 0.5],
    ["T", "F", "F", 0.5],
    ["T", "T", "T", 0.99],
    ["T", "T", "F", 0.01],
], [tableA, tableB])

tableD = ConditionalProbabilityTable([
    ["F", "T", 0.55],
    ["F", "F", 0.45],
    ["T", "T", 0.2],
    ["T", "F", 0.8],
], [tableB])

In [3]:
nodeA = Node(tableA, name="A")
nodeB = Node(tableB, name="B")
nodeC = Node(tableC, name="C")
nodeD = Node(tableD, name="D")

model = BayesianNetwork()
model.add_states(nodeA, nodeB, nodeC, nodeD)
model.add_edge(nodeA, nodeB)
model.add_edge(nodeA, nodeC)
model.add_edge(nodeB, nodeC)
model.add_edge(nodeB, nodeD)
model.bake()

## Queries

![img2](img/img2.png)

### Query A)

In [4]:
model.predict_proba([[None, None, 'T', 'T']])[0][0].parameters[0]['T']

0.5054138717420109

Node A result is the first array.   
$P(A=T|C=T,D=T) = 0.505$ 

### Query B

In [5]:
model.predict_proba([[None, None, None, 'F']])[0][0].parameters[0]['T']

0.34651898734177244

Again, the first array represents Node A probabilities.   
$P(A=T|D=F) = 0.346$

### Query C

In [6]:
model.predict_proba([[None, None, 'T', None]])[0][1].parameters[0]['T']

0.8100843263425553

Node B probability is in the 2nd array.   
$P(B=T|C=T) = 0.810$

### Query D

In [7]:
model.predict_proba([['T', None, 'T', None]])[0][1].parameters[0]['T']

0.8878923766816139

Node B is the first array.  
$P(B=T|A=T,C=T) = 0.88$

### Query E

In [8]:
model.predict_proba([['F', 'F', None, 'F']])[0][2].parameters[0]['T']

0.10000000000000016

Finally, the last one is in the 'True' part of the parameters of the third array of the result.
$P(C=T|A=F,B=F,D=F) = 0.10$

# Naive Bayes


![img3](img/img3.png)


In [9]:
from sklearn.datasets import load_wine

from sklearn.model_selection import train_test_split, GridSearchCV


from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.metrics import f1_score, accuracy_score

import numpy as np
import pandas as pd

## Split the data

In [10]:
X, y = load_wine(return_X_y=True)

In [11]:
dfTrain_X, dfTest_X, dfTrain_y, dfTest_y = train_test_split(X, y, test_size=0.1, random_state=1337)

## a) Use Naive Bayes to classify the dataset, exploring different parameter values

* Load and split the dataset.

* Apply the sklearn implementation of the Gaussian Naive Bayes

In [12]:
m_baselineGNB = GaussianNB().fit(dfTrain_X, dfTrain_y)

* Predict with the model

In [13]:
m_baselineGNB_predictions = m_baselineGNB.predict(dfTest_X)

* Evaluate the model results

In [14]:
print(f"Gaussian Naive-Bayes:\n\tF1 Score: {f1_score(dfTest_y, m_baselineGNB_predictions, average='macro')}\n\tAccuracy: {accuracy_score(dfTest_y, m_baselineGNB_predictions)}")

Gaussian Naive-Bayes:
	F1 Score: 0.7231884057971015
	Accuracy: 0.7777777777777778


### Try different parameters with Gaussian Naive-Bayes

Since the only thing avaliable for change are the priors, let us bootstrap the classes, and give their distribution as parameters.

In [15]:
dfTrain_y

array([1, 1, 0, 0, 0, 1, 0, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 0, 1, 2, 0, 2,
       0, 0, 1, 2, 0, 0, 1, 0, 1, 2, 1, 0, 1, 0, 0, 1, 0, 2, 0, 1, 1, 2,
       1, 0, 1, 1, 2, 1, 1, 2, 1, 0, 1, 2, 2, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 2, 0, 2, 1, 1, 2, 1, 0, 2, 0, 1, 2, 0, 1, 2, 1, 1, 2, 1, 1,
       2, 2, 1, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 1, 2, 1, 2, 2, 2, 1, 0, 2,
       0, 2, 1, 2, 1, 2, 2, 0, 0, 0, 2, 2, 1, 2, 2, 1, 1, 1, 2, 0, 2, 0,
       1, 1, 1, 2, 1, 1, 1, 1, 2, 0, 0, 1, 0, 1, 2, 0, 2, 2, 0, 1, 1, 1,
       1, 1, 2, 2, 1, 2])

In [16]:
n_bootstrap = 1000
n_samples = 5

dfBootstrap = pd.DataFrame(columns = ['Class', 'Observations'])

for i in range(0, n_bootstrap):
    samples = np.random.choice(dfTrain_y, n_samples)
    unique, counts = np.unique(samples, return_counts=True)
    for n_class, count in zip(unique, counts):
        data = [[n_class, count]]
        dfI = pd.DataFrame(data, columns = ['Class', 'Observations'])
        dfBootstrap = pd.concat([dfBootstrap, dfI])
        
dfFreqs = dfBootstrap.groupby(['Class']).agg('sum')['Observations']
freqs = (dfFreqs / sum(dfFreqs)).to_numpy()
freqs

array([0.2912, 0.423 , 0.2858])

In [17]:
m_freqsGNB = GaussianNB(priors=freqs).fit(dfTrain_X, dfTrain_y)
m_freqsGNB_prediction = m_freqsGNB.predict(dfTest_X)

print(f"Gaussian Naive-Bayes:\n\tF1 Score: {f1_score(dfTest_y, m_freqsGNB_prediction, average='macro')}\n\tAccuracy: {accuracy_score(dfTest_y, m_freqsGNB_prediction)}")

Gaussian Naive-Bayes:
	F1 Score: 0.7231884057971015
	Accuracy: 0.7777777777777778


We get the same results.

## Use Random Forests to classify the dataset, exploring different parameter values

* Apply the sklearn implementation of an ensemble of Trees classifier


In [18]:
m_baselineETC = ExtraTreesClassifier(random_state=1337).fit(dfTrain_X, dfTrain_y)

* Predict with the model

In [19]:
m_baselineETC_predictions = m_baselineETC.predict(dfTest_X)

* Evaluate the model results

In [20]:
print(f"Ensemble of Trees:\n\tF1 Score: {f1_score(dfTest_y, m_baselineETC_predictions, average='macro')}\n\tAccuracy: {accuracy_score(dfTest_y, m_baselineETC_predictions)}")

Ensemble of Trees:
	F1 Score: 0.8666666666666667
	Accuracy: 0.9444444444444444


### Try different parameters with Tree Ensamble

In [27]:
m_gridbaseETC = ExtraTreesClassifier(random_state=1337)

gridParams = {"n_estimators": [30, 40, 70, 100, 120, 150], 
              "criterion": ["gini", "entropy"], 
              "max_depth": [2,3,4,5],
              "max_features": ["auto", "sqrt"]}

m_gridETC = GridSearchCV(m_gridbaseETC, gridParams, scoring='precision_macro')
m_gridETC.fit(dfTrain_X, dfTrain_y)

GridSearchCV(estimator=ExtraTreesClassifier(random_state=1337),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [2, 3, 4, 5],
                         'max_features': ['auto', 'sqrt'],
                         'n_estimators': [30, 40, 70, 100, 120, 150]},
             scoring='precision_macro')

In [28]:
m_gridbestETC = m_gridETC.best_estimator_
m_gridbestETC.get_params()

{'bootstrap': False,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 40,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1337,
 'verbose': 0,
 'warm_start': False}

In [29]:
m_gridbestETC_predictions = m_gridbestETC.predict(dfTest_X)

In [30]:
print(f"Ensemble of Trees:\n\tF1 Score: {f1_score(dfTest_y, m_gridbestETC_predictions, average='macro')}\n\tAccuracy: {accuracy_score(dfTest_y, m_gridbestETC_predictions)}")

Ensemble of Trees:
	F1 Score: 0.8888888888888888
	Accuracy: 0.8888888888888888


The results improved over the baseline model.