# Introduction to Bayes Expert

Bayes Expert is a premier software by Rejuve.AI for creating Bayesian network models, including the Longevity Bayesian network model, which runs as a SingularityNET service. Bayes Expert allows scientists to crowdsource data and theories, contributing to research without needing data science expertise. Scientists can submit their data and theories through an online form, enhancing a hand-coded "seed" network. Traditionally requiring collaboration between data scientists and medical scientists, Bayes Expert democratizes the process, making Bayesian network modeling accessible to any scientist. Empowering members to actively contribute to pioneering research, pushing the boundaries of our understanding of aging and longevity.

## Bayesian Network Utilities

1. `var_deps`: Returns a dictionary mapping variable names to their dependencies.
2. `fillcols`: Creates a hierarchical structure of variables based on their dependencies.
3. `make_tree`: Constructs a hierarchical tree of the Bayesian network variables and returns it as a DataFrame.
4. `complexity_check`: Checks the complexity of the Bayesian network against specified limits and returns a tuple (passes, errors).
5. `get_var_positions`: Returns a dictionary mapping variable names to their positions.
6. `get_var_val_positions`: Returns a dictionary mapping variable names to their value positions.
7. `get_internal_var_val_positions`: Returns a dictionary mapping internal variable names to their value positions.
8. `get_var_names`: Returns a dictionary mapping variable indices to their names.
9. `get_var_val_names`: Returns a dictionary mapping variable names to their value names.
10. `parse_net`: Parses a Bayesian network query and returns the parsed components.
11. `detect_anomalies`: Detects anomalies in the given time series data using the specified Bayesian network and anomaly detection parameters.
12. `readable`: Converts a response from the Bayesian network into a human-readable format.
13. `create_query`: Creates a query for the Bayesian network.
14. `get_template_priors`: Returns the template priors for the Bayesian network.
15. `predict_proba_adjusted`: Predicts probabilities for the given evidence using the adjusted Bayesian network.
16. `batch_query`: Executes a batch query on the Bayesian network.
17. `query`: Executes a query on the Bayesian network.
18. `explain_why_bad`: Explains why a result is bad.
19. `explain_why_good`: Explains why a result is good.
20. `internal_query`: Executes an internal query on the Bayesian network.
21. `explain`: Provides an explanation for the given evidence and explanation list.
22. `make_nmap`: Creates a mapping of value ranges.
23. `dictVarsAndValues`: Returns a dictionary of variables and their values from the Bayesian network and CPT.
24. `any_of`: Constructs a CPT where any of the input variables can lead to the output variable.
25. `all_of`: Constructs a CPT where all of the input variables must lead to the output variable.
26. `avg`: Constructs a CPT based on the average value of the input variables.
27. `if_then_else`: Constructs a CPT based on if-then-else conditions.
28. `addCpt`: Adds a CPT to the Bayesian network.
29. `bayesInitialize`: Initializes the Bayesian network for use with Pomegranate.
30. `non_cpt_descriptions`: Returns a description of the non-CPT parts of the Bayesian network.
31. `get_priors`: Returns the priors for the Bayesian network.
32. `get_frequencies`: Returns the frequencies of the keylist in the Bayesian network.
33. `rr_prob_a_and_not_a_given_b_and_not_b`: Calculates the probabilities given relative risks.
34. `ss_prob_a_and_not_a_given_b_and_not_b`: Calculates the probabilities given sensitivity and specificity.
35. `prob_a_and_not_a_given_b_and_not_b`: Calculates the probabilities given invars, priors, and outvars.
36. `get_good_vars`: Returns the good variables for the given variable.
37. `get_rr_vals`: Returns the relative risk values for the given variable.
38. `replace_rr`: Replaces the relative risk value in the invars.
39. `dependency`: Constructs a dependency CPT based on the input variables and output variables.
40. `align_ci`: Aligns the confidence interval.
41. `normalize_ci`: Normalizes the confidence interval.
42. `get_window`: Returns the window for the Bayesian network.
43. `get_stat_info`: Returns the statistical information for the given variable and value.
44. `validation`: Validates the given probability and condition value.
45. `dependency_direct`: Directly constructs a dependency CPT based on the input variables and output variables.

# Imports

In [1]:
import qpsolvers
import os
from os.path import exists
import pickle

import sn_bayes
from sn_bayes import longevity_bayes
from sn_bayes.utils import complexity_check
from sn_bayes.utils import get_var_positions
from sn_bayes.utils import get_var_val_positions
from sn_bayes.utils import make_tree
from sn_bayes.utils import bayesInitialize
from sn_bayes.utils import query
from sn_bayes.utils import internal_query
from sn_bayes.utils import internal_query
from sn_bayes.utils import explain_why_bad
from sn_bayes.utils import explain_why_good
from sn_bayes.utils import create_query
from sn_bayes.utils import var_deps
from sn_bayes.utils import get_internal_var_val_positions
from sn_bayes.utils import any_of
from sn_bayes.utils import all_of
from sn_bayes.utils import avg
from sn_bayes.utils import if_then_else
from sn_bayes.utils import bayesInitialize
from sn_bayes.utils import addCpt
from sn_bayes.utils import dependency
from sn_bayes.utils import non_cpt_descriptions
from sn_bayes.utils import fillcols

import sn_service.service_spec.bayesian_pb2
from sn_service.service_spec.bayesian_pb2 import Query
import sn_service.service_spec.bayesian_pb2_grpc as grpc_bayes_grpc
import sn_service.service_spec.bayesian_pb2
from sn_service.service_spec.bayesian_pb2 import BayesianNetworkQuery
from sn_service.service_spec.bayesian_pb2 import QueryId
from sn_service.service_spec.bayesian_pb2 import Id
from sn_service.service_spec.bayesian_pb2 import BayesianNetwork

import grpc
import pandas as pd
import networkx as nx
import time
import re

import matplotlib.pyplot as plt

# Baking bayesianNetwork

## Bayesian Network Compilation with Pomegranate

To compile the Bayesian network and compute probabilities with Pomegranate, use the following code. This script first checks if a serialized Bayesian network file (bayesianNetwork.pkl) exists. If it does, it loads the network from the file. If it doesn't, it generates the network using the longevity_bayes function, saves it to a file, and then initializes and bakes the network for probability computation. Finally, it predicts and prints the probabilities of all variables with no initial evidence.

In [None]:
# Check if the Bayesian network file exists
if os.path.exists("bayesianNetwork.pkl"):
    with open("bayesianNetwork.pkl", 'rb') as infile:
        bayesianNetwork = pickle.load(infile)
else:
    # If the file does not exist, generate the Bayesian network
    %time bayesianNetwork, outstr = longevity_bayes.longevity_bayes()
    print(outstr)
    with open("bayesianNetwork.pkl", 'wb') as outfile:
        pickle.dump(bayesianNetwork, outfile)

# Initialize the Bayesian network with the bayesInitialize function
longevity = bayesInitialize(bayesianNetwork)
# Bake the network to finalize its structure
longevity.bake()
# Predict probabilities for all variables with no initial evidence
predicted_probabilities = longevity.predict_proba({})

# Utils functions

## var_deps

The `var_deps` function constructs a dictionary that maps each variable in a Bayesian network to its dependencies. It takes an object `bayesianNetwork` as input, which contains `discreteDistributions` and `conditionalProbabilityTables`. The function initializes an empty dictionary var_deps. It first iterates over the discreteDistributions in the network, adding each distribution's name to var_deps with an empty list, indicating no dependencies. Next, it processes the `conditionalProbabilityTables` by adding each table's name to var_deps with an empty list and then appending the names of its `randomVariables` (dependencies) to this list. The function finally returns the populated `var_deps` dictionary, which provides a clear mapping of each variable to its dependencies within the Bayesian network. For example, if the network has discrete distributions A and B, and CPTs C (dependent on A and B) and D (dependent on B), the function will output 

`{'A': [], 'B': [], 'C': ['A', 'B'], 'D': ['B']}`

In [None]:
variable_dependencies = var_deps(bayesianNetwork)

In [None]:
variable_dependencies

## fillcols

The `fillcols` function organizes variables from a Bayesian network into a hierarchical structure based on their dependencies, returning this structure as a list of lists. It begins by creating a deep copy of the input dictionary `var_dict` to avoid modifying the original, and initializes an empty list `tree_list` to store the hierarchy. The function iteratively processes `var_deps`, which holds the dependencies, until all variables are organized. During each iteration, it initializes `next_level` to hold the variables for the current level and deletes to track variables that have been processed. It then checks if all dependencies of each variable are already included in tree_list. If a variable's dependencies are satisfied, it is added to `next_level` and marked for deletion. After processing all variables, those marked for deletion are removed from var_deps. This process continues until no more variables can be added (`final_len` matches `initial_len`). The resulting hierarchical structure, where each sublist represents a level of variables whose dependencies are satisfied by the previous levels, is returned. For example, if `var_dict` is `{'A': [], 'B': ['A'], 'C': ['A'], 'D': ['B', 'C']}`, the function will output `[['A'], ['B', 'C'], ['D']]`, indicating that A has no dependencies, B and C depend on A, and D depends on both B and C.



In [None]:
tree_list = fillcols(var_dict = variable_dependencies)

In [None]:
tree_list

## make_tree

The `make_tree` function constructs a hierarchical tree structure of a Bayesian network's variables and returns it as a DataFrame. It first calls the `var_deps` function to obtain a dictionary of variable dependencies from the `bayesianNetwork`. It then prints this dictionary for verification. Using the `fillcols` function, it organizes these dependencies into a hierarchical list called `tree`, which is also printed for verification. The function proceeds to build a new tree structure, `newtree`, where each variable is appended with its dependencies in parentheses. It iterates through each level (`ply`) of the `tree`, and for each variable (`v`), it constructs a string `newstr` containing the variable name followed by its dependencies, separated by commas. This string is appended to `newl`, which is then added to `newtree`.

The function creates a dictionary `df_dict` to store the hierarchical structure, where each level is a key-value pair with the key formatted as `level{n}` and the value being the list of variables at that level. This dictionary is converted into a DataFrame `df` using `pd.DataFrame.from_dict`. If the `connections` parameter is set to `False`, the function removes the dependency details from the variable names using a regex replace operation. The final DataFrame `df`, which represents the hierarchical structure of the Bayesian network variables, is returned. For instance, if the `bayesianNetwork` has variables `A`, `B` (dependent on `A`), `C` (dependent on `A`), and `D` (dependent on both `B` and `C`), the function would output a DataFrame representing this hierarchy.

In [None]:
bayestree = make_tree(bayesianNetwork)

## complexity_check

The `complexity_check` function evaluates a Bayesian network to ensure it meets specified complexity constraints and returns whether it passes these checks along with any error messages. The function takes a `bayesianNetwork` object and optional parameters for maximum size in bytes, allowed number of nodes, allowed number of variables, and allowed number of variable values.

First, it initializes `passes` to `True` and an empty list `messages` to collect error messages. The function calculates the size of the Bayesian network using `bayesianNetwork.ByteSize()`. If the size exceeds the `max_size_in_bytes` limit, `passes` is set to `False`, and an error message is added to `messages`.

Next, the function calls `get_var_val_positions` to obtain a dictionary mapping variable names to their value positions and calculates the number of nodes (`num_nodes`). If `num_nodes` exceeds `allowed_number_nodes`, it updates `passes` to `False` and appends an appropriate message to `messages`.

The function then determines the maximum number of variable values (`maxvarval`) from the lengths of the lists in `var_val_positions`. If `maxvarval` exceeds `allowed_number_variable_values`, it sets `passes` to `False` and logs this in `messages`.

Lastly, the function iterates over each conditional probability table in `bayesianNetwork.conditionalProbabilityTables` to check the number of dependencies (`numvars`). If any table has more dependencies than `allowed_number_variables`, `passes` is set to `False`, and a corresponding message is added.

After all checks, the function joins all collected messages into a single string `errors` and returns a tuple `(passes, errors)`, indicating whether the network passed the checks and any error messages generated.

In [None]:
#
passes,errors = complexity_check(bayesianNetwork)

## get_var_positions

The `get_var_positions` function creates a dictionary that maps each variable in a Bayesian network to its position index, ensuring no variable names are duplicated. The function initializes an empty dictionary `var_positions` and an empty set `check_for_repeats` to track variable names that have been processed.

First, the function iterates over `bayesianNetwork.discreteDistributions` using `enumerate` to get both the index `i` and the distribution `dist`. It assigns the current index `i` to `dist.name` in `var_positions`. If `dist.name` is already in `check_for_repeats`, it prints a message indicating a duplicate instance. Otherwise, it adds `dist.name` to `check_for_repeats`.

Next, it calculates the starting index for conditional probability tables as the current length of `var_positions` (`start = len(var_positions)`). It then iterates over `bayesianNetwork.conditionalProbabilityTables`, again using `enumerate`, and assigns an index starting from `start` to each table's name. If a table's name is already in `check_for_repeats`, it prints a message about the duplicate instance. Otherwise, it adds the table's name to `check_for_repeats`.

Finally, the function returns the populated `var_positions` dictionary, which maps each variable name in the Bayesian network to its unique index position. This dictionary helps in identifying the positions of variables efficiently without duplicates.

In [None]:
#
var_positions = get_var_positions(bayesianNetwork)

## get_var_val_positions

In [None]:
#
var_val_positions = get_var_val_positions(bayesianNetwork)

## get_internal_var_val_positions

In [None]:
internal_var_val_positions = get_internal_var_val_positions(bayesianNetwork)

## get_var_names

## get_var_val_names

## parse_net

## detect_anomalies

## readable

## create_query

## get_template_priors

## predict_proba_adjusted

## batch_query

## query

## explain_why_bad

## explain_why_good

## internal_query

## explain

## make_nmap

## any_of

## dictVarsAndValues

## all_of

## avg

## if_then_else

## addCpt

## bayesInitialize

## non_cpt_descriptions

## get_priors

## get_frequencies

## rr_prob_a_and_not_a_given_b_and_not_b

## ss_prob_a_and_not_a_given_b_and_not_b

## prob_a_and_not_a_given_b_and_not_b

## get_good_vars

## get_rr_vals

## replace_rr

## dependency

## align_ci

## normalize_ci

## get_window

## get_stat_info

## validation

## dependency_direct

In [2]:
bayesianNetwork = BayesianNetwork()

discreteDistribution = bayesianNetwork.discreteDistributions.add()
discreteDistribution.name = "angina"
variable = discreteDistribution.variables.add()
variable.name = "angina_yes"
variable.probability = 0.03
variable = discreteDistribution.variables.add()
variable.name = "angina_no"
variable.probability = 0.97

discreteDistribution = bayesianNetwork.discreteDistributions.add()
discreteDistribution.name = "age"
variable = discreteDistribution.variables.add()
variable.name = "elderly"
variable.probability = 0.05
variable = discreteDistribution.variables.add()
variable.name = "adult"
variable.probability = 0.25
variable = discreteDistribution.variables.add()
variable.name = "young_adult"
variable.probability = 0.3
variable = discreteDistribution.variables.add()
variable.name = "teen"
variable.probability = 0.2
variable = discreteDistribution.variables.add()
variable.name = "child"
variable.probability = 0.25

In [10]:
cpt

{'angina': ([['elderly', 'angina_yes', 0.0649889369371378],
   ['elderly', 'angina_no', 0.9350110630628622],
   ['adult', 'angina_yes', 0.027729624456249904],
   ['adult', 'angina_no', 0.9722703755437501],
   ['young_adult', 'angina_yes', 0.03316272384316883],
   ['young_adult', 'angina_no', 0.9668372761568311],
   ['teen', 'angina_yes', 0.022185290713900237],
   ['teen', 'angina_no', 0.9778147092860998],
   ['child', 'angina_yes', 0.027729624457416144],
   ['child', 'angina_no', 0.9722703755425839]],
  ['age'],
  {'angina_yes': 0.03, 'angina_no': 0.97},
  'Against the baseline risks, the relative risk that {0} will be angina_yes for those in the age category of elderly is 2.3.')}

In [5]:
cpt = {}
outstr = ''
outstr = outstr + addCpt(bayesianNetwork,cpt) 

cpt["angina"] = dependency(bayesianNetwork,cpt,
[
({"age":["elderly"]},{"relative_risk":2.3})
],
{"angina_yes":0.03,"angina_no":0.97}
)

start timing...
start timing...
[({'age': ['elderly']}, {'relative_risk': 2.3})] ==> {'angina_yes': 0.03, 'angina_no': 0.97} took 0.01498579999999805 seconds
{'angina_yes': 0.03, 'angina_no': 0.97}  wrapper took 0.015030299999999386 seconds


For best performance, build P as a scipy.sparse.csc_matrix rather than as a numpy.ndarray
For best performance, build G as a scipy.sparse.csc_matrix rather than as a numpy.ndarray
For best performance, build A as a scipy.sparse.csc_matrix rather than as a numpy.ndarray


In [None]:
# We create a spreadsheet of the above for reference
rows_list = []
outname = "varvals.csv"
for var, valdict in var_val_positions.items():
    rowdict = {} 
    rows_list.append(rowdict)
    varstr= var +"("
    for val, pos in valdict.items():
        varstr += val
        varstr+= ","
            
    varstr=varstr[:-1]+")"
    rowdict["variable"] = varstr        
#
df = pd.DataFrame(rows_list)      
# df.to_csv(outname, index = False)

In [None]:
G = nx.DiGraph()
regex = re.compile(r'^([a-z_]*)\(.*')
for index, row in df.iterrows():
    for col in df.columns:
        if row[col] is not None:
            node = row[col][:-1] if col == "level0" else row[col]
            G.add_node(node)           
for node1 in G.nodes:
    for node2 in G.nodes:
        match = regex.match(node1)
        if (node1+',' in node2) or (match is not None and match.group(1)+"," in node2): 
            G.add_edge(node1,node2)
#             if match is not None and match.group(1)+"," in node2:
#                 print (node1)
#                 print(node2)    
#
plt.figure(3,figsize=(12,12)) 
#nx.draw(G, with_labels=True)  
#nx.draw_random(G, with_labels=True)   
nx.draw_circular(G, with_labels=True)  
#nx.draw_spectral(G, with_labels=True)   
#nx.draw_spring(G, with_labels=True)  