# Module 1 - Exam - Part 2 - Practical

<img style="float: right; margin: 0px 0px 15px 15px;" src="https://upload.wikimedia.org/wikipedia/commons/1/18/Bayes%27_Theorem_MMB_01.jpg" width="400px" height="300px" />

> In the second part of the exam, you will set up and program some Bayesian networks according to some description. These descriptions are to be obtained from experts, in real life.
>
> Then, you will make questions (queries) to your network using functions that you will define. For this exam, you are not allowed to use inference algorithms provided by `pgmpy`. Instead, you will construct simple inference functions for you to perform queries to the net.
>
> Good luck!

> **References:**
> 
> - Probabilistic Graphical Models Specialization, offered through Coursera. Prof. Daphne Koller.
>   - Simple BN Knowledge Engineering Assignment.


<p style="text-align:right;"> Imagen recuperada de: https://upload.wikimedia.org/wikipedia/commons/1/18/Bayes%27_Theorem_MMB_01.jpg.</p>

___

# 1. Engineering network for credit-worthiness (50 points).

Your friend at the bank, hearing of your newfound expertise in probabilistic graphical models, asks you to help him develop a predictor for whether a person will make timely payments on his/her debt obligations, like credit card bills and loan payments. In short, your friend wants you to develop a predictor for **credit-worthiness**. He tells you that the bank is able to observe:
- the customer’s income (High, Medium, Low),
- the amount of assets the person has (High, Medium, Low),
- the person’s ratio of debts to income (Low, High),
- the person’s payment history (Excellent, Acceptable, Unacceptable),
- as well as the person’s age (Between16and21, Between22and64, Over65).

He also thinks that the credit-worthiness of a person is ultimately dependent on 
- how reliable a person is (Reliable, Unreliable),
- as well as the person’s future income (Promising, Not_promising).

He hopes that, given the eight variables above, you can help him encode into the network the following observations he has made from his experience in evaluating people’s credit-worthiness:
1. The better a person’s payment history, the more likely the person is to be reliable.
2. The older a person is, the more likely the person is to be reliable.
3. Older people are more likely to have an excellent payment history.
4. People who have a high ratio of debts to income are likely to be in financial hardship and hence less likely to have a good payment history.
5. The higher a person’s income, the more likely it is for the person to have many assets.
6. The more assets a person has and the higher the person’s income, the more likely the person is to have a promising future income.
7. All other things being equal, reliable people are more likely to be credit-worthy than unreliable people. Likewise, people who have promising future incomes, or who have low ratios of debts to income, are more likely to be credit-worthy than people who do not.

1. (20 points) Construct a network using `pgmpy`, adding appropriate edges and defining the CPDs, so that your network captures the behavior that your friend expects. Your network will be evaluated solely on whether it produces marginals that are consistent with the desired behavior and not on the actual values of the CPDs in the network. As an example, here is the condition that your network should satisfy for it to be considered consistent with observation 1: if we let $R$ denote the random variable for the reliability variable, and let $H$ denote the random variable for payment history, then your network should satisfy:

   $$P(R=Reliable|H=Excellent) > P(R=Reliable|H=Acceptable) > P(R=Reliable|H=Unacceptable).$$

In [42]:
from pgmpy.models import BayesianModel
from pgmpy.factors.discrete import TabularCPD
import pgmpy.factors.discrete as dis

In [43]:
# Defining the model structure. We can define the network by just passing a list of edges.
model = BayesianModel([('Age', 'PaymentHistory'),
                       ('Age', 'Reliability'),
                       ('LowDebtRatio', 'PaymentHistory'),
                       ('PaymentHistory', 'Reliability'),
                       ('CurrentIncome','FutureIncome'),
                       ('CurrentIncome','Assets'),
                       ('Assets','FutureIncome'),
                       ('Reliability','CreditWorthiness'),
                       ('LowDebtRatio','CreditWorthiness'),
                       ('FutureIncome','CreditWorthiness')]) 

In [44]:
cpd_Age = TabularCPD(variable='Age', variable_card=3, values=[[0.1], [0.7], [0.2]])

In [45]:
cpd_LowDebtRatio = TabularCPD(variable='LowDebtRatio', variable_card=2, values=[[0.4], [0.6]])

In [46]:
cpd_CurrentIncome = TabularCPD(variable='CurrentIncome', variable_card=3, values=[[0.35], [0.55], [0.1]])

In [47]:
cpd_PaymentHistory = TabularCPD(variable='PaymentHistory', variable_card=3, 
                    values=[[0.99, 0.1, 0.59,  0.05, 0.6, 0.05],
                            [0.009, 0.7, 0.35, 0.35, 0.3, 0.1],
                            [0.001, 0.2,  0.06, 0.6, 0.1, 0.85]],
                   evidence=['Age', 'LowDebtRatio'],
                   evidence_card=[3, 2])

In [48]:
cpd_Reliability = TabularCPD(variable='Reliability', variable_card=2, 
                    values=[[0.9, 0.6, 0.4, 0.6, 0.4, 0.1, 0.4, 0.1, 0.001],
                            [0.1, 0.4, 0.6, 0.4, 0.6, 0.9, 0.6, 0.9, 0.999]],
                   evidence=['Age', 'PaymentHistory'],
                   evidence_card=[3, 3])

In [49]:
cpd_FutureIncome = TabularCPD(variable='FutureIncome', variable_card=2, 
                    values=[[0.95, 0.8, 0.4, 0.4, 0.6, 0.2, 0.2, 0.05, 0.001],
                            [0.05, 0.2, 0.6, 0.6, 0.4, 0.8, 0.8, 0.95, 0.999]],
                   evidence=['CurrentIncome', 'Assets'],
                   evidence_card=[3, 3])

In [50]:
cpd_Assets = TabularCPD(variable='Assets', variable_card=3, 
                    values=[[0.7, 0.3, 0.01],
                            [0.25, 0.55, 0.7],
                            [0.05, 0.15, 0.29]],
                   evidence=['CurrentIncome'],
                   evidence_card=[3])

In [51]:
cpd_CreditWorthiness = TabularCPD(variable='CreditWorthiness', variable_card=2, 
                    values=[[0.999, 0.9, 0.85, 0.35, 0.4, 0.3, 0.2, 0.001],
                            [0.001, 0.1, 0.15, 0.65, 0.6, 0.7, 0.8, 0.999]],
                   evidence=['LowDebtRatio', 'Reliability', 'FutureIncome'],
                   evidence_card=[2, 2, 2])

2. (20 points) For defining the CPDs above you must have used DiscreteFactor objects or TabularCPD objects. In any case, TabularCPD objects inherit from DiscreteFactor class. As we have seen in class, `pgmpy` already provides you how to do factor product, factor marginalization and evidence observation operations.

   Based on these operations you should define two functions:
   - **compute_joint_distribution (10 points):** - This function should return a factor representing the joint distribution given a set of factors that define a Bayesian network. You may assume that you will only be given factors defining valid CPDs, so no input validation is required.
   - **compute_marginal (10 points):** This function should return the marginals over input variables (the input variables are those that remain in the marginal), given a set of factors that define a Bayesian network, and, optionally, evidence.

In [52]:
def compute_joint_distribution(cpds):
    """
    This function takes as an input an iterable of CPDs (factors)
    and returns the joint distribution defined by them, according
    to the chain rule for a Bayesian network:
                      n
    P(X1, ..., Xn) =  𝜫  P(Xi | Pa Xi).
                     i=1
    
    :param list[DiscreteFactor] cpds: list of CPDs, such that cpds[i-1] = P(Xi | Pa Xi).
    :return: DiscreteFactor corresponding the joint distribution.
    """
    factors = [cpd.to_factor() for cpd in cpds]
    joint = factors[0].identity_factor()
    for f in factors:
        joint = joint * f
    return joint

In [81]:
def compute_marginal(cpds, variables, evidence):
    """
    This function takes as an input an iterable of CPDs (factors),
    an iterable of variables that will remain in the marginal, and
    optionally an iterable of tuples defining the evidence.
    
    It returns the marginal conditional distribution P(variables | evidence).
    
    :param list[DiscreteFactor] cpds: list of CPDs, such that cpds[i-1] = P(Xi | Pa Xi).
    :param list[str] variables: list of variables to keep in the marginal.
    :param list[tuples] evidence: list of evidence in the form of tuples ('name_of_variable', value).
    :return: DiscreteFactor corresponding the marginal conditional distribution.
    """
    #Paso 0 calcular la distribucion conjunta
    joint = compute_joint_distribution(cpds)
    #Paso 1 obtener la lista de todas las variables involucradas
    todasVariables = joint.scope()
    #Paso 2 obtener las variables a marginalizar
    variablesMarginalizar = (set(todasVariables) - set(variables)) - set([e[0] for e in evidence])
    #Paso 3 marginalizar
    marginal = joint.marginalize(variables = variablesMarginalizar, inplace = False)
    #Paso 4 reducir de acuerdo a la evidencia
    marginal_reducida = marginal.reduce(values = evidence, inplace = False)
    #Paso 5 normalizar
    marginal_reducida.values = marginal_reducida.values / marginal_reducida.values.sum()
    return marginal_reducida

In [60]:
# You may want to check your functions
print(cpd_PaymentHistory)

+-------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
| Age               | Age(0)          | Age(0)          | Age(1)          | Age(1)          | Age(2)          | Age(2)          |
+-------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
| LowDebtRatio      | LowDebtRatio(0) | LowDebtRatio(1) | LowDebtRatio(0) | LowDebtRatio(1) | LowDebtRatio(0) | LowDebtRatio(1) |
+-------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
| PaymentHistory(0) | 0.99            | 0.1             | 0.59            | 0.05            | 0.6             | 0.05            |
+-------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
| PaymentHistory(1) | 0.009           | 0.7             | 0.35            | 0.35          

In [82]:
print(compute_marginal([cpd_Age,cpd_LowDebtRatio,cpd_CurrentIncome,cpd_PaymentHistory,cpd_Reliability,cpd_FutureIncome,cpd_Assets,cpd_CreditWorthiness],['Age'],[('CreditWorthiness',1)]))

+--------+------------+
| Age    |   phi(Age) |
| Age(0) |     0.0820 |
+--------+------------+
| Age(1) |     0.6984 |
+--------+------------+
| Age(2) |     0.2196 |
+--------+------------+


None


3. (10 points) With the above functions you have implemented a rudementary inference engine for Bayesian networks. You can use your implementation to experiment with the credit-worthiness network that you constructed in the first numeral.

    Please, perform the necessary queries to check that each of your friend's observation is correctly encoded by your network:
    1. The better a person’s payment history, the more likely the person is to be reliable.
    2. The older a person is, the more likely the person is to be reliable.
    3. Older people are more likely to have an excellent payment history.
    4. People who have a high ratio of debts to income are likely to be in financial hardship and hence less likely to have a good payment history.
    5. The higher a person’s income, the more likely it is for the person to have many assets.
    6. The more assets a person has and the higher the person’s income, the more likely the person is to have a promising future income.
    7. All other things being equal, reliable people are more likely to be credit-worthy than unreliable people. Likewise, people who have promising future incomes, or who have low ratios of debts to income, are more likely to be credit-worthy than people who do not.

4. Optional. This numeral gives you extra points. In the case that the grade of one of your homeworks is less than 100, you have the opportunity to recover the points doing this numeral.
   
   Set up the network in SAMIAM and check that the marginals computed by SAMIAM are the same than those computed by your inference engine.