In [1]:
from IPython.display import Image, Math

### Machine Learning
Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions. We will be considering only classification here.

There are multiple ways to solve this problem:  
1. We could find a function which can directly map an input value to it's class label. 
2. We can find the probability distributions over the variables and then use this distribution to answer queries about the new data point.

In Probabilistic Graphical Models we try to do predictions using the second method. 
The most obvious way to do this would be to compute a Joint Probability Distribution over all these variables and then marginalize and reduce over these according to our new data point to get the probabilities of classes.


Let's take an example of class where we want to predict things about students. So, we have some previous data of students and for each student we have 5 features: Difficulty (D), Intelligence (I), SAT (S), Grade (G) and Recommendation Letter (L). For simplicity, we will also consider that each of these features can have only two values. So, difficulty can take two values easy and hard, Intelligence can take smart and dumb and so on.


In [2]:
# Let's now generate some random data. 
# Now let's try to compute the Joint Probability Distribution over these variables.

We now have a Joint Distribution $ P(D, I, S, G, L) $. Now let's say we have a new student and we want to predict what recommendation letter will he get and also say that we know that he is intelligent, the course was easy, his SAT score is good, and also he got a good grade. So, basically we want to compute $ P(L | d^0, i^1, s^1, g^1) $. From chain rule of probability we know that:
$$ P(D, I, S, G, L) = P(L | D, I, S, G) * P(D, I, S, G) $$
$$ P(D, I, S, G, L) = P(D | I, S, G, L) * P(I | S, G, L) * P(S | G, L) * P(G | L) * P(L) $$
$$ P(L | d^0, i^0, s^1, g^1) = \frac {P(d^0, i^1, s^1, g^1, L)} {P(d^0, i^1, s^1, g^1)} $$

Since we know the joint distribution, $ P(D, I, S, G, L) $ we can now easily compute both the terms $ P(d^0, i^1, s^1, g^1, L) $ and $ P(d^0, i^1, s^1, g^1) $ and using these we will be able to compute the $ P(L | d^0, i^0, s^1, g^1) $.

#### TODO: Show the computed values

The way we solved the problem above we had to compute the whole joint distribution to answer queries over this model. And we can see that the size of the table is exponential to the number of variables involved. So, in the case of large models this method becomes infeasible.

### Probabilistic Graphical Models (PGM)

Probabilistic Graphical Model is a way of compactly representing Joint Probability distribution over random variables by exploiting the independence conditions of the variables. And PGM also provides us methods to easly and efficiently compute conditional probabilities over joint distributions. 

### How is PGM different than other techniques?

The distinctive thing about PGM is that it provides a very intuitive and natural approach for modelling complex problems along with maintaining control over the computational costs.

There are two main types of modes in PGM:
1. Bayesian Models: Bayesian Models are used mainly when we have causal relationship between the variables
2. Markov Models: When non causal relationship.

In the student example that we saw, we mostly have a causal relationship between variables. So, if a student is intelligent it implies that he should get good grades, getting good grades implies that he will get a better recommendation. Let's try to create a graph of dependencies.

<img src="files/student.png">

This network structure implies many independence conditions between the variables. Any variable in the network is independent of its non descendents given its parents. This is known as local independencies in Bayesian Networks.
$$ (X \perp NonDesc_X | Pa_X) $$

Now let's consider global independencies in the network. Connections between variables in a directed graph can only be the following four ways:


<img src="files/connections_directed_graphs.png">

In the case of causal, evidential and common cause connections A and C are dependent if B is not observed but if B is observed then the become independent. Whereas in the case of common evidence A and C are independent if B is not observed but become dependent if B is observed.

So in the case of our student example we can now get many dependence assertions over the network like $ L \perp D | G $, $ L \perp I | G $, $ D \perp I $, $ D \not\perp I | G $ etc.

#### Active Trail
Two nodes in a Bayesian Network are called to be on an active trail if change in one of the variables affects the other. An active trail is formed if the trail between the variables only have causal, evidential or common cause relations and if a common evidence relation is present then the common evidence is observed.The common evidence relation is also commonly known as V structure.

Now coming back to our original joint distribution distribution:
$$ P(D, I, G, S, R) = P(D | I, G, S, R) * P(I | G, S, R) * P(G | S, R) * P(S | R) * P(R) $$

So using these independence properties in a Bayesian Network we can reduce this to:
$$ P(D, I, G, S, R) = P(D) * P(I) * P(G | D, I) * P(S | I) * P(L | G) $$

And we can also notice that all the terms in this are the probabilites of each the variables given their parents in the network. These terms are known as the conditional Probability distributions of the variables. So basically to compute the Joint Distribuiton we can compute the product of CPDs of all the variables.

### Another example using pgmpy 

Let's see an example for predicting the price of a house. For simplicity we will consider that the price of the house depends only on Area, Location, Furnishing, Crime Rate and Distance from the airport. And also we will consider that all of these are discrete variables.

Our raw data would look something like this:

In [3]:
import numpy as np
raw_data = np.random.randint(low=0, high=2, size=(1000, 6))
print(raw_data)

[[1 0 1 0 1 0]
 [0 1 1 0 1 0]
 [1 1 1 1 0 0]
 ..., 
 [1 0 0 1 0 1]
 [1 0 0 0 0 1]
 [0 1 0 1 0 1]]


Probabilistic Graphical Models keeps the representation of Joint Distribution simple by exploiting the dependencies in the variables. And these dependencies are represented using a graph. Now let's create a model for the interaction of our variables using intuition:

<img src="files/housing_price_small.png">

<img src="files/housing_price.png">

For each node/variable in the network we assign a conditional probability distribution to it. The conditional probability distribution represents the probability of the variable when its parents are observed. So now using our raw data let's assign CPDs to each of these nodes. 

In [4]:
from pgmpy.models import BayesianModel
import pandas as pd
data = pd.DataFrame(raw_data, columns=['A', 'C', 'D', 'L', 'F', 'P'])
data_train = data[: int(data.shape[0] * 0.75)]
model = BayesianModel([('F', 'P'), ('A', 'P'), ('L', 'P'), ('C', 'L'), ('D', 'L')])

model.is_active_trail('A', 'D')

False

In [5]:
model.is_active_trail('D', 'P')

True

In [6]:
model.fit(data_train)

### What does fit does ?
The fit method adds a Conditional Probability Distribution (CPD) to each of the node in our model

<img src='files/housing_price_with_CPD.png'>

In [7]:
model.get_cpds()

[<TabularCPD representing P(L:2 | D:2, C:2) at 0x7f0ca43cc940>,
 <TabularCPD representing P(F:2) at 0x7f0c839df5c0>,
 <TabularCPD representing P(C:2) at 0x7f0c839df470>,
 <TabularCPD representing P(P:2 | L:2, F:2, A:2) at 0x7f0ca43cc828>,
 <TabularCPD representing P(D:2) at 0x7f0c839df898>,
 <TabularCPD representing P(A:2) at 0x7f0c839df588>]

In [8]:
model.get_cpds('P')

0,1,2,3,4,5,6,7,8
A,A_0,A_0,A_0,A_0,A_1,A_1,A_1,A_1
F,F_0,F_0,F_1,F_1,F_0,F_0,F_1,F_1
L,L_0,L_1,L_0,L_1,L_0,L_1,L_0,L_1
P_0,0.3878,0.3832,0.4944,0.5823,0.5000,0.5253,0.4773,0.4519
P_1,0.6122,0.6168,0.5056,0.4177,0.5000,0.4747,0.5227,0.5481


But the data that we have for training might be baised so with pgmpy we also have the option to assign your own Conditional Probability Distributions. Let's say the probability of getting an unfurnished home is equal to getting a furnished house. Let's adjust the values according to this.

In [9]:
from pgmpy.factors import TabularCPD
f_cpd = TabularCPD('F', 2, [[0.5], [0.5]])

model.remove_cpds('F')
model.add_cpds(f_cpd)

model.check_model()

True

### Inference

Now let's try to do some reasoning on our model to verify if our intuitution for the model was correct or not.

In [10]:
from pgmpy.inference import VariableElimination
inference_model = VariableElimination(model)
# Returns a probability distribution over variables A and B.
inference_model.query(variables=['A', 'P'])
print(inference_model.query(variables=['P'], evidence={'L': 0, 'F': 0, 'A': 0}))

{'P': <Factor representing phi(P:2) at 0x7f0c839b4a58>}


```
P          phi(A)  
------------------
P_0        0.96
P_1        0.04
```

If you think about prediction about new values from this model, it is basically the same what we have been doing here. We basically ask questions about the probability of some variable giving conditions for other variables. Also we can account for missing values with just leaving it blank.

In [11]:
test_data = data[int(0.75 * data.shape[0]) : data.shape[0]]
test_data.drop('P', axis=1, inplace=True)
model.predict(test_data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,P
750,0
751,1
752,1
753,0
754,1
755,1
756,1
757,1
758,1
759,1
