Machine learning involves learning probability distributions over large number of random variables. For example in a classification problem, the input is mapped into a probability distribution over all the classes. In some cases, using a single function to describe everything might not be fesiable at all. In such cases, we tend to use structured probabilistic models. What if we can efficiently capture the whole probability distribution using fewer parameters by assuming or inferring some properties of the underlying data?  This underlying probability distribution that we are trying to capture could be a lot of things, from images of galaxies to audio signals generated by Whales. Is there a general principle that could be used to capture the distribution over such a huge parameter space?

### Types of Graphical Models: Directed, Undirected

* Directed graphical models are used when you have enough information about the underlying data distribution. If you sort of know what variables are effecting what other variables, you can build a directed graph where each node is a random variable and the edges represent conditional dependence on each other.

$$ P(a,b,c,d,e) = P(a) P(b|a) P(c| a,b) P(d|b) P(e|c) $$

* The conditional dependencies are acqurired from the data and a graph is built with it.

* Undirected models does not assume an underlying conditional dependence among it's nodes, instead it tries to come up with functions that can approximate the probability distribution between the nodes. These functions need not be proabability distributions, but in the end you normalize the whole distribution so they have to be normalizable atleast.

$$ P(X) = \Pi F^i C^i $$

* The whole probability is assumed to be coming from individual cliques in a graph, where each clique is parametrized by a function.

### Graphical Models Application Areas:

* Density estimation (What is the true underlying distribution?)
* Denoising data (Finding a mapping from noisy X to real X)
* Missing value imputation (if you estimate the density of the true distribution, that could be used to fill up missing values)
* Image Genenerative Models
* Approximating large Graphs with fewer number of parameters
* Generative Art!
* Generative Music!
* Brainstroming Models!

### Challenges in using standard PGM's:

* Trade-off between memomry and statistical effficiency: If we want to compute the values of probablity for all the values of random variables, it would consume too much memory. On the other hand, if you approximate too much, you lose efficiency. An example would be computing all the possible n-grams vs smoothing approaches. Coming up with a good trade-off is touch and go.

* Runtime: Using full model might make it compute exhaustive, there are two kinds of runtime costs, one is during inference, and the other would be sampling. Standards PGM's are not so good at both of them.

### Using Graphs to describe models:

#### Directed Models (also called Belief Networks, Bayesian Networks):

* Alice (t0) -> Bob (t1) -> Carol (t2)

* Dependencies are established with the graph.

* The overall PD can be established as P(t0, t1, t2) = P(t0) P(t1|t0) P(t2| t1)

* Parameter reduction (if each RV takes 100 values): Without graph = 100x100x100, with dependencies between edges = 99 + 9900 + 9900

#### Undirected Models (also called Markov Random Fields, Markov Networks):

* No CPD 
* Used when you don't have prior preference over choosing a casual relationship
* The partition function or normalization constant (z) has close relations from statistical mechanics?
* This 'Z' is often tough to compute practically. You should make sure that this function would give you a probabilities in the end.

### Energy based Models