## 1. How is the entropy of a probability mass function defined?

For the discrete random variable X:
$$
\begin{align*}
H(X) &= - \sum_i P(X = i) \log_2 P(x = i)
\end{align*}
$$

For the continuous case:

$$
\begin{align*}
H(X) &= - \int_i f(x) \ln f(x) dx
\end{align*}
$$

## 2. Calculate the entropy of the G(n, p) model and explain for which value p it is maximized.

p = 0.5

## 3. Investigate the definition of the Kullback-Leibler divergence and explain its interpretation in terms of entropy.

However, we can consider the entropy of a probability distribution, say $P(k)$, relative to another probability distribution $Q(k)$. The idea is that we want to capture the expected surprise about the outcome of random events with an actual probability distribution $P(k)$, when our model for the outcome of random events is given by $Q(k)$.

The relative entropy (also called the Kullback-Leibler divergence) **from Q to P** for a discrete random variable with outcomes $i$ is given as: 

$$ D_{\text{KL}}(P \| Q) := - \sum_{i} P(i) \cdot \log \frac{Q(i)}{P(i)}$$

## 4. Explain under which conditions the maximization of likelihood corresponds to a minimization of entropy.

Since we have equiprobable microstates, the likelihood of our model parameters is simply the inverse of the number of realizations. The smaller that number, the larger the probabilities of equiprobable microstates (and thus the larger the likelihood) and the smaller the entropy. This means that in the micro-canonical ensemble, the parameters with maximum likelihood corresponds to those parameters for which the ensemble has minimal entropy.

## 5. Explain how we can construct a Huffman code for a sequence of random variables.

We would assign more frequent pairs to shorter codes.

```python
def huffman_tree(sequence):

    counts = Counter(sequence).most_common()
    seq_length = len(sequence)

    # symbols with lowest frequency have highest priority
    q = queue.PriorityQueue()
    
    labels = {}
    node_type = {}
    for (symbol, count) in counts:
        # create leaf nodes and add to queue
        labels[symbol]='{0} / {1:.3f}'.format(symbol, count/seq_length)
        q.put((count, symbol))
        node_type[symbol] = 'leaf'        
    # Create huffman tree
    i = 0
    
    edges = []
    edge_symbols = []
    
    while q.qsize()>1:

        # retrieve two symbols with minimal frequency
        left = q.get()
        right = q.get()

        total_frequency = left[0] + right[0]

        # create internal node v with total frequency as label
        v = 'n_' + str(i)
        label = '{:.2f}'.format(total_frequency/seq_length)
        labels[v] = label
        node_type[v] = 'internal'
        edges.append((v,left[1]))
        edge_symbols.append('0')
        edges.append((v,right[1]))
        edge_symbols.append('1')

        q.put((left[0] + right[0], v))
        i += 1

    # the remaining entry corresponds to the root node
    root = q.get()
    huffman_tree=pp.Graph.from_edge_list(edges)
    huffman_tree.data.node_labels = [labels[v] for v in huffman_tree.nodes]
    huffman_tree.data.node_type = [node_type[v] for v in huffman_tree.nodes]
    huffman_tree.data.edge_symbols = edge_symbols
    return huffman_tree, root[1]
```

## 6. What statement does Shannon’s source coding theorem make?

Let $H_i$ be a sequence of n i.i.d. random variables with entropy $H(X_i)$. For any $\epsilon$ > 0 and sufficiently large n there exists a coding scheme that encodes a sequence of n realizations $X_i$ in terms of $n * H(X_i)$ bits such that the sequence can be recovered with probability larger than $1 - \epsilon$. 

## 7. Use Shannon’s source coding theorem to calculate the optimal compression for a sequence of biased coin tosses with $p \neq 0.5$.

Example at slide 14.

## 8.How can entropy be used for community detection based on the stochastic block model?

Entropy minimization is equivalent to likelihood maximization in the $G(n,p)$ version of the block model search.

## 9. How is the description length of the stochastic block model defined?

We will first define a function $h(x) = (1 + x) \log(1 + x) - x \log x$. Based on it we will describe a description length term:

$$
\begin{align*}
\triangle{\vec{z}} &= m h \left( \frac{B(B+1)}{2m} \right) + n \log B
\end{align*}
$$

## 10. Explain how the detection of the optimal number of communities in a network is related to Occam’s razor.

We try to find with maximum explaratory power and minimum system difficulties.