# Week 2 Overview

During this week's lessons, you will learn more about word association mining with a particular focus on mining the other basic form of word association (i.e., syntagmatic relations), and start learning topic analysis with a focus on techniques for mining one topic from text.

## Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

* Explain how to discover syntagmatic relations from text data
* Explain the computation task of mining and analyzing topics in text data, particularly its input and the expected output.
* Explain the problems with defining a topic as just one term when mining and analyzing topics in text data.
* Explain the limitations of using one term to represent a topic and how they can be addressed by representing a topic as a distribution over words.
* Explain basic concepts in statistical language models such as “language model”, “unigram language model”, “likelihood”, Maximum Likelihood estimate.
* Explain how to mine one topic from a text document, i.e., estimate a unigram language model

## Key Concepts

* Entropy
* Conditional entropy
* Mutual information
* Topic and coverage of topic
* Language model
* Generative model
* Unigram language model
* Word distribution
* Background language model
* Parameters of a probabilistic model
* Likelihood
* Bayes rule
* Maximum likelihood estimation
* Prior and posterior distributions
* Bayesian estimation & inference
* Maximum a posteriori (MAP) estimate
* Prior model
* Posterior mode

## Guiding Questions

Develop your answers to the following guiding questions while watching the video lectures throughout the week.

* What is entropy? For what kind of random variables does the entropy function reach its minimum and maximum, respectively?<br><br>
    Let:<br>
    (\*) **Probabilies** here means frequencies of outcomes in random experiments.<br>
    (\*) An **Ensemble** \\(X\\) is a triple \\((x, Ax, Px)\\), where the *outcome* \\(x\\) is the value of a random variable, which takes on one of a set of possible values, \\(Ax=(a_1, a_2, ..., a_i, ..., a_I)\\), having probabilities \\(Px(p_1, p_2, ..., p_I)\\), with \\(P(x=a_i)=p_i,p_i \geq 0\\) and \\(\sum_{a_i \in Ax} P(x=a_i)=1\\).<br>
    (\*) The Shannon **Information Content** of an outcome \\(x\\) is defined to be:<br>
$$\eqalign{
    h(x) &= log_2 \frac{1}{P(x)} \ \text{bits}\\
         &= -log_2 P(x) \ \text{bits}
}$$
    <br>
    Then:<br>
    (\*) The **Entropy of an ensemble** \\(X\\) is defined to be the average Shannon information content of an outcome:<br>
$$H(X) = -\sum\limits_{x \in Ax} P(x) \ log_2 P(x) \ \text{bits}$$<br><br>
    \\(H(X)\\) is **minimum** when \\(P(x)=0\\), so that:<br>
$$H(X) = - \sum\limits_{x \in Ax} 0 * log_2 0 \equiv 0$$
    since \\(lim_{\theta \rightarrow 0^-} -\theta log_2 \theta = 0\\) and \\(lim_{\theta \rightarrow 0^+} \theta log_2 1/\theta = 0\\)<br><br>
    \\(H(X)\\) is **maximum** when \\(P(x)=1/|x|\\), so that:<br>
$$H(X) = - \sum\limits_{x \in Ax} \frac{1}{|x|} * log_2 \frac{1}{|x|} = log_2 \frac{1}{|x|}$$
    <br><br>
    For further explanation, please read [Information Theory Inference Learning Algorithms](https://www.amazon.com/Information-Theory-Inference-Learning-Algorithms/dp/0521642981) at page 32.
<br><br>

* What is conditional entropy?<br><br>
    Let:<br>
    (\*) **Joint probability** is probability that event \\(X\\) and event \\(Y\\) occurred together at the same time:<br>
$$\eqalign{
    P(X, Y) &= \frac{\sum\limits_{y \in A_x} y=Y}{N}\\
    P(X, Y) &= \frac{f_X(Y)}{N}
}$$<br>
    (\*) **Conditional probability** beautifuly defined by [Kolmogorov](https://en.wikipedia.org/wiki/Andrey_Kolmogorov) is projection of \\(f_X(Y)\\) in the space \\(Y\\):
$$\eqalign{
    P(X|Y) &= \frac{f_X(Y)}{Y}\\
    &= \frac{P(X,Y)/N}{P(Y)/N}\\
    &= \frac{P(X,Y)}{P(Y)}
}$$<br>
    (\*) **Bayes rule** tells us about corelation between joint probability and conditional probability:<br>
$$P(X|Y)P(Y) = P(X,Y) = P(Y|X)P(X)$$<br>
    (\*) **Joint entropy** is entropy of joint ensembles \\(X\\) and \\(Y\\):
$$\eqalign{
    H(X,Y) &= -\sum\limits_{x \in A_x}\sum\limits_{y \in A_y} P(x,y) \ log_2 \ P(x,y)\\
           &= -E \ log_2 \ P(X,Y)
}$$<br>
    Then:<br>
    (\*) **Conditional entropy** is entropy of conditional ensembles \\(X\\) and \\(Y\\) or by expand Kolmogorov definition, how much bits we need to encode uncertainty of each \\(P(y, x) \ \text{in} \ 1/P(x|y) \ bits \ \text{where} \ yx \in A_yA_x\\):<br>
$$\eqalign{
    H(X|Y) &= \sum\limits_{y \in A_y} P(y) \ H(X|Y=y)\\
           &= - \sum\limits_{y \in A_y} P(y) \sum\limits_{x \in A_x} P(x|y) \ log_2 \ P(x|y)\\
           & \text{apply bayes rule}\\
           &= - \sum\limits_{y \in A_y}\sum\limits_{x \in A_x} P(y,x) \ log_2 \ P(x|y)\\
           &= - E \ log_2 \ P(X|Y)
}$$<br>
    For further explanation, please read [Elements of Information Theory](https://www.amazon.com/Elements-Information-Theory-Telecommunications-Processing/dp/0471241954) at page 16-17.
<br><br>

* What is the relation between conditional entropy H(X|Y) and entropy H(X)? Which is larger?<br><br>
    Let:<br>
    (a) \\(H(X)\\) is uncertainty of \\(X\\) encoded in \\(1/P(X) \ bits\\), so that \\(H(X) \leq log_2 1/P(X)\\)<br><br>
    (b) \\(H(Y)\\) is uncertainty of \\(Y\\) encoded in \\(1/P(Y) \ bits\\), so that \\(H(Y) \leq log_2 1/P(Y)\\)<br><br>
    (c) \\(H(X,Y)\\) is joint of uncertainties of \\(X \cap Y\\) in \\(1/P(X,Y) \ bits\\), so that \\(H(X,Y) \leq log_2 1/P(X,Y)\\)<br><br>
    (d) \\(H(Y,X)\\) is joint of uncertainties of \\(Y \cap X\\) in \\(1/P(Y,X) \ bits\\), so that \\(H(Y,X) \leq log_2 1/P(Y,X)\\)<br><br>
    We know that:<br>
    (\*) \\(H(X|Y)\\) is conditional of uncertainties of \\(X\\) given known event of \\(Y\\) in \\(1/P(X|Y) \ bits\\), so that \\(H(X|Y) \leq log_2 1/P(X|Y)\\)<br><br>
    (\*) Based on bayes rule, if we know \\(H(Y,X)\\) then we know about \\(H(X|Y)\\) and \\(H(Y)\\):
$$\eqalign{
    H(Y,X) &= -\sum\limits_{y \in A_y} \sum\limits_{x \in A_x} P(y, x) \ log_2 \ P(y,x)\\
           &= -\sum\limits_{y \in A_y} \sum\limits_{x \in A_x} P(y, x) \ log_2 \ P(y) \ P(x|y)\\
           &= -\sum\limits_{y \in A_y} \sum\limits_{x \in A_x} P(y,x) \ log_2 \ P(y) -\sum\limits_{y \in A_y} \sum\limits_{x \in A_x} P(y,x) \ log_2 \ P(x|y)\\
           &= -\sum\limits_{y \in A_y} P(y) \ log_2 \ P(y) - \sum\limits_{y \in A_y} \sum\limits_{x \in A_x} P(y,x) \ log_2 \ P(x|y)\\
           &= H(Y) + H(X|Y)
}$$<br>
    (\*) Rule above also apply for \\(H(X,Y)\\):
$$H(X,Y) = H(X) + H(Y|X)$$<br>
    Then:<br>
    (\*) \\(H(X|Y) \leq H(Y) \leq H(Y,X)\\)<br>
    (\*) \\(H(Y|X) \leq H(X) \leq H(X,Y)\\)<br>
    (\*) This innequalities will be disscussed further in mutual information.
<br><br>

* How can conditional entropy be used for discovering syntagmatic relations?<br><br>
    Let:<br>
    (\*) An **Ensemble W** is a triple \\((w, A_w, P_w)\\) where *outcome* of word \\(w\\) is the random word, which takes on one of a set of possible values, \\(A_w = (a_1, a_2, a_3, ..., a_I)\\), having probabilities \\(P_w = (p_1, p_2, p_3, ..., p_I)\\), which \\(P(w=a_i)=p_i, \ p_i \geq 0\\) and \\(\sum_{a_i \in A_w} P(w=a_i) = 1\\)<br><br>
    (\*) \\(WORDS\\) is set of words \\(\{w_1, w_2, w_3, ..., w_I\}\\)<br><br>
    (\*) \\(WORDS^x\\) subset of \\(WORDS\\) which contains unknown words, such that \\(\{w_1, w_2, ..., w_H \ | \ w \in WORDS^x \}, WORDS^x \subset WORDS\\)<br><br>
    (\*) \\(WORDS^{-x}\\) is reduced set of \\(WORDS\\) which contains known words, such that \\(\{w_1, w_2, ..., w_J \ | \ w_j \notin WORDS^x \}\\)<br><br>
    We want to know:<br>
    (\*) Information about unkown word \\(w_h\\) given by set of known words \\(\{w \ | \ w_j \in WORDS^{-x}\}\\), such that \\(H(w_h \ | \ w_j)\\)<br><br>
    Assume:<br>
    (\*) \\(A_w\\) is boolean values indicate whether each word \\(\{w_i \ | \ w_i \in WORDS\}\\) is present or not, such that \\(A_w = \{0, 1\}\\)<br><br>
    Then:<br>
    (\*) If the word \\(w_j\\) is known always present, then information about unknown word \\(w_h\\):<br>
$$H(w_h \ | \ w_j=1) = -\sum\limits_{a_i \in A_w} P(w_h = a_i \ | \ w_j = 1) \ log_2 \ P(w_h = a_i \ | \ w_j = 1)$$
    <br>
    (\*) If the word \\(w_j\\) is known always not present, then information about unknown word \\(w_h\\):<br>
$$H(w_h \ | \ w_j=0) = -\sum\limits_{a_i \in A_w} P(w_h = a_i \ | \ w_j = 0) \ log_2 \ P(w_h = a_i \ | \ w_j = 0)$$
<br><br>

* What is mutual information I(X;Y)? How is it related to entropy H(X) and conditional entropy H(X|Y)?<br><br>
    Let:<br>
    (\*) \\(P(x)\\) and \\(Q(y)\\) are two different probability distributions defined on the set of possible values, \\(Ax=(a_1, a_2, ..., a_i, ..., a_I)\\).<br><br>
    (\*) The **Relative Entropy** or **Kullback-Leibler divergence** between \\(P(x)\\) and \\(Q(x)\\) is:<br>
$$\eqalign{
    D_{KL}(P||Q) &= \sum\limits_{x} P(x) log_2 \frac{P(x)}{Q(x)}\\
                 &= E_p log_2 \frac{P(x)}{Q(x)}
}$$
    which satisfies **Gibbs' inequality**:<br>
    $$D_{KL}(P||Q) \geq 0$$
    with equality if only if \\(P = Q\\)<br>
    and if \\(P(x) > 0\\) and \\(Q(x) = 0\\), then \\(D_{KL}(P||Q) = \infty\\)<br><br>
    (\*) **Kullback-Leibler divergence** is asymetric distance, such that:<br>
$$D_{KL}(P||Q) \neq D_{KL}(Q||P)$$<br>
    (\*) **Mutual Information** \\(I(X; Y)\\) is the relative entropy between the joint distribution and the product distribution \\(p(x)p(y)\\):<br>
$$\eqalign{
    I(X;Y) &= \sum\limits_{x \in Ax}\sum\limits_{y \in Ay} P(x,y) log_2 \frac{P(x,y)}{P(x)P(y)}\\
           &= D_{LK}(P(x,y)||P(x)P(y))\\
           &= E_{P(x,y)} log_2 \frac{P(X,Y)}{P(X)P(Y)}
}$$
    <br>
    Then:<br>
    (\*) **Relationship between Entropy and Mutual Information** can be revealed by decomposing mutual information formula:<br>
$$\eqalign{
    I(X;Y) &= \sum\limits_{x,y} P(x,y) log_2 \frac{P(x,y)}{P(x)P(y)}\\
           &= \sum\limits_{x,y} P(x,y) log_2 \frac{P(x|y)}{p(x)}\\
           &= -\sum\limits_{x,y} P(x,y) log_2 P(x) + \sum\limits_{x,y} P(x,y) log_2 P(x|y)\\
           &= -\sum\limits_{x} P(x) log_2 P(x) - \big(-\sum\limits_{x,y} P(x,y) log P(x|y) \big)\\
           &= H(X) - H(X|Y)
}$$<br>
    For further explanation, please read [Elements of Information Theory](https://www.amazon.com/Elements-Information-Theory-Telecommunications-Processing/dp/0471241954) at page 19-30.
<br><br>

* What’s the minimum value of I(X;Y)? Is it symmetric?<br><br>
    (\*) Mutual information has symetry relationship, such that:<br>
$$I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)=I(Y;X)$$<br>
    **Proof**: Let expand previous mutual information equation with respect tu \\(Y\\):<br>
$$\eqalign{
    I(X;Y) &= \sum\limits_{x,y}P(x,y) log_2 \frac{P(x,y)}{P(x)P(y)}\\
           &= \sum\limits_{x,y}P(x,y) log_2 \frac{P(x,y)}{P(x)} - \sum\limits_{x,y}P(x,y) log_2 P(y)\\
           &= \sum\limits_{x,y}P(x) P(y|x) log_2 P(y|x) - \sum\limits_{x,y} log_2 P(y) P(x,y)\\
           &= \sum\limits_{x} P(x) \big( \sum\limits_{y} P(y|x) log_2 P(y|x) \big) - \sum\limits_{y} log_2 P(y)\\
           &= -H(Y|X) + H(Y)\\
           &= H(Y) - H(Y|X)
}$$<br>
    (\*) Since:<br>
$$\eqalign{
    H(X,Y) &= H(X) + H(Y|X)\\
    H(Y,X) &= H(Y) + H(X|Y)
}$$
    thus,we have:<br>
$$I(X;Y) = H(X) + H(Y) - H(X,Y)$$<br>
    (\*) Follow the Gibb's inequality on Relative Entropy, Mutual Information satisfies Jensen's inequality:<br>
$$I(X;Y) \geq 0$$<br>
    with equality if only if \\(X\\) and \\(Y\\) are independent.
<br><br>

* For what kind of X and Y, does mutual information I(X;Y) reach its minimum? For a given X, for what Y does I(X;Y) reach its maximum?<br><br>
    (\*) Since mutual information is non-negative, then the minimum value is \\(I(X;Y) = 0\\).<br>
    (\*) Mutual information reach its minimum if only if \\(X\\) and \\(Y\\) are independent, such that \\(P(x,y) = p(x)p(y)\\):<br>
$$\eqalign{
    I(X;Y) &= \sum\limits_{x,y}P(x,y) log_2 \frac{P(x,y)}{P(x)P(y)}\\
           &= \sum\limits_{x,y}P(x,y) log_2 1\\
           &= \sum\limits_{x,y}P(x,y) 0\\
           &= 0
}$$
    <br>
    (\*) Let expand the mutual information inequality into a throrem:<br>
$$0 \leq I(X;Y) = H(X) - H(X|Y)$$
    the theorem above tells us that knowing another random variable \\(Y\\) can only reduce the uncertainty in \\(X\\). Note that this is true only on the average. Specially, \\(H(X|Y=y)\\) may be greate than or less than or equal to \\(H(x)\\), but on the average \\(H(X|Y) = \sum_y P(y) H(X|Y=y) \leq H(X)\\).
    <br><br>
    (\*) Mutual information also reach its maximum if only if it contains itself, such that \\(I(X;X) = H(X)\\)
<br><br>

* Why is mutual information sometimes more useful for discovering syntagmatic relations than conditional entropy?<br><br>
    1. Conditional entropy only compute each pair of word probability and not comparable, such as:
$$H(w_h \ | \ w_j) \neq H(w_i \ | \ w_j)$$
    <br>
    2. Mutual information is more general than conditional entropy: Given two probabiliy distribution \\(P(X)\\) and \\(Q(X)\\), know one of them wen can also know all of them, such that:
$$I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)=I(Y;X)$$
<br><br>

* What is a topic?<br><br>
Topic is main idea of a document.
<br><br>

* How can we define the task of topic mining and analysis computationally? What’s the input? What’s the output?<br><br>
    Input:<br>
    (\*) \\(N\\) indicate number of collections.<br>
    (\*) \\(C=\{d_1, ..., d_N\}\\) is a collection of document<br>
    (\*) \\(k\\) indicate desired number of topics.<br><br>
    Output:<br>
    (\*) \\(k\\) number topics \\(\{\theta_1, ..., \theta_k\}\\)<br>
    (\*) Coverage of topics in each document: \\(d_i = \{\Pi_{i1}, ..., \Pi_{ik}\}\\)<br>
    (\*) \\(\Pi_{ij}\\) is probability of \\(d_i\\) covering topic \\(\theta_j\\), such as:
$$\sum\limits_{j=1}^k \Pi_{ij} = 1$$
<br><br>

* How can we heuristically solve the problem of topic mining and analysis by treating a term as a topic? What are the main problems of such an approach?<br><br>
    There are step by step approach to do topic mining and analysis by treating a term as a topic:<br>
    \- Parse text in colection \\(C\\) to obtain candidate term, e.g: term = word or term = phrase.<br>
    \- Design a term scoring scoring function, e.g: TF-IDF, domain specific heuristics (favor title words, hashtag, etc).<br>
    \- Pick \\(k\\) terms with the highest scores, but try to minimize redudancy, e.g: Using WordNet to find synonyms, latent semantic indexing to find corelation between words.
    <br><br>
    Problem with "term as topic":<br>
    \- Does not have broad vocabulary coverage, e.g: related words.<br>
    \- Can not detect word ambiguity, e.g: "basketball star" vs "star in the sky".
<br><br>

* What are the benefits of representing a topic by a word distribution?<br><br>
    - Used multiple words to describe complicated topic.
    - Used word weighting to model subtle semantic variations of a topic.
<br><br>

* What is a statistical language model? What is a unigram language model? How can we compute the probability of a sequence of words given a unigram language model?<br><br>
    - The statistical language model is model use probability distribution over word sequences. Alse called as generative model.
    - Unigram language model is a model which treat each word independently. Thus, a probability of text "today is wed" is:<br>
$$P("\text{today is wed}") = P("\text{today}")P("\text{is}")P("\text{wed}")$$
<br><br>

* What is Maximum Likelihood estimate of a unigram language model given a text article?<br><br>
    - Maximum likehood is find the best probability \\(P\\) to describe a topic \\(\theta\\), such that:
$$\hat{\theta} = arg \ max_{\theta} P(X|\theta)$$
<br><br>

* What is the basic idea of Bayesian estimation? What is a prior distribution? What is a posterior distribution? How are they related with each other? What is Bayes rule?<br><br>
    - The basic idea of Bayesia Estimation is find the best estimation based on initial probability. Or in another word, compute posterior believe based on priori knowledge about evidence, such that:
$$\eqalign{
    \hat{\theta} &= arg \ max_{\theta} P(\theta|X)\\
                 &= arg \ max_{\theta} P(X|\theta)P(\theta)
}$$
    <br>
    - \\(P(\theta)\\) is Prior distribution.
    - \\(P(\theta|X)\\) is posterior distribution.
    - \\(P(\theta)\\) need to be defined before compute \\(P(\theta|X)\\).
    - **Bayes rule** tells us about corelation between joint probability and conditional probability:<br>
$$P(X|Y)P(Y) = P(X,Y) = P(Y|X)P(X)$$
<br><br>

## Additional Readings and Resources

* C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM and Morgan & Claypool Publishers, 2016. Chapters 13, 17.