## NLP Task

1. **Lexical analysis**: Figure out what the basic meaningful units in a language are and determenie the meaning of each word.
2. **Syntactic analysis**: Determine how words are related eith each other in a sentence, thus revealing the syntactic structure of a sentence.
3. **Semantic analysis**: Determine the meaning of a sentence.
4. **Pragmatic analysis**: Determine meaning in context.
5. **Discourse analysis**: Determine relation between sentences.

## NLP Challenges

1. **Word-level ambiguity**: Multiple syntactic categories and senses.
2. **Syntactic ambiguity**: Multiple syntactic structures.
3. **Anaphora resolution**: Multiple pronunciation.
4. **Presupposition**: Required inferences to understand the meaning.

## Consideration in NLP Implemention

* While linguistic knowledge is always useful, today, the most advanced natural language processing techniques tend to rely on heavy use of statistical machine learning techniques with linguistic knowledge only playing a somewhat secondary role.
* Only “shallow” analysis of natural language processing can be done for arbitrary text and in a robust manner; “deep” analysis tends not to scale up well or be robust enough for analyzing unrestricted text. In many cases, a significant amount of training data (created by human labeling) must be available in order to achieve reasonable accuracy.

![nlp-difficulty-level](images/nlp-difficulty-level.png)

## Text Representation

1. **String**: Collecting every characters.<br>
Pros: General uses.<br>
Cons: Not allowed semantic analysis.<br>
2. **Word**: Create string segmentation into readable words.<br>
Pros: Statistical analysis, +POS tags, +Syntactic structures.<br>
Cons: Less general, i. e. Chinese words need more sophiscated segmentation.<br>
3. **Graph**: Represent entities and relations.<br>
Pros: Semantical analysis.<br>
Cons: Deep analysis, not general.<br>
4. **Logic**: Represent rules.<br>
Pros: Inferences.<br>
Cons: May need significant computation time.<br>
5. **Speech act**: Represent intent of languages.<br>
Pros: Represent human knowledge.<br>
Cons: Deep analysis and Limited uses.<br>
<br>
![text representation](images/text-representation.png)

## Statistical Language Model

A statistical language model is probability distribution over word sequences. It thus gives any sequence of words a potentially different probability. For example, a language model may give the following three word sequences different probabilities:

$$\eqalign{
p(\text{Today is Wednesday}) &= 0.001\\
p(\text{Today Wednesday is}) &= 0.0000000001\\
p(\text{The equation has a solution}) &= 0.000001\\
}$$

By calculate the probability of each word sequence, we may reveal document context. For example above, we may deduce that it more likely that some word sequences belong in mathematic context. Thus, a language model can be **context dependent**.

One advantage of using language model is it can quantify the uncertaities associated with the use of natural language. For examples:

* Help speech recognizer to do word correction. For example: Given correct words *John feels* and next word prediction are *happy* or *habit*. Since, *happy* and *habit* may have similar acoustic signal, thus we can suggest that word *happy* have higher probability that *habit*. So, the correct word should be *John feels happy*.
* Help document categorization by count most frequence words.
* Predict query based on user preferences. For example, know that user interested in *sport news*, we can suggest any words with high probility in the context of sport.

### Unigram language model

Assume that each word has independent probability as candidate. Thus, the probabiliy of sequence words equal to the product of the probability of each word.

Let:

$$\eqalign{
V &: \text{Set of words in the vocabulary}\\
\{w_1, ..., w_n\} &: \text{Word sequences, where} \ w_i \in V
}$$

We have:<br>

$$p(w_1, ..., w_n) = \prod\limits_{i=1}^n p(w_i)$$

Given a unigram language model \\(\theta\\), we have many parameters as the words in the vocabulary, and they satisfy constraint \\(\sum_{w \in V} p(w)=1\\). Such a model essentially specifies a multinomial distribution over all the words. For example, we define \\(\theta\\) as a model to describe topic of the document and document \\(D\\):

Let:<br>
$$\eqalign{
\theta_1 &= \text{text mining}\\
\theta_2 &= \text{health}
}$$

Each \\(\theta\\) trained in the abstact of text mining paper and health paper, given abstract with uniformly distributed length, such that:<br>
$$\eqalign{
p(w|\theta_1) &= [\{\text{text}: 0.2\}, \{\text{mining}: 0.1\}, \{\text{association}: 0.01\}, \{\text{clustering}: 0.02\}, ..., \{\text{food}: 0.00001\}]\\
p(w|\theta_2) &= [\{\text{food}: 0.25\}, \{\text{nutrition}: 0.1\}, \{\text{healthy}: 0.05\}, \{\text{diet}: 0.02\}, ..., \{\text{text}: 0.00001\}]\\
p(w|\theta_1) &\neq p(w|\theta_2)
}$$

Then, we expect that any new observed document \\(D_i\\) is belong to text mining or health topic if only if:
$$\eqalign{
p(D_i,\theta_1) > p(D_i.\theta_2) \implies D_i &\in \text{text mining}\\
p(D_i,\theta_1) < p(D_i.\theta_2) \implies D_i &\in \text{health}
}$$

If we have more than two \\(\theta\\), then it will be little bit difficult to tell which document belong to. One very simple method to solve this called as **Maximum Likehood Estimation** which difine parameter \\(\hat{\theta}\\) as higest likehood to explain about the document:

$$\hat{\theta} = arg \ max_{\theta} \ p(D|\theta)$$

which in the case of unigram language model, \\(\hat{\theta}\\) tells us about the probibility of each word equal to its relative frequency in \\(D\\):

$$p(w|\hat{\theta}) = \frac{c(w,D)}{|D|}$$

where:
$$\eqalign{
c(w,D) &= \text{count of word} \ w \ \text{in} D\\
|D| &= \text{length of} \ D \ \text{document} \ \text{or total number of words in} \ D
}$$

**The challange is create a model \\(p(w)\\) as general model in broad document topic**, thus we need penalized bias word, such as *the*, *a*, *is*, etc and normalize significant words that describe topic of the document, such as *computer*, *sport*, *goverment*, etc. Such technique called as **background language model** as shown in image below:

![background language model](images/background-language-model.png)