# language model

<table>
  <tr>
    <th>Type</th>
    <th colspan='2'>Subtype</th>
    <th>Model</th>
    <th>Description</th>
    <th>Pros</th>
    <th>Cons</th>
  </tr>
  <tr>
    <td rowspan="2">Deterministic</td>
    <td colspan='2'>Rule-based</td>
    <td>Formal Grammar</td>
    <td>Used for parsing or grammar checking</td>
    <td>Highly interpretable, easy to modify rules</td>
    <td>Requires manual rule creation, not adaptive to new data</td>
  </tr>
  <tr>
    <td colspan='2'>Template-based</td>
    <td>Regular Expression, Slot-filling, Frame semantics</td>
    <td>Pattern matching and semantic analysis based on templates</td>
    <td>Simple, fast, effective for specific tasks</td>
    <td>Not adaptive, limited expressiveness</td>
  </tr>
  <tr>
    <td rowspan="14">Probabilistic</td>
    <td rowspan="8">Statistical</td>
    <td rowspan="5">Frequentist</td>
    <td>N-gram model</td>
    <td>predict words based on previous n words</td>
    <td>Simple, interpretable, scalable</td>
    <td>Requires large data, suffers from data sparsity</td>
  </tr>
  <tr>
    <td>Hidden Markov Model (HMM)</td>
    <td>Mmemoryless, weighted finite hidden state transducer follows a Markov Chain</td>
    <td>Handles temporal dependencies, interpretable</td>
    <td>Assumes Markov property, limited modeling capacity</td>
  </tr>
  <tr>
    <td>MaxEnt</td>
    <td>logistic regression with softmax for classification. objective is maximize entropy</td>
    <td>Flexible, can handle overlapping features</td>
    <td>Requires large data, can be computationally expensive</td>
  </tr>
  <tr>
    <td>Conditional Random Field (CRF)</td>
    <td>Discriminative model for structured prediction</td>
    <td>Discriminative, can handle overlapping features</td>
    <td>Computationally expensive, requires large data</td>
  </tr>
  <tr>
    <td>Caching</td>
    <td>assign high probability to recently occurred word sequences</td>
    <td>Simple, can adapt to recent data</td>
    <td>Depends on instance similarity, limited capacity</td>
  </tr>
  <tr>
    <td rowspan="3">Bayesian</td>
    <td>Noisy Channel model</td>
    <td>information theory. recover input given noisy output by MLE or Bayesian inference</td>
    <td>Handles uncertainty, incorporates prior knowledge</td>
    <td>Computationally expensive, requires prior assumptions</td>
  </tr>
  <tr>
    <td>Latent Dirichlet Allocation (LDA)</td>
    <td>Dirichlet prior for Topic modeling</td>
    <td>Handles uncertainty, incorporates prior knowledge</td>
    <td>Computationally expensive, requires prior assumptions</td>
  </tr>
  <tr>
    <td>Naive Bayes</td>
    <td>classification and generative model. objective is MAP(Maximum a posteriori)</td>
    <td>Simple, handles uncertainty, incorporates prior knowledge</td>
    <td>Naive from independent feature assumption, can be affected by irrelevant features</td>
  </tr>
  <tr>
    <td rowspan="6">Neural Networks</td>
    <td rowspan="3">Recurrent Neural Network (RNN)</td>
    <td>SRN (Simple Recurrent Network)</td>
    <td>hidden states are determined by both current input and previous hidden state. Unidirectional with one hidden layer.</td>
    <td>Handles short-term dependencies, simple architecture</td>
    <td>Vanishing and exploding gradient problem, limited long-range dependency</td>
  </tr>
  <tr>
    <td>LSTM (Long Short-Term Memory)</td>
    <td>capture long-term dependency by gating mechanism like skip connection through time. 3 gates (input, forget, output)</td>
    <td>Handles long-range dependencies, alleviates vanishing gradient problem</td>
    <td>Computationally expensive, more complex architecture</td>
  </tr>
  <tr>
    <td>GRU (Gated Recurrent Unit)</td>
    <td>2 gates (update and reset)</td>
    <td>Handles long-range dependencies, simpler than LSTM, faster training</td>
    <td>can't count effectively (a^nb^n) as LSTM</td>
  </tr>
  <tr>
    <td rowspan="2">Transformer</td>
    <td>GPT</td>
    <td>Pretrained decoder-only Transformer</td>
    <td>SOTA, versatile, scalable, parallelizable</td>
    <td>Resource-intensive, requires large data</td>
  </tr>
  <tr>
    <td>BERT</td>
    <td>Pretrained encoder-only Transformer for contextualized word embedding</td>
    <td>Versatile, scalable, parallelizable</td>
    <td>Resource-intensive, requires large data</td>
  </tr>
  <tr>
    <td colspan='2'>Recursive Neural Network</td>
    <td>Input is tree-structured data (Parsed tree)</td>
    <td>Handles hierarchical structures, suitable for parsing, sentiment analysis</td>
    <td>Dependent on accurate tree structures, limited to specific tasks</td>
  </tr>
</table>


# probablistic language model

## definition

- A language model assigns a probability to occurence of a sequence of words. the sum of probabilities over all possible sequences is 1.

  $$
  P(seq) = p(w_1,...,w_n)
  $$


- this probability is calculated by chain rule: conditional probability of a word occur given its preceding words (history) in the sequence.

  $$
  P(seq) =\prod_{i=1}^n p(w_i|w_1, ..., w_{i-1})=p(w_1)p(w_2|w_1)...p(w_n|w_1,...,w_{n-1})
  $$


- As sequence length $n$ increases, the number of histories (possible unique sequences of words) grow exponentially as 

  $$|V|^{n-1}$$

- number of parameters in the model for the last conditional probability is 

  $$(|V|-1)|V|^{n-1}$$

  where $|V|$ is vocabulary size

## problem of MLP as language model

n-gram language model: a word only depend on previous $N-1$ words (context)

$$P(w_{t} | w_{t-1}, w_{t-2}, ..., w_{1}) \approx P(w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1})$$


To estimate the conditional probability $P(w_n | w_{t-1}, w_{t-2}, ..., w_{t-N+1})$, we can use Bayes rule based on the counts of the n-grams in the training data. Specifically, we can estimate the probability as:

$$P(w_{t} | w_{t-1}, w_{t-2}, ..., w_{t-N+1}) 
= \frac{P(w_{t-N+1}, w_{t-N+2}, ..., w_{t})}{P(w_{t-N+1}, w_{t-N+2}, ..., w_{t-1})}\\[1em]
\approx \frac{\text{count}(w_{t-N+1}, w_{t-N+2}, ..., w_{t})}{\text{count}(w_{t-N+1}, w_{t-N+2}, ..., w_{t-1})}$$

where numerator is the number of times the n-gram $w_{t-N+1}, w_{t-N+2}, ..., w_{t}$ appears in the training data, 

and denominator is the number of times the (N-1)-gram $w_{t-N+1}, w_{t-N+2}, ..., w_{t-1}$ appears in the training data.

e.g. for 4-gram N = 4, probability of word 'store' occurred given context 'walked to the' is

$$P(\text{store} | \text{walked to the}) \approx \frac{\text{count}(\text{walked to the store})}{\text{count}(\text{walked to the})}$$


Problems of N-gram model

- Sparsity of input: 

    an input sequence is represented as a feature vector of ngram vocabulary size, whose entry is the number of occurrence of a unique n-gram.
    
    as N increases, the ngram vocabulary size grows exponentially, leading to sparse representations because most N-grams in the corpus are unique or rare. 


- large model size $O(|V|^N)$: 

    as N increase, input dimension increase exponentially.

    increasing the model's size and memory requirements, lead to slower training and inference times and require more computational resources. 


- Limited context: context window size is small ($N-1$). can't hdandle longer-range dependencies. 

# development cost

- create labelled dataset: expensive human labor

- train model: computation (PFLOPs)

- inference: user computation time (ms/example)

# key to robust language model

- large data size: A large amount of text data is essential to train the NLP system to understand various language structures, patterns, and nuances. Data diversity helps the system to generalize and adapt to different situations and domains.

- Linguistic intuition: Incorporating knowledge of linguistic principles and structures helps the NLP system to better understand and process language. Understanding syntax, semantics, and pragmatics helps the system to make sense of complex language phenomena.

- Appropriate representation: Choosing the right representation of data, such as word embeddings, syntax trees, or more advanced techniques like transformers, is crucial for the system's effectiveness in capturing language patterns and relationships.

- Robust algorithms: Implementing efficient and robust algorithms for different NLP tasks, such as parsing, tokenization, sentiment analysis, or machine translation, is vital for the system's overall performance and accuracy.

- World knowledge: Incorporating general knowledge about the world, including common sense knowledge, facts, and relationships, allows the NLP system to make inferences, resolve ambiguities, and better understand the context of language.

- **Grounding**: connecting language to real-world knowledge or perceptual experiences. For example, grounding a word or phrase in an image, video, or a sensory experience. It often involves tasks like visual question-answering or image captioning, where models need to understand both linguistic and visual information to generate meaningful results.