# 1. What constitutes a probability measure?

A *probability measure* (or *probability distribution*) $P$ on the sample space ($S$, <span style="font-family: 'cursive';">S</span>) is a real-valued function defined on the collection of events that <span style="font-family: 'cursive';">S</span> satisifes the following axioms:
1. $P(A) >= 0$ for every event $A$
2. $P(S) = 1$
3. If ${A_i:i\in I}$ is a countable, pairwise disjoint collection of events then 
$$P(\bigcup_{i\in I}A_i)=\sum_{i\in I}P(A_i)$$

Source: https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/02%3A_Probability_Spaces/2.03%3A_Probability_Measures#:~:text=satisfies%20certain%20axioms.-,Definition,P(S)%3D1.

# 2. Independence

P(A,B) = P(A)P(B)

# 3. Conditional Probabilty

$$P(A|B) = \frac{P(A,B)}{P(B)}$$

# 4. Random Variables

A *random variable*, usually written *X*, is a variable whose possible values are numerical outcomes of a random phenomenon.

Source: http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm#:~:text=Discrete%20Random%20Variables,then%20it%20must%20be%20discrete.

## 4.1 Discrete Random Variables

A *discrete random variable* is one which may take on only a countable number of distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables include the number of children in a family, the number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.

## 4.2 Continuous Random Variables

A *continuous random variable* is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.

# 5. Language Models

A language model uses machine learning to conduct a probability distribution over words used to predict the most likely next word in a sentence based on the previous entry. Language models learn from text and can be used for producing original text, predicting the next word in a text, speech recognition, optical character recognition and handwriting recognition.

Source: https://builtin.com/data-science/beginners-guide-language-models

# 6. Maximum Likelihood Estimation for Binomials

Let $y$ be the number of successes resulting from $n$ independent trials with unknown success probability $p$, such that $y$ follows a binomial distribution:
$$y\simeq Bin(n,p)$$
Then, the maximum likelihood estimator of $p$ is
$$\hat{p} = \frac{y}{n}$$

Source: https://statproofbook.github.io/P/bin-mle.html

# 7. Markov Chain

**Markov chain** is a mathematical chain of events or states that describe the probability of the events that might occur in the future, based on the current state and not the previous states. It is a *stochastic model* that predicts the future based on the present state.

In a Markov Chain, each state can be represented as a set of discrete steps. Each state has its own probability of transitioning to every other state. This may be represented by a weighted connected graph or by a transition matrix.

Source: https://www.educative.io/answers/introduction-to-markov-chains

# 8. Markov Assumption

The **Markov assumption**, a tenet named in honor of the Russian mathematician Andrey Markov, is a central idea in the sphere of probabilistic models, and more so in Markov processes. At its core, the Markov assumption proposes that the future state of a process relies solely on the current state by disregarding the journey to the current state. This attribute is commonly known as the "memoryless" aspect or "absence of memory" disregarding in Markov processes.

Source: https://www.educative.io/answers/what-is-the-markov-assumption

# 9. Why is word sparcity an issue?

Word sparsity can be an issue in natural language processing (NLP) and machine learning tasks for several reasons:

1. **Data Scarcity**: In many NLP applications, you're working with large vocabularies or feature spaces. When building models based on text data, it's common to have a vast number of unique words or tokens in a corpus. However, not all words appear frequently in the data. This leads to data sparsity, where many words occur only a few times or even just once in your dataset. Sparse data can be challenging for statistical models because they lack sufficient examples to learn meaningful patterns.

2. **Reduced Model Generalization**: Sparse data can lead to overfitting. When a model encounters rare words that it has only seen a few times during training, it may fit to the noise in the data rather than capturing the true underlying patterns. This can result in poor generalization to new, unseen data.

3. **Increased Model Complexity**: Dealing with sparse data often requires more complex models. For instance, if you're using a bag-of-words representation where each unique word is a feature, you might end up with a high-dimensional feature space. This can increase model complexity and the computational resources required for training and inference.

4. **Loss of Information**: Rare words or infrequent features may carry valuable information. In tasks like sentiment analysis or topic modeling, uncommon words can be strong indicators of sentiment or topic. When you discard or downweight these features due to their rarity, you lose potentially important information.

5. **Efficiency Challenges**: Sparse data can be computationally inefficient to process and store. In large-scale NLP applications, it can lead to performance bottlenecks and increased memory requirements.

To address issues related to word sparsity, NLP practitioners often use techniques like:

- **Text Preprocessing**: Removing or reducing word sparsity by applying techniques like stemming, lemmatization, or removing stop words.
- **Feature Engineering**: Using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to weight words based on their importance.
- **Dimensionality Reduction**: Applying techniques like Principal Component Analysis (PCA) or Truncated SVD (Singular Value Decomposition) to reduce the dimensionality of sparse feature spaces.
- **Word Embeddings**: Using word embeddings (e.g., Word2Vec, GloVe) to represent words as dense vectors in a lower-dimensional space. This not only reduces sparsity but also captures semantic relationships between words.
- **Data Augmentation**: Expanding the dataset by generating more data through techniques like back-translation or synonym replacement.
- **Transfer Learning**: Leveraging pre-trained models like BERT or GPT, which have learned from large corpora and can handle word sparsity more effectively.

Addressing word sparsity is crucial for improving the performance of NLP models, especially in tasks where capturing subtle linguistic nuances is essential.

# 10. Laplace Smoothing

Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naïve Bayes. Using Laplace smoothing, we can represent $P(w’|positive)$ as
$$P(w'|positive)=\frac{number\, of\, reviews\, with\, w'\, and\, y = positive+\alpha}{N+\alpha*K}$$
Here,<br>
$\alpha$ represents the smoothing parameter<br>
$K$ represents the number of dimensions (features) in the data, and<br>
$N$ represents the number of reviews with y=positive<br>

If we choose a value of alpha!=0 (not equal to 0), the probability will no longer be zero even if a word is not present in the training dataset.

Source: https://towardsdatascience.com/laplace-smoothing-in-na%C3%AFve-bayes-algorithm-9c237a8bdece

# 11. Good Turing Smoothing

Good-Turing smoothing is a technique used in natural language processing (NLP) and machine learning to estimate the probabilities of unseen or rare events, particularly in the context of language modeling and text classification. It was developed by I.J. Good and is a modification of Laplace smoothing, also known as add-one smoothing.

The primary motivation behind Good-Turing smoothing is to address the "zero-frequency problem." In language modeling, many words or n-grams may occur in a corpus only a few times, or not at all, making it difficult to estimate their probabilities accurately. Good-Turing smoothing helps by redistributing some of the probability mass from more frequent events to less frequent or unseen events.

Here's an overview of how Good-Turing smoothing works:

1. **Count the Frequency of Events**: First, you count the frequency of each event (e.g., words, n-grams) in your training data. Let's call these counts "Nc," where "c" represents the count. For example, N1 is the count of events that occurred only once, N2 is the count of events that occurred twice, and so on.

2. **Estimate the Probability of Unseen Events (Zero Counts)**: In cases where you have events with zero counts (unseen events), Good-Turing smoothing estimates their probability using the observed frequency of events with higher counts. It assumes that less frequent events follow the same distribution as more frequent events. The formula for estimating the probability of unseen events is:

   P_unseen = (N1 + 1) / N

   Where:
   - P_unseen is the estimated probability of an unseen event.
   - N1 is the count of events that occurred only once.
   - N is the total number of events in the dataset.

3. **Smooth Probabilities for Seen Events**: For events that have nonzero counts, you apply a smoothing formula to adjust their probabilities. The formula is based on the ratio of the count of events with a count of (c+1) to the count of events with a count of (c). This is used to redistribute some probability mass from higher-frequency events to lower-frequency ones. The smoothed probability is calculated as:

   P_smoothed = (c+1) * (Nc+1) / Nc

   Where:
   - P_smoothed is the smoothed probability of an event with count "c."
   - c is the count of the event.
   - Nc is the count of events with a count of "c."
   - Nc+1 is the count of events with a count of "c+1."

Good-Turing smoothing effectively reduces the probability mass assigned to frequent events and reallocates it to rare or unseen events, which can lead to more accurate probability estimates, especially for unseen events. This technique is commonly used in tasks like language modeling with n-grams, where you need to estimate the likelihood of word sequences that may not have been observed in your training data.

# 12. LMs in topic modeling

In topic modeling, LMs (Language Models) can refer to a specific type of model used for estimating the probability of observing a sequence of words in a document or a set of documents. Language models play a crucial role in various aspects of topic modeling, including document classification, topic assignment, and text generation. Here's how LMs are used in topic modeling:

1. **Document Classification**: Language models can be employed for classifying documents into predefined topics or categories. For instance, you might have a collection of news articles and want to categorize them into topics like "politics," "sports," or "technology." LMs can calculate the probability of observing the words in each document given a specific topic model. The document is then assigned to the topic with the highest probability.

2. **Topic Assignment**: In topic modeling techniques like Latent Dirichlet Allocation (LDA), documents are assumed to be generated based on a mixture of topics. LMs can be used to estimate the likelihood of a document being generated by a particular topic. This information is vital when assigning topics to documents in an unsupervised manner.

3. **Text Generation**: LMs, especially neural language models like GPT-3, can be used to generate text based on a given topic. You can provide a topic or a set of keywords, and the language model will generate coherent text that is contextually relevant to the topic. This is useful for content generation, chatbots, and more.

4. **Word Probability Estimation**: LMs can estimate the probability of observing specific words or phrases in a document or a collection of documents. This information can be used to identify important keywords or phrases associated with particular topics.

5. **Model Evaluation**: LMs can help evaluate the quality of topic models. For instance, you can calculate the likelihood of observing your corpus of documents using an LDA model. The higher the likelihood, the better the model fits the data. LMs can also be used in perplexity calculations to assess how well a language model generalizes to unseen data.

6. **Document Similarity**: LMs can be used to measure the similarity between documents based on the probability distributions of words. Documents with similar word probability distributions are likely to be related in terms of topics or content.

7. **Summarization**: LMs can assist in generating document summaries. By identifying the most probable words or phrases in a document, you can create concise summaries that capture the key points or topics discussed.

In the context of modern NLP, pre-trained language models like BERT, GPT-3, and others have been used for various topic modeling tasks due to their ability to capture complex language patterns and semantics. Researchers and practitioners often fine-tune these models on specific topic modeling tasks to achieve state-of-the-art results.

In summary, language models play a multifaceted role in topic modeling, from document classification and topic assignment to text generation and model evaluation. They enable more accurate and sophisticated approaches to understanding and organizing text data into meaningful topics.

# Latex Styling Used in this Markdown Document

1. Set notation, union: https://latex-tutorial.com/union-latex/
2. Summation: https://www.physicsread.com/latex-summation/
3. Set notations: https://www.geeksforgeeks.org/set-notations-in-latex/
4. Equations: https://www.fabriziomusacchio.com/blog/2021-08-10-How_to_use_LaTeX_in_Markdown/
5. Similarly Equivalent: https://www.overleaf.com/learn/latex/List_of_Greek_letters_and_math_symbols
6. Line break: https://www.markdownguide.org/basic-syntax/#:~:text=To%20create%20a%20line%20break,spaces%2C%20and%20then%20type%20return.
7. Spacing in math equations: http://www.emerson.emory.edu/services/latex/latex_119.html