# 1. What constitutes a probability measure?

A *probability measure* (or *probability distribution*) $P$ on the sample space ($S$, <span style="font-family: 'cursive';">S</span>) is a real-valued function defined on the collection of events that <span style="font-family: 'cursive';">S</span> satisifes the following axioms:
1. $P(A) >= 0$ for every event $A$
2. $P(S) = 1$
3. If ${A_i:i\in I}$ is a countable, pairwise disjoint collection of events then 
$$P(\bigcup_{i\in I}A_i)=\sum_{i\in I}P(A_i)$$

Source: https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/02%3A_Probability_Spaces/2.03%3A_Probability_Measures#:~:text=satisfies%20certain%20axioms.-,Definition,P(S)%3D1.

# 2. Independence

P(A,B) = P(A)P(B)

# 3. Conditional Probabilty

$$P(A|B) = \frac{P(A,B)}{P(B)}$$

# 4. Random Variables

A *random variable*, usually written *X*, is a variable whose possible values are numerical outcomes of a random phenomenon.

Source: http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm#:~:text=Discrete%20Random%20Variables,then%20it%20must%20be%20discrete.

## 4.1 Discrete Random Variables

A *discrete random variable* is one which may take on only a countable number of distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables include the number of children in a family, the number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.

## 4.2 Continuous Random Variables

A *continuous random variable* is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.

# 5. Language Models

A language model uses machine learning to conduct a probability distribution over words used to predict the most likely next word in a sentence based on the previous entry. Language models learn from text and can be used for producing original text, predicting the next word in a text, speech recognition, optical character recognition and handwriting recognition.

Source: https://builtin.com/data-science/beginners-guide-language-models

# 6. Maximum Likelihood Estimation for Binomials

Let $y$ be the number of successes resulting from $n$ independent trials with unknown success probability $p$, such that $y$ follows a binomial distribution:
$$y\simeq Bin(n,p)$$
Then, the maximum likelihood estimator of $p$ is
$$\hat{p} = \frac{y}{n}$$

Source: https://statproofbook.github.io/P/bin-mle.html

# 7. Markov Chain

**Markov chain** is a mathematical chain of events or states that describe the probability of the events that might occur in the future, based on the current state and not the previous states. It is a *stochastic model* that predicts the future based on the present state.

In a Markov Chain, each state can be represented as a set of discrete steps. Each state has its own probability of transitioning to every other state. This may be represented by a weighted connected graph or by a transition matrix.

Source: https://www.educative.io/answers/introduction-to-markov-chains

# 8. Markov Assumption

The **Markov assumption**, a tenet named in honor of the Russian mathematician Andrey Markov, is a central idea in the sphere of probabilistic models, and more so in Markov processes. At its core, the Markov assumption proposes that the future state of a process relies solely on the current state by disregarding the journey to the current state. This attribute is commonly known as the "memoryless" aspect or "absence of memory" disregarding in Markov processes.

Source: https://www.educative.io/answers/what-is-the-markov-assumption

# 9. Why is word sparcity an issue?

Word sparsity can be an issue in natural language processing (NLP) and machine learning tasks for several reasons:

1. **Data Scarcity**: In many NLP applications, you're working with large vocabularies or feature spaces. When building models based on text data, it's common to have a vast number of unique words or tokens in a corpus. However, not all words appear frequently in the data. This leads to data sparsity, where many words occur only a few times or even just once in your dataset. Sparse data can be challenging for statistical models because they lack sufficient examples to learn meaningful patterns.

2. **Reduced Model Generalization**: Sparse data can lead to overfitting. When a model encounters rare words that it has only seen a few times during training, it may fit to the noise in the data rather than capturing the true underlying patterns. This can result in poor generalization to new, unseen data.

3. **Increased Model Complexity**: Dealing with sparse data often requires more complex models. For instance, if you're using a bag-of-words representation where each unique word is a feature, you might end up with a high-dimensional feature space. This can increase model complexity and the computational resources required for training and inference.

4. **Loss of Information**: Rare words or infrequent features may carry valuable information. In tasks like sentiment analysis or topic modeling, uncommon words can be strong indicators of sentiment or topic. When you discard or downweight these features due to their rarity, you lose potentially important information.

5. **Efficiency Challenges**: Sparse data can be computationally inefficient to process and store. In large-scale NLP applications, it can lead to performance bottlenecks and increased memory requirements.

To address issues related to word sparsity, NLP practitioners often use techniques like:

- **Text Preprocessing**: Removing or reducing word sparsity by applying techniques like stemming, lemmatization, or removing stop words.
- **Feature Engineering**: Using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to weight words based on their importance.
- **Dimensionality Reduction**: Applying techniques like Principal Component Analysis (PCA) or Truncated SVD (Singular Value Decomposition) to reduce the dimensionality of sparse feature spaces.
- **Word Embeddings**: Using word embeddings (e.g., Word2Vec, GloVe) to represent words as dense vectors in a lower-dimensional space. This not only reduces sparsity but also captures semantic relationships between words.
- **Data Augmentation**: Expanding the dataset by generating more data through techniques like back-translation or synonym replacement.
- **Transfer Learning**: Leveraging pre-trained models like BERT or GPT, which have learned from large corpora and can handle word sparsity more effectively.

Addressing word sparsity is crucial for improving the performance of NLP models, especially in tasks where capturing subtle linguistic nuances is essential.

# Latex Styling Used in this Markdown Document

1. Set notation, union: https://latex-tutorial.com/union-latex/
2. Summation: https://www.physicsread.com/latex-summation/
3. Set notations: https://www.geeksforgeeks.org/set-notations-in-latex/
4. Equations: https://www.fabriziomusacchio.com/blog/2021-08-10-How_to_use_LaTeX_in_Markdown/
5. Similarly Equivalent: https://www.overleaf.com/learn/latex/List_of_Greek_letters_and_math_symbols