# 1. What constitutes a probability measure?

A *probability measure* (or *probability distribution*) $P$ on the sample space ($S$, <span style="font-family: 'cursive';">S</span>) is a real-valued function defined on the collection of events that <span style="font-family: 'cursive';">S</span> satisifes the following axioms:
1. $P(A) >= 0$ for every event $A$
2. $P(S) = 1$
3. If ${A_i:i\in I}$ is a countable, pairwise disjoint collection of events then 
$$P(\bigcup_{i\in I}A_i)=\sum_{i\in I}P(A_i)$$

Source: https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/02%3A_Probability_Spaces/2.03%3A_Probability_Measures#:~:text=satisfies%20certain%20axioms.-,Definition,P(S)%3D1.

# 2. Independence

P(A,B) = P(A)P(B)

# 3. Conditional Probabilty

$$P(A|B) = \frac{P(A,B)}{P(B)}$$

# 4. Random Variables

A *random variable*, usually written *X*, is a variable whose possible values are numerical outcomes of a random phenomenon.

Source: http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm#:~:text=Discrete%20Random%20Variables,then%20it%20must%20be%20discrete.

## 4.1 Discrete Random Variables

A *discrete random variable* is one which may take on only a countable number of distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables include the number of children in a family, the number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.

## 4.2 Continuous Random Variables

A *continuous random variable* is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.

# 5. Language Models

A language model uses machine learning to conduct a probability distribution over words used to predict the most likely next word in a sentence based on the previous entry. Language models learn from text and can be used for producing original text, predicting the next word in a text, speech recognition, optical character recognition and handwriting recognition.

Source: https://builtin.com/data-science/beginners-guide-language-models

# 6. Maximum Likelihood Estimation for Binomials

Let $y$ be the number of successes resulting from $n$ independent trials with unknown success probability $p$, such that $y$ follows a binomial distribution:
$$y\simeq Bin(n,p)$$
Then, the maximum likelihood estimator of $p$ is
$$\hat{p} = \frac{y}{n}$$

Source: https://statproofbook.github.io/P/bin-mle.html

# 7. Markov Chain

**Markov chain** is a mathematical chain of events or states that describe the probability of the events that might occur in the future, based on the current state and not the previous states. It is a *stochastic model* that predicts the future based on the present state.

In a Markov Chain, each state can be represented as a set of discrete steps. Each state has its own probability of transitioning to every other state. This may be represented by a weighted connected graph or by a transition matrix.

Source: https://www.educative.io/answers/introduction-to-markov-chains

# 8. Markov Assumption

The **Markov assumption**, a tenet named in honor of the Russian mathematician Andrey Markov, is a central idea in the sphere of probabilistic models, and more so in Markov processes. At its core, the Markov assumption proposes that the future state of a process relies solely on the current state by disregarding the journey to the current state. This attribute is commonly known as the "memoryless" aspect or "absence of memory" disregarding in Markov processes.

Source: https://www.educative.io/answers/what-is-the-markov-assumption

# 9. Why is word sparcity an issue?

Word sparsity can be an issue in natural language processing (NLP) and machine learning tasks for several reasons:

1. **Data Scarcity**: In many NLP applications, you're working with large vocabularies or feature spaces. When building models based on text data, it's common to have a vast number of unique words or tokens in a corpus. However, not all words appear frequently in the data. This leads to data sparsity, where many words occur only a few times or even just once in your dataset. Sparse data can be challenging for statistical models because they lack sufficient examples to learn meaningful patterns.

2. **Reduced Model Generalization**: Sparse data can lead to overfitting. When a model encounters rare words that it has only seen a few times during training, it may fit to the noise in the data rather than capturing the true underlying patterns. This can result in poor generalization to new, unseen data.

3. **Increased Model Complexity**: Dealing with sparse data often requires more complex models. For instance, if you're using a bag-of-words representation where each unique word is a feature, you might end up with a high-dimensional feature space. This can increase model complexity and the computational resources required for training and inference.

4. **Loss of Information**: Rare words or infrequent features may carry valuable information. In tasks like sentiment analysis or topic modeling, uncommon words can be strong indicators of sentiment or topic. When you discard or downweight these features due to their rarity, you lose potentially important information.

5. **Efficiency Challenges**: Sparse data can be computationally inefficient to process and store. In large-scale NLP applications, it can lead to performance bottlenecks and increased memory requirements.

To address issues related to word sparsity, NLP practitioners often use techniques like:

- **Text Preprocessing**: Removing or reducing word sparsity by applying techniques like stemming, lemmatization, or removing stop words.
- **Feature Engineering**: Using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to weight words based on their importance.
- **Dimensionality Reduction**: Applying techniques like Principal Component Analysis (PCA) or Truncated SVD (Singular Value Decomposition) to reduce the dimensionality of sparse feature spaces.
- **Word Embeddings**: Using word embeddings (e.g., Word2Vec, GloVe) to represent words as dense vectors in a lower-dimensional space. This not only reduces sparsity but also captures semantic relationships between words.
- **Data Augmentation**: Expanding the dataset by generating more data through techniques like back-translation or synonym replacement.
- **Transfer Learning**: Leveraging pre-trained models like BERT or GPT, which have learned from large corpora and can handle word sparsity more effectively.

Addressing word sparsity is crucial for improving the performance of NLP models, especially in tasks where capturing subtle linguistic nuances is essential.

# 10. Laplace Smoothing

Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naïve Bayes. Using Laplace smoothing, we can represent $P(w’|positive)$ as
$$P(w'|positive)=\frac{number\, of\, reviews\, with\, w'\, and\, y = positive+\alpha}{N+\alpha*K}$$
Here,<br>
$\alpha$ represents the smoothing parameter<br>
$K$ represents the number of dimensions (features) in the data, and<br>
$N$ represents the number of reviews with y=positive<br>

If we choose a value of alpha!=0 (not equal to 0), the probability will no longer be zero even if a word is not present in the training dataset.

Source: https://towardsdatascience.com/laplace-smoothing-in-na%C3%AFve-bayes-algorithm-9c237a8bdece

# 11. Good Turing Smoothing

Good-Turing smoothing is a technique used in natural language processing (NLP) and machine learning to estimate the probabilities of unseen or rare events, particularly in the context of language modeling and text classification. It was developed by I.J. Good and is a modification of Laplace smoothing, also known as add-one smoothing.

The primary motivation behind Good-Turing smoothing is to address the "zero-frequency problem." In language modeling, many words or n-grams may occur in a corpus only a few times, or not at all, making it difficult to estimate their probabilities accurately. Good-Turing smoothing helps by redistributing some of the probability mass from more frequent events to less frequent or unseen events.

Here's an overview of how Good-Turing smoothing works:

1. **Count the Frequency of Events**: First, you count the frequency of each event (e.g., words, n-grams) in your training data. Let's call these counts "Nc," where "c" represents the count. For example, N1 is the count of events that occurred only once, N2 is the count of events that occurred twice, and so on.

2. **Estimate the Probability of Unseen Events (Zero Counts)**: In cases where you have events with zero counts (unseen events), Good-Turing smoothing estimates their probability using the observed frequency of events with higher counts. It assumes that less frequent events follow the same distribution as more frequent events. The formula for estimating the probability of unseen events is:

   P_unseen = (N1 + 1) / N

   Where:
   - P_unseen is the estimated probability of an unseen event.
   - N1 is the count of events that occurred only once.
   - N is the total number of events in the dataset.

3. **Smooth Probabilities for Seen Events**: For events that have nonzero counts, you apply a smoothing formula to adjust their probabilities. The formula is based on the ratio of the count of events with a count of (c+1) to the count of events with a count of (c). This is used to redistribute some probability mass from higher-frequency events to lower-frequency ones. The smoothed probability is calculated as:

   P_smoothed = (c+1) * (Nc+1) / Nc

   Where:
   - P_smoothed is the smoothed probability of an event with count "c."
   - c is the count of the event.
   - Nc is the count of events with a count of "c."
   - Nc+1 is the count of events with a count of "c+1."

Good-Turing smoothing effectively reduces the probability mass assigned to frequent events and reallocates it to rare or unseen events, which can lead to more accurate probability estimates, especially for unseen events. This technique is commonly used in tasks like language modeling with n-grams, where you need to estimate the likelihood of word sequences that may not have been observed in your training data.

# 12. LMs in topic modeling

In topic modeling, LMs (Language Models) can refer to a specific type of model used for estimating the probability of observing a sequence of words in a document or a set of documents. Language models play a crucial role in various aspects of topic modeling, including document classification, topic assignment, and text generation. Here's how LMs are used in topic modeling:

1. **Document Classification**: Language models can be employed for classifying documents into predefined topics or categories. For instance, you might have a collection of news articles and want to categorize them into topics like "politics," "sports," or "technology." LMs can calculate the probability of observing the words in each document given a specific topic model. The document is then assigned to the topic with the highest probability.

2. **Topic Assignment**: In topic modeling techniques like Latent Dirichlet Allocation (LDA), documents are assumed to be generated based on a mixture of topics. LMs can be used to estimate the likelihood of a document being generated by a particular topic. This information is vital when assigning topics to documents in an unsupervised manner.

3. **Text Generation**: LMs, especially neural language models like GPT-3, can be used to generate text based on a given topic. You can provide a topic or a set of keywords, and the language model will generate coherent text that is contextually relevant to the topic. This is useful for content generation, chatbots, and more.

4. **Word Probability Estimation**: LMs can estimate the probability of observing specific words or phrases in a document or a collection of documents. This information can be used to identify important keywords or phrases associated with particular topics.

5. **Model Evaluation**: LMs can help evaluate the quality of topic models. For instance, you can calculate the likelihood of observing your corpus of documents using an LDA model. The higher the likelihood, the better the model fits the data. LMs can also be used in perplexity calculations to assess how well a language model generalizes to unseen data.

6. **Document Similarity**: LMs can be used to measure the similarity between documents based on the probability distributions of words. Documents with similar word probability distributions are likely to be related in terms of topics or content.

7. **Summarization**: LMs can assist in generating document summaries. By identifying the most probable words or phrases in a document, you can create concise summaries that capture the key points or topics discussed.

In the context of modern NLP, pre-trained language models like BERT, GPT-3, and others have been used for various topic modeling tasks due to their ability to capture complex language patterns and semantics. Researchers and practitioners often fine-tune these models on specific topic modeling tasks to achieve state-of-the-art results.

In summary, language models play a multifaceted role in topic modeling, from document classification and topic assignment to text generation and model evaluation. They enable more accurate and sophisticated approaches to understanding and organizing text data into meaningful topics.

# 13. Conditional Independence

Two events $A$ and $B$ are **conditionally independent** given an event $C$ with $P(C)>0$ if
$$P(A\cap B|C) = P(A|C)P(B|C)$$

Source: https://www.probabilitycourse.com/chapter1/1_4_4_conditional_independence.php

# 14. VIDEO Bayes Theorem

https://www.youtube.com/watch?v=HZGCoVF3YvM&t=28s&ab_channel=3Blue1Brown

# 15. Bayes Theorem

Let $E_1$, $E_2$, ... , $E_n$ be a set of events associated with a sample space $S$, where all the events $E_1$, $E_2$, ... , $E_n$ have nonzero probability of occurrence and they form a partition of $S$. Let $A$ be any event associated with $S$, then according to Bayes theorem,
$$P(E_i|A)=\frac{P(E_i)P(A|E_i)}{\sum_{k=1}^{n}P(E_k)P(A|E_k)}$$
for any $k=1,2,3,...,n$

Source: https://byjus.com/maths/bayes-theorem/

# 16. Bayes Theorem in Practice

**Medical diagnosis:** Bayes’ theorem is widely used in medical diagnosis, where the probability of a particular disease or condition given certain symptoms or test results is calculated. It helps physicians assess the likelihood of a disease based on prior knowledge and test outcomes.

**Spam filtering:** In email spam filtering, Bayes’ theorem is used to classify incoming emails as spam or non-spam. It calculates the probability that an email is spam given the occurrence of certain words or patterns, based on a training dataset of known spam and non-spam emails.

**Document categorization:** Bayes’ theorem is applied in text mining and natural language processing for document categorization tasks. It can help classify documents into predefined categories by calculating the probability of a document belonging to a category given its content or features.

Source: https://medium.com/@evertongomede/applications-of-bayes-theorem-b3e95b4958de

# 17. Probability Density Function (PDF)

The **probability density function (pdf)** of a continuous random variable $X$ with support $S$ is an integrable function $f(x)$ satisfying the following:
1. $f(x)$ is positive everywhere in the support $S$, that is, $f(x)>0$, for all $x$ in $S$
2. The area under the curve $f(x)$ in the support $S$ is 1, that is:
$$\int_{S}f(x)dx=1$$
3. If $f(x)$ is the pdf of $x$, then the probability that $x$ belongs to $A$, where $A$ is some interval, is given by the integral of $f(x)$ over that integral, that is:
$$P(X\in A)=\int_{A}f(x)dx$$

Source: https://online.stat.psu.edu/stat414/lesson/14/14.1

# 18. Common PDFs

## 18.1 Normal Distribution

In a normal distribution, data is symmetrically distributed with no skew. When plotted on a graph, the data follows a bell shape, with most values clustering around a central region and tapering off as they go further away from the center.

Normal distributions are also called Gaussian distributions or bell curves because of their shape.

Source: scribbr.com/statistics/normal-distribution/

## 18.2 Uniform Distribution

A uniform distribution, sometimes also known as a rectangular distribution, is a distribution that has constant probability.

Source: https://mathworld.wolfram.com/UniformDistribution.html

## 18.3 Exponential Distribution

The exponential distribution is a continuous probability distribution used to model the time elapsed before a given event occurs.

Sometimes it is also called negative exponential distribution.

Source: https://www.statlect.com/probability-distributions/exponential-distribution

# 19. How does kernel density estimation work?, pp 116

The KDE algorithm takes a parameter, *bandwidth*, that affects how “smooth” the resulting curve is.

The KDE is calculated by weighting the distances of all the data points for each location on the curve (produced by histogram-bins). If there are more points nearby, the estimate is higher, indicating that probability of seeing a point at that location.

Changing the bandwidth changes the shape of the kernel: a lower bandwidth means only points very close to the current position are given any weight, which leads to the estimate looking squiggly; a higher bandwidth means a shallow kernel where distant points can contribute.

The concept of weighting the distances of our observations from a particular point, $x$, can be expressed mathematically as follows:
$$\hat{f}(x)=\sum_{observations}K(\frac{x-observation}{bandwidth})$$
The variable $K$ represents the kernel function. Using different kernel functions will produce different estimates.

Source: https://mathisonian.github.io/kde/

# 20. Probability Mass Functions (pmf)

Let $X$ be a discrete random variable with range $R_X=\{x_1,x_2,x_3,...\}$ (finite or countably infinite). The function
$$P_X(x_k)=P(X=x_k),\;\;\;\;for\;k=1,2,3,...$$
is called the *probability mass function (PMF)* of $X$.

Source: https://www.probabilitycourse.com/chapter3/3_1_3_pmf.php

# 21. Common PMFs

## 21.1 Binomial Distribution

A **binomial distribution** can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has **two possible outcomes** (the prefix “bi” means two, or twice).
1. The first variable in the binomial formula, $n$, stands for the number of times the experiment runs.
2. The second variable, $p$, represents the probability of one specific outcome.
***
**Criteria for Binomial Distribution:**
1. The number of observations or trials is fixed.
2. Each observation or trial is independent.
3. The probability of success (tails, heads, fail or pass) is exactly the same from one trial to another.

## 21.2 Bernoulli Distribution

Bernoulli distribution applies to events that have **one trial** and **two possible outcomes**. These are known as Bernoulli trials.

Source: https://careerfoundry.com/en/blog/data-analytics/what-is-bernoulli-distribution/#:~:text=To%20recap%3A-,Bernoulli%20distribution%20is%20a%20discrete%20probability%20distribution,success)%20or%20tails%20(failure)%3F

## 21.3 Discrete Uniform Distribution

A discrete uniform distribution is one that has a finite (or countably finite) number of random variables that have an equally likely chance of occurring. Examples of experiments that result in discrete uniform distributions are the rolling of a die or the selection of a card from a standard deck. For a fair, six-sided die, there is an equal probability $\frac{1}{6}$ of rolling a 1, 2, 3, 4, 5, or 6. Similarly, a standard deck of cards has 52 different cards, so the probability of selecting any one card is $\frac{1}{52}$. Also, in both cases, there are distinct outcomes (dice roll or cards), indicating the discrete nature of the events.

More specifically, let x be a discrete random variable having n values over the interval $[a,b]$; x has a discrete uniform distribution if its probability mass function (pmf) is defined by:
$$f(x)=\frac{1}{n},\;\;\;\;x=1,2,3,...,n$$

![image.png](attachment:dd5e66a3-7034-45ce-a75b-9781bce02dd4.png)

Source: https://www.math.net/uniform-distribution

## 21.4 Geometric Distribution

The **geometric distribution** is a probability distribution that models the number of trials required to achieve the first success in a sequence of independent Bernoulli trials, where each trial has a constant probability of success.

A geometric distribution can have an indefinite number of trials until the first success is obtained.

Source: https://www.cuemath.com/geometric-distribution-formula/

# 22. Cumulative Distribution Function (CDF)

A cumulative distribution function (CDF) describes the probabilities of a random variable having values less than or equal to $x$. It is a cumulative function because it sums the total likelihood up to that point. Its output always ranges between 0 and 1.
$$CDF(x) = P(X ≤ x)$$
Where $X$ is the random variable, and $x$ is a specific value. The CDF gives us the probability that the random variable $X$ is less than or equal to $x$. These functions are non-decreasing. As $x$ increases, the likelihood can either increase or stay constant, but it can not decrease.

Both probability density functions (PDFs) and cumulative distribution functions provide likelihoods for random variables. However, PDFs calculate probability densities for $x$, while CDFs give the chances for $≤ x$.

Cumulative distribution functions are excellent for providing probabilities that the next observation will be less than or equal to the value you specify. This ability can help you make decisions that incorporate uncertainty.

Additionally, these cumulative probabilities are equivalent to percentiles. A cumulative probability of 0.80 is the same as the $80^{th}$ percentile. So, CDFs are great for finding percentiles.

Source: https://statisticsbyjim.com/probability/cumulative-distribution-function-cdf/

## 22.1 How to Transform a PDF/PMF to CDF

### 22.1.1 PDF to CDF

1. When $x<0$, $F(x)=0$
2. For $x<when\_function\_argument\_changes$,
$$F(x)=F(0)+\int_0^x(argument)dx$$
3. For each x-range, different F(x) -> piecewise.
4. $F(last\_x)=1$

https://youtu.be/Zg_OGcSJYHI?si=XZnkkhzhxzAywcbD

### 22.1.2 PMF to CDF

https://youtu.be/1TRYf4_ZFT8?si=k5pGid37fChnLWnY

# 23. How to derive the parameter estimate from the likehood function?

Read full: 
1. https://blog.paperspace.com/maximum-likelihood-estimation-parametric-classification/
2. https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1

1. **Probability Function:** Find the probability function that makes a prediction.
2. **Likelihood:** Based on the probability function, derive the likelihood of the distribution.
3. **Log-Likelihood:** Based on the likelihood, derive the log-likelihood.
4. **Maximum Likelihood Estimation:** Find the maximum likelihood estimation of the parameters that form the distribution.
5. **Estimated Distribution:** Plug the estimated parameters into the probability function of the distribution.

# 24. MLE over a Continuous Random Variable

For **Gaussian distribution**:
1. MLE of mean:
$$m=\frac{\sum_{t=1}^{N}x^t}{N}$$
2. MLE of variance:
$$s^2=\frac{\sum_{t=1}^{N}(x^t-m)^2}{N}$$

# 25. Mean

The **mean** is the average of a data set. It is the center of a probability distribution.

Source: https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/mean-median-mode/#mean

# 26. Variance

The Standard Deviation is a measure of how spread out numbers are.

Source: https://www.mathsisfun.com/data/standard-deviation.html

# 27. Expectation

Expectation is summation or integration of a possible values from a random variable.

Source: https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/mathematical-expectation/

## 27.1 Expectation of Discrete Variables

$$E(X)=\sum xP(X=x)$$
Source: https://nzmaths.co.nz/category/glossary/expected-value-discrete-random-variable

## 27.2 Expectation of Continuous Variables

$$E(X)=\int_{-\infty}^{\infty}x.f(x)dx$$
Source: https://dlsun.github.io/probability/ev-continuous.html

# 28. The Scientific Method

1. Ask a Question
2. Do Background Research
3. Construct a Hypothesis
4. Test Your Hypothesis by Doing an Experiment
5. Analyze Your Data and Draw a Conclusion
6. Communicate Your Results

Source: https://www.sciencebuddies.org/science-fair-projects/science-fair/steps-of-the-scientific-method

# 29. Null Hypotheses

This can be thought of as the implied hypothesis. “Null” meaning “nothing.”  This hypothesis states that there is no difference between groups or no relationship between variables. The null hypothesis is a presumption of status quo or no change.

Source: https://resources.nu.edu/statsresources/hypothesis

# 30. Alternative Hypotheses

This is also known as the claim. This hypothesis should state what you expect the data to show, based on your research on the topic. This is your answer to your research question.

Source: https://resources.nu.edu/statsresources/hypothesis

# 31. Defining a rejection region based on hypothesis

## 31.1 Left-tailed

$$P(Z\leq z^*)$$

## 31.2 Right-tailed

$$P(Z\geq z^*)$$

## 31.3 Two-tailed

$$2\times P(Z\geq |z^*|)$$

Source: https://online.stat.psu.edu/stat500/lesson/6a/6a.4/6a.4.1

# 32. T-tests

A **t test** is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.

Source: https://www.scribbr.com/statistics/t-test/#:~:text=A%20t%20test%20is%20a,are%20different%20from%20one%20another.

# 33. Degrees of Freedom

In inferential statistics, you estimate a parameter of a population by calculating a statistic of a sample. The number of independent pieces of information used to calculate the statistic is called the **degrees of freedom**.

Source: https://www.scribbr.com/statistics/degrees-of-freedom/

# 34. Error Types

**Type I error** is a false positive conclusion, while a **Type II error** is a false negative conclusion.

# Latex Styling Used in this Markdown Document

1. Set notation, union: https://latex-tutorial.com/union-latex/
2. Summation: https://www.physicsread.com/latex-summation/
3. Set notations: https://www.geeksforgeeks.org/set-notations-in-latex/
4. Equations: https://www.fabriziomusacchio.com/blog/2021-08-10-How_to_use_LaTeX_in_Markdown/
5. Similarly Equivalent: https://www.overleaf.com/learn/latex/List_of_Greek_letters_and_math_symbols
6. Line break: https://www.markdownguide.org/basic-syntax/#:~:text=To%20create%20a%20line%20break,spaces%2C%20and%20then%20type%20return.
7. Spacing in math equations: http://www.emerson.emory.edu/services/latex/latex_119.html
8. Integrals: https://www.overleaf.com/learn/latex/Questions/Writing_integrals_in_LaTeX#:~:text=It%27s%20very%20easy%20in%20LaTeX%20to%20write%20an,%24%24int_%20%7B0%7D%5E%20%7Bpi%7Dx%5E2%20%2Cdx%24%24%20Basic%20LaTeX%2015%3A%20Integrals
9. Infinity: https://latex-tutorial.com/infinity-latex/#Infinity-in-LaTeX
10. Multiplication: https://www.math-linux.com/latex-26/faq/latex-faq/article/latex-symbol-multiply