# Data Science Glossary

## Statistics

### Foundation


* Chance - proportion of times event should happen over repeated trials
* P(A) - proportion of times event A happens in n trials
* Law of Large Numbers - as n, the number of trials, grows larger and approaches infinity,  P(A) approaches 
* Outcome space - all possible results of a trial, P(outcome space) = 1  for any trial (i.e. the probability of a quarter flipping either heads or tails up is 100%).  The probability of an impossible result is 0.
* Complement of an event A ($A_c$ or $\overline{\mbox{A}}$, depending on notation) - all events not including A, with total probability of $1 - P(A)$
* Union of events A and B ($A \cup B$) - any event that includes event A, event B, or both A and B.
* Intersection of events A and B ($A \cap B$) - any event that includes both A and B
* Subset: An event A is a subset of event B if the A is within B
* Partition - an event can be partitioned into non-intersecting sub-events (event A partitioned into sub-events $A_1, A_2...A_n)$.  If the probability of the intersection of sub-events is 0 (no sub-events overlap) then $A_1, A_2...A_n)$ form a partition.
    * $A_1, A_2, A_3... A_n$ form a partition of $A$ if $A = A_1 \cup A_2 \cup A_3 ...\cup A_n$ and $A_i \cap A_j$ for all $i \neq j, i,j \leq n$ 
    
    
    
* Rules of Probability:
    * Probability of any event within the outcome space is at least 0.  $(P(A) ≥ 0)$
    * If the sub-events $A_1, A_2...A_n$ form a partition of the event $A$, then $P(A) = P(A_1) + P(A_2) + ... + P(A_n)$
    * Probability of the outcome space is 1; the sum of the probability of all possible and impossible events within the outcome space is 1.
* Probability Space - defines the universe of a statistical model using 3 parts:
    * Sample space -  the collection of all possible outcomes
    * Event space - the collection of all possible sets of possible outcomes.
    * Probability measure - a function that maps each event to a probability within the $[0,1]$ interval
* Independence - event A is independent of event B if the occurrence of one does not affect the other
* Multiplicative Law of Probability - if A and B are independent events, then $P(A \cap B) = P(A)*P(B)$
* Addition Law of Probability - if A and B are independent events, then $P(A \cup B) = P(A)+P(B)-P(A \cap B)$
* Conditional Probability ($P(A|B)$) - reframing the probability of an event A given information about the occurrence of some event B
* Law of Total Probability - for a partition $A_1, A_2...A_3$ of the sample space and for event $B$ of the sample space, $P(B) = \sum_i P(B \cap A_i)$
    * If each partition $A_i$ has a positive probability (i.e. the subevent $A_i$ has a non-zero probability of existing), then by the Multiplicative Law of Probability, $P(B) = \sum_i P(B|A_i)P(A_i)$
* Bayes' Rule:
    * $P(B|A) = \frac{P(A|B)}{P(A)}$
    * Alternatively, $P(B|A)= \frac{P(A|B)P(B)}{\sum_i P(A|B_i)P(B_i)}$


* Random Variable - a variable without a fixed value.  Instead, a random variable describes any number of potential outcomes that may come from a random phenomenon.
* Indicator random variable - a binary random variable used to describe failure/success (takes on value 0 or 1).  Often used in problems to simplify calculations.
 
* Distribution - a function that divides the probability of outcome space into subsets in such a way that satisfies the rules of probability. 
    * For example, the distribution of a coin toss is $P(Heads) = 0.5, P(Tails = 0.5)$.
    * Discrete distributions - random variables take on integer values (i.e. dice rolls, number of tickets bought in an hour, etc.)
    * Continuous distributions - random variables take on continuous (decimal/real) values (i.e. time, distance, etc.)
* Probability Mass Function/Probability Density Function (PMF/PDF) - a function that describes the probability of a random variable taking on a certain value (PMF/PDF: $P(X=x) = f(x)$)
    * Probability Mass Function - used for discrete distributions
    * Probability Density Function - used for continuous distributions.  A little more complicated than PMFs, since the absolute probability of a random value equaling an exact value is 0 due to the issue of preciseness (0 vs 0.00000000001).  Instead, the PDF describes a relative probability of a random value being within a certain interval containing that exact value.
* Cumulative Density Function (CMF/CDF) - a function that describes the probability of a random variable being less than or equal to a certain value(PMF/PDF: $P(X<x) = F(x)$)
    * For discrete distributions, the CDF can be defined as $P(X<x) = F(x) = \sum_{x_i<x} f(X=x_i)$.  Essentially, we are adding the all the probabilities of X taking on all the values that are less than x.
    * For continuous distributions, the CDF can be defined as  $P(X<x) = F(x) = \int_{-\infty}^x f(X=x_i)dx$.  Essentially, we are integrating to find the area under the curve up to the point of x.

    
* Joint Distributions
* Marginal Distributions
* Conditional Distribution


* Expected Value - essentially a weighted sum of all outcomes by their probabilities
    * For discrete distributions, $E(X) = \sum P(x)x$
    * For continuous distributions, $E(X) = \int_{-\infty}^{\infty} P(x)x dx$
* Law of the Unthinking Statistician (LOTUS) - used to calculate the expectation of a function of a random variable.
    * LOTUS (discrete): $E(g(X)) = \sum g(x)f(x)$, where $f(x)$ is the PMF 
    * LOTUS(continuous): $E(g(X)) =\int_{-\infty}^{\infty} f(x)g(x) dx$, where $f(x)$ is the PDF 
    
* Variance
* Covariance
* Correlation


* Markov's Inequality
    * Deals with edge cases, not non-negative
* Chebyshev's Inequality
    * Cbebyshev: sets bounds on both sides, allows better centering
* Central Limit Theorem
* Law of Averages


* Craps Principle - for independent trials each with probability $p$ of success:
    1. Find event A of your choosing with $P(A) = \frac{1}{2}$
    2. Wait until Success + Failure or Failure + Success; if Success + Success or Failure + Failure, try again
    3. Find event B with p of your choosing
    

### Distributions

#### Discrete

* Bernoulli
* Uniform
* Binomial/Multinomial
* Negative Binomial
* Geometric
* Hypergeometric
* Poisson
    * Approximates binomial
    * Thinning
    * Poisson Process

* Conditional Expectation for Discrete Variables:
#### Continuous
            
* Uniform
* Exponential
    * Min/Max
    * Competing
* Beta
* Gamma

* Normal
    * Approximates binomial +    
    * Joint Distribution of 2 Independent Standard Normal Random Variables - 
* Rayleigh


* Change of Variable (Continuous)


## Hypothesis Testing



## Causal Inference

* Observational data - data collected outside the context of an explicitly created experiment, or used outside the context of the original experiment
* The problem with observational data
    * Can't use it to prove causal relations
    * Confounding variables/lurking variables/unobserved heterogeneity - unmeasured and uncontrolled attributes in the data
        * It is impossible to clearly formulate any relationship as causal when unobserved heterogeneity is present - any correlation may be attributed to the existence of these confounding variables
* Field experiments - randomized studies conducted in real-world settings
* Naturally occurring experiments - interventions randomly assigned by some existing institution
* Downstream experiments - intervention affects outcome of interest but also potentially other outcomes as well, which may also be studied
* Quasi-experiments - near-random process cause subjects to receive different treatments, but not explicitly random (i.e. close elections, natural phenomenon like weather/disasters)




* For any/every treatment effect, there exists two potential outcomes:
    * $Y_i(1)$ represents the outcome when the treatment effect is applied to subject $Y_i$
    * $Y_i(0)$ represents the outcome when the treatment effect is not applied to subject $Y_i$
    * In reality, we can only ever know one of these - when the treatment effect is/isn't applied, one potential outcome is realized and the other remains a hypothetical
* $Y_i = d_i Y_i(1) +(1-d_i)Y_i(0)$, where $d_i$ is an indicator variable such that $d_i = 1$ when the ith subject has received the treatment and  $d_i = 0$ when the ith subject has not received the treatment 
    * $d_i$ is the treatment that is actually applied/not applied (variable), $D_i$ is the hypothetical treatment that has not yet been applied (random variable)
        * $d_i$ is the realization of $D_i$
    * $E[Y_i(1)|d_i=1]$ represents the expectation of $Y_i(1)$ when subject i is selected at random from the treated subjects
    * $E[Y_i(1)|d_i=0]$ represents the expectation of $Y_i(1)$ when subject i is selected at random from the untreated subjects (which is a hypothetical quantity that is impossible to observe)
* $\tau_i$ represents the causal effect of the treatment, and is defined as the difference between the two potential outcomes 
    * $\tau_i = Y_i(1) - Y_i(0)$
    * $\tau_i$ is a hypothetical quantity given that we can never really know both outcomes
* Average Treatment Effect (ATE) is defined as the average of $\tau_i$ for all $i$s
    * $ATE=\frac{1}{N} \sum_i^N \tau_i$
    * In other terms, $ATE = E[Y_i(1)-Y_i(0)]$
    * While different subjects will have different $\tau_i$, the ATE describes how outcomes change on average when going from untreated to treated
* Problem: we will never know both $Y_i(1)$ and $Y_i(0)$ 
* Solution: if we assign treatments to random subjects, the expectation of the treatment and control groups are identical. 
    * $E[Y_i(1)|D_i=1]=E[Y_i(1)]=E[Y_i(1)|D_i=0]$
    * $E[Y_i(0)|D_i=0]=E[Y_i(0)]=E[Y_i(0)|D_i=1]$
    * Treatment and control groups have same expected potential outcome
    * $ATE = E[Y_i(1)-Y_i(0)] = 0$
    * $ATE = E[Y_i(1)|D_i=1]-E[Y_i(0)|D_i=0]$
        * Estimate ATE by taking difference between two sample means
        * $E[Y_i(1)|D_i=1]-E[Y_i(0)|D_i=0] = E[Y_i(1)] - E[Y_i(0)] = E[\tau_i] = ATE$
        * When treatments are randomly assigned, comparison of average outcomes in treatment and control groups (difference-in-means estimator) is unbiased estimator of ATE
* Selection problem - receiving treatment may be systemically related to potential outcomes (sets are not truly random)
    * $E[Y_i(1)|D_i=1]-E[Y_i(0)|D_i=0] = $ expected difference between treated and untreated outcomes 
    * $E[Y_i(1)|D_i=1]-E[Y_i(0)|D_i=0] = E[Y_i(1)|D_i=1] + E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]$ 
        * $E[Y_i(1)|D_i=1] = $ ATE among the treated 
        * $E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0] = $ selection bias term
        * With random assignment/selection, selection bias term is 0 ($E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0] = 0$), so the ATE among the treated is equal to the expected difference between treated and untreated outcomes
        * Without random assignment/selection, the apparent treatment effect will be a mixture of selection bias and the ATE for a subset of subjects
            * Therefore, random assignment is necessary to specifically identify ATE among the treated subjects
* Excludability - that the potential outcome depends solely on whether or not the subject receives the treatment
    * Treatment assignment $z_i$ should only affect $d_i$
* Non-interference - the subject itself receives the treatment, not the treatment of other subjects



* Standard Deviation - $\sqrt{\frac{1}{N} \sum_1^N (X_i - \bar X)^2}$ 
    * When X is a random sample from larger population containing N* subjects with an unknown mean: 
        * Standard Deviation - $\sqrt{\frac{1}{N-1} \sum_1^N (X_i - \bar X)^2}$ 
* Standard Error - $\sqrt{\frac{1}{J} \sum_1^J (\hat\theta_j - \bar{\hat \theta})^2}$ 
    * Standard error - standard deviation of a sampling distribution 
    * $J$ - number of possible ways of randomly assigning subjects
    * $\hat\theta_j$ - estimate we get from the jth randomization
    * $\bar{\hat \theta}$ - average estimate of all j randomizations
* Variance - variance of an observed or potential outcome for a set of N subjects is the average squared deviation from the mean for all N subjects
    * Ex. $Var(Y_i(1)) = \sqrt{\frac{1}{N} \sum_1^N (Y_i(1) - \frac{\sum_1^N (Y_i(1))}{N})^2} $
    * Higher variance = higher dispersion around the mean
    * Variance of 0 means the variable is a constant
* Covariance - covariance between two variables = subtract the mean from each and calculate the average cross-product of the result:
    * Ex. $Cov(Y_i(0),Y_i(1)) = \sqrt{\frac{1}{N} \sum_1^N (Y_i(0) - \frac{\sum_1^N (Y_i(0))}{N})(Y_i(1) - \frac{\sum_1^N (Y_i(1))}{N})} $
    * Covariance is measure of association between two variables
    * Negative covariance implies low values of one coincides with higher values of another (inverse relationship)
    * Positive covariance implies high values of one coincides with high values of another
* Standard Error of estimated ATE - $SE(\hat{ATE}) = \sqrt{\frac{1}{N-1} [\frac{m Var(Y_i(0))}{N-m} + \frac{(N-m) Var(Y_i(1))}{m} + 2Cov(Y_i(0),Y_i(1))]}$
    * N observations
    * m treated units
    * Implications for experiment design
        * Larger N = smaller standard error
        * Smaller variance (of either $Y_i(0),Y_i(1)$) = smaller SE
        * Smaller covariance (of $Y_i(0),Y_i(1)$) = smaller SE
        * Similar variances (of $Y_i(0),Y_i(1)$) => we should assign equal number of subjects to control and treatment groups 
            * If different variance, assign more subjects to the group with higher variance
* True SE is unknown
    * We don't know the covariance between treatment/control - if we did, no reason to run the experiment
    * Need to estimate SE
    * Formula for estimating SE of ATE in practice: $\sqrt{\frac{\hat{Var}(Y_i(0))}{N-m} + \frac{\hat{Var}(Y_i(1))}{m}}$
        * $\hat{Var}(Y_i(1)) = \frac{1}{m-1} \sum_{1}^{m} (Y_i(1)|d_i = 1 - \frac{\sum_1^m Y_i(1)|d_i = 1}{m})^2$
        * $\hat{Var}(Y_i(0)) = \frac{1}{N-m-1} \sum_{N}^{m+1} (Y_i(0)|d_i = 0 - \frac{\sum_{m+1}^N Y_i(0)|d_i = 0}{N-m})^2$
        * This is the standard approach, which is to use a conservative estimation formula that is at least as large as the theoretical equation for the SE of the estimate ATE ($SE(\hat{ATE}) = \sqrt{\frac{1}{N-1} [\frac{m Var(Y_i(0))}{N-m} + \frac{(N-m) Var(Y_i(1))}{m} + 2Cov(Y_i(0),Y_i(1))]}$)
            * Conservative formula assumes treatment effect is the same for all subjects (correlation between $Y_i(0),Y_i(1)$ is 1)
        * Calculating sample variances, divide by $n-1$ to account for the fact that 1 observation is expended when we calculate the sample mean
* One-tailed hypothesis - whether the treatment results in a change in one direction (choose either greater or less than)
* Two-tailed hypothesis - whether the treatment results in a change in either direction (either greater or less than)
* For large N, number of random assignments becomes large
    * * Number of possible assignments under complete random assignment: $\frac{N!}{m!(N-m)!}$
        * Where N is the total number of participants and m is the number of participants in the treatment group
    * For $N=50$ and treatment/control groups of equal size, number of possible randomizations = $\frac{50!}{25! 25!}$
        * Approximate sampling distribution by randomly sampling from set of all possible random assignments
* Randomization inference - calculation of p-values based on sets of possible randomizations
* Sharp null hypothesis - treatment effect is 0 for all observations
    * If true, $Y_i(1)=Y_i(0)$, assume all $\tau_i = 0$
    * Treatment is no different than control
    
    
* Statistical power ingredients:
    * Sample size
    * Effect size
    * Population variance (in respect to the effect)
    
    
* Blocking - participants are stratified, each strata is divided into control and treatment through random assignment
    * Partition total population to ensure our treatment/control groups have equal representation of subjects
    * Number of possible assignments under blocking assignment with two blocks: $(\frac{N!}{n!(N-n)!})(\frac{M!}{m!(M-m)!})$
        * Where M is the total number of participants in block 1 and m is the number of participants in block 1 to be assigned to the treatment group, N is the total number of participants in block 2 and n is the number of participants in block 2 to be assigned to the treatment group
    * For J blocks, the standard error is $SE(\hat{ATE}) = \sqrt{\sum_1^J (\frac{N_j}{N})^2 SE^2 (\hat{ATE})_j}$
        * From the general rule for variance of a sum of independent random variables: $Var(\alpha A + \beta B) = \alpha^2 Var(A) + \beta^2 Var(B)$ 
        * For 2 blocks, the standard error is $SE(\hat{ATE}) = \sqrt{(SE_1)^2 (\frac{N_1}{N})^2 + (SE_2)^2(\frac{N_2}{N})^2}$
    * To calculate $\hat{SE}(\hat{ATE}_j)$ for each block, every block must contain at least 2 observations in treatment and two observations in control
        * Incompatible with matched pair designs, where every block only contains 2 subjects (1 treatment, 1 control)
            * Matched pair design:  $\hat{SE}(\hat{ATE}_j) = \sqrt{\frac{1}{N/2 (N/2 - 1)} \sum_{j=1}^J (\hat{ATE}_j - \hat{ATE}^2}$ = $\sqrt{\frac{1}{J (J - 1)} \sum_{j=1}^J (\hat{ATE}_j - \hat{ATE}^2}$
    * Blocking allows randomized experiment to account for variation within population, increases precision
    * No real downsides as long as blocking is reasonably justified
            

* Clustering - all subjects belonging to one strata are placed into the same group (treatment or control)
    * Like simple random assignment but for groups instead of individuals
    * Conservative SE for clustering with k clusters: $\hat{SE}(\hat{ATE}_j) = \sqrt{\frac{N\hat{Var}(\hat{Y}_j(0))}{k(N-m)} + \frac{N \hat{Var}(\hat{Y}_j(1))}{km}}$
    * If clusters vary in size and this variance covaries with potential outcomes, the usual difference-in-means estimator is biased
    * If bias is suspected, can use alternative estimator that focuses on difference in total outcomes: $\hat{ATE} = \frac{k_C + k_T}{N} (\frac{\sum Y_i(1)|d_i = 1}{k_T} - \frac{\sum Y_i(0)|d_i = 0}{k_c})$

## Machine Learning 

### Dimensionality
* Dimensionality
    * Principle Components Analysis
    * Canonical Components Analysis
    
### Classification versus Regression
* Classification - used to separate data into classes/categories
    * Discriminant Analysis
    * Naive Bayes
    * Logistic Regression
    * K-Nearest Neighbors
    * 
* Regression - used to model relationships within the data
    * Linear Regression
    *

### Supervised Learning versus Unsupervised Learning
* Supervised Learning - program is trained on labeled data 
    * Regression
        * Linear Regression
        * Multiple Regression
        * Polynomial Regression
        * Logistic Regression
    * Discriminant Analysis
    * Perceptron
    * Naive Bayes
    * Decision Trees
* Unsupervised Learning - program is trained on unlabeled data, aims to find patterns in data by itself (i.e. clustering)
    * Expectation Maximization
    * K-Nearest Neighbors
    
    
### Parametric Modeling versus Unparametric Modeling
* Parametric Modeling - assumes the data follows some known distribution that we can estimate (number of parameters are fixed according to the sample size)
    * Logistic Regression
    * Linear Discriminant Analysis
    * Perceptron
    * Naive Bayes
    * Simple Neural Networks
* Non-Parametric Modeling - does not assume the data follows some known distribution, no reliance on distribution assumptions (numvber param)
    * K-Nearest Neighbors
    * Decision Trees
    * Support Vector Machines
    
### Generative Models versus Nongenerative/Discriminative Models
* Generative Model - model can be used to generate new data after analyzing existing data
    * Discriminant Analysis
    * Naive Bayes
    * K-Nearest Neighbors
    
* Nongenerative/Discriminative - model cannot be used to generate new data, only used to classify
    * Regression
    * Simple Neural Networks
    * Support Vector Machines
    * Decision Trees


* Reinforcement Learning - semi-supervised, given labels for some data 