# Data Science Glossary

## Statistics

### Foundation


* Chance - proportion of times event should happen over repeated trials
* P(A) - proportion of times event A happens in n trials
* Law of Large Numbers - as n, the number of trials, grows larger and approaches infinity,  P(A) approaches 
* Outcome space - all possible results of a trial, P(outcome space) = 1  for any trial (i.e. the probability of a quarter flipping either heads or tails up is 100%).  The probability of an impossible result is 0.
* Complement of an event A ($A_c$ or $\overline{\mbox{A}}$, depending on notation) - all events not including A, with total probability of $1 - P(A)$
* Union of events A and B ($A \cup B$) - any event that includes event A, event B, or both A and B.
* Intersection of events A and B ($A \cap B$) - any event that includes both A and B
* Subset: An event A is a subset of event B if the A is within B
* Partition - an event can be partitioned into non-intersecting sub-events (event A partitioned into sub-events $A_1, A_2...A_n)$.  If the probability of the intersection of sub-events is 0 (no sub-events overlap) then $A_1, A_2...A_n)$ form a partition.
    * $A_1, A_2, A_3... A_n$ form a partition of $A$ if $A = A_1 \cup A_2 \cup A_3 ...\cup A_n$ and $A_i \cap A_j$ for all $i \neq j, i,j \leq n$ 
    
    
    
* Rules of Probability:
    * Probability of any event within the outcome space is at least 0.  $(P(A) ≥ 0)$
    * If the sub-events $A_1, A_2...A_n$ form a partition of the event $A$, then $P(A) = P(A_1) + P(A_2) + ... + P(A_n)$
    * Probability of the outcome space is 1; the sum of the probability of all possible and impossible events within the outcome space is 1.
* Probability Space - defines the universe of a statistical model using 3 parts:
    * Sample space -  the collection of all possible outcomes
    * Event space - the collection of all possible sets of possible outcomes.
    * Probability measure - a function that maps each event to a probability within the $[0,1]$ interval
* Independence - event A is independent of event B if the occurrence of one does not affect the other
* Multiplicative Law of Probability - if A and B are independent events, then $P(A \cap B) = P(A)*P(B)$
* Addition Law of Probability - if A and B are independent events, then $P(A \cup B) = P(A)+P(B)-P(A \cap B)$
* Conditional Probability ($P(A|B)$) - reframing the probability of an event A given information about the occurrence of some event B
* Law of Total Probability - for a partition $A_1, A_2...A_3$ of the sample space and for event $B$ of the sample space, $P(B) = \sum_i P(B \cap A_i)$
    * If each partition $A_i$ has a positive probability (i.e. the subevent $A_i$ has a non-zero probability of existing), then by the Multiplicative Law of Probability, $P(B) = \sum_i P(B|A_i)P(A_i)$
* Bayes' Rule:
    * $P(B|A) = \frac{P(A|B)}{P(A)}$
    * Alternatively, $P(B|A)= \frac{P(A|B)P(B)}{\sum_i P(A|B_i)P(B_i)}$


* Random Variable - a variable without a fixed value.  Instead, a random variable describes any number of potential outcomes that may come from a random phenomenon.
* Indicator random variable - a binary random variable used to describe failure/success (takes on value 0 or 1).  Often used in problems to simplify calculations.
 
* Distribution - a function that divides the probability of outcome space into subsets in such a way that satisfies the rules of probability. 
    * For example, the distribution of a coin toss is $P(Heads) = 0.5, P(Tails = 0.5)$.
    * Discrete distributions - random variables take on integer values (i.e. dice rolls, number of tickets bought in an hour, etc.)
    * Continuous distributions - random variables take on continuous (decimal/real) values (i.e. time, distance, etc.)
* Probability Mass Function/Probability Density Function (PMF/PDF) - a function that describes the probability of a random variable taking on a certain value (PMF/PDF: $P(X=x) = f(x)$)
    * Probability Mass Function - used for discrete distributions
    * Probability Density Function - used for continuous distributions.  A little more complicated than PMFs, since the absolute probability of a random value equaling an exact value is 0 due to the issue of preciseness (0 vs 0.00000000001).  Instead, the PDF describes a relative probability of a random value being within a certain interval containing that exact value.
* Cumulative Density Function (CMF/CDF) - a function that describes the probability of a random variable being less than or equal to a certain value(PMF/PDF: $P(X<x) = F(x)$)
    * For discrete distributions, the CDF can be defined as $P(X<x) = F(x) = \sum_{x_i<x} f(X=x_i)$.  Essentially, we are adding the all the probabilities of X taking on all the values that are less than x.
    * For continuous distributions, the CDF can be defined as  $P(X<x) = F(x) = \int_{-\infty}^x f(X=x_i)dx$.  Essentially, we are integrating to find the area under the curve up to the point of x.

    
* Joint Distributions
* Marginal Distributions
* Conditional Distribution


* Expected Value - essentially a weighted sum of all outcomes by their probabilities
    * For discrete distributions, $E(X) = \sum P(x)x$
    * For continuous distributions, $E(X) = \int_{-\infty}^{\infty} P(x)x dx$
* Law of the Unthinking Statistician (LOTUS) - used to calculate the expectation of a function of a random variable.
    * LOTUS (discrete): $E(g(X)) = \sum g(x)f(x)$, where $f(x)$ is the PMF 
    * LOTUS(continuous): $E(g(X)) =\int_{-\infty}^{\infty} f(x)g(x) dx$, where $f(x)$ is the PDF 
    
* Variance
* Covariance
* Correlation


* Markov's Inequality
    * Deals with edge cases, not non-negative
* Chebyshev's Inequality
    * Cbebyshev: sets bounds on both sides, allows better centering
* Central Limit Theorem
* Law of Averages


* Craps Principle - for independent trials each with probability $p$ of success:
    1. Find event A of your choosing with $P(A) = \frac{1}{2}$
    2. Wait until Success + Failure or Failure + Success; if Success + Success or Failure + Failure, try again
    3. Find event B with p of your choosing
    

### Distributions

#### Discrete

* Bernoulli
* Uniform
* Binomial/Multinomial
* Negative Binomial
* Geometric
* Hypergeometric
* Poisson
    * Approximates binomial
    * Thinning
    * Poisson Process

* Conditional Expectation for Discrete Variables:
#### Continuous
            
* Uniform
* Exponential
    * Min/Max
    * Competing
* Beta
* Gamma

* Normal
    * Approximates binomial +    
    * Joint Distribution of 2 Independent Standard Normal Random Variables - 
* Rayleigh


* Change of Variable (Continuous)


## Hypothesis Testing



## Machine Learning 

### Dimensionality
* Dimensionality
    * Principle Components Analysis
    * Canonical Components Analysis
    
### Classification versus Regression
* Classification - used to separate data into classes/categories
    * Discriminant Analysis
    * Naive Bayes
    * Logistic Regression
    * K-Nearest Neighbors
    * 
* Regression - used to model relationships within the data
    * Linear Regression
    *

### Supervised Learning versus Unsupervised Learning
* Supervised Learning - program is trained on labeled data 
    * Regression
        * Linear Regression
        * Multiple Regression
        * Polynomial Regression
        * Logistic Regression
    * Discriminant Analysis
    * Perceptron
    * Naive Bayes
    * Decision Trees
* Unsupervised Learning - program is trained on unlabeled data, aims to find patterns in data by itself (i.e. clustering)
    * Expectation Maximization
    * K-Nearest Neighbors
    
    
### Parametric Modeling versus Unparametric Modeling
* Parametric Modeling - assumes the data follows some known distribution that we can estimate (number of parameters are fixed according to the sample size)
    * Logistic Regression
    * Linear Discriminant Analysis
    * Perceptron
    * Naive Bayes
    * Simple Neural Networks
* Non-Parametric Modeling - does not assume the data follows some known distribution, no reliance on distribution assumptions (numvber param)
    * K-Nearest Neighbors
    * Decision Trees
    * Support Vector Machines
    
### Generative Models versus Nongenerative/Discriminative Models
* Generative Model - model can be used to generate new data after analyzing existing data
    * Discriminant Analysis
    * Naive Bayes
    * K-Nearest Neighbors
    
* Nongenerative/Discriminative - model cannot be used to generate new data, only used to classify
    * Regression
    * Simple Neural Networks
    * Support Vector Machines
    * Decision Trees


* Reinforcement Learning - semi-supervised, given labels for some data 