## Stats Cheatsheet

https://drive.google.com/open?id=1e_Nklf_stdNcXx2MXskey-XLH8CyWFsy


## Statistical Modeling

#### Statistical Learning

> Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the problem of finding a predictive function based on data. **The goal of statistical learning theory is to study, in a statistical framework, the properties of learning algorithms.**

#### Types of data in statistical learning 

In the context of Statistical learning, there are two types of data:

* **Data that can be controlled directly OR independent variables** 
* **Data that cannot be controlled directly OR dependent variables**

#### Statistical Model

> A statistical model can be thought as some kind of a transformation that helps us express dependent variables **as a function** of independent variables. 

SO a model essentially defines a **Relationship** between a dependent and an independent variable. For the plot we see above, the relationship between height and weight, in its simplest form, can be shown using a **straight line** connecting all the individual observations in the data. So this line here would be our **model** as shown. 

We can define and **fit** such a straight line to our data following a straight line equation: **y = m * x + b** . Such a simple model would simply describes, a person's height  has almost a linear relationship with weight i.e. weight increases with height. 
<img src="https://blogs.sas.com/content/iml/files/2013/02/RegSlopeInt.png" width = 400>. 

So this is our simple model for the relationship. Of course we can use more sophisticated models like quadratic equations or polynomial equations for a **better fit**, and we shall see that with advanced modeling techniques. Let's get back to our plain old straight line for now. 

Looking at this line above, we can define is as **Weight = -143 + 3.9 * Height**, based on slope(m) and intercept(c) values for **y = mx+ b**.  

This would be our **model**, which can help us work out a weight value for a given height OR in some cases you may put to change the orientation of data and try to predict height based on an individual's weight. That's all got to do with the question you are trying to ask. 

> A model is expressed as a mathematical equation showing the relationship between dependent and independent variables.

#### Model Parameters

Every model Parameters are the co-efficients of the model equation for estimating the output. Statistical Learning is all about learning these parameters. A statistical learning approach would help us **learn** these parameters so we have a clear description of their relationship which we can replicate and analyze under different circumstances. 

#### Model Validation

> Data is finite. 

The available data needs to be used very efficiently to build and **validate** a model. 

Here is a brief introduction to validation, in its simplest form:

* Split the data into two parts.
* Use one part for training so the model learns from it. This set of data is normally called the **Training Data**

* Use the other part for testing the model. This is data is kept away from the model during learning process and used only for testing the performance of a learned model. This dataset is called as the **Testing Data.**

This setup looks like as shown below:
![](https://francisbrochu.github.io/microbiome-summer-school-2017_mass-spec/sections/machine_learning/figures/train_test_sets.png)

In statistical learning, if the model has learned well from the training data, it will perform well on the test data and that would be our measure of accuracy. It is assessed based on how close it has estimated the output to the actual value.

#### Loss Functions

>A loss function method of evaluating how well your model represents the relationship between data variables. 

If the model can not figure out the underlying relationship between independent and dependent variable(s), the loss function outputs a higher number. If the relationship is well modeled, the loss function will be a lower value. As you change parameters of your model to try and improve results, your loss function is your best friend, telling you if you are on the right track. Below is an example loss function which calculates a loss for fitting straight line to set of variables (as in our case above). The function tries to measure the distance between data points and line to measure the level of LOSS. 
![](https://blog.algorithmia.com/wp-content/uploads/2018/04/word-image-5.png)



## Variance

    data.var()

a measure of dispersion for continuous random variables from its expected mean value. Let's quickly revisit this, as variance formula plays a key role while calculating covariance and correlation measures.

The formula for calculating variance as shown below:

$$\sigma^2 = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i-\mu)^2$$

- $x$ represents an individual data points
- $\mu $ is the sample mean of the data points
- $n$ is the total number of data points 

## Covariance

    data.cov()

In some cases, you'll want to look at **two random variables** to get an idea on how they **change together**. In statistics, when trying to figure out how two random variables **vary together**, you can use the so-called **covariance** between these variables.

Covariance calculation plays a major role in a number of advanced machine learning algorithms like dimensionality reduction, predictive analyses, etc.

### Calculating Covariance
If you have $X$ and $Y$, two random variables having $n$ elements each. You can calculate covariance ($\sigma_{xy}$) between these two variables by using the formula:

$$\sigma_{XY} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)$$

- $\sigma_{XY}$ = Covariance between $X$ and $Y$
- $x_i$ = ith element of variable $X$
- $y_i$ = ith element of variable $Y$
- $n$ = number of data points (__$n$ must be same for $X$ and $Y$__)
- $\mu_x$ = mean of the independent variable $X$
- $\mu_y$ = mean of the dependent variable $Y$

### Interpreting covariance values 

Covariance values range from positive infinity to negative infinity. 

* A **positive covariance** indicates that two variables are **positively related**

* A **negative covariance** indicates that two variables are **inversely related**

* A **covariance equal or close to 0** indicates that there is **no linear relationship** between two variables

## Correlation

    data.corr()

Covariance uses a formulation that only depends on the units of $X$ and $Y$ variables. When doing data science, it is often not appropriate to use covariance as such because different experiments may contain underlying data measured in different units. 

Because of this, it is important to normalize this degree of variation into a standard unit, with interpretable results, *independent of the units of data*. You can achieve this with a derived normalized measure, called **correlation**. 

The term "correlation" refers to a relationship or association between variables. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example: 
- Sales might increase when the marketing department spends more on TV advertisements
- Customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer, e.g. location, age group, gender etc.
- Social media activity and website clicks might be associated with revenue that a digital publisher makes. etc.

Correlation is the first step to understanding these relationships and subsequently building better business and statistical models.


### Pearson's Correlation Coefficient

__Pearson Correlation Coefficient__, $r$, also called the __linear correlation coefficient__, measures the strength and the direction of a __linear relationship__ between two variables. This coefficient quantifies the degree to which a relationship between two variables can be described by a line. 

**Note:** There are a [number other correlation coefficients](https://math.tutorvista.com/statistics/correlation.html),  but for now, we will focus on __Pearson correlation__ as it is the go-to correlation measure for most needs.

### Calculating Correlation Coefficient

Pearson Correlation ($r$) is calculated using following formula :

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{\sum_{i=1}^{n}(x_i - \mu_x)^2 \sum_{i=1}^{n}(y_i-\mu_y)^2}}$$

So just like in the case of covariance,  $X$ and $Y$ are two random variables having n elements each. 


- $x_i$ = ith element of variable $X$
- $y_i$ = ith element of variable $Y$
- $n$ = number of data points (__$n$ must be same for $X$ and $Y$__)
- $\mu_x$ = mean of the independent variable $X$
- $\mu_y$ = mean of the dependent variable $Y$
- $r$ = Calculated Pearson Correlation

### Interpreting Correlation values

> _The Pearson Correlation formula always gives values in a range between -1 and 1_

### Correlation is not Causation

You may have come across the saying: **“correlation is not causation”** or **“correlation does not imply causation”**. 

What do we mean by saying this?

Causation takes a step further than correlation.
> Any change in the value of one variable will cause a change in the value of another variable, which means one variable causes the other one to happen. It is also referred to as __cause and effect__.

### Key Takeaways
- Correlation is dimensionless, i.e. it is a unit-free measure of the relationship between variables. 
- Correlation is a normalized form of covariance and exists between [0,1]
- Correlation is not causation

## Statistical Distribution

A statistical distribution is a representation of the frequencies of potential events or the percentage of time each event occurs.

#### Examples of Discrete Distributions

##### The Bernoulli (Binomial) Distribution 

The Bernoulli distribution represents the probability of success for a certain experiment (the outcome being "success or not", so there are two possible outcomes). A coin toss is a classic example of a Bernoulli experiment with a probability of success 0.5 or 50%, but a Bernoulli experiment can have any probability of success between 0 and 1.

##### The Poisson Distribution

The Poisson distribution represents the probability of $n$ events in a given time period when the overall rate of occurrence is constant. A typical example is pieces of mail. If your overall mail received is constant, the number of items received on a single day (or month) follows a Poisson distribution. Other examples might include visitors arriving on a website, or customers arrive at a store, or clients waiting to be served in a queue.

##### The Uniform Distribution

The uniform distribution occurs when all possible outcomes are equally likely. The dice example shown before follows a uniform distribution with equal probabilities for throwing values from 1 to 6. The dice example follows a discrete uniform distribution, but continuous uniform distributions exist as well.

#### Examples of Continuous Distributions

##### The Normal or Gaussian distribution

A normal distribution is the single most important distribution, you'll basically come across it very often. The normal distribution follows a bell shape and is a foundational distribution for many models and theories in statistics and data science. A normal distribution turns up very often when dealing with real world data including heights, weights of different people, errors in some measurement or marks on a test.

#### Discrete vs Continuous Distributions

When dealing with **discrete** data you use a **Probability Mass Function (PMF)**. When dealing with **continuous** data, you use a **Probability Density Function (PDF)**.

Based on the variation of their attributes, data distributions can take many shapes and forms. In the next few lessons, you'll learn how to describe data distributions. Very often, distributions are described using their statistical mean (or **expected value**) and variance of the data, but this is not always the case.


## Negative binomial distribution and Geometric distribution

The Binomial Distribution describes the number of successes  k  achieved in  n  trials, where the probability of success is  p .

The Negative Binomial Distribution describes the number of successes  k  until observing  r  failures (or successes--this is arbitrary, and depends on how you phrase the question; it doesn't particularly matter if we define heads or tails as a failure, as long as we pick one). Note that these failures do not need to be consecutive, just cumulative!

If we know the parameters, we can calculate our Negative Binomial Probability by pulling them into the following formula:

$b(x, r, P) =\  _{x-1}C_{\ r-1} * P^{\ r} * (1-P)^{\ x-r}  $ 

$ r = no. of failures $

$ x =  no. of trials$

$ P = 0.5 probability (success/failure)$

#### Characteristics of the Negative Binomial Distribution

The **_mean_** of the Negative Binomial Distribution is: 

$$\mu = \frac{r}{p}$$

The **_variance_** of the Negative Binomial Distribution is:

$$\sigma^2 = \frac{r\ (1-p)}{p^{\ 2}}  $$

#### Calculating Negative Binomial Probability with Numpy

[numpy documenation for negative binomial sampling function](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.negative_binomial.html):

> A company drills wild-cat oil exploration wells, each with an estimated probability of success of 0.1. What is the probability of having one success for each successive well, that is what is the probability of a single success after drilling 5 wells, after 6 wells, etc.?

    import numpy as np

    s = np.random.negative_binomial(1, 0.1, 100000)
    for i in range(1, 11):
        probability = sum(s<i) / 1000000
        print("{} wells drilled, probability of success: {:.4f}%".format(i, probability * 100))
        
## Geometric Distribution

The Geometric Distribution is extremely similar to the negative binomial distribution. Whereas  r  is a parameter we choose ourselves in the Negative Binomial Distribution, in the Geometric Distribution r is always equal to 1! In this way, any questions that we can answer with the Geometric Distribution are questions that we can also answer with the Negative Binomial Distribution. The Geomtric Distribution, then, is just a subset of the Negative Binomial Distribution where  r always equals 1.

Whereas our previous example for the Negative Binomial Distribution was about how many times we could flip a coin before tails comes up twice, an equivalent question we could solve with the Geometric Distribution would be "What is the probability that I can flip a coin X times before it lands on tails?"

#### Equation for the Geometric Distribution

$$P(X=x) = q^{(x\ -\ 1)}p$$

Where $$q = 1 - p$$


$p$: The probability of failure for a given trial.

$q$: (1 - p), which is the probability of success for a given trial.

Note that in cases where there is an equal chance of both outcomes such as our coin flip example, p and q are the same thing.  This means that for "fair" trials where there is an equal chance of success or failure, we can further simplify our equation to $$P(X=x) = q^x$$

#### Python function for geo dist

    def geometric_dist(y,p):
        """y is a discrete random variable. It should be an integer that is greater then zero.
        p is the probability of a success for the Bernoulli experiment to be conducted.
        This function should return the probability that the first successful Bernoulli experiment will occur on the yth trial."""
        #This outline could further help students.
        probability_failures_all_previous = (1-p)**(y-1)
        probability_success_this_trial = p
        overall_prob = probability_failures_all_previous * probability_success_this_trial
        prob = overall_prob#The probability that the first successful bernoulli experiment occurs on the yth trial.
        return prob

## Poisson Distribution

The Poisson Distribution is yet another statistical distribution we can use to answer questions about the probability of a given number of successes, the probability of success and a series of independent trials. Specifically, the Poisson Distribution allows us to calculate the probability of a given event happening by examining the mean number of events that happen in a given time period. Given a set time period, we can use the Poisson Distribution to predict how many times a given event will happen over that time period. We are not given this probability--however, we know how likely an event is to occur the mean number of times over a given time period, which means that we actually **do** know the probability--we just need to do some basic calculations to discover this probability.

#### Poisson Distribution Formula

$$p(x) = \frac{\lambda^xe^{-\lambda}}{x!}$$

$\lambda = mean # of units per unit of time$
$e$: Euler's Constant, which is $e \approx 2.71828$
$x!$: The factorial of x.  For example, $3! = 3 * 2 * 1 = 6$ 

#### Python poisson function

    import numpy as np
    from math import factorial
    def poisson_probability(lambd, x):
        return ((lambd)**x * (np.exp(-lambd))) / factorial(x)

## Exponential Distribution

The Exponential Distribution lets us ask how likely the length of an interval of time is before an event occurs exactly once. As with the other distributions we've learned about, our goal is to discover the probability that our **_Random Variable, $X$_** will turn out to be a specific value, $x$. 

In order to figure this out, we need to know the **_Decay Parameter_**, $\lambda$ (although you may also see this denoted by the letter $m$).  To calculate the decay parameter, we just divide 1 by the average length of time it takes for an event to occur (e.g. the average number of minutes a customer interaction takes, or the average number of days before a machine breaks down). The average interval length is usually labeled as $\mu$.

#### Decay Rate Formula

$$\lambda = \frac{1}{\mu}$$

Once we know the decay rate, we can use the **_Probability Density Function_** to tell us the **exact point probability for any length $x$.**

$$PDF(x) = \lambda e^{-\lambda x}$$

Since we are talking about a Continuously-valued function, we'll also often want to make use of the **_Cumulative Density Function_**.  This allows us to answer questions such as **what is the probability that it will take less than 4 minutes ring up this customer?"**

$$CDF(x) = 1 - e^{-\lambda x}$$

#### Python functions for exp dist.

    import numpy as np
    #solve for exact time
    def exp_pdf(mu, x):
        decay_rate = 1 / mu
        return decay_rate * np.exp(-decay_rate * x)

    #solve for a period of time
    def exp_cdf(mu, x):
        decay_rate = 1 / 4
        return 1 - np.exp(-decay_rate * x)

    print("Point robability for exactly 3 minutes: {:.4f}%".format(exp_pdf(4, 3) * 100))
    print("Cumulative probability of 3 minutes or less: {:.4f}%".format(exp_cdf(4, 3) * 100))

## Normal Distributions

The normal distribution is the most important and most widely used distribution in statistics and analytics. It is also called the "bell curve," due to its shape or the "Gaussian curve" after the mathematician Karl Friedrich Gauss.

#### Measures of Center and Spread
If you remember skewness, you would recognize there is no skew in a perfectly normal distribution. It is centered around its mean.

There could possibly be many normal distributions based on how they are defined. Normal distributions can differ in their means and in their standard deviations.

#### Normal Characteristics
For now , we will identify normal distributions with following key characteristics.

- Normal distributions are symmetric around their mean.
- The mean, median, and mode of a normal distribution are equal.
- The area under the bell curve is equal to 1.0.
- Normal distributions are denser in the center and less dense in the tails.
- Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ).
- Around 68% of the area of a normal distribution is within one standard deviation of the mean (μ - σ to μ + σ)
- Approximately 95% of the area of a normal distribution is within two standard deviations of the mean ((μ - 2σ to μ + 2σ).

#### Central limit theorem

When we add large number of independent random variables, irrespective of the original distribution of these variables, their normalized sum tends towards a Gaussian distribution.



## Standard Normal Distribution

The standard normal distribution is a special case of the normal distribution. The Standard Normal Distribution is a normal distribution with a mean of 0, and a standard deviation of 1.

Thinking back to the standard deviation rule for normal distributions:

* $68\%$ of the area lies in the interval of 1 standard deviation from the mean, or mathematically speaking, $68\%$ is in the interval  $[\mu-\sigma, \mu+\sigma]$
*  $95\%$ of the area lies in the interval of 2 standard deviations from the mean, or mathematically speaking, $95\%$ is in the interval  $[(\mu-2\sigma), (\mu+2\sigma)]$
* $99\%$ of the area lies in the interval of 3 standard deviations from the mean, or mathematically speaking, $99\%$ is in the interval  $[(\mu-3\sigma), (\mu+3\sigma)]$


With a $\mu = 0$ and $\sigma=1$, this means that for the standard normal distribution:

* $68\%$ of the area lies between -1 and 1.
* $95\%$ of the area lies between -2 and 2.
* $99\%$ of the area lies between -3 and 3.

#### Standard Score (z-score)

The standard score (more commonly referred to as a z-score) is a very useful statistic because it allows us to:
1. Calculate the probability of a certain score occurring within a given normal distribution and 
2. Compare two scores that are from different normal distributions.

Any normal distribution can be converted to a standard normal distribution and vice versa using this
equation:

$$\Large z=\dfrac{x-\mu}{\sigma}$$

Here, $x$ is an observation from the original normal distribution, $\mu$ is the mean and $\sigma$ is the standard deviation of the original normal distribution. 


The standard normal distribution is sometimes called the $z$ distribution. A $z$ score always reflects the number of standard deviations above or below the mean. 

#### Data Standardization

Data standardization is common data preprocessing skill, which is used to compare a number of observations belonging to different normal distributions, and having distinct means and standard deviations. 

Standardization applying a $z$ score calculation on each element of the distribution. The output of this process is a **z-distribution** or a **standard normal distribution**. 

## Probability Mass Function

A probability mass function (pmf), sometimes also called just a frequency function gives us probabilities for discrete random variables. We already know that discrete random variables from examples like coin flips and dice rolls etc. The **discrete** part in discrete distributions means that there is a known number of possible outcomes. For example, you can only roll a 1,2,3,4,5, or 6 on a die. **Based on our observations** of all the values from 1 to 6 in a number of dice rolls, we can develop a pmf for the dice showing the probability of each possible value occurring. 

Here is a more formal understanding:

> There is a probability that a discrete random variable X takes on a particular value x, so that P(X = x), denoted as f(x). The function f(x) is typically called the probability mass function, or pmf. 

#### Function for calculating PMF using given collection data

    import collections
    x = [1,1,1,1,2,2,2,2,3,3,4,5,5]
    counter = collections.Counter(x)
    print(counter)
    print (len(x))
    
    pmf = []
    for key,val in counter.items():
        pmf.append(round(val/len(x), 2))

    print(counter.keys(), pmf)

#### Class Size Paradox

The class size paradox describes apparent contradictory findings where a total allocation of resources is fixed. 

The idea behind this paradox is that there is a difference in how events are actually distributed and how events are perceived to be distributed. These types of divergence can have important consequences for data analysis. 

During random surveys, for smaller class sizes, the probability of coming across a students is lower than the actual probability. For larger classes, the probability of coming across a student is much higher than actual probability. This explains why the paradox takes place!

## Cumulative Distribution Function

The cdf is a function of x just like a pmf where x is any value that can possibly appear in given discrete distribution. To calculate cdf(x) for any value of x, we compute the fraction of values in the distribution less than or equal to x following the percentile intuition.

Cdf is a **cumulative** function because it lets you find the probability by adding up the individual probabilities of all the outcomes included. For a die roll, when you want a 2 or less, you have 2 outcomes fulfilling this condition: 1 and 2, each with an individual probability of 1/6. Adding these up as 1/6 + 1/6 equals 2/6 or 1/3, which is the cumulative probability of a 2. 

>That's what cumulative means - its just adding up probabilities.

#### CDF Function

    def calculate_cdf(lst, X):
    count = 0
    for value in lst:
        if value <= X:
            count += 1

    cum_prob = count / len(lst) # normalizing cumulative probabilities (as with pmfs)
    return round(cum_prob, 3)

    #test data
    test_lst = [1,2,3]
    test_X = 2

    calculate_cdf(test_lst, test_X)

## Probability Density Function

A Probability Density Function (pdf) helps identify the regions in the distribution where observations are more likely to occur i.e. it is more dense. Remember that while dealing with pmfs, we calculated the mass for each class. For the case of continuous variable, we do not have fixed number of possible outcomes as described above so instead we create a density function.

> **pdf is the probability function F(x), such that x falls between two values (a and b), is equals to the integral (area under the curve) from a to b**

$$f(x\ |\ \mu,\ \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}}e$$

Where:

$x$ is the **_point_** we want to calculate the probability for

$\mu$ is the **_mean_** of the sample

$\pi$ is a mathematical constant, the irrational number $3.14159$

$\sigma^2$ is the **_variance_** (since $\sigma$ is the **_standard deviation_**)

$e$ is **_Euler's Constant_**, also known as the **_Base of the Natural Logarithm_**, $2.71828$

#### PDF function (provides coordinates for graphing) 

    def density(x):

        n, bins = np.histogram(x, 10, density=1)
        # Initialize numpy arrays with zeros to store interpolated values
        pdfx = np.zeros(n.size)
        pdfy = np.zeros(n.size)

        # Interpolate through histogram bins 
        # identify middle point between two neighbouring bins, in terms of x and y coords
        for k in range(n.size):
            pdfx[k] = 0.5*(bins[k]+bins[k+1])
            pdfy[k] = n[k]

        # plot the calculated curve
        return pdfx, pdfy


    #Generate test data and test the function
    np.random.seed(5)
    mu, sigma = 0, 0.1 # mean and standard deviation
    s = np.random.normal(mu, sigma, 100)
    x,y = density(s)

## Skewness and kurtosis

### Skewness

Skewness is the degree of distortion or deviation from the symmetrical normal distribution. Skewness can be seen as a measure to calculate the lack of symmetry in the data distribution.

Skewness helps you identify extreme values in one of the tails. Symmetrical distribution has a skewness of 0. 

Distributions can be **positively** or **negatively** skewed.

##### Positive Skewness

A distribution is **positively skewed** when the tail on the right side of the distribution is longer (also often called "fatter"). When there is positive skewness, the mean and median are bigger than the mode.

##### Negative Skewness

Distributions are **negatively skewed** when the tail on the left side of the distribution is longer or fatter than the tail on the right side. The mean and median are smaller than the mode.

##### Measuring Skewness

For univariate data $Y_1, Y_2, ..., Y_n$ the formula for skewness is:

$$\dfrac{\dfrac{\displaystyle\sum^n_{i=1}(Y_i-Y)^3}{n}}{s^3}$$

where $Y$ is the mean, $s$ is the standard deviation, and $n$ is the number of data points. This formula for skewness is referred to as the **Fisher-Pearson coefficient of skewness**. There are also other ways to calculate skewness, yet this one is the one that is used most commonly.

##### Using this formula, when is data skewed?

The rule of thumb seems to be:

* A skewness between -0.5 and 0.5 means that the data are pretty symmetrical
* A skewness between -1 and -0.5 (negatively skewed) or between 0.5 and 1 (positively skewed) means that the data are moderately skewed.
* A skewness smaller than -1 (negatively skewed) or bigger than 1 (positively skewed) means that the data are highly skewed.

### Kurtosis

Kurtosis deals with the lengths of tails in the distribution. 

> **Where skewness talks about extreme values in one tail versus the other, kurtosis aims at identifying extreme values in both tails at the same time!**

You can think of Kurtosis as a **measure of outliers** present in the distribution.

##### Measuring Kurtosis

For univariate data $Y_1, Y_2, \dots, Y_n$ the formula for kurtosis is:

$$\dfrac{\dfrac{\displaystyle\sum^n_{i=1}(Y_i-Y)^4}{n}}{s^4}$$

If there is a high kurtosis, then you may want to investigate why there are so many outliers. 
Presence of outliers could be indications of errors on the one hand, but they could also be some interesting observations that may need to be explored further. For banking transactions, for example, an outlier may signify a fraudulent activity. How we deal with outliers mainly depends on the domain. 

Low kurtosis in a data set is an indication that data has light tails or lack of outliers. If we get low kurtosis, then also we need to investigate and trim the dataset of unwanted results.

#### How much kurtosis is bad kurtosis?

##### Mesokurtic ($\text{kurtosis}  \approx 3 $)

A mesokurtic distribution has kurtosis statistics that lie close to the ones of a normal distribution. Mesokurtic distributions have a kurtosis of around 3.
According to this definition, the standard normal distribution has a kurtosis of three.

##### Platykurtic ($\text{kurtosis} < 3 $):

When a distribution is platykurtic, the distribution is shorter and tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that the tails are light, and that there are fewer outliers than in a normal distribution. 

##### Leptokurtic ($\text{kurtosis}  > 3 $)

When you have a leptokurtic distribution, you have a distribution with longer and fatter tails. The peak is higher and sharper than the peak of a normal distribution, which means that data have heavy tails, and that there are more outliers. 

Outliers stretch your horizontal axis of the distribution, which means that the majority of the data appear in a narrower vertical range. This is why the leptokurtic distribution looks "skinny".


## Statistical testing

##### Link to several tests and their descriptions and uses

https://stats.idre.ucla.edu/spss/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-spss/

### Statistical significance
Statistical significance is one of those terms that is often used when someone claims that some data collection and analysis proves a point (or hypothesis). The terminology associated with statistical significance is usually not well understood and results are accepted by masses, however, it is a simple idea that can be understood fairly easily. 

Statistical significance is mainly developed using samples and populations, hypothesis testing, the normal distribution, and p values.

### Population vs sample
The first step of every statistical analysis you will perform is the population vs sample check or to determine whether the data you are dealing with is a population or a sample.

A **population** is the collection of all items of interest to our study and is usually denoted with an uppercase N. The numbers we’ve obtained when using a population are called parameters.

A **sample** is a subset of the population and is denoted with a lowercase n, and the numbers we’ve obtained when working with a sample are called statistics.

#### One Sample z-Test
The one-sample Z test is used when we want to know whether our sample comes from a particular population. Z-scores lets us ask go from "how far is a value from the mean" to "how likely is a value this far from the mean to be from the same group of observations?". So once again, moving from stats to probability (likelihood is measured as probabilities).

A one-sample z-statistic is calculated as:

$$ \large \text{z-statistic} = \dfrac{\bar x - \mu_0}{{\sigma}/{\sqrt{n}}} $$

    #calculate z-score
    import scipy.stats as stats
    from math import sqrt
    x_bar =  # sample mean 
    n =  # sample total
    sigma = # sd of population
    mu = # Population mean 

    z = (x_bar - mu)/(sigma/sqrt(n))
    z

    # Z-table in python 

    # Probabilities up to z-score
    print(stats.norm.cdf(z_score))
    # p-value
    print (1-stats.norm.cdf(1.5))
    


#### P-value

Statistical summary of the compatibility between the observed data and what we would predict or expect to see if we knew the entire statistical model were correct. **above formula includes solving for p-value**

#### Significance Threshold (Alpha)
The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.Thresholds have been typically set at the .05 and .01 levels.

#### Standard Error

The standard error(SE) is very similar to standard deviation. Both are measures of spread. The higher the number, the more spread out your data is. To put it simply, the two terms are essentially equal — but there is one important difference. While the standard error uses statistics (sample data) standard deviations use parameters (population data). We achieve this dividing the standard deviation by the square root of the sample size.

$$ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}}$$

$\sigma$ is the population standard deviation
$n$ is the sample size
    
##### python code for standrad error
    # Calculate the standard error by dividing sample means with square root of sample size
    err = round(np.std(means)/np.sqrt(n), 2)
    
#### Confidence Interval

A Confidence Interval is a range of values above and below the point estimate that captures the true population parameter at some predetermined confidence level. If we want to have a 95% chance of capturing the true population parameter with a point estimate and a corresponding confidence interval, we would set confidence level to 95%. Higher confidence levels result in a wider confidence intervals.

We calculate a confidence interval by taking a point estimate and then adding and subtracting a **margin of error** to create a range. Margin of error is based on your desired confidence level, the spread of the data and the size of your sample. The way you calculate the margin of error depends on whether you know the standard deviation of the population or not.

the margin of error for a known population stadard deviation is:

**Margin of Error = z ∗ σ / √n**

Where σ (sigma) is the population standard deviation, n is sample size, and z is a number known as the z-critical value.

The z-critical value is the number of standard deviations you'd have to go from the mean of the normal distribution to capture the proportion of the data associated with the desired confidence level.

##### function to calculate z-critical value, confidence interval and margin of error between population and sample 

    def conf_interval(pop, sample):
        '''
        Function input: population , sample 
        Function output: z-critical, Margin of error, Confidence interval
        '''
        sample_size = 500
        n = len(sample)
        x_hat = sample.mean()

        # Calculate the z-critical value using stats.norm.ppf()
        # Note that we use stats.norm.ppf(q = 0.975) to get the desired z-critical value 
        # instead of q = 0.95 because the distribution has two tails.
        z = stats.norm.ppf(q = 0.975)  #  z-critical value for 95% confidence

        #Calculate the population std from data
        pop_stdev = population_ages.std()

        # Calculate the margin of error using formula given above
        moe = z * (pop_stdev/math.sqrt(sample_size))

        # Calculate the confidence interval by applying margin of error to sample mean 
        # (mean - margin of error, mean+ margin of error)
        conf = (x_hat - moe,x_hat + moe)

        return z, moe, conf

    #Call above function with sample and population 
    z_critical, margin_of_error, confidence_interval = conf_interval(population, sample)    

    print("Z-critical value:")              
    print(z_critical)         
    print ('\nMargin of error')
    print(margin_of_error)
    print("\nConfidence interval:")
    print(confidence_interval)
    
#### Confidence Intervals with T-distribution
T distributions are similar to the normal distribution in shape, but have heavier tails. T distributions also have a parameter known as degrees of freedom. The higher the degrees of freedom, the closer the distribution resembles that of the normal distribution.

##### function to run confidence interval w/o using population mean via t-critical-value

    stats.t.interval(alpha = 0.95,              # Confidence level
                     df= 24,                    # Degrees of freedom (sample size - 1)
                     loc = sample_mean,         # Sample mean
                     scale = sigma)             # Standard deviation estimate
