# **Statistical Terms**

In [2]:
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt

## **Contents**

- [Coefficient of Determination (R-squared: $R^2$)](#coefficient-of-determination-r-squared-r2)
- [Coefficient of Variation (CV)](#coefficient-of-variation-cv) 
- [Correlation Coefficient (Pearson's r)](#correlation-coefficient-pearsons-r)
- [Covariance](#covariance)
- [Mean](#mean)
- [Mean Normalization](#mean-normalization)
- [Min-Max Normalization](#min-max-normalization)
- [Standard Deviation](#standard-deviation)
- [Variance](#variance)
- [Z-Score (Standard Score)](#z-score-standard-score)

## **Coefficient of Determination (R-squared: $R^2$)**

The **coefficient of determination** is a measure of how well the regression line represents the data.
$$ 
R^2 = 1 - \frac{SS_{res}}{SS_{tot}} 
$$
- Where
    - $ SS_{res} $ is the sum of squares of residuals
    - $ SS_{tot} $ is the total sum of squares

**HINT**: In linear least squares multiple regression (with fitted intercept and slope), $R^2$ equals $\rho^2(y, \hat{y})$, the square of the [Pearson correlation coefficient](#correlation-coefficient-pearsons-r-pcc) between the observed $y$ and modeled (predicted) $\hat{y}$ data values of the dependent variable.

## **Coefficient of Variation (CV)**

In probability theory and statistics, the **coefficient of variation (CV)**, also known as normalized root-mean-square deviation (NRMSD), **percent RMS**, and **relative standard deviation (RSD)**, is a standardized measure of dispersion of a probability distribution or frequency distribution.

It is defined as the ratio of the standard deviation $\sigma$ to the mean $\mu$ (or its absolute value, $|\mu|$) and often expresses as a percentage (%RSD).

The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay.
For Percent CV, multiply the coefficient of variation by 100 ($CV \times 100$).

### **Population**

$$
CV = \frac{\sigma}{\mu}
$$
- Where
    - $ \sigma $ is the standard deviation of the population
    - $ \mu $ is the mean of the population

### **Sample**

$$
CV = \frac{s}{\bar{x}}
$$
- Where
    - $ s $ is the standard deviation of the sample
    - $ \bar{x} $ is the mean of the sample

## **Correlation Coefficient (Pearson's r)**

The **correlation coefficient** is a statistical measure that describes the strength and direction of a relationship between two variables. It ranges from -1 to 1. The closer the value is to 1, the stronger the relationship.

It is  essentially a normalized measurement of the [covariance](#covariance). As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. 

As a simple example, one would expect the age and height of a sample of children from a school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).

The following table shows the correlation coefficient's strength and direction:

| Correlation Coefficient | Strength of Relationship | Correlation Direction / Type |
|-------------------------|--------------------------|-----------------------|
| -0.7 to -1.0 | Very Strong | Negative |
| -0.5 to -0.7 | Strong | Negative |
| -0.3 to -0.5 | Moderate | Negative |
| -0.1 to -0.3 | Weak | Negative |
| 0.0 | None | Zero |
| 0.1 to 0.3 | Weak | Positive |
| 0.3 to 0.5 | Moderate | Positive |
| 0.5 to 0.7 | Strong | Positive |
| 0.7 to 1.0 | Very Strong | Positive |

### **Population**

$$
\rho_{X,Y} = \frac{cov(X, Y)}{\sigma_X \sigma_Y}
$$
- Where
    - $ cov(X, Y) $ is the covariance of $X$ and $Y$
    - $ \sigma_X $ is the standard deviation of $X$
    - $ \sigma_Y $ is the standard deviation of $Y$

### **Sample**

$$
r_{x,y} = \frac{cov(x, y)}{s_x s_y}
$$
- Where
    - $ cov(x, y) $ is the covariance of $x$ and $y$
    - $ s_x $ is the standard deviation of $x$
    - $ s_y $ is the standard deviation of $y$

## **Covariance**

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables:
- Positive: the variables tend to increase together
- Negative: one variable tends to decrease as the other increases
- Zero: the variables are independent

The [correlation coefficient](#correlation-coefficient-pearsons-r) normalizes the covariance by dividing by the geometric mean of the total variances for the two random variables. 

The Covariance of the same variable is the variance of that variable.

### **Population**

$$
cov(X, Y) = \frac{1}{N} \sum_{i=1}^{N} (X_i - \mu_X)(Y_i - \mu_Y)
$$
- Where
    - $ N $ is the number of observations
    - $ X_i $ is the $i^{th}$ observation of $X$
    - $ Y_i $ is the $i^{th}$ observation of $Y$
    - $ \mu_X $ is the mean of $X$
    - $ \mu_Y $ is the mean of $Y$

### **Sample**

$$
cov(x, y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$
- Where
    - $ n $ is the number of observations
    - $ x_i $ is the $i^{th}$ observation of $x$
    - $ y_i $ is the $i^{th}$ observation of $y$
    - $ \bar{x} $ is the mean of $x$
    - $ \bar{y} $ is the mean of $y$

## **Mean**

**Mean** is the average of the numbers. It is calculated by dividing the sum of the numbers by the count of the numbers.
$$
\mu = \frac{1}{N} \sum_{i=1}^{N} X_i
$$
- Where
    - $ N $ is the number of observations
    - $ X_i $ is the $i^{th}$ observation

## **Mean Normalization**

$$
X_{norm} = \frac{X - \mu}{\max(X) - \min(X)}
$$
- Where
    - $ X $ is the observation
    - $ \mu $ is the mean of the observations
    - $ \max(X) $ is the maximum value of the observations
    - $ \min(X) $ is the minimum value of the observations

## **Min-Max Normalization**

### **Range [0, 1]**

$$
X_{norm} = \frac{X - \min(X)}{\max(X) - \min(X)}
$$
- Where
    - $ X $ is the observation
    - $ \min(X) $ is the minimum value of the observations
    - $ \max(X) $ is the maximum value of the observations

### **Range [a, b]**

$$
X_{norm} = a + \frac{(X - \min(X))(b - a)}{\max(X) - \min(X)}
$$
- Where
    - $ X $ is the observation
    - $ \min(X) $ is the minimum value of the observations
    - $ \max(X) $ is the maximum value of the observations
    - $ a $ is the minimum value of the new range
    - $ b $ is the maximum value of the new range

## **Standard Deviation**

### **Population**

$$
\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2}
$$
- Where
    - $ N $ is the number of observations
    - $ X_i $ is the $i^{th}$ observation
    - $ \mu $ is the mean of the population

### **Sample**

$$
s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}
$$
- Where
    - $ n $ is the number of observations
    - $ x_i $ is the $i^{th}$ observation
    - $ \bar{x} $ is the mean of the sample

## **Variance**

### **Population**

$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2
$$
- Where
    - $ N $ is the number of observations
    - $ X_i $ is the $i^{th}$ observation
    - $ \mu $ is the mean of the population

### **Sample**

$$
s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$
- Where
    - $ n $ is the number of observations
    - $ x_i $ is the $i^{th}$ observation
    - $ \bar{x} $ is the mean of the sample

## **Z-Score (Standard Score)**

### **Population**

$$
z = \frac{X - \mu}{\sigma}
$$
- Where
    - $ X $ is the observation
    - $ \mu $ is the mean of the population
    - $ \sigma $ is the standard deviation of the population

### **Sample**

$$
z = \frac{x - \bar{x}}{s}
$$
- Where
    - $ x $ is the observation
    - $ \bar{x} $ is the mean of the sample
    - $ s $ is the standard deviation of the sample