# Chapter 2
## Numeric Attributes

## 2.1 Univariate Analysis 
* Data matrix D can be a n x 1 matrix (simply a column vector).
$D = \begin{bmatrix}X \\ x_1\\x_2\\...\\x_n\end{bmatrix}$ <br/>
* X is the numeric attribute of interest and each point can be an identity random variables.

### Empirical Cumulative Distribution Function (CDF)
$F(x) = \frac{1}{n}\sum_{i=1}^{n}I(x_i <=x)$ <br/>
where <br/>
$I(x_i<=x) = \begin{cases}1, & \text{if } x_i <= x \\ 0, & \text{if } x_i > x \end{cases}$ <br/>
I is a binary indicator variable that indicates whether the given condition is satisfied. This function essentially checks to see how many values in the data set are <= the given x and divide it by the sample size (essentially the frequency of points that are <= x).

### Inverse Cumulative Distribution Function

$F^{-1}(q) = min{x|F(x) >= q}$ for $q\in[0,1]$ <br/> 
Gives the least value of X for which q % of the values are higher than x and 1-q% is lower. <br/>
$F^{-1}(0.75)$ will return the value that corresponds to the 75th percentile (3rd quartile)

### Empirical Porbaiblity Mass Function (PMF)

$F(x) = \frac{1}{n}\sum_{i=1}^{n}I(x_i = x)$ <br/>
where <br/>
$I(x_i<=x) = \begin{cases}1, & \text{if } x_i = x \\ 0, & \text{if } x_i != x \end{cases}$ <br/>
If the values are equal then it will be 1, counts the number of times a certain number appears in the dataset.

### 2.1.1 Measures of Central Tendency

#### Mean

* Also known as expected value for discrete distributions: 
$\mu = E[X] = \sum_{x}xf(x)$ 
* For continuous distributions
$\mu = E[X] = \int_{-\infty}^{\infty}xf(x)dx$

#### Sample Mean
* Same idea as mean but for within the sample

#### Sample mean is Unbiased
* An estimator,x, is unbiased for a parameter,y, if E(x) = y. Sample mean will be denoted as $\mu^{x}$
$E[\mu^{x}] = E[\frac{1}{n}\sum_{i=1}^{n}x_i] = \frac{1}{n}\sum_{i=1}^{n}E[x_i] = \frac{1}{n}\sum_{i=1}^{n}\mu = \mu$

#### Robustness 

* A statistic is robust if it's not affected by extreme values
    * Sample mean is not robust.
    * Trimmed mean, discards small fraction of extreme values on one or both ends, is robust.
    * Median is robust.
    
#### Median
* Defined as $P(X <= m) >= \frac{1}{2} \text{and} P(X >= m) >= \frac{1}{2}$ <br/>
* Is the middle most value, half of the vlaues of X are less than m and half of the vlaues of X are more than m.

#### Mode
* value at which it appears the most, mathematically defined to be: <br/>
$ mode(X) = arg max f(x)$
* Example 2.1 shows sample mean, median, and mode.
### 2.1.2 Measures of Dispersion

Robustness:
* range is not robust (max and min values are directly part of the calculation) 


#### Range
* Difference between max and min values of X <br/>
$r = max(X) - min(X)$

#### Inter-Quartile Range
* third quartile - first quartile. 
$ F^{-1}(0.75) - F^{-1}(0.25)$

#### Variance and Standard Deviation
* Variance of a random variable X shows how the values deviate from the mean. Formally it is defined as <br/>
$ \sigma^2 = var(X) = E[(X-\mu)^2] = \begin{cases} \sum_x(x-\mu)^2 f(x), & \text{if X is discrete} \\ \int_{-\infty}^{\infty}(x-\mu)^2f(x)dx, & \text{if X is continuous} \end{cases}$
* Standard deviation $\sigma$ is just the square root of variance $\sqrt{var(X)}$
* variance can also e written as <br/>
$ var(X) = E[X^2] - (E[X])^2$ <br/> 

#### Sample Variance
* It is just the variance for a sample, so the mean used is the sample mean. Stastics are denoted as the parameter with a carat, for example the parameter mean would be denoted as $\hat{\mu}$ for the sample version. <br/>
$ \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(X_i - \hat{\mu})^2$
* Z-score, also known as the standard score, is the number of standard deviations away the value is from the mean <br/>
$ z_i = \frac{x_i - \hat{\mu}}{\hat{\mu}}$

#### Geometric Interpretation of Sample Variance 
* We can write out the z-score in the form of a matrix given data with <br/> 
$ Z = X - 1 * \hat{\mu} = \begin{bmatrix} x_1 - \hat{\mu} \\ x_2 - \hat{\mu} \\ ... \\ x_n - \hat{\mu} \end{bmatrix}$ <br/>
* We can write the sample variance in terms of Z with <br/>
$ \hat{\mu}^2 = \frac{1}{n} ||Z|| ^2 = \frac{1}{n}Z^{T}Z = \frac{1}{n}\sum_{i=1}^{n}(x_i - \hat{\mu})^2$

#### Variance of the Sample Mean
* We can derive an expression for the variance of the sample mean by assuming that the random variables are all independent. 
$var(\sum_{i=1}^{n}x_i) = \sum_{i=1}^{n}var(x_i)$ <br/>
$var(x_i) = \sigma^2 \text{for all i}$ <br/>
$var(\sum_{i=1}^{n}x_i) = \sum_{i=1}^{n}var(x_i) = \sum_{i=1}^{n}\sigma^2 = n\sigma^2$ <br/>
$E[\sum_{i=1}^{n}x_i] = n\mu$ <br/>
* To get the mean, we would need to do $\frac{1}{n}\sum_{i=1}^{n}x_i$. We know that $E[X] = \mu$ so theoretically the expected value of all values in X should be the mean multiplied by the number of values there are. We can use the top equations to get <br/>
$var(\hat{\mu}) = \frac{\sigma^2}{n}$ <br/>
* The sample mean $\hat{\mu}$ varies or deviates from the mean in proportion to the population variance $\sigma^2$. However, the deviation can be made smaller with the sample size $n$.

#### Sample Variance s Biased, but is Asymptotically Unbiased.
$E[\hat{\sigma}^2] = \frac{1}{n}n\sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2$
* However as n approaches infinity, $\hat{\sigma}^2$ becomes an unbiased estimator of population variance. 

## 2.2 Bivariate Analysis
* We can put two attributes into a data matrix with
$D = \begin{bmatrix}X_1 | X_2 \\ x_{11}, x_{12} \\ x_{21}, x_{22} \\ ... \\ x_{n1} x_{n2} \end{bmatrix}$

#### Empirical Joint Probability Mass Function
* sums the number of occurances of a certain observation. 

### 2.2.1 Measures of Location and Dispersion
#### Mean
$\mu = E[X] = E[\begin{bmatrix}X1 \\ X2\end{bmatrix}] = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}$

#### Variance
* We can use the total variance which is $\sigma_1^2$ for X_1 and $\sigma_2^2$ for X_2
* total variance is given as <br/>
$var(D) = \sigma_1^2 + \sigma_2^2$
* same can be applied to sample variance 

### 2.2.2 Measures of Association
#### Covariance
* *covariance* between two attributes provides a measure of association between them. Covariance between X1 and X2 is defined as $\sigma_{12} = E[X_1X_2] -E[X_1]E[X_2]$
* If $X_1$ and $X_2$ are indpendent, then we can conclude that $E[X_1X_2] = E[X_1]*E[X_2]$ which implies that $\sigma_{12} = 0$
    * However, we can't say that if the covariance between two variables is 0, that the variables are indpenedent of one another BECAUSE there might be a higher order relationship betweeen them two that exceeds beyond linear. 
* $\hat{\sigma_{11}} = \hat{\sigma_{1}}^2$

#### Correlation
* Standardized covariance, obtained by normalizing the coveriance with the standard deviation of each variable. <br/>
$\rho_{12} = \frac{\sigma_{12}}{\sigma_{1} \sigma_{2}}$

#### Geometric Interpretation of Sample Covariance and Correlation
$Z_1 = X_1 - 1 * \hat{\mu_1} = \begin{bmatrix} x_{11} - \hat{\mu_1} \\ x_{21} - \hat{\mu_1} \\ ... \\ x_{n1}-\hat{\mu_1}\end{bmatrix}$

$Z_2 = X_2 - 1 * \hat{\mu_2} = \begin{bmatrix} x_{12} - \hat{\mu_2} \\ x_{22} - \hat{\mu_2} \\ ... \\ x_{n2}-\hat{\mu_1}\end{bmatrix}$ <br/>

The sample covariance can be written as $\hat{\rho_{12}} = \frac{Z_1^TZ_2}{n}$

#### Covariance Matrix
* Covariance matrix can be summarized as <br/>
$\sum = \begin{bmatrix} \sigma_1^2, \sigma_{12} \\ \sigma_{21}, \sigma_2^2 \end{bmatrix}$ which is a symmetric matrix. The total variance is $var(D) = tr(\sigma) = \sigma_1^2 + \sigma_2^2$ where tr() is the trace of a matrix, which is the sum of the diagonal elements.
* We can say that tr($\sum$) >= 0, since variance can never be negative. The sample covariance matrix is given as: <br/>
$\hat{\sum} = \begin{bmatrix} \hat{\sigma_1^2}, \hat{\sigma_{12}} \\ \hat{\sigma_{21}}, \hat{\sigma_2^2} \end{bmatrix}$

## 2.3 Multivariate Analysis
* The data matrix D hasa d attributes $X_d$ (columns) and n rows $x_n$ (observations). 

#### Mean
* The mean is obtained by taking the mean of each attribute
$\mu = E[X] = \begin{bmatrix} E[X_1] \\ E[X_2] \\ ... \\ E[X_d] \end{bmatrix} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ ... \\ \mu_d \end{bmatrix}$

#### Covariance Matrix
$\sum = E[(X-\mu)(X-\mu)^T] = \begin{bmatrix}\sigma_1^2, \sigma_{12}, ..., \sigma_{1d} \\ \sigma_{21}, \sigma_2^2, ..., \sigma_{2d} \\ ..., ..., ..., ... \\ \sigma_{d1}, \sigma_{d2}, ..., \sigma_d^2 \end{bmatrix}$
* The off diagonal elements $\sigma_{ij} = \sigma_{ji}$ represents the covariance between attribute pairs $X_i$ and $X_j$. 

#### Total and Generalized Variance 
$ var(D) = tr(D) = \sigma_1^3 + \sigma_2^2 + ... + \sigma_d^2$ total variance must be non-negative. 

#### Sample Covariance Matrix
* Let  represent the centered data matrix given as the matrix of centered attributes vectors $Z_i = X_i - 1*\hat{\mu_i}$ <br/>
$ Z = D - 1 * \hat{\mu^T} = (Z_1 Z_2 ... Z_d) $ <br/> 
$ \hat{\sum} = \frac{1}{n}(Z^T * Z) = \frac{1}{n} \begin{bmatrix} Z_1^T * Z_1, Z_1^T * Z_2, ..., Z_1^T * Z_d \\ Z_2^T * Z_1, Z_2^T * Z_2, ..., Z_2^T * Z_d \\ ..., ..., ..., ... \\ Z_d^T * Z_1, Z_d^T * Z_2, ..., Z_d^T * Z_d \end{bmatrix}$
* the sample covariance matrix is the dot product of the normalized values divided by the sample size. 

In [9]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A1':[1,5,9], 'A2':[0.8, 2.4, 5.5]})
df

Unnamed: 0,A1,A2
0,1,0.8
1,5,2.4
2,9,5.5


In [10]:
mu_sample = df.mean()
mu_sample

A1    5.0
A2    2.9
dtype: float64

In [22]:
Z = df - np.ones((3,1)) * mu_sample.values

In [31]:
cov = 1/3 * np.dot(Z.T.values, Z.values)
cov

array([[10.66666667,  6.26666667],
       [ 6.26666667,  3.80666667]])

## 2.4 Data Normalization 
#### Range Normalization 
Let X be an attribute, range normalization is scaled with the following values: <br/>
$ x_i = \frac{x_i - min_i{x_i}}{max_i{x_i} - min_i{x_i}}$ <br/> 
Each new attribute whould take on values between [0, 1]

#### Standard Score Normalization 
* It is just z-score

## 2.5 Normal Distribution
Read the chapter a lot of it is just review. 