# Norm


* Norm is a function that returns the length of a vector. 
* A **normed vector** means a vector has length and is not negative. i.e. $||x|| > 0$
* Given a vector space $V$ where $\boldsymbol{u}, \boldsymbol{v} \in V$, the norm of a vector $p$ has three properties:
    1. $p(\boldsymbol{u} + \boldsymbol{v}) <= p(\boldsymbol{u}) + p(\boldsymbol{v})$ (Triangle Inequality)
    2. $p(a\boldsymbol{v}) = |a| p(\boldsymbol{v})$ (Length has no direction)
    3. if $p(\boldsymbol{v}) = 0$ then $V$ is a zero vector (Zero vector has zero length)
* The length of a vector can be think of as the distance between the origin and the vector's head.


## $p$ norm

Also known as $L_p$ norm:

### $||x||_p := (\sum\limits_{i}^{n}{|x_i|^p})^\frac{1}{p}$

## $L_1$ norm

### $||x||_1 := (\sum\limits_{i}^{n}{|x_i|})$

## $L_2$ norm

### $||x||_2 := (\sum\limits_{i}^{n}{|x_i|^2})^\frac{1}{2}$
### $||x||_2 := \sqrt{x_1^2 + x_2^2 + ... + x_n^2}$

# Regularization

### $L(\theta) = - \sum\limits_{i} y_i \log(\hat{y}) + \lambda R(\theta)$

where $\theta$ is a vector of the parameters.

### $L_1$ Regularization: $R(\theta) = ||\theta||_1$ 

- Parameters become sparse
- Less sensetive to outliers
- Has multiple solution

### $L_2$ Regularization: $R(\theta) = ||\theta||_2$ 
sometimes,$||\theta||_2^2$ is used.

- Global minimum
- Experience shows $L_2$ norm usually outperforms $L_1$

# Distance

## Minkowski

### $D(X, Y) = (\sum\limits_{i}^{n} |x_i - y_i|^p)^\frac{1}{p}$

## Manhattan

### $D(X, Y) = (\sum\limits_{i}^{n} |x_i - y_i|)$


## Euclidean

### $D(X, Y) = \sqrt{\sum\limits_{i}^{n} (x_i - y_i)^2}$


## Chebyshev

### $D(X, Y) = \lim\limits_{p \rightarrow \infty}(\sum\limits_{i}^{n} |x_i - y_i|^p)^\frac{1}{p}$

### $D(X, Y) = \max\limits_{i}{(|x_i - y_i|)}$

# Similarity

$-1 \times D(X, Y)$ can be viewed as similarity as well. e.g. $S = - \sqrt{\sum\limits_{i}^{n} (x_i - y_i)^2}$ is the reverse of the euclidean distance, where $-0$ is considered most similar. In Affinity Propagation, this kind of similarity is used.

## Cosine Similarity

$a \cdot b = ||a|| ||b|| cos(\theta)$


## $cos(\theta) = \frac{A \cdot B}{||A|| ||B||}$

Q: Why use cosine similarity?

Consider the case of tf-idf representation for retreving document, document $A$ and $B$ can be using similar words but repeated different times, i.e. $A = [1, 1, 0, 0, 1]$, $B = [10, 10, 0.1, 0.1, 10]$. In this case, distance measured by e.g. euclidean distance will be very large, however, the cosine similarity properly reflects that two documents are actually similar.