Finding the closest objects in the feature space
=======================================================
Sometimes, the easiest thing to do is to just nd the distance between
two objects. We just need to nd some distance metric, compute the
pairwise distances, and compare the outcomes to what's expected.

#### Getting ready
- A lower-level utility in scikit-learn is sklearn.metrics.pairwise.
This contains server functions to compute the distances between
the vectors in a matrix X or the distances between the vectors
in X and Y easily.
- This can be useful for information retrieval. For example, given
a set of customers with attributes of X, we might want to take
a reference customer and nd the closest customers to this cus-
tomer.
- In this scenario, we might want to rank customers by the notion
of similarity measured by a distance function. The quality of
the similarity depends upon the feature space selection as well
as any transformation we might do on the space.

We'll walk through several dierent scenarios of measuring distance.
Implementation
We will use the pairwise distances function to determine the "close-
ness" of objects. Remember that the closeness is really just similarity
that we use our distance function to assess.
First, let's import the pairwise distance function from the metrics
module and create a dataset to play with:

In [2]:
import numpy as np 
from sklearn.metrics import pairwise
from sklearn.datasets import make_blobs
points, labels = make_blobs()


This simplest way to check the distances is pairwise distances:


In [7]:
distances = pairwise.pairwise_distances(points)
#distances

distances is an N x N matrix with 0s along the diagonals. In the
simplest case, let's see the distances between each point and the rst
point:

In [8]:
np.diag(distances) [:5]


array([ 0.,  0.,  0.,  0.,  0.])

Now we can look for points that are closest to the first point inpoints:


In [9]:
distances[0][:5]


array([  0.        ,  12.24904308,   4.04040824,  12.77570536,   0.9984187 ])

Ranking the points by closeness is very easy with `np.argsort`:


In [10]:
ranks = np.argsort(distances[0])
ranks[:5]


array([ 0, 49, 28, 97, 79])

A useful characteristic of argsort is that now we can sort our
points matrix to get the actual points:


In [11]:
points[ranks][:5]

array([[ 4.44851113, -4.98728128],
       [ 4.67189656, -5.14993938],
       [ 4.37803613, -4.46318162],
       [ 4.7284959 , -4.23643072],
       [ 4.3194771 , -4.1474975 ]])

It's useful to see what the closest points look like.


#### Theory : Euclidean Distance

Given some distance function, each point is measured in a pairwise
function. The default is the Euclidian distance, which is as follows:
if $p = (p1; p2; \ldots; pn)$ and $q = (q_1; q_2; \ldots; q_n)$ are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:

$$ d(p, q) = d(q, p)=  \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + \ldots + (q_n - p_n)^2 }$$

$$ d(p, q) = \sum_{i=1}^n (q_i - p_i)^2$$

Verbally, this takes the difference between each component of the
two vectors, squares the difference, sums them, and then takes the
square root. 

This looks very familiar as we used something very
similar to this when looking at the mean-squared error. If we take
the square root, we have the same thing. In fact, a metric used often
is root-mean-square deviation (RMSE), which is just the applied
distance function.


In Python, this looks like the following:



In [12]:
def euclid_distances(x, y):
    return np.power(np.power(x - y, 2).sum(), .5)

euclid_distances(points[0], points[1])


12.249043076597852

#### Other Distance Measures
There are several other functions available in scikit-learn, but scikit-
learn will also use distance functions of SciPy.
1. cityblock
2. cosine
3. euclidean
4. L1
5. L2
6. Manhattan

#### Worked Example
We can now solve problems. For example, if we were standing on
a grid at the origin, and the lines were the streets, how far will we
have to travel to get to point (5, 5)?.


In [13]:
pairwise.pairwise_distances([[0, 2], [6, 6]],metric='cityblock')[0]


array([  0.,  10.])