Write some code which prints out the distance matrix of a news article of your choosing. Test out the program with different sets of news articles. Can you find or create your own dataset?

## What is a Distance Matrix?

*How similar are these two articles?*

How would you as a human answer this question?

You might begin by reading the articles, then you summarise their revelant points, then you compare the articles by their summaries and ultimately suggest that these two articles are "*insert whatever adjective*" alike.

But how would a computer do this? A computer, or a program, might take a similar approach. For example a program might:
* Read the articles
* Look for patterns in the words (TF)
* Allocate importance to the words (IDF)
* Compare the patterns in one article to the patterns in the other

Whatever result the final stage ends up with, is the program's estimation of the similarity of those two articles.

This is the essence of a distance matrix - it contains the distances, taken pairwise, between elements in a set.

In this case, the elements in our set are the set of TF-IDF values for each article (with the set labelled as the name of the article).

There are many ways to calculate distance, but the most simple and intuitive way we might measure this is by using Euclidean distance - *the distance of the straight line from A to B*.


#### Euclidean Distance
source: https://sciencing.com/euclidean-distance-7829754.html

This is simple if you have two singular points, for example the distance between two numbers on a line.

Subtract one point on the number line from another; the order of the subtraction doesn't matter. For example, one number is 8 and the other is -3. Subtracting 8 from -3 equals -11.

Calculate the absolute value of the difference. To calculate the absolute value, square the number. For this example, -11 squared equals 121.

Calculate the square root of that number to finish calculating the absolute value. For this example, the square root of 121 is 11. The distance between the two points is 11.

--------------------------------------

*...but our data sets contain many TF-IDF values, how do we compare the distance between those?*

--------------------------------------

The principle is the same, we are finding the distance between two points, but treating each variable (word) in the data set as our *a* and *b*. So for each TF-IDF value for a word in set (article) *a*, subtract the corresponding word's TF-IDF value from set (article) *b*.

Calculate the squared difference for each pair of variables, sum up all the squared differences and take the square root. This is the euclidean distance between article 1 and article 2.

Complete the same procedure for each pair of articles and you will be able to produce a square-matrix, with the article name spanning both the x- and y-axis, the contents of which will be the distances between the articles in the columns and the rows.

The Euclidean Distance can be written as:

$$d = \sqrt{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}$$

Where:
* *i* = ith variable in your set of TF-IDF values
* *a* = article 1's TF-IDF values
* *b* = article 2's TF-IDF values


You can come back to this explanation any time, or look here for some further reading:

* https://en.wikipedia.org/wiki/Distance_matrix
* https://sciencing.com/euclidean-distance-7829754.html
* https://en.wikipedia.org/wiki/Euclidean_distance

Now, let's try calculating a distance matrix in python.

Full disclosure: I am going to use SciPy's Spatial library, which provides functions to run spatial and distance algorithms on data sets. So while we will not be doing the above by hand, it is useful to have an understanding of what the result is actually telling you.

https://docs.scipy.org/doc/scipy/reference/spatial.html

## TF-IDF DataFrame

It is important to note that TF-IDF scores are not the only measure of an article's content. But seeing as we have all encountered them before, for the purposes of this exercise we will be comparing the TF-IDF scores of each news article.

This notebook is going to focus solely on implementing the distance matrix, so I will load up my own TF-IDF DataFrame that I created in the 3.14 exercise.

In [1]:
import pandas as pd

# load up my pickled file - pickling is simply a way of flattening a python object into a storage format
tfidf = pd.read_pickle('Data/covid-articles-tfidf.pkl')
tfidf

Unnamed: 0,spread,important,largely,forward,these,fatally,ones,emotional,started,focusing,...,mortality,presence,s,tragedy,t,families,Patel,less,against,exercising
Data/Coronavirus lockdown brings anguish for Tashan Daniel's family.txt,0.0,0.0,0.0,0.0,0.0,0.003188,0.0,0.0,0.0,0.003188,...,0.0,0.003188,0.0,0.0,0.003188,0.001188,0.0,0.0,0.0,0.003188
Data/Does_20000_hospital_deaths_mean_failure_for_UK.txt,0.000987,0.000987,0.000987,0.000987,0.001973,0.0,0.000368,0.000987,0.001973,0.0,...,0.000987,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000987,0.0
Data/UK_hospital_deaths_pass_20000.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.000743,0.0,0.0,0.0,...,0.0,0.0,0.0,0.001993,0.0,0.001486,0.001993,0.001993,0.0,0.0


# Distance Matrix

We will use the *pdist()* SciPy function to get the pair-wise distance between the articles in our set.

In the code box below, press *shift and tab x2* in the pdist() function to show the documentation behind the function, and to see what arguments you need to pass it.

In [2]:
pd.DataFrame(squareform(pdist(shift-tab-tab-here)))

NameError: name 'squareform' is not defined

By default, this function uses the Euclidean distance method, but this can be changed by passing the *metric=* keyword.

The pdist function returns a 1-D array of distance measures, to create the standard square-matrix, pass the result of the pdist into the squareform() function directly.

Lastly, for ease of display we will put the results of the squareform pdist directly into a Pandas DataFrame.

*run the next two cells*

In [3]:
# import the library
from scipy.spatial.distance import squareform, pdist

In [4]:
# pass the tfidf table into pdist, turn the result into a square-form matrix and make a DataFrame out of it
pd.DataFrame(squareform(pdist(tfidf)))

Unnamed: 0,0,1,2
0,0.0,0.056807,0.057615
1,0.056807,0.0,0.037388
2,0.057615,0.037388,0.0


In the above DataFrame, we have the distance measures between each article in the *tfidf* dataframe.

The articles are labelled '0, 1, 2' and these are derived from the index position of each article in the tfidf table.

#### Interpreting the square-matrix
The higher the number, the further away from the other article it is, and thus the less similar those two articles are.

Say you wanted to know how similar article 0 is to article 2. You first find 0 on the column headers, then move down that column to the index position 2. The number shown in that cell (0.057615) is the distance between article 0 and 2.

Using your knowledge of distance matrixes, answer the question below:

Which pair of articles are most alike?

#### End