# Computing Similarities Across Large Documents Datasets

## Important question : Which of our Newsgroup post is most similar to first one? 

We can obtain the answer by computing all the cosine similarities between **tfidf_np_matrix** and **tf_np_matrix[0]**

In [1]:
import numpy as np

In [2]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(remove=('headers','footers'))

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(newsgroups.data)

### Get first post

In [4]:
tfidf_np_matrix = tfidf_matrix.toarray()
tfidf_vector = tfidf_np_matrix[0]

### Compute similarities

As we remember from previous Data Science Bookcamp section 13 we can obtain vector similarities with dot.product.

In [5]:
cosine_similarities = tfidf_np_matrix @ tfidf_np_matrix[0]
print(cosine_similarities)
print(len(cosine_similarities))

[1.         0.00834093 0.04448717 ... 0.         0.00270615 0.01968562]
11314


Its output is a vector of cosine similarities. Each ith index of the vector corresponds to the cosine similarity between newsgroups.data[0] and newsgroups.data[i]. From the print-out, we can see that cosine_similarities[0] is equal to 1.0. This is not surprising, since newsgroups_data[0] will have a perfect similarity with itself

## What is the next highest similarity in vector ?

In [18]:
most_similar_index = np.argsort(cosine_similarities)[-2]
print(f"Most similar index is...{most_similar_index}")

Most similar index is...958


In [20]:
#Get similarity level by index
similarity = cosine_similarities[most_similar_index]
print(f"Highest similarity indicator is ...{similarity}")

Highest similarity indicator is ...0.6410493167298943


In [23]:
# Get most similar post 
most_similar_post = newsgroups.data[most_similar_index]
print(f"The following post has a cosine similarity of {similarity:.2f} "
       "with newsgroups.data[0]:\n")
print(most_similar_post)


The following post has a cosine similarity of 0.64 with newsgroups.data[0]:

In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my  
thing) writes:
> 
>  I was wondering if anyone out there could enlighten me on this car I saw
> the other day. It was a 2-door sports car, looked to be from the late 60s/
> early 70s. It was called a Bricklin. The doors were really small. In  
addition,
> the front bumper was separate from the rest of the body. This is 
> all I know. If anyone can tellme a model name, engine specs, years
> of production, where this car is made, history, or whatever info you
> have on this funky looking car, please e-mail.

Bricklins were manufactured in the 70s with engines from Ford. They are rather  
odd looking with the encased front bumper. There aren't a lot of them around,  
but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a  
performance Ford with new styling slapped on top.

>    ---- brought to you by your neig

**The printed text is... a reply to the car-post at index 0!** Reply includes the original post, which is a question about the certain car-brand.

Their cosine similarity is 0.64. This does not seem like a large number. However, within extensive text collections, a cosine similarity that’s greater than 0.6 is good indicator of overlapping content.

## Note ! 

As discussed in Section Thirteen, the cosine similarity can easily be converted into the Tanimoto similarity, which has a deeper theoretical basis for text overlap. We can convert cosine_similarities into Tanimoto similarities by running cosine_similarities / (2 - cosine_similarities)

In [25]:
def tanimoto_similarities(documents,text):
    cosine_similarity = documents @ text
    tanimoto_similarity = cosine_similarity / (2 - cosine_similarity)
    return tanimoto_similarity

In [28]:
np.argsort(tanimoto_similarities(tfidf_np_matrix,tfidf_np_matrix[0]))[-2]

958

 However, that conversion will not change our final results. Choosing the top index of the Tanimoto array will still return the same posted reply. Thus, for simplicity’s sake, we will focus on the cosine similarity during our next few text-comparison examples.

## Exercise

1. Pick post at random and then choose its most similar neigbour.
2. Output both post along with their cosine similary

In order to make this exercise more interesting, we'll first compute a matrix of all-by-all cosine similarities. We'll then laverage the matrix to select our random pair of similar posts.

## How do we compute the matrix of all-by-all cosine similarities?

The naive situation is to multiple **tfidf_np_matrix** with its transpose. However, for reasons discussed in Section thirteen the **matrix multiplication is not computationally efficient**. 

Out TFIDF matrix has over 100 000 columns. **We need to reduce the matrix size, prior to executing the multiplication**. 


### TruncatedSVD 

In previous section we learng how to reduce the column-count using scikit-learn **TruncatedSVD** class. 

The class is able to shrint a matrix down to a specified number of columns. The reduced column-count is determined by the n_components parameter. 

According to Scikit-Learn's documentation an n_components **value of 100 is recommended for processing text data**.

Documentation : https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

In [29]:
#set pseudo-random generator
np.random.seed(0)

from sklearn.decomposition import TruncatedSVD

shrunk_matrix = TruncatedSVD(n_components=100).fit_transform(tfidf_matrix)
print(f"We've dimensionally-reduced a {tfidf_matrix.shape[1]}-column "
      f"{type(tfidf_matrix)} matrix.")
print(f"Our output is a {shrunk_matrix.shape[1]}-column "
      f"{type(shrunk_matrix)} matrix.")

We've dimensionally-reduced a 114441-column <class 'scipy.sparse.csr.csr_matrix'> matrix.
Our output is a 100-column <class 'numpy.ndarray'> matrix.


Our shrunk matrix contains just 100 columns. We can now efficient compute cosine similarities by running **shrunk_matrix @ shrunk_matrix.T.**

**However, first we’ll need to confirm that the matrix rows remain normalized.** Lets check the magnitude of shrunk_matrix[0].

In [36]:
magnitude = np.linalg.norm(shrunk_matrix[0])
print(f"The magnitude of the first row is {magnitude:.2f}")

The magnitude of the first row is 0.49


The magnitude of the row is less than 1. Scikit-Learn’s SVD output has not been automatically normalized. We’ll need to manually normalize the matrix, prior to computing the similarities.

In [38]:
from sklearn.preprocessing import normalize
shrunk_norm_matrix = normalize(shrunk_matrix)
magnitude = np.linalg.norm(shrunk_norm_matrix[0])
print(f"The magnitude of the first row is {magnitude:.2f}")

The magnitude of the first row is 1.00


The shrunken matrix has been normalized. Now, running shrunk_norm_matrix @ shrunk_norm_matrix.T should produce a matrix of all-by-all cosine similarities.

In [41]:
cosine_similarity_matrix = shrunk_norm_matrix @ shrunk_norm_matrix.T
cosine_similarity_matrix

array([[ 1.00000000e+00,  4.20207330e-02,  2.09103735e-01, ...,
        -3.48442355e-02, -6.79152644e-04,  1.95435814e-01],
       [ 4.20207330e-02,  1.00000000e+00,  3.01777775e-01, ...,
         5.44917724e-01,  4.54122181e-02,  1.67038105e-01],
       [ 2.09103735e-01,  3.01777775e-01,  1.00000000e+00, ...,
         2.40456903e-01,  8.10512396e-02,  1.07226785e-01],
       ...,
       [-3.48442355e-02,  5.44917724e-01,  2.40456903e-01, ...,
         1.00000000e+00,  1.92546180e-02,  5.53444235e-02],
       [-6.79152644e-04,  4.54122181e-02,  8.10512396e-02, ...,
         1.92546180e-02,  1.00000000e+00,  8.38745651e-02],
       [ 1.95435814e-01,  1.67038105e-01,  1.07226785e-01, ...,
         5.53444235e-02,  8.38745651e-02,  1.00000000e+00]])

### We have our similarity matrix !

Lets leverage it to choose a random pair of very similar texts. We’ll start by randomly selecting a post at some index1. We’ll next select an index of cosine_similarities[index1], that has the second-highest cosine similarity. Then, we’ll print both the indices and their similarity prior to displaying the texts.

0

In [90]:
np.random.seed(4)
index1 = np.random.randint(dataset_size)
index2 = np.argsort(cosine_similarity_matrix[index1])[-2]
similarity = cosine_similarity_matrix[index1][index2]
print(f"The posts at indices {index1} and {index2} share a cosine "
      f"similarity of {similarity:.2f}")

The posts at indices 1146 and 5558 share a cosine similarity of 0.98


In [91]:
print(newsgroups.data[index2].replace('\n\n', '\n'))

Usually when I start up an application, I first get the window outline
on my display. I then have to click on the mouse button to actually
place the window on the screen. Yet when I specify the -geometry 
option the window appears right away, the properties specified by
the -geometry argument. The question now is:
How can I override the intermediary step of the user having to specify
window position with a mouseclick? I've tried explicitly setting window
size and position, but that did alter the normal program behaviour.
Thanks for any hints
---> Robert
PS: I'm working in plain X.



In [92]:
print(newsgroups.data[index1].replace('\n\n', '\n'))

I posted this about tow weeks ago but never saw it make it (Then again
I've had some problems with the mail system). Apologies if this appears
for the second time:
Usually when I start up an application, I first get the window outline
on my display. I then have to click on the mouse button to actually
place the window on the screen. Yet when I specify the -geometry 
option the window appears right away, the properties specified by
the -geometry argument. The question now is:
How can I override the intermediary step of the user having to specify
window position with a mouseclick? I've tried explicitly setting window
size and position, but that did alter the normal program behaviour.
Thanks for any hints
---> Robert
PS: I'm working in plain X, using tvtwm.

******************************************************************************
* Robert Gasch        * Der erste Mai ist der Tag an dem die Stadt ins      *
* Oracle Engineering   * Freihe tritt und den staatlichen Monopolanspruch    

Thus far, we have examined 2 pairs of similar posts. Each post-pair was composed of a question and a reply, where the question was included in the reply. Such boring pairs of overlapping texts are trivial to extract. Lets challenge ourselves to find something more interesting.

We’ll search for clusters of similar texts, where posts within a cluster share some text without perfectly overlapping.