# NLP: Algorithms to get Document Similarity

### Have you ever read a book and found that this book was similar to another book that you had read before? I have already. Practically all self-help books that I read are similar to Napolean Hill’s books.
### So I wondered if Natural Language Processing (NLP) could mimic this human ability and find the similarity between documents.

## Similarity Problem

### To find the similarity between texts you first need to define two aspects:
### 1.	The similarity method that will be used to calculate the similarities between the embeddings.
### 2.	The algorithm that will be used to transform the text into an embedding, which is a form to represent the text in a vector space.

## Similarity Methods

### Cosine Similarity:  measures the cosine of the angle between two embeddings. When the embeddings are pointing in the same direction the angle between them is zero so their cosine similarity is 1 when the embeddings are orthogonal the angle between them is 90 degrees and the cosine similarity is 0 finally when the angle between them is 180 degrees the the cosine similarity is -1.
### -1 to 1 is the range of values that the cosine similarity can vary, where the closer to 1 more similar the embeddings are.

![image.png](attachment:image.png)
***Image from https://datascience-enthusiast.com/DL/Operations_on_word_vectors.html showing a situation where the cosine similarity is 1 because France and Italy are related and other where the similarity is 0 because ball is not similar to crocodile***

### Mathematically, you can calculate the cosine similarity by taking the dot product between the embeddings and dividing it by the multiplication of the embeddings norms, as you can see in the image below.
![image-2.png](attachment:image-2.png)

### In python, you can use the cosine_similarity function from the sklearn package to calculate the similarity for you.

In [24]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# converting the text data to feature vectors (numeric values)
vectorizer = TfidfVectorizer()

# accept a list (or an iterable) containing a single element.
# "there are a lot of flowers in the wood","this wood is so amazing and it has a lot of lowers"
feature_vectors = vectorizer.fit_transform(["action film", "movie film"])

# getting the similarity scores using cosine similarity
similarity = cosine_similarity(feature_vectors)

print(similarity[0][1])

0.33609692727625756


## Euclidean Distance:  is probably one of the most known formulas for computing the distance between two points applying the Pythagorean theorem. To get it you just need to subtract the points from the vectors, raise them to squares, add them up and take the square root of them. Did it seem complex? Don’t worry, in the image below it will be easier to understand.

![image.png](attachment:image.png)

## In python, you can use the euclidean_distances function also from the sklearn package to calculate it.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances

# converting the text data to feature vectors (numeric values)
vectorizer = TfidfVectorizer()

# accept a list (or an iterable) containing a single element.
feature_vectors = vectorizer.fit_transform(["action film","action movie"])

# getting the similarity scores using cosine similarity
similarity = euclidean_distances(feature_vectors)

print(similarity[0][1])

1.1523047103294706


### There are also other metrics such as Jaccard, Manhattan, and Minkowski distance that you can use

https://towardsdatascience.com/similarity-metrics-in-nlp-acc0777e234c