**OBJECTIVE**
- Understand what soft cosine similarity actually computes by implementing it from scratch
- Use a simple example to compare difference in cosine similarity vs soft cosine similarity

In [1]:
import numpy as np

**Soft Cosine Similarity**

![title](images/scs.png)

In [2]:
def softCosineSimilarity(a: np.array, b: np.array, S: np.array) -> float:
    """
    Soft Cosine Similarity (SCS) algorithm implemented via for loops for better understanding.
    
    Computes SCS between bag of words term frequency vectors a & b. Uses word-similarity matrix S
    of shape NxN where N is number of unique words in corpus including a & b.

    Parameters:
    a (np.array): bag of words term frequency vector of sentence a
    b (np.array): bag of words term frequency vector of sentence b
    S (np.array): word similarity matrix

    Returns:
    float: soft cosine similarity between a & b.
   """
    numerator = 0    
    denominator_a = 0
    denominator_b = 0
    
    # calculate numerator
    for i in range(len(a)):
        for j in range(len(a)):
            numerator += (S[i][j])*a[i]*b[j]
            denominator_a += (S[i][j])*a[i]*a[j]
            denominator_b += (S[i][j])*b[i]*b[j]
    
    # calculate denominator a
    denominator_a = np.sqrt(denominator_a)
    # calculate denominator b
    denominator_b = np.sqrt(denominator_b)
    # calculate denominator
    denominator = denominator_a*denominator_b
    return round(numerator/denominator,2)

**Soft Cosine Similarity (Matrix Form)**

![title](images/scs_matrix.png)

In [3]:
def softCosineSimilarityFast(a: np.array, b: np.array, S: np.array) -> float:
    """
    Soft Cosine Similarity (SCS) algorithm implemented via matrix multiplication.
    
    Computes SCS between bag of words term frequency vectors a & b. Uses word-similarity matrix S
    of shape NxN where N is number of unique words in corpus including a & b.

    Parameters:
    a (np.array): bag of words term frequency vector of sentence a
    b (np.array): bag of words term frequency vector of sentence b
    S (np.array): word similarity matrix

    Returns:
    float: soft cosine similarity between a & b.
   """
    numerator = a.T@S@b
    denominator_a = np.sqrt(a.T@S@a)
    denominator_b = np.sqrt(b.T@S@b)
    denominator = denominator_a*denominator_b
    return round(numerator/denominator,2)

In [4]:
sentence_a = ['play','video', 'game'] # unique word representation of sentence a after preprocessing
sentence_b = ['best','player'] # unique word representation of sentence b after preprocessing
unique_words = ['play','video','game','best','player'] # all unique words in corpus of a & b

bow_a = np.array([1, 1, 1, 0, 0], float) # term frequency vector. x[i] = # times unique_words[i] occurs in sentence a.
bow_b = np.array([0, 0, 0, 1, 1], float) # term frequency vector. x[i] = # times unique_words[i] occurs in sentence b.

# X[i,j] = similarityFunction(unique_words[i], unique_words[j])
# similarityFunction can be any function but here its just assumed to be dot product between vector i and vector j.
# vector i and vector j are assumed to be loaded from a word-embedding model such as GloVe
wordSimilarityMatrix = np.array([[1,0.6,0.65,0.2,0.8],
                                 [0.6,1,0.75,0.1,0.6],
                                 [0.65,0.75,1,0.2,0.6],
                                 [0.2,0.1,0.2,1,0.5],
                                 [0.8,0.6,0.5,0.5,1]])

print('Similarity Matrix Shape: {}'.format(wordSimilarityMatrix.shape))

Similarity Matrix Shape: (5, 5)


In [5]:
%timeit softCosineSimilarity(bow_a, bow_b, wordSimilarityMatrix)

36.4 µs ± 462 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [6]:
%timeit softCosineSimilarityFast(bow_a, bow_b, wordSimilarityMatrix)

11.4 µs ± 239 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [7]:
cosineSimilarity = (bow_a@bow_b)/(np.sqrt(bow_a@bow_a)*np.sqrt(bow_b@bow_b))
softCosineSimilarity = softCosineSimilarity(bow_a, bow_b, wordSimilarityMatrix)
softCosineSimilarityFast = softCosineSimilarityFast(bow_a, bow_b, wordSimilarityMatrix)

In [8]:
print('Cosine Similarity: {}'.format(cosineSimilarity))
print('Soft Cosine Similarity: {}'.format(softCosineSimilarity))
print('Soft Cosine Similarity Fast: {}'.format(softCosineSimilarityFast))

Cosine Similarity: 0.0
Soft Cosine Similarity: 0.55
Soft Cosine Similarity Fast: 0.55
