# Semantic Similarity

In this notebook we follow a pipeline to extract relevant information from a data dump (generated by GPT).

The **Pipeline** is as follows:
* Read the Data
* Extract the Topics (We work with topics since it is easier to prototype with. Full Sentences can also be used.)
* Pre-Process the Data
* Generate Semantic Embeddings using a SentenceTransformer Model
* Perform K-Means Clustering

The last step, *Perform K-Means Clustering*, itself has a few parts:
* Optimize K for K-Means (this can be hard coded, or, as in the notebook, estimate from intertia)
* If preferred, apply Dimensionality Reduction using TSNE or PCA (this has proved to defeat the point of semantic embeddings to a large extent)
* Retrieve Cluster Assignments in the form of indexes
* Use the indexes to get the real clusters as a List of List of Strings (This is also where lexical cluster refinement has been built into)
* Print Clusters in a cohesive manner as well as the optimizations / datapoint filtering that has been done at multiple stages

In [1]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import pylcs

In [2]:
# Global variables for the pipeline

red1 = None
red2 = None

In [3]:
# Read Dataset

with open('../data/I2_1_1000.txt','r') as txtfile:
    biz_ideas = [line.rstrip('\n') for line in txtfile]

In [4]:
def topic(s1:str):
    
    # Returns the topic extracted from the entire output.
    
    return s1[:s1.find('.')]

In [5]:
def preprocess(data:list):
    
    # Applies three proprocessing steps: turns every item to lower case, selects only short strings, and eliminates duplicates
    
    len1 = len(data)
    data = [datum.lower() for datum in data]
    data = [datum for datum in data if len(datum) < 25]
    data = [datum for datum in set(data)]
    
    global red1
    red1 = (len1 - len(data))/len1 * 100
    
    return data

In [6]:
def overlap(s1:str, s2:str):
    
    # Returns the lexical overlap using longest common subsequence between two strings
    
    if len(s1) > len(s2):
        return pylcs.lcs(s1,s2)/float(len(s1))
    else:
        return pylcs.lcs(s1,s2)/float(len(s2))

In [7]:
# Create a list of topics from the corpus

topics_unprocessed = []

for idea in biz_ideas:

    topics_unprocessed.append(topic(idea))

topics = preprocess(topics_unprocessed)

for topic_ in topics[:5]:
    print(topic_)

laundries
internet shoppes
dog day care
pet food business
restaurant business


In [8]:
# Generate embeddings using a pre-trained SentenceTransformer model

model = SentenceTransformer('paraphrase-mpnet-base-v2')
topics_embeddings_unprocessed = model.encode(topics_unprocessed)
topics_embeddings = model.encode(topics)

In [9]:
print(f'Shape of topic embeddings before pre-processing: {topics_embeddings_unprocessed.shape}')
print(f'Shape of topic embeddings after pre-processing: {topics_embeddings.shape}')

Shape of topic embeddings before pre-processing: (1000, 768)
Shape of topic embeddings after pre-processing: (487, 768)


In [10]:
# Top K similar ideas (Pre-processed but unrefined)

topic_query = 'Beauty Parlor'
query_embedding = model.encode(topic_query)

top_k = 5

cos_scores = util.pytorch_cos_sim(query_embedding, topics_embeddings)[0]
top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

print("Sentence:", topic_query, "\n")
print(f'Top {top_k} most similar items in corpus:\n')

for idx in top_results[0:top_k]:
    print(topics[idx], "(Score: %.4f)" % (cos_scores[idx]))

Sentence: Beauty Parlor 

Top 5 most similar items in corpus:

beauty salon (Score: 0.9091)
a beauty salon (Score: 0.9033)
beauty shop (Score: 0.8997)
beauty salons (Score: 0.8859)
hair care salon (Score: 0.7918)


In [11]:
# Clustering : KMeans

def kmeans(num_clusters, data:list):
    
    from sklearn.cluster import KMeans
        
    clustering_model = KMeans(n_clusters = num_clusters)
    clustering_model.fit(data)
    cluster_assignment = clustering_model.labels_

    return cluster_assignment

In [12]:
def optimize_k(data:list):
    
    # This function is used to automatically determine the optimal value of K for K-Means Clustering
    
    from sklearn.cluster import KMeans
    import math
    
    dists = []
    K = range(1,70)
        
    for n in K:
        k_model = KMeans(n_clusters = n)
        k_model.fit(data)
        dists.append(k_model.inertia_)
        
    def calc_dist(x1,y1,a,b,c):
        return abs((a*x1 + b*y1 + c))/(math.sqrt(a**2 + b**2))
        
    a = dists[0] - dists[-1]
    b = K[-1] - K[0]
    c1 = K[0] * dists[-1]
    c2 = K[-1] * dists[0]
    c = c1 - c2
        
    dists_line = []

    for k in range(K[-1]):
        dists_line.append(calc_dist(K[k], dists[k], a, b, c))
            
    num_clusters = dists_line.index(max(dists_line))+1
        
    return num_clusters

In [13]:
# Alternative Approach: DBScan

def dbscan(data:list):
    
    from sklearn.cluster import DBSCAN

    db_default = DBSCAN(eps = 0.0375, min_samples = 3).fit(data)
    cluster_assignment = db_default.labels_

    return cluster_assignment

In [14]:
def reduce_dims(data, alg='tsne', num_components=2):
    
    # Used to reduce dimensions using TSNE or PCA algorithms
    
    topics_red = None
    
    if alg == 'tsne':
        
        from sklearn.manifold import TSNE

        topics_red = TSNE(n_components=num_components).fit_transform(data)
        
    elif alg == 'pca':
        
        from sklearn.decomposition import PCA
        topics_red = PCA(n_components=num_components,svd_solver='full').fit_transform(data)
            
    return topics_red

In [15]:
def refine_clusters(clusters:list, threshold=0.7):
    
    # Eliminates lexically similar items using Longest Common Subsequence
    
    refined_clusters = []
    reductions = []
    
    for cluster in clusters:
        
        refined_cluster = []

        for i in range(len(cluster)-1):
            flag = True
            
            for j in range(i+1,len(cluster)):
                
                if overlap(cluster[i],cluster[j]) > threshold:
                    flag = False
                    break
                    
            if flag:
                refined_cluster.append(cluster[i])
        
        if len(cluster):
            reductions.append((len(cluster) - len(refined_cluster)) / len(cluster))
            refined_clusters.append(refined_cluster)
    
    global red2
    red2 = np.average(np.array(reductions))*100
    
    return refined_clusters

In [16]:
def get_clusters(num_clusters, cluster_assignment, topics):
    
    # Prepares and returns clusters as list of lists
    
    clusters = []
    
    for i in range(num_clusters):
        
        clust_sent = np.where(cluster_assignment == i)
        clust_points = []
        
        for k in clust_sent[0]:
            
            clust_points.append(topics[k])
            
        clusters.append(clust_points)
    
    clusters = refine_clusters(clusters)
    
    return clusters

In [17]:
def print_clusters(clusters, n=10):
    
    # Prints clusters in a cohesive manner and also displays how much redundant data has been eliminated

    for i in range(len(clusters)):
        print()
        print(f'Cluster {i + 1} contains:')
        
        for j in range(min(n,len(clusters[i]))):
            print(f'- {clusters[i][j]}')
    
    global red1, red2
    print(f'\nData trimmed by {red1:.2f}% in preprocessing step, and by {red2:.2f}% in cluster refinement step.\n')

In [None]:
# K-Means Pipeline

# Define number of clusters or auto estimate optimum using intertia
num_clusters = optimize_k(topics_embeddings)

# Reduce Dimentions using TSNE or PCA
topics_red = reduce_dims(topics_embeddings,alg='tsne',num_components=2)

# Apply K-Means Clustering
cluster_assignment = kmeans(num_clusters=num_clusters, data=topics_embeddings) # data=topics_embeddings, topics_red

# Get the clusters
clusters = get_clusters(num_clusters, cluster_assignment, topics)

# Print clusters cohesively
print_clusters(clusters, 5)

In [None]:
# DBScan Pipeline

# Define number of clusters to show (note number of clusters is automatically determined)
num_clusters = 75

# Reduce Dimentions using TSNE
topics_red = reduce_dims(topics_embeddings,alg='tsne',num_components=3)

# Apply DBScan Clustering
cluster_assignment = dbscan(data=topics_embeddings) # data=topics_embeddings, topics_red

# Get the clusters
clusters = get_clusters(num_clusters, cluster_assignment, topics)

# Print clusters cohesively
print_clusters(clusters, 5)