In this file we are developing a deep learning system with Sentence Transformer models and Semantic Similarity approach.

The goal is to create a model which would take a query request from the user about an action in code, and recieves the closest libraries about that action.

First we load the data as a jason file :

And for metrics sake lets create some metrics

In [7]:
import psutil
import time

start_time = time.time()

In [8]:
import pandas as pd
import json


input_file_path = r"D:\Sharif University of Tech\Data\Library Recommender\Pypi data\1\Pypi_data_Feb_19_2024.json"
with open(input_file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

data_df = pd.DataFrame(data)

Because of hardware limitations we limit our data

In [9]:
data_df = data_df[100000:]

A simple preprocessing 

In [10]:
import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer
import faiss
import hdbscan
from sklearn.cluster import KMeans
import torch


def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text

Lets use an ensemble system for the embedding

In [11]:
models = [
    SentenceTransformer('all-MiniLM-L6-v2'),
    SentenceTransformer('paraphrase-MiniLM-L6-v2'),
    SentenceTransformer('distilbert-base-nli-mean-tokens')
]

def generate_ensemble_embeddings(texts):
    all_embeddings = []
    for model in models:
        embeddings = model.encode(texts, show_progress_bar=False, convert_to_tensor=True)
        all_embeddings.append(embeddings)
    concatenated_embeddings = torch.cat(all_embeddings, dim=1)
    return concatenated_embeddings.cpu().numpy()

and lets combine everything about the library into one single column (this would hurt the model in any way shape or form as the embedding system does not care for them to be seperate)

In [12]:
data_df['text'] = data_df['Summary'].str[0] + " " + data_df['Description'].str[0]
data_df['text'] = data_df['text'].fillna('')

lets apply both our functions (this might take a while even for the limited data we have)

In [13]:
data_df['text'] = data_df['text'].apply(preprocess_text)
ensemble_embeddings = generate_ensemble_embeddings(data_df['text'].tolist())

Lets create a clustring system and refine it with k-means (MLM)

lets explain how this would be beneficial

this system as a whole works by clustering the data (the embedded data) into 5 clusters (this number could be different but per my experience it would not make much of a difference)
after the clustering is done the seperation of the meaning of each cluster in the embedded space is more clear and as a result the algorithm will have less trouble finding new instances (improving time and accuracy)

In [14]:
hdbscan_clusterer = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
hdbscan_labels = hdbscan_clusterer.fit_predict(ensemble_embeddings)
data_df['hdbscan_cluster'] = hdbscan_labels

def refine_clusters_with_kmeans(embeddings, hdbscan_labels, n_subclusters=5):
    unique_clusters = set(hdbscan_labels) - {-1}
    refined_labels = np.array(hdbscan_labels)

    for cluster_id in unique_clusters:
        mask = hdbscan_labels == cluster_id
        cluster_embeddings = embeddings[mask]

        kmeans = KMeans(n_clusters=n_subclusters, random_state=42)
        kmeans_labels = kmeans.fit_predict(cluster_embeddings)

        refined_labels[mask] = kmeans_labels + cluster_id * n_subclusters

    return refined_labels

lets apply 

In [None]:
refined_labels = refine_clusters_with_kmeans(ensemble_embeddings, hdbscan_labels)
data_df['refined_cluster'] = refined_labels

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_me

In [None]:
dimension = ensemble_embeddings.shape[1]
faiss_indexes = {}

for cluster_id in set(refined_labels):
    cluster_mask = refined_labels == cluster_id
    cluster_embeddings = ensemble_embeddings[cluster_mask].astype('float32')

    index = faiss.IndexFlatL2(dimension)
    index.add(cluster_embeddings)
    faiss_indexes[cluster_id] = index

implementing a simple semantic search function and the clustring system

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def find_nearest_cluster(query_embedding, data_df, embeddings):
    similarities = cosine_similarity(query_embedding.reshape(1, -1), embeddings)
    nearest_index = np.argmax(similarities)
    return data_df.iloc[nearest_index]['refined_cluster']

def semantic_search_refined_cluster(query, data_df, top_n=5):
    query_processed = preprocess_text(query)
    query_embedding = np.hstack([
        model.encode([query_processed], convert_to_tensor=True).cpu().numpy()
        for model in models
    ]).astype('float32')

    refined_cluster_id = find_nearest_cluster(query_embedding, data_df, ensemble_embeddings)

    if refined_cluster_id in faiss_indexes:
        index = faiss_indexes[refined_cluster_id]
        _, top_n_indices = index.search(query_embedding.reshape(1, -1), top_n)

        cluster_mask = data_df['refined_cluster'] == refined_cluster_id
        cluster_libraries = data_df[cluster_mask].iloc[top_n_indices[0]]
        return cluster_libraries[['Package', 'Summary', 'Description']]
    else:
        return pd.DataFrame()


And the metrics would be :

In [None]:
execution_time = time.time() - start_time

cpu_usage = psutil.cpu_percent()
ram_usage = psutil.virtual_memory().percent

with open('metrics.txt', 'w') as f:
    f.write(f'CPU Usage: {cpu_usage}%\n')
    f.write(f'RAM Usage: {ram_usage}%\n')
    f.write(f'Execution Time: {execution_time} seconds\n')

Lets save the model :

In [None]:
from joblib import dump

dump(model, 'model.joblib')

and now lets test :

In [None]:
data_df

Unnamed: 0,Package,Command,Release,Summary,License,Description,Topic,tags,text,hdbscan_cluster,refined_cluster
100000,easywindow 0.0.1,pip install easywindow,"Sep 1, 2022",A NONDOS Project,OSI Approved :: MIT License,This is a very easy library. Change Log 0....,,[tkinter],a,137,685
100001,EasyWidgets 0.4.1,pip install EasyWidgets,"Jun 29, 2022",A minimalistic approach to HTML generation and...,OSI Approved :: MIT License,Not so easy.,,"[TurboGears,]",a,141,706
100002,easy-workflow-manager 0.0.14,pip install easy-workflow-manager,"Jul 10, 2023",Tools to support a straightforward branch/qa/m...,OSI Approved :: MIT License,Install Install with pip % pip3 install ea...,[Software Development :: Libraries],"[git,]",t,494,2471
100003,easy-wrap 0.1.2,pip install easy-wrap,"Feb 14, 2023",No project description provided,,easy_wrap 基于 pillow 的简单文本转图片渲染工具，帮助实现自动换行 安...,,,n,427,2135
100004,easywsgi 0.4,pip install easywsgi,"May 4, 2015","Small module to make wsgi apps easy to write, ...",,# easywsgi Small module to make wsgi apps easy...,,,s,334,1671
...,...,...,...,...,...,...,...,...,...,...,...
396559,zzzeeksphinx 1.4.0,pip install zzzeeksphinx,"Apr 25, 2023",Zzzeek's Sphinx Layout and Utilities.,,"This is zzzeek’s own Sphinx layout, used by S...",[Documentation],[Sphinx],z,74,370
396560,zzzfs 0.1.2,pip install zzzfs,"May 7, 2017",Dataset management à la ZFS,,ZzzFS: dataset management à la ZFS ZzzFS (“s...,[System :: Filesystems],[zfs],d,450,2250
396561,zzzing 0.4.8,pip install zzzing,"Jun 5, 2022",zzzing CLI by xuanzhi33,OSI Approved :: GNU General Public License v3 ...,zzzing CLI by xuanzhi33,,,z,74,370
396562,zzzutils 0.1.7,pip install zzzutils,"Sep 11, 2018",Time utils for Humans.,OSI Approved :: Apache Software License,Requests: Python utils - Time ================...,,,t r,389,1945


                     Package  \
239836           orcid 1.0.3   
247072           pdown 1.0.5   
260786    project-sync 0.2.0   
262551  psychopy_ext 0.6.0.4   
264649          py2sms 1.1.1   
265121          pyaeries 1.0   
265517       pyAniSort 1.0.4   
269688        pycutter 0.1.1   
282086     pyProCT-GUI 0.4.1   
283176  pyramid-openid 0.3.4   

                                                  Summary  \
239836                A python wrapper over the ORCID API   
247072  A command line program to download/manipulate ...   
260786  A simple tool for synchronizing local project ...   
262551  A framework for a rapid reproducible experimen...   
264649                A package to send SMS using Way2Sms   
265121                                     Aeries SIS API   
265517  Automatically sorts anime using information fr...   
269688                          A simple screen shot tool   
282086  A Graphical User Interface for pyProCT cluster...   
283176  A view for pyramid that funct