# **Ranking Software Vendors using Sentence Embeddings & Cosine Similarity**

## **Objective**
We aim to rank software vendors based on their relevance to a user's input criteria by computing similarity scores using **sentence embeddings** and **cosine similarity**.

### Data Exploration

In [1]:
# Imports:
import json
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import torch
import pickle

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load CSV data into dataframe
csv = pd.read_csv("data/G2 software product overview.csv")
df = pd.DataFrame(csv)
df.head()

Unnamed: 0,url,product_name,rating,description,product_url,seller,ownership,seller_website,headquarters,total_revenue,...,full_pricing_page,badge,what_is_description,main_category,main_subject,Features,region,country_code,software_product_id,overview_provided_by
0,https://www.g2.com/products/newforma-project-c...,Newforma Project Center,4.0,Newforma PIM solution an integrated solution f...,https://www.newforma.com/newforma-project-center/,Newforma,,https://www.newforma.com/,"Manchester, NH",,...,https://www.g2.com/products/newforma-project-c...,https://images.g2crowd.com/uploads/report_meda...,,Construction Software,Home>Construction Software>Construction Projec...,"[{""Category"":""Library"",""features"":[{""descripti...",,US,newforma-project-center,Henry Auger
1,https://www.g2.com/products/nitro-pro/reviews,Nitro Pro,4.3,Nitro deliver trusted PDF & eSign software for...,https://www.gonitro.com/pricing,"Nitro, Inc",,https://www.gonitro.com/,"San Francisco, CA",,...,https://www.g2.com/products/nitro-pro/pricing,https://images.g2crowd.com/uploads/report_meda...,,Document Creation Software,Home>Document Creation Software>Nitro Pro>Nitr...,"[{""Category"":""Platform"",""features"":[{""descript...",,US,nitro-pro,Jaclyn Core
2,https://www.g2.com/products/netmera/reviews,Netmera,4.2,"Netmera enables marketers to create, schedule,...",https://www.netmera.com/mobile-marketing-autom...,Netmera,,https://netmera.com/,"İstanbul, TR",,...,https://www.g2.com/products/netmera/pricing,https://images.g2crowd.com/uploads/report_meda...,,Mobile Marketing Software,Home>Mobile Marketing Software>Netmera>Netmera...,"[{""Category"":""Integration"",""features"":[{""descr...",AS,TR,netmera,Irem BaylanNetmera şirketinde Product Marketin...
3,https://www.g2.com/products/netlify/reviews,Netlify,4.5,Netlify provides a full-featured CDN hosting s...,https://www.netlify.com/features/,Netlify,,https://www.netlify.com/,"San Francisco, CA",,...,https://www.g2.com/products/netlify/pricing,https://images.g2crowd.com/uploads/report_meda...,,WebOps Platforms,Home>WebOps Platforms>Netlify>Netlify Reviews,"[{""Category"":""Content"",""features"":[{""descripti...",,US,netlify,Lisa Kretsch
4,https://www.g2.com/products/openbuildings-desi...,OpenBuildings Designer,4.3,OpenBuildings Designer is a single building in...,https://www.g2.com/products/openbuildings-desi...,Bentley Systems,NASDAQ: BSY,https://www.bentley.com/,"Exton, PA",,...,https://www.g2.com/products/openbuildings-desi...,,,CAD Software,Home>CAD Software>Building Design and Building...,"[{""Category"":""Design"",""features"":[{""descriptio...",,US,openbuildings-designer,Prathamesh Gawde


In [3]:
df.columns

Index(['url', 'product_name', 'rating', 'description', 'product_url', 'seller',
       'ownership', 'seller_website', 'headquarters', 'total_revenue',
       'social_media_profiles', 'seller_description', 'reviews_count',
       'discussions_count', 'pros_list', 'cons_list', 'competitors',
       'highest_rated_features', 'lowest_rated_features', 'rating_split',
       'pricing', 'official_screenshots', 'official_downloads',
       'official_videos', 'categories', 'user_ratings', 'languages_supported',
       'year_founded', 'position_against_competitors', 'overview', 'claimed',
       'logo', 'reviews', 'top_alternatives', 'top_alternatives_url',
       'full_pricing_page', 'badge', 'what_is_description', 'main_category',
       'main_subject', 'Features', 'region', 'country_code',
       'software_product_id', 'overview_provided_by'],
      dtype='object')

### Data Cleaning

I extracted relevant columns from the vendor dataset:
- **`main_category`**: The primary category of the vendor's software.
- **`categories_text`**: Additional category-related keywords that describe the software.
- **`feature_names`**: A list of features provided by the vendor.
- **`rating`**: The vendor's rating score, which influences the final ranking.
- **`seller & product_name`**: The vendor's name and product name for final ranking.


In [4]:
def relevant_attributes(df):
    # Select columns: 'seller', 'product_name', 'Features', 'categories', 'rating', and 'main_category'
    df = df.loc[:, [ 'seller', 'product_name', 'Features', 'categories', 'rating', 'main_category']]
    return df

vendors_data = relevant_attributes(df.copy())
vendors_data.head()

Unnamed: 0,seller,product_name,Features,categories,rating,main_category
0,Newforma,Newforma Project Center,"[{""Category"":""Library"",""features"":[{""descripti...","[""Construction Project Management"",""Jobsite Ma...",4.0,Construction Software
1,"Nitro, Inc",Nitro Pro,"[{""Category"":""Platform"",""features"":[{""descript...","[""Document Creation"",""E-Signature"",""PDF Editor""]",4.3,Document Creation Software
2,Netmera,Netmera,"[{""Category"":""Integration"",""features"":[{""descr...","[""Marketing Automation"",""Customer Journey Mapp...",4.2,Mobile Marketing Software
3,Netlify,Netlify,"[{""Category"":""Content"",""features"":[{""descripti...","[""Continuous Delivery"",""Cloud Platform as a Se...",4.5,WebOps Platforms
4,Bentley Systems,OpenBuildings Designer,"[{""Category"":""Design"",""features"":[{""descriptio...","[""Building Design and Building Information Mod...",4.3,CAD Software


In [None]:
# Drop rows with missing data in column: 'Features'
def clean_data(vendors_data):
    vendors_data = vendors_data.dropna(subset=['Features'])
    return vendors_data

vendors_data_clean = clean_data(vendors_data.copy())
vendors_data_clean.head()

Unnamed: 0,seller,product_name,Features,categories,rating,main_category
0,Newforma,Newforma Project Center,"[{""Category"":""Library"",""features"":[{""descripti...","[""Construction Project Management"",""Jobsite Ma...",4.0,Construction Software
1,"Nitro, Inc",Nitro Pro,"[{""Category"":""Platform"",""features"":[{""descript...","[""Document Creation"",""E-Signature"",""PDF Editor""]",4.3,Document Creation Software
2,Netmera,Netmera,"[{""Category"":""Integration"",""features"":[{""descr...","[""Marketing Automation"",""Customer Journey Mapp...",4.2,Mobile Marketing Software
3,Netlify,Netlify,"[{""Category"":""Content"",""features"":[{""descripti...","[""Continuous Delivery"",""Cloud Platform as a Se...",4.5,WebOps Platforms
4,Bentley Systems,OpenBuildings Designer,"[{""Category"":""Design"",""features"":[{""descriptio...","[""Building Design and Building Information Mod...",4.3,CAD Software


In [6]:
# convert column values to list
vendors_data_clean["Features"] = vendors_data_clean["Features"].apply(lambda x: json.loads(x))
vendors_data_clean["categories"] = vendors_data_clean["categories"].apply(lambda x: json.loads(x))

In [7]:
# Extract Features 
extracted_features = []

for category in vendors_data_clean["Features"]:
    feature_list = []
    for features in category:
        for feature in features['features']:
            feature_list.append(feature['name'])
        
    extracted_features.append(feature_list)

vendors_data_clean["feature_names"] = extracted_features

# Convert the list of categories into a concatenated string
vendors_data_clean["categories_text"] = vendors_data_clean["categories"].apply(lambda x: " ".join(x) if isinstance(x, list) else "")

vendors_data_clean    

Unnamed: 0,seller,product_name,Features,categories,rating,main_category,feature_names,categories_text
0,Newforma,Newforma Project Center,"[{'Category': 'Library', 'features': [{'descri...","[Construction Project Management, Jobsite Mana...",4.0,Construction Software,"[Objects, Materials, Textures, Shading, Lighti...",Construction Project Management Jobsite Manage...
1,"Nitro, Inc",Nitro Pro,"[{'Category': 'Platform', 'features': [{'descr...","[Document Creation, E-Signature, PDF Editor]",4.3,Document Creation Software,"[Custom Branding, User, Role, and Access Manag...",Document Creation E-Signature PDF Editor
2,Netmera,Netmera,"[{'Category': 'Integration', 'features': [{'de...","[Marketing Automation, Customer Journey Mappin...",4.2,Mobile Marketing Software,"[Data Import & Export Tools, Integration APIs,...",Marketing Automation Customer Journey Mapping ...
3,Netlify,Netlify,"[{'Category': 'Content', 'features': [{'descri...","[Continuous Delivery, Cloud Platform as a Serv...",4.5,WebOps Platforms,"[Static Content Caching, Dynamic Content Routi...",Continuous Delivery Cloud Platform as a Servic...
4,Bentley Systems,OpenBuildings Designer,"[{'Category': 'Design', 'features': [{'descrip...",[Building Design and Building Information Mode...,4.3,CAD Software,"[Visualizing, Rendering, Drawing, Editing, Seq...",Building Design and Building Information Model...
...,...,...,...,...,...,...,...,...
995,Securiti,Securiti,"[{'Category': 'Administration', 'features': [{...","[AWS Marketplace, Privacy Impact Assessment (P...",4.8,Data Privacy Management Software,"[Data Modelling, Recommendations, Workflow Man...",AWS Marketplace Privacy Impact Assessment (PIA...
996,SentinelOne,SentinelOne Singularity,"[{'Category': 'Performance', 'features': [{'de...","[Endpoint Management, Endpoint Detection & Res...",4.7,Endpoint Protection Software,"[Issue Tracking, Detection Rate, False Positiv...",Endpoint Management Endpoint Detection & Respo...
997,Semrush,Semrush,"[{'Category': 'Social Management', 'features':...","[AI Writing Assistant, Social Media Management...",4.5,SEO Tools,"[Social Analytics, Social Publishing, Social E...",AI Writing Assistant Social Media Management M...
998,SAP,SAP Business ByDesign,"[{'Category': 'General Ledger', 'features': [{...","[Accounting, ERP Systems, Discrete ERP, Distri...",4.0,ERP Systems,"[Journal Entries, Tags / Dimensions, Audit Tra...",Accounting ERP Systems Discrete ERP Distributi...


### Generate Embeddings

To numerically compare text descriptions, I chose to use a **pre-trained SentenceTransformer model** (e.g., `"all-MiniLM-L6-v2"`) to generate embeddings for:
- **User's `software_category`** → Compared against `"main_category"` and `"categories_text"`.
- **User's `capabilities`** → Compared against `"feature_names"`.

These embeddings will allow us to compute similarity scores between the user input and vendor offerings.


In [None]:
# Load a pre-trained sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

def get_embedding(text):
    """Generate embedding for a given text using the sentence transformer model."""
    return model.encode(text, convert_to_tensor=True)

vendors_data_clean["main_category_embedding"] = vendors_data_clean["main_category"].apply(get_embedding)
vendors_data_clean["categories_text_embedding"] = vendors_data_clean["categories_text"].apply(get_embedding)
vendors_data_clean["feature_embeddings"] = vendors_data_clean["feature_names"].apply(
    lambda features: [get_embedding(feature) for feature in features]
)

#### **Optimizing Reloading with Pickle**
Embedding the text took around 7mins. Since this process might be time consuming I decied to pickle the dataframe with the preprocessed **dataframe with embedded columns** using Python’s `pickle` module.  


In [None]:
# I pieckled the dataframe with the embedded columns for faster reloding 
with open("data/vendors_data_with_embeddings.pkl", "wb") as f:
    pickle.dump(vendors_data_clean, f)

In [None]:
# Load a pre-trained sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

def get_embedding(text):
    """Generate embedding for a given text using the sentence transformer model."""
    return model.encode(text, convert_to_tensor=True)


with open("data/vendors_data_with_embeddings.pkl", "rb") as f:
    embedded_vendors_data = pickle.load(f)

#### Similarity Computation
- The `compute_feature_similarities` function calculates pairwise similarities between user capability embeddings and vendor feature embeddings, returning a similarity matrix. 
- The `compute_similarity` function compares a single input embedding (e.g., software category) against vendor category embeddings to assess relevance.

In [9]:
# Similarity computation of vectorized fetures and vectorized input
def compute_feature_similarities(capability_embeddings, vendor_feature_embeddings):
    """Compute pairwise similarity scores between user capabilities and vendor features."""
    if not vendor_feature_embeddings:
        return []  # Return an empty list if no features are available

    vendor_feature_embeddings = torch.stack(vendor_feature_embeddings)  # Convert list to tensor
    similarity_matrix = util.pytorch_cos_sim(torch.stack(capability_embeddings), vendor_feature_embeddings)

    return similarity_matrix.tolist()  # Keeping as list of lists for now

In [10]:
def compute_similarity(input_embedding, vendor_embeddings):
    """ Compute cosine similarity between software_category and (main_category + categories_text) """
    similarity_scores = util.pytorch_cos_sim(input_embedding, vendor_embeddings)
    return similarity_scores.squeeze().tolist()

#### Example User Inputs to test the system

In [11]:
# Example inputs
software_category = "Project Management Software"
capabilities = ["Task Scheduling", "Time Tracking"]

# Generate embeddings input embeddings
software_category_embedding = get_embedding(software_category)  # Already a tensor
capability_embeddings = [get_embedding(feature) for feature in capabilities]  # List of tensors

Computing similarity for each user input

In [12]:
# Compute category similarity
embedded_vendors_data["category_similarity"] = embedded_vendors_data.apply(
    lambda row: max(compute_similarity(software_category_embedding, 
                                       torch.stack([row["main_category_embedding"], row["categories_text_embedding"]]))),
    axis=1
)

# Compute feature similarity (list of scores for each vendor)
embedded_vendors_data["feature_similarities"] = embedded_vendors_data["feature_embeddings"].apply(
    lambda feature_emb: compute_feature_similarities(capability_embeddings, feature_emb)
)


#### Filtering and Weighted Feature Similarity Calculation

The filtering step ensures that only vendors with at least one feature similarity score **≥ 0.6** are retained.

This threshold of 0.6 is chosen to include only vendors with a moderately strong semantic match to user capabilities, eliminating less relevant options.

Next, the weighted feature similarity is computed by averaging all similarity scores for each vendor. This step ensures that vendors with higher overall feature alignment receive a better ranking.

In [13]:
# filtaring any vender under the threshold of 0.6
filtered_vendors = embedded_vendors_data[
    embedded_vendors_data["feature_similarities"].apply(
        lambda scores: any(score >= 0.6 for row in scores for score in row)  # Flatten nested lists 
    )
].copy()

ranked_vendors = filtered_vendors.copy()

# computing weigthed feature similarity for final ranking
ranked_vendors.loc[:, "weighted_feature_similarity"] = ranked_vendors["feature_similarities"].apply(
    lambda scores: sum(score for row in scores for score in row) / sum(len(row) for row in scores) if scores else 0
)

#### Normalizing Ratings and Final Ranking Calculation

To ensure fairness in ranking, vendor ratings are **normalized** between 0 and 1. This ensures that ratings are scaled proportionally, maintaining their relative importance.

The final ranking score is then computed as a weighted combination of feature similarity (70%) and normalized rating (30%)

In [14]:
# Normalize vendor ratings between 0 and 1
if not ranked_vendors.empty:
    min_rating = ranked_vendors["rating"].min()
    max_rating = ranked_vendors["rating"].max()
    if max_rating > min_rating:  # Avoid division by zero
        ranked_vendors.loc[:, "normalized_rating"] = ranked_vendors["rating"].apply(
            lambda r: (r - min_rating) / (max_rating - min_rating)
        )
    else:
        ranked_vendors.loc[:, "normalized_rating"] = 0
else:
    ranked_vendors["normalized_rating"] = []

# final ranking score (70% feature similarity, 30% rating)
ranked_vendors.loc[:, "final_score"] = (
    0.7 * ranked_vendors["weighted_feature_similarity"] + 
    0.3 * ranked_vendors["normalized_rating"]
)

# sort based on final score
ranked_vendors = ranked_vendors.sort_values(by="final_score", ascending=False)

# Results:

In [15]:

# store matched features with similarity scores
def extract_high_scoring_features(features, scores, threshold=0.6):
    """Extracts feature names where similarity score is above the threshold."""
    high_score_features = [
        f"{feature} ({score:.2f})"  # Format as "Feature (Score)"
        for feature, row in zip(features, scores)  # Match features with similarity scores
        for score in row if score >= threshold
    ]
    return ", ".join(high_score_features) if high_score_features else "No strong matches"

# Store matched features
ranked_vendors.loc[:, "matched_features"] = ranked_vendors.apply(
    lambda row: extract_high_scoring_features(row["feature_names"], row["feature_similarities"], 0.6), axis=1
)

# relevant columns for output
top_vendors = ranked_vendors[["seller", "final_score", "weighted_feature_similarity", "category_similarity", "rating", "matched_features"]]

# table styling
def style_table(df):
    return df.style.set_table_styles(
        [
            {"selector": "th", "props": [("background-color", "#343a40"), ("color", "white"), ("font-weight", "bold"), ("text-align", "center")]},
            {"selector": "td", "props": [("border", "1px solid #dee2e6"), ("text-align", "center"), ("background-color", "#f5f5f5"), ("color", "#212529")]},
        ]
    ).set_caption("Top Ranked Software Vendors")

# Display results
if not top_vendors.empty:
    display(style_table(top_vendors.head(10)))  # Show top 10 vendors
else:
    print("\n⚠️ No vendors met the similarity threshold.")


Unnamed: 0,seller,final_score,weighted_feature_similarity,category_similarity,rating,matched_features
916,QAD,0.525779,0.331113,0.288011,4.9,Templates (0.85)
174,CAST AI,0.472296,0.263279,0.413441,4.8,Scheduling (0.85)
988,Fullbay,0.468945,0.249921,0.382038,4.9,"Expense reporting (0.85), Expense reporting (0.74)"
363,Intuit,0.46533,0.279043,0.393246,4.5,"Tracking Time to Project/Task (0.65), Tracking Time to Project/Task (1.00)"
894,"Take 44, Inc.",0.460226,0.246037,0.382306,4.8,"Contact Management (0.74), Contact Management (0.74)"
765,Willo Technologies Ltd,0.458308,0.243297,0.46452,4.8,Reporting (0.74)
683,AlignOps,0.458302,0.243288,0.393246,4.8,"Ease of Completing Timesheets (0.74), Ease of Completing Timesheets (0.71), Tracking Time to Project/Task (0.65), Tracking Time to Project/Task (1.00)"
334,Pocketstop,0.457726,0.242465,0.324263,4.8,SMS Messaging (0.85)
535,Deputy,0.456907,0.258438,0.568492,4.6,"Customization (0.75), Customization (0.74), Integration APIs (0.65), Integration APIs (1.00), Integration APIs (0.63)"
873,Contractor Foreman,0.456373,0.266247,0.578517,4.5,"Construction Accounting Tool Integrations (0.71), Construction Estimating Tool Integration (1.00)"
