# DFF grants similarity: A Bag-of-Words Approach
* Inspired by [SNSF Grant Similarity](https://github.com/snsf-data/snsf-grant-similarity/blob/main/notebooks/grant_similarity_tfidf.ipynb)core



The following notebook is based on SNSF's grants similarity notebook. The code is turned into functions that can be called making it easier to use on new datasets and for new purposes. 

The algorithm for detecting similarity is the same:

1. pre-process the texts for the tf-idf model: english texts, lower casing, stop words and punctuation removal, stemming, n-grams
2. apply the tf-idf weighting model and extract the tf-idf vectors
3. compute the cosine similarity between the tf-idf vectors
4. rank the texts based on the similarity score

The functions take a set of baseline and comparison texts as input. Each is a dictionary containing an id key and a text key. 

## Library Imports
First, we import the neccessary libraries for data wrangling and natural language processing.

In [None]:
# import standard libraries
import numpy as np
import pandas as pd
import os
from tqdm import tqdm 

# import NLP/text libraries
import nltk
import string

# import tfidf vectorizer and similarity metrics from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# and lanuage detection
from langdetect import detect

# import OpenAlex libraries
import pyalex
from pyalex import Works, Authors, Sources, Institutions, Topics, Publishers, Funders

## Setup stopwords dictionary and load the Porter stemmer
In order to filter out stop words from the texts, we need to download the dictionary of stopwords available in the 'nltk' package (Bird et al., 2009).

In [None]:
# Download stopwords if not already done
nltk.download('stopwords')

# Initialize stopwords and stemmer
stop_words = set(nltk.corpus.stopwords.words('english'))
ps = nltk.stem.PorterStemmer()

## Data pre-processing
We perform some data wrangling first as we remove non-english texts and concatenate the texts of titles and abstracts

In [None]:
def text_preprocessing(data):
    # concatenate titles and abstracts
    data['TitleAbstract'] = data.Title + '. ' + data.Abstract
    # detect language of titles and abstracts
    data['Lang'] = data.TitleAbstract.apply(detect)
    # keep only english texts
    data = data[data.Lang == 'en']
    # extract texts as a list
    texts = data.TitleAbstract.tolist()

    return texts

## Text Processing and Tokenizer
We begin the text pre-processing by removing the punctuation and further create the so-called unigrams by splitting the text sequence into separate words (tokens), while removing stop words and performing stemming of the remaining words. 

Additionally to the unigrams, we create short word combinations called n-grams, up to n=3, i.e. combinations of 3 words. As with unigrams, we perform stemming, but keep the stopwords.
Finally, we concatenate the unigrams with n-grams to complete the tokenization process.





In [None]:
def preprocess_texts(texts, use_stopwords=True, use_ngrams=True, n_grams=3):
    """
    Preprocess a list of texts:
      - Lowercases text.
      - Removes punctuation.
      - Optionally removes stopwords and applies stemming for unigrams.
      - Optionally creates n-grams (with stemming) without stop word removal.
      - Returns a list of token lists for each text.
    """
    
    processed_tokens = []
    
    for text in texts:
        # Lowercase and remove punctuation (string.punctuation: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
        text = text.lower().translate(str.maketrans('', '', string.punctuation))
        
        # Unigrams: tokenize, remove stopwords and stem (if desired)
        words = text.split()
        if use_stopwords:
            words = [word for word in words if word not in stop_words]
        unigrams = [ps.stem(word) for word in words if len(word) > 1]
        
        tokens = unigrams.copy()
        
        # Create n-grams if desired
        if use_ngrams and n_grams > 1:
            # First, stem the words (without removing stopwords this time)
            words_stemmed = [ps.stem(word) for word in text.split()]
            # Create n-grams from 2 up to n_grams
            ngrams = nltk.everygrams(words_stemmed, 2, n_grams)
            ngrams = [' '.join(gram) for gram in ngrams]
            tokens.extend(ngrams)
        
        processed_tokens.append(tokens)
    return processed_tokens

## TF-IDF Model and similarity metric
In order to create a numerical representation of the tokens, we apply the so-called TF-IDF (Term Frequency – Inverse Document Frequency) weighting (Sparck Jones, 1972). TF-IDF is a type of bag-of-words approach, where the numerical representation of the text in vector space is based on a token decomposition of the text, ignoring the sequential nature of the text. This corresponds to the tokenization procedure conducted above. The TF-IDF then applies a weighting scheme that puts a higher weight on words that appear frequently in one document, but rarely across documents. The TF-IDF vectorization results in high-dimensional sparse vectors. Such TF-IDF vectorization has proven to be very effective in text similarity tasks, despite its simplicity (compare e.g. Shahmirzadi et al, 2019).In order to compare the similarity of the grants represented by the TF-IDF vectors, we compute the cosine distance between the vectors.




In [None]:
def compute_similarity(baseline_texts, comparison_texts):
    """
    Computes the cosine similarity between the baseline set and the comparison set.
    Returns a similarity matrix (rows: baseline, columns: comparison) along with the fitted vectorizer.
    """
    # Preprocess both sets of texts
    baseline_tokens = preprocess_texts(baseline_texts)
    comparison_tokens = preprocess_texts(comparison_texts)
    
    # Combine tokens to fit a common vocabulary
    combined_tokens = baseline_tokens + comparison_tokens
    
    # Initialize the TF-IDF vectorizer (using identity functions since tokens are precomputed)
    tfidf = TfidfVectorizer(tokenizer=lambda x: x, preprocessor=lambda x: x, use_idf=True, norm='l2')
    tfidf_vector = tfidf.fit_transform(combined_tokens)
    
    # Split the vectors back into baseline and comparison sets
    baseline_vector = tfidf_vector[:len(baseline_tokens)]
    comparison_vector = tfidf_vector[len(baseline_tokens):]
    
    # Compute cosine similarity (each row corresponds to a baseline text and each column to a comparison text)
    similarity_matrix = cosine_similarity(baseline_vector, comparison_vector)
    return similarity_matrix, tfidf

## Ranking
To retrieve the most similar grants relative to a target grant of interest, we rank-order the grants according to their cosine similarity.

In [None]:
def rank_matches(similarity_matrix, baseline_ids, comparison_ids, top_n=1):
    """
    For each comparison case, find the top_n matching baseline items.
    Returns a dictionary mapping each comparison ID (e.g. case number) to a list of best matching baseline IDs.
    """
    matches = {}
    # For each column (comparison case) in the similarity matrix...
    for j, comp_id in enumerate(comparison_ids):
        col_sim = similarity_matrix[:, j]
        # Get indices of top_n highest similarity scores
        top_indices = col_sim.argsort()[::-1][:top_n]
        best_match_ids = [baseline_ids[i] for i in top_indices]
        matches[comp_id] = best_match_ids
    return matches

# Reviewer suggestions
Below is an implementation of the above algorithm that fetches data from our database on grants and reviewers. For the reviewers it downloads abstracts from OpenALEX in order to do the similarity ranking.

In order to make the Notebook produce any output you need to complete the following functions:

* def establish_database_conn()
* def fetch_members()
* def fetch_applications()



## SQL-connection

In [None]:
# import SQl libraries
from sqlalchemy import URL, create_engine

def establish_database_conn():
    # Entern information to retrieve data from SQL database
    return engine

engine = establish_database_conn()

### Fetch data about reviewers

In [None]:
def fetch_members():
    # Return SQL query with the column name "Navn" for each member. 
    return pd.read_sql_query(query, engine)

In [None]:
def fetch_member_publications(name):
    author = Authors().search_filter(display_name=name).get()

    if author and "id" in author[0]:  # Hvis forfatteren findes
        author_id = author[0]["id"]  # OpenAlex ID

        works = Works().filter(author={"id": author_id}).get(per_page=200)
        
        data = []
        
        for work in works:
            abstract = work["abstract"]
            title = work["title"]
            data.append([name, title, abstract])
        
        df = pd.DataFrame(data, columns=["Name", "Title", "Abstract"])
        df = df.dropna()
        return df
    return pd.DataFrame(columns=["Name", "Title", "Abstract"])  # Return empty DF if no publications

In [None]:
def load_publicationdata(file_path):
    if os.path.exists(file_path):
        print("Using existing publications data")
        return pd.read_excel(file_path)
        
    else:
        print("Exisiting publications data do not exist. Fetching records for all members from OpenALEX")
        # Fetch all members
        members_df = fetch_members()
        
        # Loop through each member and get their publications
        all_publications = []
        
        for _, row in tqdm(members_df.iterrows(), total=len(members_df), desc="Fetching Publications", unit="member"):
            name = row["Navn"]
            publications_df = fetch_member_publications(name)
            all_publications.append(publications_df)
        
        # Combine all publication data into a single DataFrame
        final_df = pd.concat(all_publications, ignore_index=True)
        final_df.to_excel(file_path, index=False)

        return final_df

### Fetch data about applications

In [1]:
def fetch_applications():
    # Return dataframe with grants or applications to be processed. The script expects the following columns:
    # case_number
    # Text
    
    return df

## Compare applications to reviewers

In [None]:
# Fetch reviewers
publications = load_publicationdata("reviewer_publications.xlsx")
reviewer_data = publications.apply(lambda row: {'id': row['Name'], 'text': f"{row['Title']} {row['Abstract']}"}, axis=1).tolist()

In [None]:
# Fetch applications
applications = fetch_applications()
grant_application_data = applications.apply(lambda row: {'case_number': row['case_number'], 'text': row['text']}, axis=1).tolist()

In [None]:
# Extract texts and IDs for baseline (reviewers) and comparison (grant applications)
reviewer_texts = [item['text'] for item in reviewer_data]
reviewer_ids = [item['id'] for item in reviewer_data]
grant_texts = [item['text'] for item in grant_application_data]
grant_ids = [item['case_number'] for item in grant_application_data]

# Compute similarity: rows correspond to reviewers, columns to grant applications.
similarity_matrix, _ = compute_similarity(reviewer_texts, grant_texts)

# Rank matches: for each grant application, identify the best matching reviewer(s)
matches = rank_matches(similarity_matrix, reviewer_ids, grant_ids, top_n=3)

print("Best reviewer match for each grant application:")
print(matches)

In [None]:
matches

In [None]:
df = pd.DataFrame(matches).transpose()

In [None]:
df.to_excel("matches.xlsx")