## Import packages

In [7]:
from ETLPipelines.scholartomongodb import SemanticScholarToMongoDBPipeline
from ETLPipelines.arxivtomongodb import ArxivToMongoDBPipeline
from ETLPipelines.mongodbtoabstractchunks import MongoDBPapersToAbstractChunksPipeline
from ETLPipelines.mongodbtotitlechunks import MongoDBPapersToTitleChunksPipeline
from sentence_transformers import SentenceTransformer, SimilarityFunction
from Services.papersservice import PapersService
import os
import warnings
import pandas as pd
from Services.transfomerembeddingservice import TransformerEmbeddingService
from Services.basicenglishpreprocessingservice import BasicEnglishPreprocessingService
from Services.embeddingservice import EmbeddingService
from Services.preprocessingservice import PreprocessingService
from Services.helper import Helper
from abc import ABC
from abc import abstractmethod
import numpy as np
from dotenv import load_dotenv
from sklearn.cluster import AffinityPropagation, KMeans
from sklearn.metrics import silhouette_score, pairwise_distances
warnings.simplefilter("ignore")

## Download and import nltk packages

In [18]:
import nltk
nltk.download("punkt", download_dir="./tokenizers")
nltk.download("stopwords", download_dir="./tokenizers")
nltk.download('punkt_tab', download_dir="./tokenizers")
from nltk.corpus import stopwords
nltk.data.path.append("./tokenizers")

[nltk_data] Downloading package punkt to ./tokenizers...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to ./tokenizers...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to ./tokenizers...
[nltk_data]   Package punkt_tab is already up-to-date!


## ETL: Semantic Scholar API -> Transform -> Load to MongoDB

The data is fetched using a specific API handler that communicates with the API using HTTP protocol (returns a dictionary with specific fields).
Afterwards, the dictionary is cleaned and prepared (removing papers with missing abstracts and titles, removing duplicate papers).
Following attributes are used:
* id
* title
* abstract
* authors
* publicationDate

The cleaned data is loaded to a MongoDB cluster.

In [3]:
pipeline = SemanticScholarToMongoDBPipeline("https://api.semanticscholar.org/graph/v1/paper/search/bulk", "long term care", 3000, 1500)
pipeline.execute()

Extraction started: 2024-12-10 19:35:37.449318
Extraction ended: 2024-12-10 19:35:44.112930
Transformation started: 2024-12-10 19:35:44.112930
Transformation ended: 2024-12-10 19:35:44.173905
Loading started: 2024-12-10 19:35:44.173905
Loading ended: 2024-12-10 19:35:50.469286


## ETL: Arxiv API -> Transform -> Load to MongoDB

The data is fetched using a specific API handler that communicates with the API using HTTP protocol (returns a dictionary with specific fields).
Afterwards, the dictionary is cleaned and prepared (removing papers with missing abstracts and titles, removing duplicate papers).
Following attributes are used:
* id
* title
* abstract
* authors
* publicationDate

The cleaned data is loaded to a MongoDB cluster.

In [4]:
pipeline = ArxivToMongoDBPipeline("http://export.arxiv.org/api/query", "long+term+care", 6000, 100)
pipeline.execute()

Extraction started: 2024-12-10 19:35:54.759683
Extraction ended: 2024-12-10 19:37:32.197517
Transformation started: 2024-12-10 19:37:32.197517
Transformation ended: 2024-12-10 19:37:32.492564
Loading started: 2024-12-10 19:37:32.492564
Loading ended: 2024-12-10 19:37:38.883819


## Model comparison

In this section, different sentence transformers are compared that generate embeddings of different lengths.
For this purpose, embeddings with these models will be generated and and clustered using different clustering algorithms.
Afterwards, the silhoutte score of these clusters will be calculated.
Silhouette score is calculated by averaging the following calculation for every point: (b - a) / max(b, a). </br>
b ... average distance to the other clusters</br>
a ... average distance to other points within the cluster</br>
Silhouette score of 0 = the points are in a cluster with other overlapping clusters or close to other clusters </br>
Silhouette score of 1 = the points are in a cluster that is clearly distinguished from other clusters </br>
Silhouette score of -1 = the points are in not clearly distinguished clusters, the clustering is bad </br>
The algorithm with the best clustering score will be chosen as the best one.
The abstract sentences will be used to generate the embeddings.
Following algorithms will be compared (detailed explanation in slides): 
* all-MiniLM-L6-v2
* paraphrase-MiniLM-L3-v2
* bert-base-nli-mean-tokens

In [5]:
# Helper functions
class ClusteringMetric(ABC):
    """Represents an abstract clustering metric.

    Args:
        ABC (_type_): The abstract base class.
    """
    @abstractmethod
    def get_labels(self, vectors_list: list) -> np.array:
        """Calculates the labels using a clustering method
        for the given list of vectors.

        Args:
            vectors_list (list): A list of vectors.

        Returns:
            np.array: Array of the labels for the given vectors.
        """
        pass

class KMeansClusteringMetric(ClusteringMetric):
    """Represents the K-Means clustering metric.

    Args:
        ClusteringMetric (_type_): Base abstract class.
    """
    def __init__(self, k: int, random_state: int):
        """Initializes a new instance of KMeansClusteringMetric.

        Args:
            k (int): The amount of clusters.
            random_state (int): The random state.

        Raises:
            ValueError: Is thrown if the amount of clusters is 0 or less.
            ValueError: Is thrown if the random state is negative.
        """
        Helper.ensure_type(k, int, "k must be of type int!")
        Helper.ensure_type(random_state, int, "random_state must be an int!")

        if k <= 0:
            raise ValueError("k cannot be negative or less than 0!")

        if random_state < 0:
            raise ValueError("random_state cannot be negative!")
        
        self.k = k
        self.random_state = random_state
        
    def get_labels(self, vectors_list: list) -> np.array:
        """Calculates the labels for the given vectors 
        using K-Means.

        Args:
            vectors_list (list): A list of vectors. 

        Returns:
            np.array: Array of the labels for the given vectors.
        """
        Helper.ensure_type(vectors_list, list, "vectors_list must be a list!")

        kmeans = KMeans(n_clusters=self.k, random_state=self.random_state)
        labels = kmeans.fit_predict(vectors_list)
        return labels

class AffinityClusteringMetric(ClusteringMetric):
    """Represents the clustering metric that uses affinity propagation.

    Args:
        ClusteringMetric (_type_):  Base abstract class.
    """
    def __init__(self, metric: str, affinity: str, random_state: int):
        """Initializes a new instance of AffinityClusteringMetric.

        Args:
            metric (str): The metric used to calculate distances between vectors.
            affinity (str): The metric used for affinity calculation.
            random_state (int): The random state.

        Raises:
            ValueError: Is thrown if the random state is negative.
        """
        Helper.ensure_type(metric, str, "metric must be of type str!")
        Helper.ensure_type(affinity, str, "affinity must be of type str!")
        Helper.ensure_type(random_state, int, "random_state must be an int!")

        if random_state < 0:
            raise ValueError("random_state cannot be negative!")
        
        self.metric = metric
        self.affinity = affinity
        self.random_state = random_state
        
    def get_labels(self, vectors_list: list) -> np.array:
        """Calculates the labels for the given vectors 
        using the affinity propagation.

        Args:
            vectors_list (list): A list of vectors. 

        Returns:
            np.array: Array of the labels for the given vectors.
        """
        Helper.ensure_type(vectors_list, list, "vectors_list must be a list!")

        similarity_matrix = 1 - pairwise_distances(vectors_list, metric=self.metric)
        affinity_propagation = AffinityPropagation(affinity=self.affinity, random_state=self.random_state)
        affinity_propagation.fit(similarity_matrix)
        labels = affinity_propagation.labels_
        return labels

def generate_embeddings(input_list: list[str], embedding_service: EmbeddingService, preprocessing_service: PreprocessingService = None) -> list:
    """Generates embeddings for strings and 
    also includes possible preprocessing.

    Args:
        input_list (list[str]): The input list of strings.
        embedding_service (EmbeddingService): The embedding service.
        preprocessing_service (PreprocessingService, optional): The preprocessing service. Defaults to None.

    Returns:
        list: List of vector embeddings.
    """
    Helper.ensure_list_of_type(input_list, str, "input_list must be a list!", "input_list must contain only string!")

    if preprocessing_service is not None:
        Helper.ensure_instance(preprocessing_service, PreprocessingService, "preprocessing_service must be an instance of PreprocessingService!")

    embeddings = []

    for el in input_list:
        to_embed = el

        if preprocessing_service is not None:
            to_embed = preprocessing_service.preprocess(el)

        embedding = embedding_service.create_embedding(to_embed)
        embeddings.append(embedding)

    return embeddings

def get_silhouette_scores(input_list: list[str], transformer_name_dict: dict, clustering_metric: ClusteringMetric, preprocessing_service: PreprocessingService = None) -> pd.DataFrame:
    """Calculates silhoutte scores for different sentence
    transformers.

    Args:
        input_list (list[str]): A list of strings.
        transformer_name_dict (dict): A dictionary containing transformer names as keys
        and its embedding services as values.
        clustering_metric (ClusteringMetric): Clustering metric used to cluster.
        preprocessing_service (PreprocessingService, optional): Preprocessing service. Defaults to None.

    Returns:
        pd.DataFrame: A dataframe with sentence transformer model names as attributes
        and silhoutte scores as values.
    """
    Helper.ensure_list_of_type(input_list, str, "input_list must be a list!", "input_list must contain only string!")
    Helper.ensure_type(transformer_name_dict, dict, "transformer_name_dict must be a dict!")

    if preprocessing_service is not None:
        Helper.ensure_instance(preprocessing_service, PreprocessingService, "preprocessing_service must be an instance of PreprocessingService!")

    result = dict()

    for model_name in transformer_name_dict.keys():
        result[model_name] = []
        embedder = transformer_name_dict[model_name]
        embeddings = generate_embeddings(input_list, embedder, preprocessing_service)
        labels = clustering_metric.get_labels(embeddings)
        result[model_name].append(silhouette_score(embeddings, labels))

    return pd.DataFrame(result)

Calculate the Silhouettes frame for different clustering metrics.

In [29]:
models_to_services = {
    "all-MiniLM-L6-v2" : TransformerEmbeddingService(SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", similarity_fn_name=SimilarityFunction.COSINE)),
    "paraphrase-MiniLM-L3-v2": TransformerEmbeddingService(SentenceTransformer("sentence-transformers/paraphrase-MiniLM-L3-v2", similarity_fn_name=SimilarityFunction.COSINE)),
    "bert-base-nli-mean-tokens": TransformerEmbeddingService(SentenceTransformer("sentence-transformers/bert-base-nli-mean-tokens", similarity_fn_name=SimilarityFunction.COSINE))
}

load_dotenv()
url = os.getenv("MONGODB_URL")
db_service = PapersService(url)
papers = db_service.get_papers({})
abstract_sentences = Helper.concat_lists([nltk.sent_tokenize(el["abstract"], language="english") for el in papers])
sws = list(set(stopwords.words('english')))
preprocessing_service = BasicEnglishPreprocessingService(sws)

Take only first 20000 sentences to avoid memory problems.

In [30]:
abstract_sentences = abstract_sentences[:20000]

Get scores for affinity clustering.

In [32]:
metric = AffinityClusteringMetric("cosine", "precomputed", 42)
scores_frame = get_silhouette_scores(abstract_sentences, models_to_services, metric, preprocessing_service)

In [33]:
scores_frame

Unnamed: 0,all-MiniLM-L6-v2,paraphrase-MiniLM-L3-v2,bert-base-nli-mean-tokens
0,0.045169,0.017071,0.024869


Get scores for K-Means clustering (try to find 10 similar meanings and cluster them).

In [34]:
k = len(abstract_sentences) // 10
metric_2 = KMeansClusteringMetric(k, 42)
scores_frame_2 = get_silhouette_scores(abstract_sentences, models_to_services,  metric_2, preprocessing_service)

In [35]:
scores_frame_2

Unnamed: 0,all-MiniLM-L6-v2,paraphrase-MiniLM-L3-v2,bert-base-nli-mean-tokens
0,0.048464,0.034965,0.036298


**Conclusion:** As we can see, both clustering methods, affinity propagation and K-Means have similar Silhouette scores.
They are all near 0. This means that points can be in overlapping clusters or equally close to multiple clusters.
In this case, the Silhouette score does not give a good estimator, if the clustering is really suitable or not.
But in both cases (affinity propagation and K-Means), we see that the sentence transformer all-MiniLM-L6-v2
performs the best. Therefore, all-MiniLM-L6-v2 will be chosen as the model for this project.

## ETL: MongoDB -> Transform to abstract chunks with embeddings -> Load to MongoDB

The data abstracts are split into chunks (just sentences, in this case).
The sentences are lowercased and stopwords are removed from them.
Afterwards, embeddings are generated.
Embeddings are stored to MongoDB again (collection abstractChunks).
abstractChunks has following attributes:
* paperId
* chunk
* chunkVector

In [27]:
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", similarity_fn_name=SimilarityFunction.COSINE)
embedding_service = TransformerEmbeddingService(embedder)
sws = list(set(stopwords.words('english')))
preprocessing_service = BasicEnglishPreprocessingService(sws)
pipeline = MongoDBPapersToAbstractChunksPipeline(embedding_service, preprocessing_service, 10000, "../tokenizers")
pipeline.execute()

Extraction started: 2024-12-10 20:18:09.586992
Extraction ended: 2024-12-10 20:18:11.931790
Transformation started: 2024-12-10 20:18:11.931790
Transformation ended: 2024-12-10 20:25:51.090132
Loading started: 2024-12-10 20:25:51.090132
Loading ended: 2024-12-10 20:29:12.903062


## ETL: MongoDB -> Transform to title chunks with embeddings -> Load to MongoDB

The data titles are split into chunks (just sentences, in this case).
The sentences are lowercased and stopwords are removed from them.
Afterwards, embeddings are generated.
Embeddings are stored to MongoDB again (collection titleChunks).
titleChunks has following attributes:
* paperId
* chunk
* chunkVector

In [28]:
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", similarity_fn_name=SimilarityFunction.COSINE)
embedding_service = TransformerEmbeddingService(embedder)
sws = list(set(stopwords.words('english')))
preprocessing_service = BasicEnglishPreprocessingService(sws)
pipeline = MongoDBPapersToTitleChunksPipeline(embedding_service, preprocessing_service, 10000, "../tokenizers")
pipeline.execute()

Extraction started: 2024-12-10 20:29:14.679990
Extraction ended: 2024-12-10 20:29:16.637376
Transformation started: 2024-12-10 20:29:16.637376
Transformation ended: 2024-12-10 20:30:14.958559
Loading started: 2024-12-10 20:30:14.958559
Loading ended: 2024-12-10 20:30:41.446858
