# Pipeline Framework 
This is a notebook for illustrating the pipeline framework of our project. Our project can be divided into 5 steps:
1. Split text and candidate summary into two lists of sentences.
2. Convert those lists of sentences to embedding matrix.
3. Calculate the cosine similarity between sentences of summary and sentences of text based on their embeddings.
4. Find the indices of top k related sentences in text for each sentence in summary.
5. Check if the sentence from the summary can be obtained from the sentence from the text with the help of LLMs.

The pipeline framework is just a toy model. There might be some possible improvements. For example, we can try to check if the dependency arcs or name entities in the summary sentence can be obtained from the related sentences in the original text with the help of LLMs. 

In [None]:
import stanza
import numpy as np
from sentence_transformers import SentenceTransformer

We need to import some packages and initialize some tools in advance. 
1. `nlp` is a tool for splitting text into sentences.
2. `model` is a tool for converting sentences to embeddings.

In [None]:
model = SentenceTransformer('bert-base-nli-mean-tokens')
nlp = stanza.Pipeline(lang='en')   

## Step one: Split text and candidate summary into two lists of sentences.
We use `nlp` to split text and summaries into sentences. This will help us to check if the sentence from the summary can be obtained from specific sentences from the text.

In [None]:
def split_text(text:str)->list:
    """
    Split text into sentences
    Args:
        text: the text to be split

    Returns:
        a list of sentences
    """
    doc = nlp(text)
    return [sentence.text for sentence in doc.sentences]

## Step two: Convert those lists of sentences to embedding matrix.
We use `model` to convert sentences to embeddings. The output is a matrix with the type of `np.ndarray`, each row is an embedding.

In [None]:
def sentence2embedding(sentences:list[str])->np.ndarray:
    """
    Convert sentences to embeddings
    Args:
        sentences: a list of sentences

    Returns:
        a matrix of embeddings, each row is an embedding
    """
    embeddings = model.encode(sentences)
    return embeddings

## Step three: Get the most related sentences from the original text for each sentence in the summary.
- We use cosine similarity to calculate the similarity between sentences of summary and sentences of text. The output is a matrix with the type of `np.ndarray`.
- Assume there are $M$ sentences in the original text and $N$ sentences in the summary, the output matrix is of shape $N\times M$. 
- The `[i,j]` element of the matrix is the cosine similarity between the $i$-th sentence in the summary and the $j$-th sentence in the original text.

In [1]:
def cosine_similarity(embed_text:np.ndarray, embed_summary: np.ndarray)->np.ndarray:
    """
    Calculate the cosine similarities between sentences of summary and sentences of text
    Args:
        embed_text: embedding matrix of text sentences
                    each row is an embedding
        embed_summary: embedding matrix of summary sentences
                    each row is an embedding

    Returns:
        a matrix of cosine similarities
    """
    
    dot_prod = embed_summary @ embed_text.T # [i,j] is the dot product of summary sentence i and text sentence j
    norm = np.linalg.norm(embed_summary, axis=1) @ np.linalg.norm(embed_text, axis=1).T # [i,j] is the norm of summary sentence i and text sentence j
    return dot_prod / norm

NameError: name 'np' is not defined

Then we will find the indices of top k related sentences in text for each sentence in summary. Those selected sentences from the original text will be used in the prompt of LLMs for checking if the sentence from the summary can be obtained from the sentence from the text.

In [2]:
def topk_related(sim_matrix:np.ndarray, k:int)->np.ndarray:
    """
    Find the indices of top k related sentences in text for each sentence in summary
    Args:
        sim_matrix: cosine similarity matrix
        k: number of sentences to be selected

    Returns:
        a matrix of indices
    """
    return sim_matrix.argsort(axis=1)[:, -k:]

NameError: name 'np' is not defined

## Step four: Check if the sentence from the summary can be obtained from the sentence from the text with the help of LLMs.
For each sentence in the summary, check if it can be obtained from the top k related sentences in the text.
1. If yes, return True
2. Otherwise, return False.

Meanwhile, we can also return the probability that the sentence from the summary can be obtained from the sentence from the text.

We just consider the factuality in sentence-level currently.

This part will employ LLMs and [Guidance](https://github.com/guidance-ai/guidance) to check if the sentence from the summary can be obtained from the sentence from the text.

In [None]:
def checker(sens_text:list[str], sen_summary:str)->(bool, float):
    """
    Check if the sentence from the summary con be obtained from the sentence from the text.
    Args:
        sens_text: list of sentences from the text
        sen_summary: the sentence from the summary

    Returns:
        a tuple of (bool, float)
        bool: True if the sentence from the summary can be obtained from the sentence from the text
        float: the probability that the sentence from the summary can be obtained from the sentence from the text
            True: >0.5
            False: <0.5
    """
    
    # to be completed
    
    
    
    res = ____
    prob = ____
    
    return (res, prob)

## Step five: Evaluate the quality of the summary (Combine the above steps).
We combine the above steps to evaluate the quality of the summary. 

We will get a score between 0 and 1, the higher the better.

In [None]:
def evaluate(text:str, summary:str, k:int)->float:
    """
    evaluate the quality of the summary according to the given text
    Args:
        text: original text
        summary: summary to be evaluated
        k: number of sentences to be selected from the text

    Returns:
        a float number between 0 and 1, the higher the better
    """
    
    # split the text into sentences
    sens_text = split_text(text)
    # split the summary into sentences
    sens_summary = split_text(summary)
    
    # convert sentences to embeddings
    embed_text = sentence2embedding(sens_text)
    embed_summary = sentence2embedding(sens_summary)
    
    # calculate cosine similarity
    sim_matrix = cosine_similarity(embed_text, embed_summary)
    
    # find top k related sentences
    topk = topk_related(sim_matrix, k)
    
    # check if the sentence from the summary can be obtained from the sentence from the text
    denominator = 0
    numerator = 0
    for idx, sen in enumerate(sens_summary):
        sens_text_selected = [sens_text[i] for i in topk[idx]]
        res, _ = checker(sens_text_selected, sen)
        if res:
            numerator += 1
        denominator += 1
    return numerator / denominator