# Pipeline Framework
This is a notebook for illustrating the pipeline framework of our project. Our project can be divided into 5 steps:
1. Split text and candidate summary into two lists of sentences.
2. Convert those lists of sentences to embedding matrix.
3. Calculate the cosine similarity between sentences of summary and sentences of text based on their embeddings.
4. Find the indices of top k related sentences in text for each sentence in summary.
5. Check if the sentence from the summary can be obtained from the sentence from the text with the help of LLMs.

The pipeline framework is just a toy model. There might be some possible improvements. For example, we can try to check if the dependency arcs or name entities in the summary sentence can be obtained from the related sentences in the original text with the help of LLMs.

In [6]:
import stanza
import numpy as np
from sentence_transformers import SentenceTransformer
from openai import OpenAI
from dotenv import load_dotenv
import os
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


We need to import some packages and initialize some tools in advance.
1. `nlp` is a tool for splitting text into sentences.
2. `model` is a tool for converting sentences to embeddings.

In [9]:
text_data = pd.read_csv('final_version_cropped_first1000.csv')

In [8]:
model = SentenceTransformer('all-mpnet-base-v2')

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Step one: Split text and candidate summary into two lists of sentences.
We use `nlp` to split text and summaries into sentences. This will help us to check if the sentence from the summary can be obtained from specific sentences from the text.

In [10]:
def split_text(text:str)->list:
    """
    Split text into sentences
    Args:
        text: the text to be split

    Returns:
        a list of sentences
    """
    sentence_list = sent_tokenize(text)
    return sentence_list

## Step two: Convert those lists of sentences to embedding matrix.
We use `model` to convert sentences to embeddings. The output is a matrix with the type of `np.ndarray`, each row is an embedding.

In [11]:
def sentence2embedding(sentences:list[str])->np.ndarray:
    """
    Convert sentences to embeddings
    Args:
        sentences: a list of sentences

    Returns:
        a matrix of embeddings, each row is an embedding
    """
    embeddings = model.encode(sentences)
    return embeddings

## Step three: Get the most related sentences from the original text for each sentence in the summary.
- We use cosine similarity to calculate the similarity between sentences of summary and sentences of text. The output is a matrix with the type of `np.ndarray`.
- Assume there are $M$ sentences in the original text and $N$ sentences in the summary, the output matrix is of shape $N\times M$.
- The `[i,j]` element of the matrix is the cosine similarity between the $i$-th sentence in the summary and the $j$-th sentence in the original text.

In [12]:
def cosine_similarity(embed_text:np.ndarray, embed_summary: np.ndarray)->np.ndarray:
    """
    Calculate the cosine similarities between sentences of summary and sentences of text
    Args:
        embed_text: embedding matrix of text sentences
                    each row is an embedding
        embed_summary: embedding matrix of summary sentences
                    each row is an embedding

    Returns:
        a matrix of cosine similarities
    """

    dot_prod = embed_summary @ embed_text.T # [i,j] is the dot product of summary sentence i and text sentence j
    norm = np.linalg.norm(embed_summary, axis=1, keepdims=True) @ np.linalg.norm(embed_text, axis=1, keepdims=True).T # [i,j] is the norm of summary sentence i and text sentence j
    return dot_prod / norm

Then we will find the indices of top k related sentences in text for each sentence in summary. Those selected sentences from the original text will be used in the prompt of LLMs for checking if the sentence from the summary can be obtained from the sentence from the text.

In [13]:
def topk_related(sim_matrix:np.ndarray, k:int)->np.ndarray:
    """
    Find the indices of top k related sentences in text for each sentence in summary
    Args:
        sim_matrix: cosine similarity matrix
        k: number of sentences to be selected

    Returns:
        a matrix of indices
    """
    return sim_matrix.argsort(axis=1)[:, -k:]

## Step four: Check if the sentence from the summary can be obtained from the sentence from the text with the help of LLMs.
For each sentence in the summary, check if it can be obtained from the top k related sentences in the text.
1. If yes, return True
2. Otherwise, return False.

Meanwhile, we can also return the probability that the sentence from the summary can be obtained from the sentence from the text.

We just consider the factuality in sentence-level currently.

This part will employ LLMs and [Guidance](https://github.com/guidance-ai/guidance) to check if the sentence from the summary can be obtained from the sentence from the text.

In [None]:
def checker(sens_text:list[str], sen_summary:str)->str:
    """
    Check if the sentence from the summary con be obtained from the sentence from the text.
    Args:
        sens_text: list of sentences from the text
        sen_summary: the sentence from the summary

    Returns:
        a tuple of (bool, float)
        bool: True if the sentence from the summary can be obtained from the sentence from the text
        float: the probability that the sentence from the summary can be obtained from the sentence from the text
            True: >0.5
            False: <0.5
    """
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    source_text = ''.join(sens_text)

    prompt = f"""
As a compliance officer at a financial institution, you're tasked with evaluating the accuracy of a summary sentence based on its alignment with source sentences from a financial document. Consider the following criteria carefully:

1. The summary accurately reflects the content of the source sentences, especially numerical information.
2. All named entities in the summary are present in the source sentences.
3. Relationships between entities in the summary are consistent with those in the source sentences.
4. The directional flow of relationships among named entities matches between the summary and source sentences.
5. There are no factual discrepancies between the summary and source sentences.
6. The summary does not introduce any entities not found in the source sentences.

Your job is to determine if the summary adheres to these criteria. Answer "Yes" if it does, or "No" if it doesn't.

Summary sentence: ```{sen_summary}```

Source sentences: ```{source_text}```

Final Answer (Yes/No only):
"""

    response = client.chat.completions.create(
        model = 'gpt-4',
        messages=[{'role':"user",'content':prompt}],
        max_tokens=1
    )

    return response.choices[0].text

## Step five: Evaluate the quality of the summary (Combine the above steps).
We combine the above steps to evaluate the quality of the summary.

We will get a score between 0 and 1, the higher the better.

In [None]:
def evaluate(text:str, summary:str, k:int)->float:
    """
    evaluate the quality of the summary according to the given text
    Args:
        text: original text
        summary: summary to be evaluated
        k: number of sentences to be selected from the text

    Returns:
        a float number between 0 and 1, the higher the better
    """

    # split the text into sentences
    sens_text = split_text(text)
    # split the summary into sentences
    sens_summary = split_text(summary)

    # convert sentences to embeddings
    embed_text = sentence2embedding(sens_text)
    embed_summary = sentence2embedding(sens_summary)

    # calculate cosine similarity
    sim_matrix = cosine_similarity(embed_text, embed_summary)

    # find top k related sentences
    topk = topk_related(sim_matrix, k)

    # check if the sentence from the summary can be obtained from the sentence from the text
    denominator = 0
    numerator = 0
    for idx, sen in enumerate(sens_summary):
        sens_text_selected = [sens_text[i] for i in topk[idx]]
        res, _ = checker(sens_text_selected, sen)
        if res:
            numerator += 1
        denominator += 1
    return numerator / denominator