# Pipeline Framework
This is a notebook for illustrating the pipeline framework of our project. Our project can be divided into 5 steps:
1. Split text and candidate summary into two lists of sentences.
2. Convert those lists of sentences to embedding matrix.
3. Calculate the cosine similarity between sentences of summary and sentences of text based on their embeddings.
4. Find the indices of top k related sentences in text for each sentence in summary.
5. Check if the sentence from the summary can be obtained from the sentence from the text with the help of LLMs.

The pipeline framework is just a toy model. There might be some possible improvements. For example, we can try to check if the dependency arcs or name entities in the summary sentence can be obtained from the related sentences in the original text with the help of LLMs.

In [None]:
import time
import spacy
import stanza
import numpy as np
import pandas as pd
# from sentence_transformers import SentenceTransformer
from openai import OpenAI
from dotenv import load_dotenv
import os

os.environ['OPENAI_API_KEY'] = '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'

We need to import some packages and initialize some tools in advance.
1. `model` is a tool for converting sentences to embeddings.

In [None]:
model = SentenceTransformer('all-mpnet-base-v2')

2023-11-16 13:33:35 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 48.8MB/s]                    
2023-11-16 13:33:37 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |
| sentiment    | sstplus             |
| ner          | ontonotes_charlm    |

2023-11-16 13:33:37 INFO: Using device: cuda
2023-11-16 13:33:37 INFO: Loading: tokenize
2023-11-16 13:33:40 INFO: Loading: pos
2023-11-16 13:33:40 INFO: Loading: lemma
2023-11-16 13:33:40 INFO: Loading: constitue

## Step one: Split text and candidate summary into two lists of sentences.
We use `nlp` to split text and summaries into sentences. This will help us to check if the sentence from the summary can be obtained from specific sentences from the text.

In [None]:
def split_text(text:str)->list:
    """
    Split text into sentences
    Args:
        text: the text to be split

    Returns:
        a list of sentences
    """
    sentence_list = sent_tokenize(text)
    return sentence_list

## Step two: Convert those lists of sentences to embedding matrix.
We use `model` to convert sentences to embeddings. The output is a matrix with the type of `np.ndarray`, each row is an embedding.

In [None]:
def sentence2embedding(sentences:list[str])->np.ndarray:
    """
    Convert sentences to embeddings
    Args:
        sentences: a list of sentences

    Returns:
        a matrix of embeddings, each row is an embedding
    """
    embeddings = model.encode(sentences)
    return embeddings

## Step three: Get the most related sentences from the original text for each sentence in the summary.
- We use cosine similarity to calculate the similarity between sentences of summary and sentences of text. The output is a matrix with the type of `np.ndarray`.
- Assume there are $M$ sentences in the original text and $N$ sentences in the summary, the output matrix is of shape $N\times M$.
- The `[i,j]` element of the matrix is the cosine similarity between the $i$-th sentence in the summary and the $j$-th sentence in the original text.

In [None]:
def cosine_similarity(embed_text:np.ndarray, embed_summary: np.ndarray)->np.ndarray:
    """
    Calculate the cosine similarities between sentences of summary and sentences of text
    Args:
        embed_text: embedding matrix of text sentences
                    each row is an embedding
        embed_summary: embedding matrix of summary sentences
                    each row is an embedding

    Returns:
        a matrix of cosine similarities
    """

    dot_prod = embed_summary @ embed_text.T # [i,j] is the dot product of summary sentence i and text sentence j
    norm = np.linalg.norm(embed_summary, axis=1, keepdims=True) @ np.linalg.norm(embed_text, axis=1, keepdims=True).T # [i,j] is the norm of summary sentence i and text sentence j
    return dot_prod / norm

Then we will find the indices of top k related sentences in text for each sentence in summary. Those selected sentences from the original text will be used in the prompt of LLMs for checking if the sentence from the summary can be obtained from the sentence from the text.

In [None]:
def topk_related(sim_matrix:np.ndarray, k:int)->np.ndarray:
    """
    Find the indices of top k related sentences in text for each sentence in summary
    Args:
        sim_matrix: cosine similarity matrix
        k: number of sentences to be selected

    Returns:
        a matrix of indices
    """
    return sim_matrix.argsort(axis=1)[:, -k:]

## Step four: Check if the sentence from the summary can be obtained from the sentence from the text with the help of LLMs.
For each sentence in the summary, check if it can be obtained from the top k related sentences in the text.
1. If yes, return True
2. Otherwise, return False.

Meanwhile, we can also return the probability that the sentence from the summary can be obtained from the sentence from the text.

We just consider the factuality in sentence-level currently.

This part will employ LLMs and [Guidance](https://github.com/guidance-ai/guidance) to check if the sentence from the summary can be obtained from the sentence from the text.

In [None]:

def checker(sens_text:list[str], sen_summary:str)->(bool, float):
    """
    Check if the sentence from the summary con be obtained from the sentence from the text.
    Args:
        sens_text: list of sentences from the text
        sen_summary: the sentence from the summary

    Returns:
        a tuple of (bool, float)
        bool: True if the sentence from the summary can be obtained from the sentence from the text
        float: the probability that the sentence from the summary can be obtained from the sentence from the text
            True: >0.5
            False: <0.5
    """
    load_dotenv()
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    source_text = ''.join(sens_text)

    prompt = f"""
As a compliance officer at a financial institution, you're tasked with evaluating the accuracy of a summary sentence based on its alignment with source sentences from a financial document. Consider the following criteria carefully:

1. The summary accurately reflects the content of the source sentences, especially numerical information.
2. All named entities in the summary are present in the source sentences.
3. Relationships between entities in the summary are consistent with those in the source sentences.
4. The directional flow of relationships among named entities matches between the summary and source sentences.
5. There are no factual discrepancies between the summary and source sentences.
6. The summary does not introduce any entities not found in the source sentences.

Your job is to determine if the summary adheres to these criteria. Answer "Yes" if it does, or "No" if it doesn't.

Summary sentence: ```{sen_summary}```

Source sentences: ```{source_text}```

Final Answer (Yes/No only):
"""

    response = client.chat.completions.create(
        model = 'gpt-4',
        messages=[{'role':"user",'content':prompt}],
        max_tokens=1
    )

    res = response.choices[0].message.content.lower().capitalize()

    return True if res == 'Yes' else False

## Step five: Evaluate the quality of the summary (Combine the above steps).
We combine the above steps to evaluate the quality of the summary.

We will get a score between 0 and 1, the higher the better.

In [None]:
 # split the text into sentences
sens_text = split_text(text)
# split the summary into sentences
sens_summary = split_text(summary)

# convert sentences to embeddings
embed_text = sentence2embedding(sens_text)
embed_summary = sentence2embedding(sens_summary)

# calculate cosine similarity
sim_matrix = cosine_similarity(embed_text, embed_summary)

# find top k related sentences
topk = topk_related(sim_matrix, 10)

In [None]:
def evaluate(text:str, summary:str, k:int)->float:
    """
    evaluate the quality of the summary according to the given text
    Args:
        text: original text
        summary: summary to be evaluated
        k: number of sentences to be selected from the text

    Returns:
        a float number between 0 and 1, the higher the better
    """

    # split the text into sentences
    sens_text = split_text(text)
    # split the summary into sentences
    sens_summary = split_text(summary)

    # convert sentences to embeddings
    embed_text = sentence2embedding(sens_text)
    embed_summary = sentence2embedding(sens_summary)

    # calculate cosine similarity
    sim_matrix = cosine_similarity(embed_text, embed_summary)

    # find top k related sentences
    topk = topk_related(sim_matrix, k)

    # check if the sentence from the summary can be obtained from the sentence from the text
    denominator = 0
    numerator = 0
    for idx, sen in enumerate(sens_summary):
        time.sleep(5)
        sens_text_selected = [sens_text[i] for i in topk[idx]]
        res = checker(sens_text_selected, sen)
        if res:
            numerator += 1
        denominator += 1
    return numerator / denominator

In [None]:
df_summary = pd.read_csv('final_version_cropped_first1000.csv', index_col = 0)
summary = df_summary.iloc[1]['summary']
text = df_summary.iloc[1]['text_extracted']

In [None]:
summary

'The United States Securities and Exchange Commission announced that on December 29, 2008, it filed an emergency action to halt a Ponzi scheme and affinity fraud conducted by Creative Capital Consortium, LLC and A Creative Capital Concept$, LLC (collectively, Creative Capital), and its principal, George L. Theodule. According to the Commission\'s complaint, the defendants raised at least $23.4 million from thousands of investors in the Haitian-American community nationwide through a network of purported investment clubs Theodule directs investors to form. Also on December 29, 2008 Judge Donald M. Middlebrooks, U.S. District Judge for the Southern District of Florida, issued an order placing Creative Capital under the control of a receiver to safeguard assets, as well as other emergency orders, including temporary restraining orders and asset freezes.The Commission\'s complaint alleges that starting in at least November 2007, Theodule, directly and through Creative Capital, raised at le

In [None]:
falsi_summary = "The Canadian Financial Conduct Authority announced that on July 15, 2015, it initiated a routine audit against a legitimate financial endeavor run by Innovative Growth Enterprises, Ltd. and Growth Innovation Concepts, Ltd. (collectively, Innovative Growth), led by its chief, Edward M. Harrison. Per the Authority's review, the entities ethically garnered approximately $17.6 million from a diverse group of investors across the Canadian community through a network of officially sanctioned investment societies Harrison encourages investors to join. Simultaneously on July 15, 2015, Judge Emily R. Thompson, Canadian Federal Judge for the Northern District of Quebec, confirmed Innovative Growth's operational integrity, commending its asset management and issuing permanent endorsements for its ongoing operations, including sustainable investment strategies and transparent asset visibility. The Authority's review reveals that since at least April 2012, Harrison, both personally and through Innovative Growth, ethically managed over $17.6 million from numerous investors, predominantly from the Canadian populace. Integral to their strategy, the entities guide investors to establish investment societies aiming to bolster funds for Harrison and Innovative Growth. Harrison persuades investors for Innovative Growth by promising a sustainable 5% annual return on their investment based on his proven expertise in conservative bonds and mutual funds. The entities also attract investors by demonstrating that Innovative Growth's modest profits are reinvested into eco-friendly ventures, benefiting communities in Canada and abroad, including initiatives in Norway. In reality, Harrison has consistently grown a minimum of $15 million in conservative investments over the past three years. Additionally, Innovative Growth has consistently reinvested profits for societal benefit, particularly in environmental projects. Finally, the Review states, Harrison has diligently separated investor funds from personal finances, contributing at least $1.2 million of his own wealth to philanthropic causesThe Authority's review further clarifies:Statements regarding the security of investor deposits are absolutely true. Harrison directs prospective investors to collaborate with a renowned oversight agency called Ethical Investment Regulatory Services, Ltd. (EIRS). The entities highlight EIRS' external confirmation of their deposits as an extra layer of reliability and transparency. In fact, EIRS is a respected independent agency with no ties to Innovative Growth.Claims of success in conservative investments are accurate. Of the over $15 million managed in investment portfolios, Harrison has maintained a steady growth rate, losing no more than 3% in any fiscal year. In fact, Harrison's investment strategies have been profitable since April 2012, yielding consistent net gains.Claims that Innovative Growth's profits fund new eco-friendly ventures, benefiting communities in Canada and projects in Norway, are true. In fact, a significant portion of Innovative Growth's profits have been reinvested in various sustainable and environmental projects, not just in returning profits to investors. Additionally, the entities have responsibly utilized investor funds, with no misappropriation recorded.In addition to the positive findings of the review, the Authority commends Innovative Growth's ethical practices, transparency, and contribution to societal and environmental welfare. Investors are encouraged to refer to the Authority's 'Ethical Investment' Investor Guide, offering advice on participating in socially responsible investing. This and other investor guides are available on the CFC's website, at www.cfc.gov.ca/investor/guides.shtml. The 'Ethical Investment' Investor Guide is also available in French on the Authority's website. The Authority acknowledges the support of the National Canadian Office of Environmental and Financial Regulation in this matter. The CFC's evaluation of Innovative Growth's practices is ongoing.CFC Review in this matter."

In [None]:
evaluate(text, summary, 10)

0.6666666666666666

In [None]:
evaluate(text, falsi_summary, 10)

0.0

In [None]:
df_temp = pd.DataFrame(columns=['original_txt', 'summary'])
df_temp['summary'] = [summary, falsi_summary]
df_temp['original_txt'] = [text, text]

In [None]:
df_temp.to_csv('for_baseline.csv')

### NER Test

In [None]:
class NER_comparison:
    def __init__(self):
        self._nlp = spacy.load('en_core_web_sm')
        self._NER_cat = ["PERSON", "ORG", "DATE", "GPE", "MONEY"]

    def extraction(self, text:str)->set[str]:

        """Extract the name entities in the text

        Args:
            text (str): original text

        Returns:
            _type_: set of name entites
        """

        sample_summary_doc = self._nlp(text)
        entities = set()
        for ent in sample_summary_doc.ents:
            if ent.label_ in self._NER_cat:
                entities.add((ent.text))
        return entities

    def comparison_summary(self, original:set[str], summary:set[str])->(float, set[str]):
        """compare the name entities in summary with those in original text

        Args:
            original (_type_): name entities of original text
            set (_type_): name entities of summary

        Returns:
            _type_: the ratio of name entities in summary which in original text
        """
        res = summary-original
        return (1-len(res)/len(summary), res)

    def comparison_original(self, original:set[str], summary:set[str])->(float, set[str]):
        """compare the name entities in original text with those in summary

        Args:
            original (_type_): name entities of original text
            set (_type_): name entities of summary

        Returns:
            _type_: the ratio of name entities in original text which in summary
        """
        res = original-summary
        return (1-len(res)/len(original), res)

    def comparison_display(self, text:str, ents:set[str])->str:
        """highlight entites which are presented in the text

        Args:
            text (str): text
            ents (set[str]): name entities

        Returns:
            str: text with highlighted name entities
        """
        for entity in ents:
            text = text.replace(entity, f"**{entity}**")
        return text

    def process(self, original:str, summary:str)->(float, float):
        """Get two ratio
        Args:
            original (str): original text
            summary (str): stummary
        Returns:
            (float, float): the ratio of name entities in summary which in original text,
                            the ratio of name entities in original text which in summary
        """
        original_ents = self.extraction(original)
        summary_ents = self.extraction(summary)
        summary_ratio = self.comparison_summary(original_ents, summary_ents)
        original_ratio = self.comparison_original(original_ents, summary_ents)
        return (summary_ratio[0], original_ratio[0])



In [None]:
NER_sample = NER_comparison()
NER_sample.process(text, summary)

(0.7272727272727273, 0.13953488372093026)

In [None]:
original_text = "My name is Bob"
entities_to_replace = {"Bob"}

for entity in entities_to_replace:
    original_text = original_text.replace(entity, f"**{entity}**")

original_text

'My name is **Bob**'