# Pipeline Framework 
先拉一个大致的框架，用notebook方便叙述逻辑以及调试，等确定最终框架后我再把框架写成class，方便调用。

In [None]:
import stanza
import numpy as np

加载一个stanza的英文模型作为全局变量，用来分句，以及后续的处理。这个加载过程回头我写到class里面，这里为了避免重复加载先这样写。

In [None]:
nlp = stanza.Pipeline(lang='en')   

## 1. 将文本分句
将文本拆分成句子，方便后续以句子为单位进行事实核查。

In [None]:
def split_text(text:str)->list:
    """
    Split text into sentences
    Args:
        text: the text to be split

    Returns:
        a list of sentences
    """
    doc = nlp(text)
    return [sentence.text for sentence in doc.sentences]

## 2. 将句子转化成embedding
获得每个句子的embedding，用于后续使用 cosine similarity 筛选相关句子

@ CC&Lauren

In [None]:
def sentence2embedding(sentences:list[str])->list[list[float]]:
    """
    Convert sentences to embeddings
    Args:
        sentences: a list of sentences

    Returns:
        a list of embeddings
    """
    embeddings = ____ # to be completed
    return embeddings

不确定直接获得的embeddings是不是一个矩阵，很可能是列表，这里需要一个函数将列表转化成矩阵，方便后续计算。

如果后续发现矩阵规模比较大的话可以考虑使用pytorch配合GPU进行计算，这里先用numpy写一个简单的版本。

In [None]:
def embeddinesmatric(embeddings:list[list[float]])->np.ndarray:
    """
    Convert a list of embeddings to a matrix
    Args:
        embeddings: 

    Returns:
        a matrix of embeddings, each row is an embedding
    """
    
    return ____ # to be completed

## 3. 筛选相关句子
计算summary中的句子与text中的句子的cosine similarity，用于后续对每个summary中的句子，筛选原文句子中最相关的那几个。

In [None]:
def cosine_similarity(embed_text:np.ndarray, embed_summary: np.ndarray)->np.ndarray:
    """
    Calculate the cosine similarities between sentences of summary and sentences of text
    Args:
        embed_text: embedding matrix of text sentences
                    each row is an embedding
        embed_summary: embedding matrix of summary sentences
                    each row is an embedding

    Returns:
        a matrix of cosine similarities
    """
    
    dot_prod = embed_summary @ embed_text.T # [i,j] is the dot product of summary sentence i and text sentence j
    norm = np.linalg.norm(embed_summary, axis=1) @ np.linalg.norm(embed_text, axis=1).T # [i,j] is the norm of summary sentence i and text sentence j
    return dot_prod / norm

找到每个summary中的句子在text中最相关的k个句子，用于后续事实核查。

In [None]:
def topk_related(sim_matrix:np.ndarray, k:int)->np.ndarray:
    """
    Find the indices of top k related sentences in text for each sentence in summary
    Args:
        sim_matrix: cosine similarity matrix
        k: number of sentences to be selected

    Returns:
        a matrix of indices
    """
    return sim_matrix.argsort(axis=1)[:, -k:]

## 4. 事实核查
对于每个summary中的句子，检查是否可以从text中的最相关的几个句子中推知
1. 如果可以，返回True
2. 否则返回False。

同时返回一个概率，表示summary中的句子可以从text中的句子中获得的概率。

这里先用一个简单的方法，后续可以加入根据 dependency arc 或 name entity 来判断的方法。

@ hyc

**PS：这部分需要使用LLM，guidance**

In [None]:
def checker(sens_text:list[str], sen_summary:str)->(bool, float):
    """
    Check if the sentence from the summary con be obtained from the sentence from the text.
    Args:
        sens_text: list of sentences from the text
        sen_summary: the sentence from the summary

    Returns:
        a tuple of (bool, float)
        bool: True if the sentence from the summary can be obtained from the sentence from the text
        float: the probability that the sentence from the summary can be obtained from the sentence from the text
            True: >0.5
            False: <0.5
    """
    
    # to be completed
    
    
    
    res = ____
    prob = ____
    
    return (res, prob)

## 5. 评估
将上述步骤组合起来，对summary进行评估，返回一个0到1之间的分数，分数越高表示summary越好。

In [None]:
def evaluate(text:str, summary:str, k:int)->float:
    """
    evaluate the quality of the summary according to the given text
    Args:
        text: original text
        summary: summary to be evaluated
        k: number of sentences to be selected from the text

    Returns:
        a float number between 0 and 1, the higher the better
    """
    
    # split the text into sentences
    sens_text = split_text(text)
    # split the summary into sentences
    sens_summary = split_text(summary)
    
    # convert sentences to embeddings
    embed_text = sentence2embedding(sens_text)
    embed_summary = sentence2embedding(sens_summary)
    
    # convert embeddings to matrix
    embed_text_mat = embeddinesmatric(embed_text)
    embed_summary_mat = embeddinesmatric(embed_summary)
    
    # calculate cosine similarity
    sim_matrix = cosine_similarity(embed_text_mat, embed_summary_mat)
    
    # find top k related sentences
    topk = topk_related(sim_matrix, k)
    
    # check if the sentence from the summary can be obtained from the sentence from the text
    denominator = 0
    numerator = 0
    for idx, sen in enumerate(sens_summary):
        sens_text_selected = [sens_text[i] for i in topk[idx]]
        res, _ = checker(sens_text_selected, sen)
        if res:
            numerator += 1
        denominator += 1
    return numerator / denominator