# Sumarizador Automático de Texto usando Embeddings 
A proposta deste trabalho é desenvolver uma Prova de Conceito no intuito de demonstrar um sumarizador automático de texto capaz de gerar resumos extrativos de documentos usando embeddings para capturar a semântica das frases. Para os testes e avaliação, será utilizado o dataset CNN/DailyMail; ele contém artigos de notícias e seus resumos, e é amplamente utilizado para tarefas de sumarização. 

O pipeline de execução do sumarizador será composto, basicamente, pelas seguintes etapas: 
1. Pré-processar o texto
2. Transformar cada frase do texto em um vetor (embedding); 
3. Aplicar clusterização nesses embeddings para agrupar frases contextualmente similares; 
4. Obter as frases mais representativas de cada cluster; 
5. Compor o resumo extrativo a partir das frases obtidas.

Neste trabalho serão usados os modelos de embedding da OpenAI ``text-embedding-3-small``, ``text-embedding-3-large`` e ``text-embedding-ada-002``, sendo necessário acesso a uma chave de API da OpenAI para reproduzir.

---

## Instalação de Dependências

In [1]:
# Uncomment to install necessary libraries and packages
# ! pip install numpy pandas nltk datasets spacy openai scikit-learn

In [83]:
! pip install transformers pytorch

  from pkg_resources import load_entry_point
Collecting pytorch
  Downloading pytorch-1.0.2.tar.gz (689 bytes)
Building wheels for collected packages: pytorch
  Building wheel for pytorch (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-i1e4h9ck/pytorch/setup.py'"'"'; __file__='"'"'/tmp/pip-install-i1e4h9ck/pytorch/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-s73n0f3v
       cwd: /tmp/pip-install-i1e4h9ck/pytorch/
  Complete output (5 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-i1e4h9ck/pytorch/setup.py", line 15, in <module>
      raise Exception(message)
  Exception: You tried to install "pytorch". The package named for PyT

In [86]:
# from transformers import BertModel, BertTokenizer
# import 
# model = BertModel.from_pretrained('bert-base-uncased')
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [2]:
from datasets import load_dataset
from nltk.cluster import KMeansClusterer
from dotenv import load_dotenv
from openai import OpenAI
import pandas as pd
import numpy as np
import nltk
import os
import re

In [2]:
# Uncomment to download relevant tools if not already downloaded
# spacy.cli.download("en_core_web_sm")
# nltk.download('punkt')

## Conexão com a OpenAI

In [3]:
# Load enviroment variables
load_dotenv()

# Configure client with API key
client_openai = OpenAI(
    api_key=os.getenv('API_KEY'),
)

## Preparação dos Dados

In [4]:
# Load CNN/DailyMail Dataset
dataset = load_dataset('cnn_dailymail', '3.0.0')

In [19]:
# Get data that will be used for evaluation
train = dataset['train']
data = train.select(range(20))

In [6]:
# Content example from dataset
print(data[0]['article'])  # Text
print()
print(data[0]['highlights'])  # Summary

LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how

## Funções Auxiliares

In [7]:
# Function to get embedding from text
def get_embedding(text, model):
   text = text.replace("\n", " ")
   return client_openai.embeddings.create(input = [text], model=model).data[0].embedding

In [8]:
# Calculate similarity between embeddings
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [9]:
# Split text in sentences
def tokenize_sentences(text):
    sentences = nltk.sent_tokenize(text)
    sentences = [sentence.strip() for sentence in sentences]
    return sentences

In [10]:
# Assign clusters to embeddings and find their centroids
def kmeans_clustering(data, n_clusters, iterations=25):
    embeddings = np.array(data["embeddings"].tolist())
    kclusterer = KMeansClusterer(
            n_clusters, 
            distance=nltk.cluster.util.cosine_distance,
            repeats=iterations, 
            avoid_empty_clusters=True)
    
    assigned_clusters = kclusterer.cluster(embeddings, assign_clusters=True)
    data['cluster'] = pd.Series(assigned_clusters, index=data.index)
    data['centroid'] = data['cluster'].apply(lambda x: kclusterer.means()[x])
    return data

In [11]:
# Calculate distance of each embedding from its cluster centroid
def distance_from_centroid(data):
    def euclidean_distance(embedding, centroid):
        return np.linalg.norm(np.array(embedding) - np.array(centroid))

    # Apply to each row
    data['distance_from_centroid'] = data.apply(
        lambda row: euclidean_distance(row["embeddings"], row["centroid"]), axis=1)
    return data

## Implementação

### Pré-Processamento do Texto
Para obter resultados de melhor qualidade no dataset CNN/DailyMail, foi realizado um pré-processamento de texto nos artigos que iremos utilizar. Isso consistiu de:
- Remover frases poluentes no início e fim dos artigos, que geralmente são informações referentes ao canal de notícias de onde o texto foi retirado e sem relação direta com o artigo em si.
- Limpar o texto de espaços desnecessários e caracteres especiais soltos.
- Remover frases muito curtas, com menos de 3 palavras.

Devido ao fato da nossa solução usar modelos de embeddings baseados em LLMs e treinados em corpus massivo de dados, capazes de captar as nuances e semântica intrínseca do texto ao invés de pura sintaxe, consideramos desnecessário aplicar técnicas de pré-processamento mais drásticas como remoção de stopwords e stemming, além de que nesse processo parte do significado e contexto poderia ser perdido, o que reduziria o desempenho.

In [67]:
# Remove sentences at the beginning and end of text that are not useful
def remove_polluting_phrases(text):
    text = re.sub(r"Editor's note:", '', text, flags=re.IGNORECASE).strip()

    start_phrase = text.find("--")
    if start_phrase != -1:
        text = text[start_phrase + 3:]

    end_phrase = text.find("E-mail to a friend")
    if end_phrase != -1:
        text = text[:end_phrase]
    
    return text

In [13]:
# Remove unnecessary spaces and stray special characters
def clean_text(text):
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\s+\.', '.', text).strip()
    text = re.sub(r'\s[^\w\s]\s', ' ', text)
    return text

In [14]:
# Remove sentences with few words
def remove_short_sentences(text, min_length=3):
    sentences = tokenize_sentences(text)
    filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) > min_length]
    return ' '.join(filtered_sentences)

In [15]:
def preprocess_text(text):
    text = remove_polluting_phrases(text)
    text = remove_short_sentences(text)
    text = clean_text(text)
    return text

In [16]:
# Sample data for testing
text = data[0]['article']
print("No Preprocessing:")
print(text)
print("\nWith Preprocessing:")
print(preprocess_text(text))

No Preprocessing:
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office cha

### Sumarização

In [63]:
# Summarize text
def summarize_text(text, model, n_clusters=3, output_as_list=False):
    # Preprocess text
    text = preprocess_text(text)

    # Create a dataframe with the text sentences
    sentences = tokenize_sentences(text)
    df_sentences = pd.DataFrame({"sentences" : sentences})

    # Transform sentences into embeddings using an NLP model
    df_sentences['embeddings'] = df_sentences['sentences'].apply(lambda x: get_embedding(x, model))

    # Cluster sentence embeddings with KMeans
    df_sentences = kmeans_clustering(df_sentences, n_clusters)

    # Find distance of each embedding to its cluster's centroid
    df_sentences = distance_from_centroid(df_sentences)

    # Compose summary with each cluster's most meaninful sentence - embedding with the least distance to centroid
    summary = df_sentences.sort_values('distance_from_centroid', ascending = True) \
                            .groupby('cluster').head(1) \
                            .sort_index()['sentences'] \
                            .tolist()
    if output_as_list: 
        return summary
    else:
        # Format summary as a string
        summary_str = ' '.join(summary)
        return summary_str

In [18]:
# Sample data for testing
text = data[0]['article']
model = "text-embedding-3-large"
summary = summarize_text(text, model, 3, True)

print("Reference Summary:")
print(data[0]['highlights'])

print("\nGenerated Summary:")
for sentence in summary:
    print(sentence)

Reference Summary:
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

Generated Summary:
Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him.
"People are always looking to say 'kid star goes off the rails,'" he told reporters last month.
There is life beyond Potter, however.


## Avaliação da Solução
Para tentar evitar o problema de tentar avaliar essa solução com uma métrica ROUGE, visto que os resumos de referência ('highlights') não foram criados da mesma forma que os nossos, que são resumos extrativos, tentamos uma abordagem diferente. Optamos por utilizar os próprios embeddings para avaliação: a ideia foi usar a similaridade entre o embeddings do resumo gerado e do resumo de referência como score de avaliação. Desde que o mesmo modelo de embedding seja utilizado para todos os resumos, a similaridade (ou distância de cosseno) entre eles é uma comparação válida.

Essa é uma forma de avaliação parecida com o que foi usado no ROUGE-WE, discutido no artigo "Better Summarization Evaluation with Word Embeddings for ROUGE", de Jun-Ping Ng e Viktoria Abrech. O ROUGE-WE é uma métrica baseada no ROUGE, mas que usa de embeddings de palavras de forma a aliviar a perda de score que se teria devido a parafraseamento do texto. Neste trabalho, no entanto, usaremos embeddings de texto completo.

Assim, vamos avaliar a solução conforme a média dos scores de similaridade usando os 20 primeiros artigos do dataset de treinamento do CNN/DailyMail. Vamos comparar os resultados usando 3 modelos de embeddings diferentes para sumarização do texto: o text-embedding-3-small, text-embedding-3-large, e text-embedding-ada-002. Para a avaliação em si, será usado apenas o text-embedding-3-small para que os embeddings da avaliação sejam gerados no mesmo espaço vetorial e suas distâncias sejam comparáveis, de forma a garantir a consistência e justiça da avaliação.

In [72]:
data_df = data.to_pandas()  # contains 20 articles
data_df.head()

Unnamed: 0,article,highlights,id
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",06352019a19ae31e527f37f7571c6dd7f0c5da37
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",24521a2abb2e1f5e34e6824e0f9e56904a2b0e88
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a


In [61]:
# Get articles and validation summaries we will use for evaluation
articles = data_df["article"].tolist()
ref_summaries = data_df["highlights"].tolist()

### Geração dos Resumos
Gerando resumos dos artigos de notícias com nosso sumarizador, usando múltiplos modelos de embeddings.

In [87]:
model_tel = "text-embedding-3-large"
gen_summaries_tel = [summarize_text(text=article, model=model_tel, n_clusters=3) for article in articles]

In [88]:
model_tes = "text-embedding-3-small"
gen_summaries_tes = [summarize_text(text=article, model=model_tes, n_clusters=3) for article in articles]

In [89]:
model_ada = "text-embedding-ada-002"
gen_summaries_ada = [summarize_text(text=article, model=model_ada, n_clusters=3) for article in articles]

### Comparar com os Resumos de Referência

In [90]:
# Função para avaliar a solução usando similaridade entre embeddings
def evaluate_summaries(generated_summaries, reference_summaries, embedding_model="text-embedding-3-small"):
    similarities = []
    
    for gen_summary, ref_summary in zip(generated_summaries, reference_summaries):
        gen_embedding = get_embedding(gen_summary, embedding_model)
        ref_embedding = get_embedding(ref_summary, embedding_model)
        
        similarity = cosine_similarity(gen_embedding, ref_embedding)
        similarities.append(similarity)
    
    mean_similarity = np.mean(similarities)
    return mean_similarity

In [99]:
score_tel = evaluate_summaries(gen_summaries_tel, ref_summaries)
print(f"Using text-embedding-3-large -- Score: {score_tel}")

Using text-embedding-3-large -- Score: 0.7088255374317931


In [96]:
score_tes = evaluate_summaries(gen_summaries_tes, ref_summaries)
print(f"Using text-embedding-3-small -- Score: {score_tes}")

Using text-embedding-3-small -- Score: 0.676615599976754


In [95]:
score_ada = evaluate_summaries(gen_summaries_ada, ref_summaries)
print(f"Using text-embedding-ada-002 -- Score: {score_ada}")

Using text-embedding-ada-002 -- Score: 0.6761565650747232
