# Sumarizador Automático de Texto usando Embeddings 
A proposta deste trabalho é desenvolver uma Prova de Conceito no intuito de demonstrar um sumarizador automático de texto capaz de gerar resumos extrativos de documentos usando embeddings para capturar a semântica das frases. Para os testes e avaliação, será utilizado o dataset CNN/DailyMail; ele contém artigos de notícias e seus resumos, e é amplamente utilizado para tarefas de sumarização. 

O pipeline de execução do sumarizador será composto, basicamente, pelas seguintes etapas: 
1. Transformar cada frase do texto em um vetor (embedding); 
2. Aplicar clusterização nesses embeddings para agrupar frases contextualmente similares; 
3. Obter as frases mais representativas de cada cluster; 
4. Compor o resumo extrativo a partir das frases obtidas.

Neste projeto serão usados o banco de dados vetorial ``Milvus Standalone`` e o modelo de embedding ``text-embedding-3-large`` da OpenAI.

```Observações:```

A princípio, a ideia era realizar a busca por similaridade usando o Milvus, que é capaz de buscar embeddings em larga escala. No entanto, o Milvus opera em disco, e considerando que a quantidade de dados no caso deste trabalho em específico é pequena, não havia necessidade de usar um banco de dados vetorial, sendo mais eficiente e prático fazer o processamento completo em memória. Ainda assim, já que o Milvus já estava configurado, decidimos usá-lo apenas para armazenar os embeddings das sentenças dos 100 primeiros textos do dataset de treinamento do CNN/DailyMail, os quais serão usados para avaliação, de forma a evitar ter que refazer chamadas à API ao longo dos testes e incorrer em custos e processamento extras desnecessários.


```Instruções de Uso:```

- Para rodar o notebook, é necessário ter um servidor do Milvus rodando na máquina local. Para isso, é necessário ter o Docker instalado e em execução, e rodar os seguintes comandos no terminal:
    1. Baixar os arquivos do Milvus:
        ```sh
        wget https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh
        ```
    2. Iniciar o Milvus
        ```sh
        bash standalone_embed.sh start
        ```
 - Para parar o Milvus após o uso, usar: 
    ```sh
    bash standalone_embed.sh stop
    ```
- Para deletar os dados do Milvus:
    ```sh
    bash standalone_embed.sh delete
    ```
 - No caso de sistema operacional Windows, é necessário usar o WSL2.

---

## Instalação de Dependências

In [78]:
# Uncomment to install necessary libraries and packages
# ! pip install pymilvus==2.4.1 numpy pandas nltk datasets spacy openai scikit-learn

  from pkg_resources import load_entry_point
Collecting scikit-learn
  Using cached scikit_learn-1.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.1 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.3.2 threadpoolctl-3.5.0


In [91]:
from pymilvus import MilvusClient, DataType, Collection, connections, utility
from datasets import load_dataset, load_from_disk
from nltk.cluster import KMeansClusterer
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix
from dotenv import load_dotenv
from openai import OpenAI
from ast import literal_eval
import pandas as pd
import numpy as np
import nltk
import spacy
import os

In [4]:
# Uncomment to download relevant tools if not already downloaded
# spacy.cli.download("en_core_web_sm")
# nltk.download('punkt')

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package punkt to /home/barbara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Preparação do Milvus

In [7]:
# Set up Milvus client
client = MilvusClient(
    uri="http://localhost:19530"
)

In [71]:
# Create schema
summarizer_schema = MilvusClient.create_schema(
    auto_id=True,
    enable_dynamic_field=True,
)

# Add fields to schema
summarizer_schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
summarizer_schema.add_field(field_name="text_id", datatype=DataType.INT64)
summarizer_schema.add_field(field_name="sentence_content", datatype=DataType.VARCHAR, max_length = 1024)
summarizer_schema.add_field(field_name="sentence_vector", datatype=DataType.FLOAT_VECTOR, dim=3072)

# Prepare index parameters
index_params = client.prepare_index_params()

# Add indexes
index_params.add_index(field_name="id")
index_params.add_index(field_name="text_id")
index_params.add_index(field_name="sentence_vector", index_type="AUTOINDEX", metric_type="COSINE")

# Create a collection
client.create_collection(
    collection_name="news_articles",
    schema=summarizer_schema,
    index_params=index_params
)

In [70]:
# client.drop_collection("news_articles")
# print(f"Collection news_articles foi excluída com sucesso.")

Collection news_articles foi excluída com sucesso.


## Conexão com a OpenAI

In [182]:
# Load enviroment variables
load_dotenv()

# Configure client with API key
client_openai = OpenAI(
    api_key=os.getenv('API_KEY'),
)

## Preparação dos Dados

In [186]:
# Load CNN/DailyMail Dataset
dataset = load_dataset('cnn_dailymail', '3.0.0')

In [187]:
# Get data that will be used for evaluation
train = dataset['train']
data= train.select(range(100))

In [206]:
# Content example from dataset
print(data[0]['article'])  # Text
print()
print(data[0]['highlights'])  # Summary

LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how

## Funções Auxiliares

In [None]:
# Function to get embedding from text
def get_embedding(text, model):
   text = text.replace("\n", " ")
   return client_openai.embeddings.create(input = [text], model=model).data[0].embedding

In [102]:
# Split text in sentences
def tokenize_sentences(text):
    sentences = nltk.sent_tokenize(text)
    sentences = [sentence.strip() for sentence in sentences]
    return sentences

In [191]:
# Assign clusters to embeddings and find their centroids
def kmeans_clustering(data, n_clusters, iterations=25):
    embeddings = np.array(data["embeddings"].tolist())
    kclusterer = KMeansClusterer(
            n_clusters, 
            distance=nltk.cluster.util.cosine_distance,
            repeats=iterations, 
            avoid_empty_clusters=True)
    
    assigned_clusters = kclusterer.cluster(embeddings, assign_clusters=True)
    data['cluster'] = pd.Series(assigned_clusters, index=data.index)
    data['centroid'] = data['cluster'].apply(lambda x: kclusterer.means()[x])
    return data

In [192]:
# Calculate distance of each embedding from its cluster centroid
def distance_from_centroid(data):
    def euclidean_distance(embedding, centroid):
        return np.linalg.norm(np.array(embedding) - np.array(centroid))

    # Apply to each row
    data['distance_from_centroid'] = data.apply(
        lambda row: euclidean_distance(row["embeddings"], row["centroid"]), axis=1)
    return data

In [194]:
# Summarize text
def summarize_text(text, model, n_clusters):
    # Create a dataframe with the text sentences
    sentences = tokenize_sentences(text)
    df_sentences = pd.DataFrame({"sentences" : sentences})

    # Transform sentences into embeddings using an NLP model
    df_sentences['embeddings'] = df_sentences['sentences'].apply(lambda x: get_embedding(x, model))

    # Cluster sentence embeddings with KMeans
    df_sentences = kmeans_clustering(df_sentences, n_clusters, iterations)

    # Find distance of each embedding to its cluster's centroid
    df_sentences = distance_from_centroid(df_sentences)

    # Compose summary with each cluster's most meaninful sentence - embedding with the least distance to centroid
    summary = df_sentences.sort_values('distance_from_centroid', ascending = True) \
                            .groupby('cluster').head(1) \
                            .sort_index()['sentences'] \
                            .tolist()

    # Format summary as a string
    # summary_str = ' '.join(summary)
    
    return summary

## Testes

In [183]:
model_small = "text-embedding-3-small"
model = "text-embedding-3-large"

In [184]:
a = get_embedding('texto de teste', model)
print(a)

[0.0025332425720989704, 0.013405518606305122, -0.01072100643068552, 0.039066050201654434, 0.013431085273623466, -0.0039948103949427605, -0.05474701151251793, 0.07574586570262909, -0.032589130103588104, 0.01511849369853735, 0.0007084130775183439, 0.0603376179933548, -0.01924326829612255, -0.024885006248950958, 0.008223983459174633, -0.008812019601464272, 0.013115761801600456, 0.0057013933546841145, -0.007529418915510178, -0.011453920975327492, 0.0005528817418962717, 0.012127179652452469, -0.026759903877973557, 0.05505381524562836, -0.016456488519906998, 0.0012410545023158193, -0.008564873598515987, 0.03275957703590393, 0.01842513121664524, -0.014632724225521088, -0.005305108148604631, 0.014206611551344395, 0.006131767760962248, -0.009630156680941582, -0.03991827741265297, -0.016746245324611664, -0.003845670958980918, 0.002731385175138712, 0.01961824856698513, -0.006038022693246603, 0.032657310366630554, -0.03381633758544922, -0.04659973084926605, -0.0038243653252720833, 0.01224649138748

In [199]:
summary = summarize_text(text=data[0]['article'], model=model, n_clusters=4)
for sentence in summary:
    print(sentence)

LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him.
There is life beyond Potter, however.
Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's "Equus."
Copyright 2007 Reuters.


## Implementação

### Pré-Processamento

## Avaliação da Solução