# Sumarizador Automático de Texto usando Embeddings 
A proposta deste trabalho é desenvolver uma Prova de Conceito no intuito de demonstrar um sumarizador automático de texto capaz de gerar resumos extrativos de documentos usando embeddings para capturar a semântica das frases. Para os testes e avaliação, será utilizado o dataset CNN/DailyMail; ele contém artigos de notícias e seus resumos, e é amplamente utilizado para tarefas de sumarização. 

O pipeline de execução do sumarizador será composto, basicamente, pelas seguintes etapas: 
1. Transformar cada frase do texto em um vetor (embedding); 
2. Clusterizar esses embeddings; 
3. Ranquear as frases mais importantes de cada cluster; 
4. Compor o resumo extrativo a partir das frases obtidas.

Neste projeto serão usados o banco de dados vetorial ``Milvus Standalone`` e o modelo de embedding ``text-embedding-3-small`` da OpenAI:.


```Observações:```

- Para rodar o notebook, é necessário ter um servidor do Milvus rodando na máquina local. Para isso, é necessário ter o Docker instalado e em execução, e rodar os seguintes comandos no terminal:
    1. Baixar os arquivos do Milvus:
        ```sh
        wget https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh
        ```
    2. Iniciar o Milvus
        ```sh
        bash standalone_embed.sh start
        ```
 - Para parar o Milvus após o uso, usar: 
    ```sh
    bash standalone_embed.sh stop
    ```
- Para deletar os dados do Milvus:
    ```sh
    bash standalone_embed.sh delete
    ```
 - No caso de sistema operacional Windows, é necessário usar o WSL2.

---

## Instalação de Dependências

In [1]:
# Uncomment
# ! pip install pymilvus==2.4.1 scikit-learn nltk datasets

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py38-none-any.whl.metadata (7.1 kB)
Collecting fsspec<=2024.3.1,>=2023.1.0 (from fsspec[http]<=2024.3.1,>=2023.1.0->datasets)
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.9.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting frozenlist>=1.1.1 (from aiohttp->

In [1]:
from pymilvus import MilvusClient, DataType, Collection
from datasets import load_dataset, load_from_disk

## Preparação do Milvus

In [13]:
# Set up Milvus client
client = MilvusClient(
    uri="http://localhost:19530"
)

In [14]:
# Create schema
summarizer_schema = MilvusClient.create_schema(
    auto_id=True,
    enable_dynamic_field=True,
)

# Add fields to schema
summarizer_schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
summarizer_schema.add_field(field_name="sentence_content", datatype=DataType.VARCHAR, max_length = 1024)
summarizer_schema.add_field(field_name="sentence_vector", datatype=DataType.FLOAT_VECTOR, dim=1536)

# Prepare index parameters
index_params = client.prepare_index_params()

# Add indexes
index_params.add_index(field_name="id")
index_params.add_index(field_name="sentence_vector", index_type="AUTOINDEX", metric_type="COSINE")

# Create a collection
client.create_collection(
    collection_name="news_articles",
    schema=summarizer_schema,
    index_params=index_params
)

## Preparação dos Dados

In [2]:
# Load CNN/DailyMail Dataset
dataset = load_dataset('cnn_dailymail', '3.0.0')

In [3]:
train_data = dataset['train']
test_data = dataset['test']

In [4]:
# Content example from dataset
print(train_data[0]['article'])  # Text
print()
print(train_data[0]['highlights'])  # Summary

LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how

In [5]:
print(train_data[0])

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char