# Análise de Embeddings e Redução da Dimensionalidade

**Objetivo.** Dado um conjunto de textos, gerar embeddings com BERT e investigar a estrutura dos dados via PCA, t-SNE e UMAP. Em seguida, identificar clusters e relacioná-los a categorias semânticas.

In [None]:
sentences = [
    'I swap butter for olive oil in many recipes.',
    'Canberra is the capital of Australia.',
    'Ottawa is the capital city of Canada.',
    'Paris is the most populated city in France.',
    'Tokyo is among the most populous metropolitan areas worldwide.',
    'I prefer my coffee with no sugar and a splash of milk.',
    'The recipe for pasta carbonara is simple.',
    'A pinch of salt enhances sweetness in desserts.',
    'Alignment techniques reduce harmful outputs.',
    'Explainable AI highlights salient features for decisions.',
    'Transformer models enable long-range language dependencies.',
    'Black swan events stress-test portfolio resilience.',
    'The Sahara Desert spans much of North Africa.',
    'Inflation erodes real purchasing power of cash.',
    'Aromatics like garlic and onion build flavor early.',
    'Value stocks trade at lower multiples relative to fundamentals.',
    'Quantization reduces memory with minimal accuracy loss.',
    'Tax-loss harvesting offsets capital gains.',
    'Investing in technology can be risky.',
    'Fermented foods add acidity and complexity.',
    'Marinating tofu improves texture and taste.',
    'Vector databases power semantic search at scale.',
    'Distillation transfers knowledge from large to small models.',
    'The Great Barrier Reef lies off Australia’s northeast coast.',
    'Retrieval-augmented generation grounds answers in sources.',
    'Iceland lies on the Mid-Atlantic Ridge.',
    'The Baltic states border the eastern Baltic Sea.',
    'Multimodal learning aligns text with images and audio.',
    'Risk tolerance should guide position sizing.',
    'Time in the market beats timing the market.',
    'Behavioral biases can derail investment plans.',
    'Reinforcement learning fine-tunes policies from human feedback.',
    'Edge AI runs models under strict latency constraints.',
    'Deglazing lifts browned bits to make pan sauces.',
    'Tempering chocolate stabilizes cocoa butter crystals.',
    'What is the capital of France?',
    'Johannesburg is a major city but not South Africa’s capital.',
    'The Danube passes through multiple European capitals.',
    'The Amazon River carries one of the largest water volumes on Earth.',
    'A healthy emergency fund reduces forced selling.',
    'I batch-cook grains for quick lunches.',
    'Resting steak helps redistribute the juices.',
    'The Atacama is one of the driest deserts on the planet.',
    'Liquidity risk rises when trading volumes are thin.',
    'Mount Everest is the highest peak above sea level.',
    'Graph neural networks capture relational structure.',
    'Sourdough starter needs regular feedings to stay active.',
    'The stock market experienced a drop today.',
    'Umami-rich ingredients deepen savory dishes.',
    'Al dente pasta retains a slight bite after cooking.',
    'Rebalancing restores target asset allocation.',
    'Continual learning mitigates catastrophic forgetting.',
    'Bond duration measures sensitivity to interest-rate changes.',
    'Diffusion models synthesize high-fidelity images.',
    'Expense ratios compound against long-term returns.',
    'Self-supervised pretraining reduces labeled data needs.',
    'What country contains the city of Kyoto?',
    'Stir-frying requires high heat and constant movement.',
    'Covered calls generate income with capped upside.',
    'The Nile flows northward into the Mediterranean Sea.',
    'Causal inference distinguishes correlation from effect.',
    'Prompt engineering steers generative behavior reliably.',
    'Few-shot prompting improves generalization on new tasks.',
    'Growth investing prioritizes earnings expansion.',
    'The Alps stretch across several central European countries.',
    'The Andes form a continuous mountain range along South America.',
    'I cook vegetarian meals on weekdays to simplify planning.',
    'Natural language processing has advanced greatly.',
    'Sous-vide delivers precise temperature control.',
    'Diversification reduces idiosyncratic risk across holdings.',
    'Sharpe ratio evaluates risk-adjusted performance.',
    'Artificial intelligence is transforming the world.',
    'Credit spreads widen during economic uncertainty.',
    'Emerging markets add diversification but higher volatility.',
    'Mise en place speeds up weeknight cooking.',
    'The Caspian Sea is a landlocked body of water.',
    'Evaluation with benchmarks must avoid data leakage.',
    'Cairo sits along the Nile River delta.',
    'Federated learning trains models without centralizing data.',
    'Lagos is Nigeria’s largest city by population.',
    'Dollar-cost averaging smooths entry price over time.',
    'LoRA adapters enable efficient fine-tuning.',
    'I keep a jar of homemade pesto for pasta.',
    'New Delhi serves as the seat of India’s government.',
    'I like to cook Italian dishes on Sundays.',
    'Roasting vegetables caramelizes natural sugars.',
    'ETFs provide broad market exposure with intraday liquidity.',
    'Proofing time affects a bread’s crumb structure.'
]

## Predição dos Embeddings

Utilize o modelo BERT pré-treinado para gerar embeddings de todos os textos fornecidos.  
O objetivo é obter uma matriz `X` com formato **(N, dim)**, onde **N** é o número de textos e **dim** é a dimensionalidade dos vetores de embedding.

In [None]:
%pip install sentence-transformers torch scikit-learn matplotlib seaborn pandas umap-learn -q

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import torch

nome_do_modelo = 'sentence-transformers/bert-base-nli-mean-tokens'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

modelo = SentenceTransformer(nome_do_modelo, device=device)

X = modelo.encode(sentences, convert_to_numpy=True)

print(f'Quantidade de frases: {len(sentences)}')
print(f'Shape da matriz: {X.shape}')

## PCA

Aplique **PCA (Principal Component Analysis)** para projetar os embeddings em duas dimensões e visualizar a estrutura global dos dados.  
O PCA ajuda a capturar as direções de maior variância e pode indicar agrupamentos lineares.

**Tarefas:**
- Reduza a dimensionalidade dos embeddings para 2 componentes principais.  
- Plote os pontos resultantes com `matplotlib`, identificando possíveis agrupamentos.  
- Analise qualitativamente se há separação entre textos de temas distintos.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2, random_state=314)

X_pca = pca.fit_transform(X)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], s=30, alpha=0.8)

plt.title("PCA dos embeddings (2D)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True, linestyle="--", alpha=0.3)

plt.tight_layout()
plt.show()

## t-SNE

Use **t-SNE (t-distributed Stochastic Neighbor Embedding)** para investigar a estrutura local dos dados.  
Diferente do PCA, o t-SNE tenta preservar vizinhanças locais e pode revelar grupos mais sutis.

**Tarefas:**
- Reduza os embeddings para 2D usando `TSNE` do `scikit-learn`.  
- Ajuste parâmetros como `perplexity` e `learning_rate` para comparar resultados.  
- Visualize o mapa e observe se os textos semelhantes ficam próximos.

In [None]:
from sklearn.manifold import TSNE

configuracoes = {
  "per_5_lr_50": { "perplexity": 5, "learning_rate": 50 },
  "per_10_lr_70": {"perplexity": 10, "learning_rate": 70},
  "per_15_lr_90": {"perplexity": 15, "learning_rate": 90},
}

n_plots = len(configuracoes)
fig, axes = plt.subplots(1, n_plots, figsize=(6 * n_plots, 5), squeeze=False)

for (name, params), ax in zip(configuracoes.items(), axes[0]):
    tsne = TSNE(
        n_components=2,
        init="random",
        random_state=42,
        **params,
    )

    X_tsne = tsne.fit_transform(X)

    ax.scatter(
        X_tsne[:, 0],
        X_tsne[:, 1],
        s=40,
        alpha=0.8,
    )

    ax.set_title(
        f"{name}\nperp={params['perplexity']}, lr={params['learning_rate']}"
    )
    ax.set_xlabel("t-SNE 1")
    ax.set_ylabel("t-SNE 2")
    ax.grid(True, linestyle="--", alpha=0.3)

plt.tight_layout()
plt.show()

## UMAP

Aplique **UMAP (Uniform Manifold Approximation and Projection)** como alternativa ao t-SNE.  
O UMAP é mais eficiente, preserva parte da estrutura global e é útil para visualização e pré-processamento.

**Tarefas:**
- Gere uma projeção 2D dos embeddings com `umap.UMAP`.  
- Experimente variar `n_neighbors` e `min_dist` para observar mudanças na distribuição dos clusters.  
- Compare visualmente com os resultados do PCA e t-SNE.

In [None]:
from umap import UMAP

configuracoes = {
    "nn_5_md_0.5":  {"n_neighbors": 5,  "min_dist": 0.5},
    "nn_15_md_0.10": {"n_neighbors": 15, "min_dist": 0.10},
    "nn_30_md_0.1": {"n_neighbors": 30, "min_dist": 0.1},
}

n_plots = len(configuracoes)
fig, axes = plt.subplots(1, n_plots, figsize=(6 * n_plots, 5), squeeze=False)

for (name, params), ax in zip(configuracoes.items(), axes[0]):
    umap = UMAP(
        n_components=2,
        random_state=42,
        **params,
    )

    X_umap = umap.fit_transform(X)

    ax.scatter(
        X_umap[:, 0],
        X_umap[:, 1],
        s=40,
        alpha=0.8,
    )

    ax.set_title(
        f"{name}\nn_neighbors={params['n_neighbors']}, min_dist={params['min_dist']}"
    )
    ax.set_xlabel("UMAP 1")
    ax.set_ylabel("UMAP 2")
    ax.grid(True, linestyle="--", alpha=0.3)

plt.tight_layout()
plt.show()

## Classificação

Com base nas categorias observadas nos gráficos anteriores, crie uma função simples que receba um texto e classifique-o na categoria mais provável.

**Tarefas:**
- Use os embeddings existentes e os clusters identificados para rotular automaticamente cada texto.  
- Crie uma função `classificar_texto(texto: str)` que:
  1. Gere o embedding do texto.
  2. Calcule a distância para os clusters identificados.
  3. Retorne o nome do cluster mais próximo.

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

dbscan = DBSCAN(
    eps=0.5,
    min_samples=5
)

labels = dbscan.fit_predict(X_umap)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(
    X_umap[:, 0],
    X_umap[:, 1],
    c=labels,
    s=40,
    alpha=0.8,
)
plt.title("DBSCAN em UMAP (2D)")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.grid(True, linestyle="--", alpha=0.3)
plt.colorbar(scatter, label="Cluster (label)")
plt.tight_layout()
plt.show()

df_clusters = pd.DataFrame({
    "frase": sentences,
    "cluster_id": labels
})

cluster_ids_validos = np.unique(labels)
cluster_ids_validos = cluster_ids_validos[cluster_ids_validos != -1]

centroides = {}
for c in cluster_ids_validos:
    centroides[c] = X[labels == c].mean(axis=0)

def classificar_texto(texto: str):
    emb = modelo.encode([texto], convert_to_numpy=True)

    ids = list(centroides.keys())
    matriz_centroides = np.stack([centroides[c] for c in ids])

    sims = cosine_similarity(emb, matriz_centroides)[0]

    idx_melhor = sims.argmax()
    cluster_escolhido = ids[idx_melhor]
    similaridade = sims[idx_melhor]

    nome_cluster = f"Cluster {cluster_escolhido}"

    return {
        "cluster_id": cluster_escolhido,
        "nome_cluster": nome_cluster,
        "similaridade": float(similaridade),
        "todas_similaridades": dict(zip(ids, sims))
    }

testes = [
    "I like to cook Italian pasta with tomato sauce.",
    "The Nile is one of the longest rivers in the world.",
    "Stock market volatility affects portfolio risk.",
    "Transformers are powerful models for NLP tasks.",
]

for t in testes:
    resultado = classificar_texto(t)
    print(f"\nTexto: {t}")
    print(f" → {resultado['nome_cluster']} (id = {resultado['cluster_id']}, sim = {resultado['similaridade']:.3f})")