
# _Fine-tuning_ de modelo Qwen
Este notebook realiza o _fine-tuning_ do modelo **Qwen/Qwen2.5-0.5B** utilizando o m√©todo **LoRA** (Low-Rank Adaptation), focado no dom√≠nio de moda.  
O objetivo √© treinar o modelo para responder perguntas espec√≠ficas sobre moda com base em um conjunto supervisionado de **perguntas e respostas** contidas em um arquivo CSV.

O CSV foi gerado em duas etapas:
1. Transcri√ß√£o dos v√≠deos da Curadobia
    - Utilizamos modelos de Speech-to-text para gerar transcri√ß√µes de diversos v√≠deos sobre moda produzidos pelos parceiros de projeto.
2. Uso de LLM para estruturar os dados
    - A partir de um `.txt` gerado pela transcri√ß√£o, utilizamos um LLM para gerar perguntas e respostas baseadas no `.txt`. O processo foi feito em etapas com supervis√£o humana.

## Estrutura do processo de _fine-tuning_
1. **Prepara√ß√£o do ambiente**  
   - Instala√ß√£o de depend√™ncias necess√°rias.
   - Montagem do Google Drive para leitura dos dados e salvamento do modelo final.

2. **Processamento de dados**  
   - Leitura do CSV que cont√©m colunas `input` (pergunta) e `output` (resposta).
   - Limpeza e formata√ß√£o dos dados no formato de chat para treino do modelo.

3. **Treinamento com LoRA**  
   - Aplica√ß√£o do m√©todo LoRA para treinar apenas partes espec√≠ficas do modelo (proje√ß√µes de aten√ß√£o).
   - Treinamento silencioso para evitar polui√ß√£o visual no notebook.

4. **Avalia√ß√£o e Salvamento**  
   - Avalia√ß√£o final do modelo.
   - Mesclagem do backbone com os adapters LoRA em um √∫nico modelo e salvamento em `.pkl` e `.safetensors`.

5. **Infer√™ncia e Compara√ß√£o**  
   - Defini√ß√£o de fun√ß√µes para comparar o comportamento do modelo **base** e do modelo **fine-tunado**.
   - Execu√ß√£o de perguntas manuais para observar a melhoria obtida ap√≥s o fine-tuning.


# Configura√ß√µes iniciais

Esta c√©lula prepara o ambiente para o treinamento:

- Define seeds para reprodutibilidade dos resultados.
- Configura caminhos absolutos para os arquivos no Google Drive:
  - **CSV** com perguntas e respostas.
  - **Modelo final** salvo em `.pkl`.
- Monta o Google Drive (Colab) para acesso aos dados.
- Seleciona automaticamente GPU, se dispon√≠vel.
- Carrega o tokenizer do Qwen e ajusta o token de padding.
- Define o texto de sistema (`SYSTEM_PREFIX`), que estabelece o contexto do modelo como consultora de moda.


In [None]:
!pip install -U -q transformers peft accelerate datasets

: 

In [None]:
# ===== Imports & Config =====
import os, re, json, random, pickle, sys, platform
from pathlib import Path
from typing import Dict

import numpy as np
import pandas as pd
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
)
from peft import LoraConfig, get_peft_model
from safetensors.torch import save_file

# Seeds / Modelo base
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
BASE_MODEL = "Qwen/Qwen2.5-0.5B"

# Caminhos absolutos
DATA_CSV    = "/content/drive/Shareddrives/nsync_m11/sprint3/perguntas_respostas_1000.csv"
PICKLE_PATH = "/content/drive/Shareddrives/nsync_m11/sprint3/qwen_fine_tuned.pkl"
SAFETENSORS_PATH  = "/content/drive/Shareddrives/nsync_m11/sprint3/qwen_fine_tuned.safetensors"

# Monta Google Drive (modelo foi desenvolvido usando o Google Colab)
from google.colab import drive
drive.mount('/content/drive')

# Device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Tokenizer
tok = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

# Tom do sistema
SYSTEM_PREFIX = "Voc√™ √© uma consultora de moda brasileira. Responda de forma clara, objetiva e elegante."


# Leitura e limpeza do CSV

- L√™ o CSV que cont√©m as colunas:
  - **input**: pergunta.
  - **output**: resposta.
- Valida a presen√ßa dessas colunas.
- Remove linhas vazias ou com dados inv√°lidos.
- Renomeia a coluna `input` para `instruction` para padroniza√ß√£o interna do pipeline.


In [None]:
csv_path = Path(DATA_CSV)
assert csv_path.exists(), f"CSV n√£o encontrado: {DATA_CSV}"

df = pd.read_csv(csv_path)
assert "input" in df.columns and "output" in df.columns, f"CSV deve conter colunas 'input' e 'output'. Colunas: {list(df.columns)}"

df = df.rename(columns={"input": "instruction", "output": "output"})
df["instruction"] = df["instruction"].astype(str).str.strip()
df["output"] = df["output"].astype(str).str.strip()
df = df[(df["instruction"] != "") & (df["output"] != "")].reset_index(drop=True)


# Convers√£o para formato de chat

Transforma cada exemplo do CSV em um formato de **conversa estruturada**, utilizado no treino e na infer√™ncia.  
O formato final ser√°:

```<|system|>
{SYSTEM_PREFIX}
<|user|>
{instruction}
<|assistant|>
{output}
```

Esse formato garante consist√™ncia no processo, facilitando que o modelo aprenda a diferenciar perguntas de respostas.



In [None]:
# Opcional: limite para testes r√°pidos (None = usa tudo)
MAX_EXAMPLES = None
df_trainable = df.iloc[:MAX_EXAMPLES] if MAX_EXAMPLES else df

def to_text_row(instr: str, output: str, input_text: str = "") -> str:
    if input_text:
        return (
            f"<|system|>\n{SYSTEM_PREFIX}\n"
            f"<|user|>\n{instr}\n\nContexto:\n{input_text}\n"
            f"<|assistant|>\n{output}\n"
        )
    else:
        return (
            f"<|system|>\n{SYSTEM_PREFIX}\n"
            f"<|user|>\n{instr}\n"
            f"<|assistant|>\n{output}\n"
        )

texts = [to_text_row(r["instruction"], r["output"]) for _, r in df_trainable.iterrows()]


# Cria√ß√£o do dataset e tokeniza√ß√£o

1. Cria um objeto `Dataset` com todos os exemplos.
2. Divide em **treino (90%)** e **valida√ß√£o (10%)** para avalia√ß√£o durante e ap√≥s o treinamento.
3. Tokeniza os textos:
   - Limite m√°ximo de 1024 tokens por amostra.
   - Truncamento autom√°tico de textos muito longos.


In [None]:
ds = Dataset.from_dict({"text": texts})
ds = ds.train_test_split(test_size=0.1, seed=SEED)

def tokenize(batch):
    return tok(batch["text"], truncation=True, max_length=1024)

ds_tok = ds.map(tokenize, batched=True, remove_columns=["text"])

# Carregamento do modelo e aplica√ß√£o do LoRA

- Carrega o modelo base **Qwen/Qwen2.5-0.5B**.
- Aplica **LoRA** (Low-Rank Adaptation), que permite treinar apenas partes espec√≠ficas do modelo:
  - Camadas de proje√ß√£o da aten√ß√£o (`q_proj`, `k_proj`, `v_proj`, `o_proj`).
- Essa abordagem reduz o custo computacional e mant√©m a maior parte dos par√¢metros congelados.


In [None]:
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

lora_cfg = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_cfg)

# Configura√ß√£o do treinamento

- Define os hiperpar√¢metros principais:
  - `batch_size`, `epochs`, `learning_rate`.
  - Estrat√©gias de logging e salvamento desativadas para manter o notebook limpo.
- Utiliza `DataCollatorForLanguageModeling` configurado para **modelos causais** (`mlm=False`).


In [None]:
os.environ["WANDB_DISABLED"] = "true"

args = TrainingArguments(
    output_dir="./_tmp_session_only",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    logging_strategy="no",
    bf16=torch.cuda.is_available(),
)

collator = DataCollatorForLanguageModeling(tok, mlm=False)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_tok["train"],
    eval_dataset=ds_tok["test"],
    data_collator=collator
)


# Treinamento do modelo

Inicia o processo de treinamento com LoRA aplicado ao backbone Qwen.


In [None]:
train_result = trainer.train()

# Merge do modelo e salvamento

- Ap√≥s o treino, os pesos do LoRA s√£o **mesclados** ao modelo base.  
- O resultado final √© salvo em **dois formatos**:
  - `.pkl` ‚Äì compat√≠vel com carregamento via `pickle`, √∫til para scripts Python simples.  
  - `.safetensors` ‚Äì formato mais seguro e eficiente, recomendado para uso em produ√ß√£o ou compartilhamento.  




In [None]:
try:
    merged = model.merge_and_unload()
except AttributeError:
    merged = model.base_model.merge_and_unload()

# Cria diret√≥rios se n√£o existirem
out_dir = Path(PICKLE_PATH).parent
out_dir.mkdir(parents=True, exist_ok=True)

# --- Salvar em .pkl ---
with open(PICKLE_PATH, "wb") as f:
    pickle.dump(merged.state_dict(), f)

# --- Salvar em .safetensors ---
merged.save_pretrained(out_dir, safe_serialization=True)
tok.save_pretrained(out_dir)

print(f"Modelo salvo em:\n- {PICKLE_PATH}\n- {out_dir}/model.safetensors")


# Fun√ß√µes de infer√™ncia

Define fun√ß√µes auxiliares para:
- Carregar o modelo base para compara√ß√£o.
- Montar prompts de entrada no formato esperado.


In [None]:
# Modelo base para compara√ß√£o
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)
base_model.eval()
merged.eval()

def build_prompt(user_msg: str) -> str:
    return f"<|system|>\n{SYSTEM_PREFIX}\n<|user|>\n{user_msg}\n<|assistant|>\n"

def generate_only_new(model, prompt: str, max_new_tokens=200, temperature=0.7, top_p=0.9, seed=123):
    torch.manual_seed(seed)
    inputs = tok(prompt, return_tensors="pt").to(next(model.parameters()).device)
    input_len = inputs["input_ids"].shape[-1]
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        pad_token_id=tok.eos_token_id,
        eos_token_id=tok.eos_token_id
    )
    new_tokens = out[0][input_len:]
    text = tok.decode(new_tokens, skip_special_tokens=True).strip()
    if "<|user|>" in text:
        text = text.split("<|user|>")[0].strip()
    return text


# Compara√ß√£o entre modelo base e fine-tunado

Permite comparar a performance do modelo base e do modelo ajustado em **3 perguntas escolhidas manualmente**:

1. `Q1`, `Q2` e `Q3` s√£o perguntas presentes no CSV.
2. A c√©lula exibir√° lado a lado:
   - Resposta do Qwen base.
   - Resposta do Qwen ap√≥s fine-tuning.


In [None]:
Q1 = "Quais marcas oferecem boas regatas b√°sicas?"
Q2 = "Como combinar t√™nis em looks elegantes?"
Q3 = "Que cor de bolsa combina com roupas escuras al√©m do preto?"

tests = [q for q in [Q1, Q2, Q3] if isinstance(q, str) and q.strip()]

# Executa compara√ß√£o para cada pergunta preenchida
results = []
for idx, q in enumerate(tests, 1):
    prompt = build_prompt(q)
    base_ans = generate_only_new(base_model, prompt, max_new_tokens=220, temperature=0.1, top_p=0.9, seed=idx*11)
    ft_ans   = generate_only_new(merged,     prompt, max_new_tokens=220, temperature=0.1, top_p=0.9, seed=idx*11)
    results.append((q, base_ans, ft_ans))

# Exibe resultados de forma simples
for i, (q, base_ans, ft_ans) in enumerate(results, 1):
    print("="*80)
    print(f"[Pergunta {i}]")
    print(q)
    print("-"*80)
    print("[QWEN BASE]")
    print(base_ans)
    print("-"*80)
    print("[QWEN FINE-TUNING]")
    print(ft_ans)
print("="*80 if results else "")


# An√°lise Sem√¢ntica Avan√ßada de Clusters

Esta se√ß√£o implementa uma an√°lise sem√¢ntica avan√ßada para visualizar e quantificar clusters de perguntas similares.
O objetivo √© entender como o modelo organiza semanticamente as perguntas e identificar padr√µes de classifica√ß√£o.

## Funcionalidades:
1. **Extra√ß√£o de Embeddings**: Extrai representa√ß√µes sem√¢nticas das perguntas usando o modelo fine-tunado
2. **M√∫ltiplas T√©cnicas de Redu√ß√£o de Dimens√£o**: t-SNE, UMAP e PCA
3. **Algoritmos de Clustering**: K-means, DBSCAN e Clustering Hier√°rquico
4. **Visualiza√ß√µes Interativas**: Gr√°ficos 2D e 3D com Plotly
5. **M√©tricas de Avalia√ß√£o**: Silhouette Score, Calinski-Harabasz Index
6. **An√°lise de Varia√ß√µes Lingu√≠sticas**: Identifica√ß√£o de perguntas similares com diferentes formula√ß√µes

In [None]:
CLUSTERING_CONFIG = {
    "max_samples": 2000,
    "embedding_strategy": "last_hidden_state",
    "tsne_perplexity": 30,
    "tsne_learning_rate": 200,
    "umap_n_neighbors": 15,
    "umap_min_dist": 0.1,
    "n_clusters_range": range(2, 21),
    "random_state": SEED
}

In [None]:
# Extrai representa√ß√µes sem√¢nticas das perguntas usando o modelo fine-tunado
# Converte texto em vetores num√©ricos que capturam o significado sem√¢ntico
def extract_question_embeddings(model, tokenizer, questions, strategy="last_hidden_state", max_length=512):

    model.eval()
    embeddings = []

    with torch.no_grad():
        for question in questions:
            # Tokeniza a pergunta
            inputs = tokenizer(
                question,
                return_tensors="pt",
                truncation=True,
                max_length=max_length,
                padding=True
            ).to(next(model.parameters()).device)

            # Obt√©m outputs do modelo
            outputs = model(**inputs, output_hidden_states=True)

            if strategy == "last_hidden_state":
                # Usa o √∫ltimo token (CLS-like)
                embedding = outputs.hidden_states[-1][:, -1, :].float().cpu().numpy()
            elif strategy == "mean":
                # M√©dia dos tokens (ignorando padding)
                attention_mask = inputs["attention_mask"]
                hidden_states = outputs.hidden_states[-1].float()
                mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
                masked_embeddings = hidden_states * mask
                sum_embeddings = torch.sum(masked_embeddings, dim=1)
                sum_mask = torch.sum(mask, dim=1)
                embedding = (sum_embeddings / sum_mask).cpu().numpy()
            elif strategy == "pooler":
                # Usa pooler se dispon√≠vel, sen√£o usa last_hidden_state
                if hasattr(outputs, 'pooler_output') and outputs.pooler_output is not None:
                    embedding = outputs.pooler_output.float().cpu().numpy()
                else:
                    embedding = outputs.hidden_states[-1][:, -1, :].float().cpu().numpy()
            else:
                raise ValueError(f"Estrat√©gia '{strategy}' n√£o suportada")

            embeddings.append(embedding[0])

    return np.array(embeddings)

In [None]:
# Reduz a dimensionalidade dos embeddings para visualiza√ß√£o em 2D
# Aplica PCA, t-SNE e UMAP para criar representa√ß√µes visuais dos dados
def reduce_dimensions(embeddings, config):

    # Normaliza√ß√£o
    scaler = StandardScaler()
    embeddings_scaled = scaler.fit_transform(embeddings)

    reductions = {}

    # PCA
    print("Aplicando PCA...")
    pca = PCA(n_components=50, random_state=config["random_state"])
    reductions["pca_50d"] = pca.fit_transform(embeddings_scaled)

    pca_2d = PCA(n_components=2, random_state=config["random_state"])
    reductions["pca_2d"] = pca_2d.fit_transform(embeddings_scaled)

    # t-SNE
    print("Aplicando t-SNE...")
    tsne = TSNE(
        n_components=2,
        perplexity=config["tsne_perplexity"],
        learning_rate=config["tsne_learning_rate"],
        random_state=config["random_state"],
        max_iter=1000
    )
    reductions["tsne_2d"] = tsne.fit_transform(embeddings_scaled)

    # UMAP
    print("Aplicando UMAP...")
    umap_reducer = umap.UMAP(
        n_components=2,
        n_neighbors=config["umap_n_neighbors"],
        min_dist=config["umap_min_dist"],
        random_state=config["random_state"]
    )
    reductions["umap_2d"] = umap_reducer.fit_transform(embeddings_scaled)

    return reductions

In [None]:
# Encontra o n√∫mero ideal de clusters usando m√©tricas de qualidade
# Testa diferentes n√∫meros de clusters e escolhe o melhor baseado no Silhouette Score
def find_optimal_clusters(embeddings, config):

    # Normaliza√ß√£o
    scaler = StandardScaler()
    embeddings_scaled = scaler.fit_transform(embeddings)

    silhouette_scores = []
    calinski_scores = []
    inertias = []

    for n_clusters in config["n_clusters_range"]:
        kmeans = KMeans(n_clusters=n_clusters, random_state=config["random_state"], n_init=10)
        cluster_labels = kmeans.fit_predict(embeddings_scaled)

        # Silhouette Score
        sil_score = silhouette_score(embeddings_scaled, cluster_labels)
        silhouette_scores.append(sil_score)

        # Calinski-Harabasz Index
        cal_score = calinski_harabasz_score(embeddings_scaled, cluster_labels)
        calinski_scores.append(cal_score)

        # Inertia
        inertias.append(kmeans.inertia_)

    # Encontra o n√∫mero √≥timo
    optimal_sil = config["n_clusters_range"][np.argmax(silhouette_scores)]
    optimal_cal = config["n_clusters_range"][np.argmax(calinski_scores)]

    return {
        "n_clusters_range": list(config["n_clusters_range"]),
        "silhouette_scores": silhouette_scores,
        "calinski_scores": calinski_scores,
        "inertias": inertias,
        "optimal_silhouette": optimal_sil,
        "optimal_calinski": optimal_cal
    }

In [None]:
# Aplica algoritmos de clustering para agrupar perguntas similares
# Usa K-Means, DBSCAN e Clustering Hier√°rquico para validar os resultados
def apply_clustering_algorithms(embeddings, n_clusters, config):

    # Normaliza√ß√£o
    scaler = StandardScaler()
    embeddings_scaled = scaler.fit_transform(embeddings)

    results = {}

    # K-Means
    kmeans = KMeans(n_clusters=n_clusters, random_state=config["random_state"], n_init=10)
    results["kmeans"] = kmeans.fit_predict(embeddings_scaled)

    # DBSCAN
    dbscan = DBSCAN(eps=0.5, min_samples=5)
    results["dbscan"] = dbscan.fit_predict(embeddings_scaled)

    # Clustering Hier√°rquico
    hierarchical = AgglomerativeClustering(n_clusters=n_clusters)
    results["hierarchical"] = hierarchical.fit_predict(embeddings_scaled)

    return results

In [None]:
# Cria as 3 visualiza√ß√µes principais da an√°lise sem√¢ntica
# Gera gr√°ficos interativos mostrando clusters, densidade e similaridade
def create_main_visualizations(questions, embeddings, reductions, cluster_results, optimal_n_clusters):

    print("Criando visualiza√ß√µes principais...")

    print("Gerando Gr√°fico 1: Clusters 2D com t-SNE...")
    fig1 = go.Figure()
    colors = px.colors.qualitative.Set3 + px.colors.qualitative.Pastel1 + px.colors.qualitative.Pastel2

    # Adicionar pontos com tamanho baseado na densidade do cluster
    for cluster_id in range(optimal_n_clusters):
        mask = cluster_results["kmeans"] == cluster_id
        if np.any(mask):
            cluster_questions = [questions[i] for i in range(len(questions)) if mask[i]]
            cluster_size = len(cluster_questions)

            # Tamanho dos pontos baseado no tamanho do cluster
            point_size = max(8, min(20, 8 + (cluster_size / 50)))

            fig1.add_trace(
        go.Scatter(
                    x=reductions["tsne_2d"][mask, 0],
                    y=reductions["tsne_2d"][mask, 1],
            mode='markers',
                    marker=dict(
                        size=point_size,
                        color=colors[cluster_id % len(colors)],
                        opacity=0.7,
                        line=dict(width=1.5, color='white'),
                        symbol='circle'
                    ),
                    text=[f"Cluster {cluster_id}: {q[:50]}..." for q in cluster_questions],
                    hovertemplate='<b>%{text}</b><br>X: %{x:.2f}<br>Y: %{y:.2f}<extra></extra>',
                    name=f'Cluster {cluster_id} ({cluster_size} perguntas)',
                    legendgroup=f'cluster_{cluster_id}',
                    showlegend=True
                )
            )

    fig1.update_layout(
        title=dict(
            text="Gr√°fico 1: Visualiza√ß√£o dos Clusters Sem√¢nticos (t-SNE 2D)",
            font=dict(size=18, family="Arial Black")
        ),
        xaxis_title="t-SNE Dimens√£o 1",
        yaxis_title="t-SNE Dimens√£o 2",
        width=1000,
        height=700,
        showlegend=True,
        template="plotly_white",
        margin=dict(l=80, r=80, t=100, b=80),
        legend=dict(
            x=1.02,
            y=1,
            bgcolor="rgba(255,255,255,0.8)",
            bordercolor="Black",
            borderwidth=1,
            font=dict(size=12)
        ),
        xaxis=dict(
            showgrid=True,
            gridwidth=1,
            gridcolor='lightgray',
            title_font=dict(size=14),
            tickfont=dict(size=12)
        ),
        yaxis=dict(
            showgrid=True,
            gridwidth=1,
            gridcolor='lightgray',
            title_font=dict(size=14),
            tickfont=dict(size=12)
        )
    )

    fig1.show()

    print("Gerando Gr√°fico 2: An√°lise de Densidade...")
    cluster_stats = []
    for cluster_id in range(optimal_n_clusters):
        mask = cluster_results["kmeans"] == cluster_id
        if np.any(mask):
            cluster_embeddings = embeddings[mask]

            from sklearn.metrics.pairwise import euclidean_distances
            distances = euclidean_distances(cluster_embeddings)
            np.fill_diagonal(distances, np.inf)
            avg_distance = np.mean(distances[distances != np.inf])
            from sklearn.metrics.pairwise import cosine_similarity
            similarities = cosine_similarity(cluster_embeddings)
            np.fill_diagonal(similarities, 0)
            avg_similarity = np.mean(similarities[similarities != 0])

            cluster_stats.append({
                'cluster_id': cluster_id,
                'size': np.sum(mask),
                'avg_distance': avg_distance,
                'avg_similarity': avg_similarity,
                'density_score': 1 / (1 + avg_distance)
            })

    cluster_stats = pd.DataFrame(cluster_stats)

    fig2 = go.Figure()
    # Criar cores √∫nicas para cada cluster
    colors = px.colors.qualitative.Set3 + px.colors.qualitative.Pastel1 + px.colors.qualitative.Pastel2

    # Primeiro trace: c√≠rculos sem texto, apenas com cores √∫nicas
    fig2.add_trace(
        go.Scatter(
            x=cluster_stats['avg_distance'],
            y=cluster_stats['avg_similarity'],
            mode='markers',
            marker=dict(
                size=cluster_stats['size'] * 2.5,
                color=[colors[int(cid) % len(colors)] for cid in cluster_stats['cluster_id']],
                opacity=0.8,
                line=dict(width=3, color='white'),
                sizemode='diameter',
                sizemin=20
            ),
            hovertemplate='<b>Cluster %{customdata}</b><br>Size: %{marker.size}<br>Avg Distance: %{x:.3f}<br>Avg Similarity: %{y:.3f}<extra></extra>',
            customdata=cluster_stats['cluster_id'],
            name='Clusters',
            showlegend=False
        )
    )

    # Adicionar uma tabela de legenda no canto superior direito
    legend_x = 0.98
    legend_y = 0.95

    # T√≠tulo da legenda
    fig2.add_annotation(
        x=legend_x - 0.04,
        y=legend_y + 0.03,
        text="<b>Legenda dos Clusters</b>",
        showarrow=False,
        font=dict(size=14, color='black'),
        xanchor="center",
        yanchor="middle"
    )

    for i, (cluster_id, size, avg_dist, avg_sim) in enumerate(zip(
        cluster_stats['cluster_id'],
        cluster_stats['size'],
        cluster_stats['avg_distance'],
        cluster_stats['avg_similarity']
    )):
        y_offset = 0.02 * (len(cluster_stats) - i - 1)

        # Quadrado colorido da legenda
        fig2.add_shape(
            type="rect",
            x0=legend_x - 0.08, y0=legend_y - y_offset - 0.008,
            x1=legend_x - 0.06, y1=legend_y - y_offset + 0.008,
            fillcolor=colors[int(cluster_id) % len(colors)],
            line=dict(width=1, color='white'),
            layer="above"
        )

        # Texto da legenda
        fig2.add_annotation(
            x=legend_x - 0.05,
            y=legend_y - y_offset,
            text=f"Cluster {int(cluster_id)}: {int(size)} perguntas",
            showarrow=False,
            font=dict(size=12, color='black'),
            xanchor="left",
            yanchor="middle"
        )

    fig2.update_layout(
        title=dict(
            text="Gr√°fico 2: An√°lise de Densidade e Coes√£o dos Clusters",
            font=dict(size=18, family="Arial Black")
        ),
        xaxis_title="Dist√¢ncia M√©dia Interna (menor = mais denso)",
        yaxis_title="Similaridade M√©dia Interna (maior = mais coeso)",
        width=1200,
        height=700,
        template="plotly_white",
        margin=dict(l=80, r=200, t=100, b=80),
        xaxis=dict(
            showgrid=True,
            gridwidth=1,
            gridcolor='lightgray',
            title_font=dict(size=14),
            tickfont=dict(size=12)
        ),
        yaxis=dict(
            showgrid=True,
            gridwidth=1,
            gridcolor='lightgray',
            title_font=dict(size=14),
            tickfont=dict(size=12)
        )
    )

    fig2.show()

    print("Gerando Gr√°fico 3: Heatmap de Similaridade...")
    cluster_centroids = []
    for cluster_id in range(optimal_n_clusters):
        mask = cluster_results["kmeans"] == cluster_id
        if np.any(mask):
            centroid = np.mean(embeddings[mask], axis=0)
            cluster_centroids.append(centroid)
        else:
            cluster_centroids.append(np.zeros(embeddings.shape[1]))

    cluster_centroids = np.array(cluster_centroids)
    from sklearn.metrics.pairwise import cosine_similarity
    similarity_matrix = cosine_similarity(cluster_centroids)

    # Criar matriz de texto com melhor formata√ß√£o
    text_matrix = []
    for i in range(optimal_n_clusters):
        row = []
        for j in range(optimal_n_clusters):
            if i == j:
                row.append("1.000")  # Diagonal sempre 1.000
            else:
                row.append(f"{similarity_matrix[i][j]:.3f}")
        text_matrix.append(row)

    fig3 = go.Figure(data=go.Heatmap(
        z=similarity_matrix,
        x=[f'C{i}' for i in range(optimal_n_clusters)],  # Labels mais curtos
        y=[f'C{i}' for i in range(optimal_n_clusters)],
        colorscale='RdYlBu_r',
        zmin=0,
        zmax=1,
        text=text_matrix,
        texttemplate="%{text}",
        textfont={"size": 12, "color": "black", "family": "Arial Bold"},
        hovertemplate='<b>Similaridade entre %{y} e %{x}</b><br>Valor: %{z:.3f}<extra></extra>',
        showscale=True,
        colorbar=dict(
            title=dict(
                text="Similaridade",
                side="right",
                font=dict(size=16)
            ),
            tickmode="linear",
            tick0=0,
            dtick=0.2,
            len=0.8,
            thickness=25,
            tickfont=dict(size=14)
        ),
        xgap=2,  # Espa√ßamento entre c√©lulas
        ygap=2
    ))

    fig3.update_layout(
        title=dict(
            text="Gr√°fico 3: Matriz de Similaridade Entre Clusters",
            font=dict(size=18, family="Arial Black")
        ),
        xaxis_title="Clusters",
        yaxis_title="Clusters",
        width=1000,
        height=700,
        template="plotly_white",
        margin=dict(l=80, r=80, t=100, b=80),
        xaxis=dict(
            tickfont=dict(size=14),
            title_font=dict(size=16),
            side="bottom",
            tickangle=0
        ),
        yaxis=dict(
            tickfont=dict(size=14),
            title_font=dict(size=16),
            autorange="reversed",
            tickangle=0
        )
    )

    fig3.show()

    return {
        'cluster_stats': cluster_stats,
        'similarity_matrix': similarity_matrix,
        'cluster_centroids': cluster_centroids
    }


In [None]:
# Cria visualiza√ß√£o de confian√ßa inspirada no c√≥digo do professor
# Mostra acertos vs erros com tamanho proporcional √† confian√ßa da predi√ß√£o
def create_confidence_analysis_plot(questions, embeddings, reductions, cluster_results, confidence_data):

    y_true = cluster_results["kmeans"]  # Clusters verdadeiros
    y_pred = confidence_data['predictions']  # Predi√ß√µes baseadas em similaridade
    y_conf = confidence_data['confidence']  # Confian√ßa

    # Calcular acertos (se predi√ß√£o = cluster verdadeiro)
    is_correct = (y_true == y_pred).astype(int)

    # Criar tamanhos baseados na confian√ßa
    sizes = 20 + 80 * y_conf

    fig4 = go.Figure()

    # Pontos corretos (verde)
    correct_mask = is_correct == 1
    fig4.add_trace(
            go.Scatter(
            x=reductions["tsne_2d"][correct_mask, 0],
            y=reductions["tsne_2d"][correct_mask, 1],
                mode='markers',
            marker=dict(
                size=sizes[correct_mask],
                color='green',
                opacity=0.7,
                line=dict(width=1, color='darkgreen')
            ),
            name='Predi√ß√µes Corretas',
            hovertemplate='<b>Acerto</b><br>Confian√ßa: %{marker.size}<br>X: %{x:.2f}<br>Y: %{y:.2f}<extra></extra>'
        )
    )

    # Pontos incorretos (vermelho) - apenas se houver erros
    incorrect_mask = is_correct == 0
    if incorrect_mask.any():
        fig4.add_trace(
            go.Scatter(
                x=reductions["tsne_2d"][incorrect_mask, 0],
                y=reductions["tsne_2d"][incorrect_mask, 1],
                mode='markers',
                marker=dict(
                    size=sizes[incorrect_mask],
                    color='red',
                    opacity=0.7,
                    line=dict(width=1, color='darkred')
                ),
                name='Predi√ß√µes Incorretas',
                hovertemplate='<b>Erro</b><br>Confian√ßa: %{marker.size}<br>X: %{x:.2f}<br>Y: %{y:.2f}<extra></extra>'
            )
        )
    else:
        # Se n√£o h√° erros, adicionar uma anota√ß√£o informativa
        fig4.add_annotation(
            x=0.5,
            y=0.95,
            xref="paper",
            yref="paper",
            showarrow=False,
            font=dict(size=16, color='green'),
            bgcolor='rgba(255,255,255,0.8)',
            bordercolor='green',
            borderwidth=2
        )

    fig4.update_layout(
        title=dict(
            text="Gr√°fico 4: An√°lise de Confian√ßa e Qualidade",
            font=dict(size=18, family="Arial Black")
        ),
        xaxis_title="t-SNE Dimens√£o 1",
        yaxis_title="t-SNE Dimens√£o 2",
        width=1000,
        height=700,
        showlegend=True,
        template="plotly_white",
        margin=dict(l=80, r=80, t=100, b=80),
        xaxis=dict(
            showgrid=True,
            gridwidth=1,
            gridcolor='lightgray',
            title_font=dict(size=14),
            tickfont=dict(size=12)
        ),
        yaxis=dict(
            showgrid=True,
            gridwidth=1,
            gridcolor='lightgray',
            title_font=dict(size=14),
            tickfont=dict(size=12)
        )
    )

    fig4.show()

    # An√°lise de erros muito confiantes
    high_conf_threshold = np.nanpercentile(y_conf, 85)
    high_conf_errors = (is_correct == 0) & (y_conf >= high_conf_threshold)

    print(f"\nüìä AN√ÅLISE DE QUALIDADE:")
    print(f"   ‚Ä¢ Taxa de acerto geral: {(is_correct.mean() * 100):.1f}%")
    print(f"   ‚Ä¢ Confian√ßa m√©dia: {y_conf.mean():.3f}")
    print(f"   ‚Ä¢ Confian√ßa m√°xima: {y_conf.max():.3f}")
    print(f"   ‚Ä¢ Confian√ßa m√≠nima: {y_conf.min():.3f}")

    if incorrect_mask.any():
        print(f"   ‚Ä¢ Erros muito confiantes (‚â•{high_conf_threshold:.2f}): {high_conf_errors.sum()}")
        max_error_conf = y_conf[incorrect_mask].max()
        print(f"   ‚Ä¢ Maior confian√ßa em erro: {max_error_conf:.3f}")
    else:
        print(f"   ‚Ä¢ Todas as predi√ß√µes est√£o corretas com confian√ßa m√©dia de {y_conf.mean():.3f}")
        max_error_conf = 0.0

    return {
        'accuracy': is_correct.mean(),
        'avg_confidence': y_conf.mean(),
        'high_confidence_errors': high_conf_errors.sum(),
        'max_error_confidence': max_error_conf
    }


In [None]:
# Analisa confian√ßa e qualidade do modelo baseado na similaridade com centroides
# Simula predi√ß√µes e calcula m√©tricas de confian√ßa para identificar erros
def analyze_model_confidence_and_quality(questions, embeddings, cluster_results, model, tokenizer):


    # Simular predi√ß√µes baseadas na similaridade com centroides
    from sklearn.metrics.pairwise import cosine_similarity

    cluster_labels = cluster_results["kmeans"]
    cluster_centroids = []

    for cluster_id in range(len(np.unique(cluster_labels))):
        mask = cluster_labels == cluster_id
        if np.any(mask):
            centroid = np.mean(embeddings[mask], axis=0)
            cluster_centroids.append(centroid)
        else:
            cluster_centroids.append(np.zeros(embeddings.shape[1]))

    cluster_centroids = np.array(cluster_centroids)

    # Calcular similaridades com todos os centroides
    similarities = cosine_similarity(embeddings, cluster_centroids)

    # Predi√ß√µes baseadas no centroide mais pr√≥ximo
    y_pred = np.argmax(similarities, axis=1)

    # Confian√ßa baseada na diferen√ßa entre o melhor e segundo melhor
    sorted_similarities = np.sort(similarities, axis=1)
    confidence = sorted_similarities[:, -1] - sorted_similarities[:, -2]

    # Normalizar confian√ßa para 0-1
    confidence = (confidence - confidence.min()) / (confidence.max() - confidence.min() + 1e-8)

    return {
        'predictions': y_pred,
        'confidence': confidence,
        'similarities': similarities,
        'cluster_centroids': cluster_centroids
    }


In [None]:
# Identifica varia√ß√µes lingu√≠sticas dentro dos clusters
# Encontra perguntas com significados similares mas formuladas de forma diferente
def analyze_linguistic_variations(questions, embeddings, cluster_labels, threshold=0.8):


    from sklearn.metrics.pairwise import cosine_similarity

    # Calcular similaridade coseno
    similarity_matrix = cosine_similarity(embeddings)

    variations = {}

    for cluster_id in np.unique(cluster_labels):
        if cluster_id == -1:  # Pular outliers do DBSCAN
            continue

        cluster_mask = cluster_labels == cluster_id
        cluster_questions = [questions[i] for i in range(len(questions)) if cluster_mask[i]]
        cluster_indices = np.where(cluster_mask)[0]

        if len(cluster_indices) < 2:
            continue

        # Encontrar pares similares dentro do cluster
        cluster_similarities = similarity_matrix[np.ix_(cluster_indices, cluster_indices)]

        similar_pairs = []
        for i in range(len(cluster_indices)):
            for j in range(i+1, len(cluster_indices)):
                # Evitar compara√ß√£o da mesma pergunta consigo mesma
                if i != j and cluster_similarities[i, j] >= threshold:
                    similar_pairs.append({
                        "question1": cluster_questions[i],
                        "question2": cluster_questions[j],
                        "similarity": cluster_similarities[i, j]
                    })

        if similar_pairs:
            variations[f"cluster_{cluster_id}"] = {
                "total_questions": len(cluster_questions),
                "similar_pairs": similar_pairs,
                "variation_rate": len(similar_pairs) / (len(cluster_questions) * (len(cluster_questions) - 1) / 2)
            }

    return variations

In [None]:
# Encontra exemplos representativos de cada cluster
# Seleciona as perguntas mais pr√≥ximas ao centroide de cada grupo
def analyze_cluster_examples(questions, cluster_labels, embeddings, n_examples=5):


    from sklearn.metrics.pairwise import cosine_similarity

    cluster_examples = {}

    for cluster_id in np.unique(cluster_labels):
        if cluster_id == -1:  # Pular outliers do DBSCAN
            continue

        cluster_mask = cluster_labels == cluster_id
        cluster_questions = [questions[i] for i in range(len(questions)) if cluster_mask[i]]
        cluster_indices = np.where(cluster_mask)[0]
        cluster_embeddings = embeddings[cluster_indices]

        if len(cluster_questions) < 2:
            continue

        # Encontrar o centroide do cluster
        centroid = np.mean(cluster_embeddings, axis=0)

        # Calcular similaridade com o centroide
        similarities = cosine_similarity([centroid], cluster_embeddings)[0]

        # Pegar os exemplos mais pr√≥ximos ao centroide
        top_indices = np.argsort(similarities)[-n_examples:][::-1]

        cluster_examples[f"cluster_{cluster_id}"] = {
            "total_questions": len(cluster_questions),
            "representative_examples": [
                {
                    "question": cluster_questions[idx],
                    "similarity_to_centroid": similarities[idx]
                }
                for idx in top_indices
            ]
        }

    return cluster_examples

In [None]:
# Preparar dados
questions = df_trainable["instruction"].tolist()
if CLUSTERING_CONFIG["max_samples"] and len(questions) > CLUSTERING_CONFIG["max_samples"]:
    # Amostragem aleat√≥ria para datasets grandes
    np.random.seed(CLUSTERING_CONFIG["random_state"])
    indices = np.random.choice(len(questions), CLUSTERING_CONFIG["max_samples"], replace=False)
    questions = [questions[i] for i in indices]
    print(f"Usando amostra de {len(questions)} perguntas de {len(df_trainable)} total")

print(f"Extraindo embeddings de {len(questions)} perguntas...")

# Extrair embeddings usando o modelo fine-tunado
embeddings = extract_question_embeddings(
    merged, tok, questions,
    strategy=CLUSTERING_CONFIG["embedding_strategy"]
)

print(f"Embeddings extra√≠dos: {embeddings.shape}")

# Reduzir dimens√µes
reductions = reduce_dimensions(embeddings, CLUSTERING_CONFIG)

# Otimizar n√∫mero de clusters
optimization_results = find_optimal_clusters(embeddings, CLUSTERING_CONFIG)

print(f"N√∫mero √≥timo de clusters (Silhouette): {optimization_results['optimal_silhouette']}")
print(f"N√∫mero √≥timo de clusters (Calinski-Harabasz): {optimization_results['optimal_calinski']}")

# Usar o n√∫mero √≥timo baseado no Silhouette Score
optimal_n_clusters = optimization_results['optimal_silhouette']

# Aplicar algoritmos de clustering
cluster_results = apply_clustering_algorithms(embeddings, optimal_n_clusters, CLUSTERING_CONFIG)

# Criar visualiza√ß√µes principais
visualization_data = create_main_visualizations(questions, embeddings, reductions, cluster_results, optimal_n_clusters)

# An√°lise adicional
confidence_data = analyze_model_confidence_and_quality(questions, embeddings, cluster_results, merged, tok)
quality_metrics = create_confidence_analysis_plot(questions, embeddings, reductions, cluster_results, confidence_data)

# Analisar varia√ß√µes lingu√≠sticas
linguistic_variations = analyze_linguistic_variations(
    questions, embeddings, cluster_results["kmeans"], threshold=0.7
)

# Analisar exemplos representativos dos clusters
cluster_examples = analyze_cluster_examples(
    questions, cluster_results["kmeans"], embeddings, n_examples=3
)


print(f"\nüìä ESTAT√çSTICAS GERAIS:")
print(f"   ‚Ä¢ Total de perguntas analisadas: {len(questions)}")
print(f"   ‚Ä¢ Dimens√£o dos embeddings: {embeddings.shape[1]}")
print(f"   ‚Ä¢ N√∫mero √≥timo de clusters: {optimal_n_clusters}")

print(f"   ‚Ä¢ Melhor Silhouette Score: {max(optimization_results['silhouette_scores']):.3f}")
print(f"   ‚Ä¢ Melhor Calinski-Harabasz Score: {max(optimization_results['calinski_scores']):.3f}")

print(f"\nüîç DISTRIBUI√á√ÉO DOS CLUSTERS (K-Means):")
cluster_counts = np.bincount(cluster_results["kmeans"])
for i, count in enumerate(cluster_counts):
    percentage = (count / len(questions)) * 100
    print(f"   ‚Ä¢ Cluster {i}: {count} perguntas ({percentage:.1f}%)")

print(f"\nüìà QUALIDADE DOS CLUSTERS:")
cluster_stats = visualization_data['cluster_stats']
avg_similarity = cluster_stats['avg_similarity'].mean()
avg_density = cluster_stats['density_score'].mean()
print(f"   ‚Ä¢ Similaridade m√©dia interna: {avg_similarity:.3f}")
print(f"   ‚Ä¢ Densidade m√©dia: {avg_density:.3f}")

print(f"\nüéØ M√âTRICAS DE QUALIDADE:")
print(f"   ‚Ä¢ Taxa de acerto: {quality_metrics['accuracy']:.1%}")
print(f"   ‚Ä¢ Confian√ßa m√©dia: {quality_metrics['avg_confidence']:.3f}")
print(f"   ‚Ä¢ Erros muito confiantes: {quality_metrics['high_confidence_errors']}")
print(f"   ‚Ä¢ Maior confian√ßa em erro: {quality_metrics['max_error_confidence']:.3f}")

print(f"\nüîç VARIA√á√ïES LINGU√çSTICAS IDENTIFICADAS:")
total_variations = 0
for cluster_name, variation_data in linguistic_variations.items():
    cluster_id = cluster_name.split("_")[1]
    similar_pairs = len(variation_data["similar_pairs"])
    total_variations += similar_pairs
    if similar_pairs > 0:
        print(f"   ‚Ä¢ Cluster {cluster_id}: {similar_pairs} pares similares")

print(f"\n‚úÖ RESUMO FINAL:")
print(f"   ‚Ä¢ Sistema identifica {total_variations} varia√ß√µes lingu√≠sticas")
print(f"   ‚Ä¢ Taxa de varia√ß√£o m√©dia: {(total_variations / len(questions)) * 100:.1f}%")
print(f"   ‚Ä¢ Clusters bem definidos com Silhouette Score de {max(optimization_results['silhouette_scores']):.3f}")
print(f"   ‚Ä¢ Taxa de acerto: {quality_metrics['accuracy']:.1%} com confian√ßa m√©dia de {quality_metrics['avg_confidence']:.3f}")