# 06 - Similitud textual (Cosine Similarity)

Objetivo:
- Calcular similitud entre mensajes usando TF-IDF.
- Encontrar los mensajes más parecidos a un mensaje dado.

In [1]:
import sys #Esto es para importar la función de src
import os

#Añadimos src al path de python<
sys.path.append(os.path.abspath("../src"))

In [2]:
import pandas as pd
import pickle
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentiment_utils import sentimiento_auto

In [3]:
def preparar_similitud(slug):
    df = pd.read_csv(f"../data/{slug}/messages_clean_ready.csv") #ruta donde está el archivo generado en el file 3
    with open(f"../data/{slug}/X_tfidf.pkl", "rb") as f: #Cargar matriz TF-IDF númerica
        X = pickle.load(f)
    with open(f"../data/{slug}/tfidf_vectorizer.pkl", "rb") as f: #Carga vectorizador para interpretar términos
        vectorizer = pickle.load(f)

    S = cosine_similarity(X) #Compara cada documento con todos los demás.
    return df, X, vectorizer, S

Esta función es la que toma un texto nuevo del usuario y busca los mensajes más parecidos dentro del repositorio usando TF-IDF + similitud coseno. Además, le saca sentimiento automático al texto del usuario y a los resultados.

In [4]:
def similares_por_texto(df, X, vectorizer, texto_usuario, top_n=5, max_chars=240):
    X_user = vectorizer.transform([texto_usuario]) #Vectoriza el texto del usuario
    sims = cosine_similarity(X_user, X).flatten() #Similitud del usuario vs todos
    best = np.argsort(sims)[::-1][:top_n] #Obtener los mejores índices

    # Sentimiento del texto ingresado
    sent_user = sentimiento_auto(texto_usuario) #Calcula el sentimiento con la función de lexicon.
    print("\nTEXTO INGRESADO:")
    print(texto_usuario[:max_chars])
    print("Sentimiento (auto):", sent_user)

    print("\nMAS SIMILARES:")
    for idx in best:
        sim = float(sims[idx]) #Puntuación de similitud con el texto del usuario
        issue = df.loc[idx, "issue_number"] #Recupera el número de issue asociado a ese documento

        # Si existe la columna sentiment, la usa; si no, la calcula al vuelo
        if "sentiment" in df.columns:
            sent = df.loc[idx, "sentiment"]
        else:
            sent = sentimiento_auto(df.loc[idx, "text_clean"]) #Si no hay la calcula con text_clean

        print(f"\nSim={sim:.3f} | issue={issue} | sentimiento={sent}")
        print(df.loc[idx, "text_clean"][:max_chars])

In [5]:
def buscar_por_keyword(df, keyword, n=5): #Dataframe con los mensajes, palabra clave y cantidad de resultados a mostrar
    kw = keyword.lower().strip() #La convierte a minusculas por si la puse en mys
    indices = df[df["text_clean"].str.contains(kw, na=False)].index.tolist() #Busca coincidencias
    print(f"\nKeyword='{kw}' | encontrados={len(indices)}")
    for idx in indices[:n]: #Mostrar primeros resultados
        print(f"- idx={idx} | issue={df.loc[idx,'issue_number']} | {df.loc[idx,'text_clean'][:180]}")
    return indices

Texto de prueba aleatorio

Positivo : Thank you so much, very helpful. Would it be possible to add important information to the Repos homepage? I think many people would benefit.

Negativo: ROCM version error, it becomes unresponsive, very low quality, checklist steps are skipped

# PRUEBA CON EL REPO 1

In [38]:
slug = "auto1111_webui"    #ytdlp #auto1111_webui
df, X, vectorizer, S = preparar_similitud(slug)

# 1) Probar con un texto aleatorio
mi_mensaje = "ROCM version error, it becomes unresponsive, very low quality, checklist steps are skipped"

similares_por_texto(df, X, vectorizer, mi_mensaje, top_n=5)

# 2) Probar por keyword
idxs = buscar_por_keyword(df, "error", n=5)


TEXTO INGRESADO:
ROCM version error, it becomes unresponsive, very low quality, checklist steps are skipped
Sentimiento (auto): negativo

MAS SIMILARES:

Sim=0.145 | issue=16954 | sentimiento=negativo
bug rocm version turns stupid super low quality skipping steps checklist x issue exists disabling extensions x issue exists clean installation webui issue caused extension believe caused bug webui x issue exists current version webui x issu

Sim=0.136 | issue=17031 | sentimiento=negativo
error practically specs steps also reproduced error stable version v cloning master branch

Sim=0.134 | issue=16807 | sentimiento=positivo
might better using fork made support amd gpu rocm using rx user compiled rocm libs etc works well

Sim=0.122 | issue=16807 | sentimiento=neutral
update using guess rocm must even earlier version python

Sim=0.120 | issue=17031 | sentimiento=negativo
bug error trying install pytorch checklist issue exists disabling extensions issue exists clean installation webui issue

In [39]:
# Si hay resultados, buscar similares del PRIMER resultado encontrado
if idxs:
    base = df.loc[idxs[0], "text_clean"]
    similares_por_texto(df, X, vectorizer, base, top_n=5)


TEXTO INGRESADO:
bug runtimeerror clone stable diffusion checklist x issue exists disabling extensions x issue exists clean installation webui issue caused extension believe caused bug webui x issue exists current version webui x issue reported recently iss
Sentimiento (auto): negativo

MAS SIMILARES:

Sim=1.000 | issue=17255 | sentimiento=negativo
bug runtimeerror clone stable diffusion checklist x issue exists disabling extensions x issue exists clean installation webui issue caused extension believe caused bug webui x issue exists current version webui x issue reported recently iss

Sim=0.951 | issue=17216 | sentimiento=negativo
bug repository found checklist issue exists disabling extensions issue exists clean installation webui issue caused extension believe caused bug webui issue exists current version webui issue reported recently issue reported fixed yet happe

Sim=0.903 | issue=16861 | sentimiento=negativo
bug torch able use gpu add skip torch cuda test commandline args varia

# PRUEBA CON EL REPO 2

Texto de prueba aleatorio

Positivo : Thanks a lot, this works perfectly now after the update. Great job!

Negativo: Caution: This comment may contain links with malicious content. Everything is still failing. A ytdlp dump file is being generated.

In [40]:
slug = "ytdlp"  
df, X, vectorizer, S = preparar_similitud(slug)

# 1) Probar con un texto tuyo
mi_mensaje = "Caution: The comment may contain links with malicious content. Everything is still failing. A ytdlp dump file is generated"
similares_por_texto(df, X, vectorizer, mi_mensaje, top_n=5)

# 2) Probar por keyword
idxs = buscar_por_keyword(df, "cuda", n=5)


TEXTO INGRESADO:
Caution: The comment may contain links with malicious content. Everything is still failing. A ytdlp dump file is generated
Sentimiento (auto): negativo

MAS SIMILARES:

Sim=0.679 | issue=15287 | sentimiento=negativo
caution comment may contain links malicious content follow links since html contains dash url everything still fails dump file generated ytdlp

Sim=0.179 | issue=15740 | sentimiento=neutral
sure coincidence happen links sites

Sim=0.139 | issue=15774 | sentimiento=neutral
yep content targeting protected drm yt dlp support downloading drm content extractor still work non drm content like stuff national geographic everything could find abc drm thus downloadable

Sim=0.113 | issue=15520 | sentimiento=negativo
bashonly assuming maintainer commands failing morning work without changes done side sure test reddit side bug right longer issues

Sim=0.113 | issue=15660 | sentimiento=negativo
updates still issue

Keyword='cuda' | encontrados=1
- idx=150 | issue=15784

In [41]:
# Si hay resultados, buscar similares del PRIMER resultado encontrado
if idxs:
    base = df.loc[idxs[0], "text_clean"]
    similares_por_texto(df, X, vectorizer, base, top_n=5)


TEXTO INGRESADO:
stripchat preview checklist x reporting yt dlp broken supported site x verified updated yt dlp nightly master update instructions x checked provided urls playable browser ip login details x checked urls arguments special characters properly
Sentimiento (auto): negativo

MAS SIMILARES:

Sim=1.000 | issue=15784 | sentimiento=negativo
stripchat preview checklist x reporting yt dlp broken supported site x verified updated yt dlp nightly master update instructions x checked provided urls playable browser ip login details x checked urls arguments special characters properly

Sim=0.649 | issue=15891 | sentimiento=negativo
please add f command cookies show output debug command line config vu f cookies mp cookies cookies txt debug system config etc yt dlp conf debug encodings locale utf fs utf pref utf utf ansi error utf ansi screen utf ansi debug yt dlp versi

Sim=0.645 | issue=15891 | sentimiento=negativo

Sim=0.253 | issue=15886 | sentimiento=negativo
add support redzidzird