In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Similitud_Libros").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/07 17:51:38 WARN Utils: Your hostname, Antonio, resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
25/12/07 17:51:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/07 17:51:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
df_tfidf = spark.read.parquet("notebooks/tfidf_vectors.parquet")

### spark representa vectores adentro de dataframes. Luego toma la columnas "filename" y "tfidf" y los convierte en RDD para mapear cada fila andentro de un tuple de python

In [6]:
from pyspark.ml.linalg import VectorUDT

rdd_vectors = df_tfidf.select("filename", "tfidf").rdd.map(lambda x: (x[0], x[1]))
rdd_vectors = rdd_vectors.cache()

### Primero se obtiene el producto punto entre ambos vectores, luego se calculan sus magnitudes (normas), y finalmente se divide el producto punto entre el producto de las magnitudes para obtener la similitud del coseno, que indica qué tan parecidos son los documentos.

In [7]:
from pyspark.ml.linalg import Vectors
import math

def coseno(v1, v2):
    dot = float(v1.dot(v2))
    norm1 = math.sqrt(v1.dot(v1))
    norm2 = math.sqrt(v2.dot(v2))
    return dot / (norm1 * norm2)

### Este fragmento de código genera todas las parejas posibles de vectores utilizando cartesian, calcula la similitud de coseno entre cada par y luego elimina las comparaciones donde un vector es comparado consigo mismo. Finalmente, muestra las primeras diez similitudes calculadas. En resumen, el proceso evalúa qué tan similares son todos los vectores entre sí mediante la métrica de coseno.

In [8]:
pairs = rdd_vectors.cartesian(rdd_vectors)

similitudes = pairs.map(lambda x: (x[0][0], x[1][0], coseno(x[0][1], x[1][1]))) \
                   .filter(lambda x: x[0] != x[1])

similitudes.take(10)


                                                                                

[('file:///home/robc/SistDist/books/The%20Republic.txt',
  'file:///home/robc/SistDist/books/Dracula.txt',
  0.046271933474976934),
 ('file:///home/robc/SistDist/books/The%20Republic.txt',
  'file:///home/robc/SistDist/books/Little%20Women.txt',
  0.057500250563714855),
 ('file:///home/robc/SistDist/books/The%20Republic.txt',
  'file:///home/robc/SistDist/books/The%20Odyssey.txt',
  0.09529164955938274),
 ('file:///home/robc/SistDist/books/The%20Republic.txt',
  'file:///home/robc/SistDist/books/The%20Works%20of%20Edgar%20Allan%20Poe%20—%20Volume%202.txt',
  0.20152729398636623),
 ('file:///home/robc/SistDist/books/The%20Republic.txt',
  'file:///home/robc/SistDist/books/The%20Adventures%20of%20Sherlock%20Holmes.txt',
  0.11463865487679491),
 ('file:///home/robc/SistDist/books/The%20Republic.txt',
  'file:///home/robc/SistDist/books/The%20Aeneid.txt',
  0.08010926591648669),
 ('file:///home/robc/SistDist/books/The%20Republic.txt',
  'file:///home/robc/SistDist/books/Thus%20Spake%20Zara

### Este código convierte las similitudes calculadas entre pares de vectores en un DataFrame de Spark. Primero, cada tupla (libro1, libro2, similitud) se transforma en un objeto Row, asignando nombres claros a cada campo. Luego, toDF() convierte esos registros en un DataFrame llamado df_sim. Finalmente, df_sim.show() muestra las primeras diez filas, permitiendo visualizar qué tan similares son los libros entre sí según la similitud de coseno.

In [9]:
from pyspark.sql import Row

df_sim = similitudes.map(lambda x: Row(libro1=x[0], libro2=x[1], similitud=float(x[2]))).toDF()
df_sim.show(10, truncate=False)


[Stage 7:>                                                          (0 + 1) / 1]

+---------------------------------------------------+---------------------------------------------------------------------------------------------------+--------------------+
|libro1                                             |libro2                                                                                             |similitud           |
+---------------------------------------------------+---------------------------------------------------------------------------------------------------+--------------------+
|file:///home/robc/SistDist/books/The%20Republic.txt|file:///home/robc/SistDist/books/Dracula.txt                                                       |0.046271933474976934|
|file:///home/robc/SistDist/books/The%20Republic.txt|file:///home/robc/SistDist/books/Little%20Women.txt                                                |0.057500250563714855|
|file:///home/robc/SistDist/books/The%20Republic.txt|file:///home/robc/SistDist/books/The%20Odyssey.txt                      

Traceback (most recent call last):                                              
  File "/home/Antonio/SistDist/venv/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 200, in manager
    code = worker(sock, authenticated)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Antonio/SistDist/venv/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 81, in worker
    worker_main(infile, outfile)
  File "/home/Antonio/SistDist/venv/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 2068, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
       ^^^^^^^^^^^^^^^^
  File "/home/Antonio/SistDist/venv/lib/python3.12/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in read_int
    raise EOFError
EOFError


### Este comando ordena las similitudes entre libros de mayor a menor

In [10]:
df_sim.orderBy(df_sim.similitud.desc()).show(20, truncate=False)




+---------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+-------------------+
|libro1                                                                                                               |libro2                                                                                                               |similitud          |
+---------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+-------------------+
|file:///home/robc/SistDist/books/A%20Christmas%20Carol.txt                                                           |file:///home/robc/SistDist/books/A%20Christmas%20Carol%20in%20Prose;%20Being%20a%20Ghost%20Story%20of%20Chr

                                                                                

### Este comando guarda los resultados de las similitudes en un archivo Parquet.

In [11]:
df_sim.write.mode("overwrite").parquet("matriz_similitud_libros.parquet")


                                                                                