# Vectores TF-IDF:

In [2]:
import pandas as pd

peliculas = pd.read_csv("peliculas.csv")

peliculas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1610 entries, 0 to 1609
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        1610 non-null   int64  
 1   title                     1610 non-null   object 
 2   year                      1610 non-null   int64  
 3   synopsis                  1602 non-null   object 
 4   critic_score              1610 non-null   int64  
 5   people_score              1609 non-null   float64
 6   consensus                 1593 non-null   object 
 7   total_reviews             1610 non-null   int64  
 8   total_ratings             1610 non-null   object 
 9   type                      1610 non-null   object 
 10  rating                    1139 non-null   object 
 11  genre                     1603 non-null   object 
 12  original_language         1570 non-null   object 
 13  director                  1609 non-null   object 
 14  producer

In [3]:
# Eliminar Películas con synopsis = null

peliculas = peliculas.dropna(subset=["synopsis"])

peliculas.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1602 entries, 0 to 1609
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        1602 non-null   int64  
 1   title                     1602 non-null   object 
 2   year                      1602 non-null   int64  
 3   synopsis                  1602 non-null   object 
 4   critic_score              1602 non-null   int64  
 5   people_score              1601 non-null   float64
 6   consensus                 1585 non-null   object 
 7   total_reviews             1602 non-null   int64  
 8   total_ratings             1602 non-null   object 
 9   type                      1602 non-null   object 
 10  rating                    1139 non-null   object 
 11  genre                     1602 non-null   object 
 12  original_language         1569 non-null   object 
 13  director                  1601 non-null   object 
 14  producer     

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer


############# Vectorización TF-IDF para las sinopsis ############

# Inicializamos el Vectorizador TF-IDF de sklearn
tfidf_vectorizer = TfidfVectorizer(
    stop_words="english",    # Eliminamos palabras vacías en inglés.
    lowercase=True,          # Convertimos todo a minúsculas.
    token_pattern=r'\b\w+\b' # Consideramos solo palabras alfanuméricas (eliminamos puntuación).
)

# Identifica las palabras únicas y convierte las sinopsis a vectores TF-IDF
# fit -> Ajusta el modelo al texto para identificar términos únicos.
# transform -> Convierte las sinopsis en una matriz de vectores TF-IDF.
tfidf_matrix = tfidf_vectorizer.fit_transform(peliculas["synopsis"])


In [None]:
# Convertir la matriz TF-IDF a un DataFrame 
tfidf_dataframe = pd.DataFrame(
    tfidf_matrix.toarray(),  # Convertimos la matriz dispersa a un array denso.
    columns=tfidf_vectorizer.get_feature_names_out(),  # Obtenemos los nombres de las palabras únicas.
    index=peliculas["title"]  # Usamos los títulos de las películas como índice.
)

### Transformación de Sinopsis en Vectores TF-IDF

Cada sinopsis es transformada en un vector TF-IDF, donde:

- **TF (Frecuencia de Término):** 
  Indica cuántas veces aparece un término en el documento. Este valor es normalizado para evitar que documentos más largos tengan valores más altos.
  
- **IDF (Inversa Frecuencia de Documento):** 
  Reduce la importancia de palabras muy comunes en todos los textos (como "hero", "fight", etc.), asignándoles un peso más bajo.

- **TF-IDF:** 
  Multiplica TF por IDF, asignando un peso mayor a los términos que son importantes y específicos para cada documento (en este caso, cada sinopsis).

---

### Resultado de la Transformación

El resultado es una **matriz dispersa (sparse matrix)** donde:

- **Las filas** corresponden a las sinopsis (una por cada película).
- **Las columnas** corresponden a los términos únicos encontrados en todas las sinopsis.
- **Cada celda** contiene el peso TF-IDF de un término en una sinopsis específica.

---


# Similitud del Coseno:

In [8]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Cargar la matriz TF-IDF
tfidf_matrix = pd.read_csv("tfidf_matrix.csv", index_col=0)

# Calcular la matriz de similitud del coseno
cos_sim_matrix = cosine_similarity(tfidf_matrix)

# Eliminar duplicados estableciendo la parte superior derecha de la matriz como cero
# Usando una matriz triangular inferior para evitar redundancias
cos_sim_matrix = np.tril(cos_sim_matrix, k=-1)

# Convertir la matriz a un DataFrame
cos_sim_df = pd.DataFrame(cos_sim_matrix, index=tfidf_matrix.index, columns=tfidf_matrix.index)

# Guardar la matriz de similitud en un archivo CSV
cos_sim_df.to_csv("cosine_similarity_matrix.csv")

In [9]:

#preview del dataframe
cos_sim_df.iloc[0:5]


title,Black Panther,Avengers: Endgame,Mission: Impossible -- Fallout,Mad Max: Fury Road,Spider-Man: Into the Spider-Verse,Wonder Woman,Dunkirk,Coco,Thor: Ragnarok,Logan,...,The Lone Ranger,The Alamo,The Warrior's Way,The Dark Tower,Wild Wild West,Priest,September Dawn,American Outlaws,Jonah Hex,Texas Rangers
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Black Panther,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Avengers: Endgame,0.051737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mission: Impossible -- Fallout,0.0,0.014902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mad Max: Fury Road,0.003114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Spider-Man: Into the Spider-Verse,0.021198,0.012468,0.015316,0.009936,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
