## PROYECTO FINAL INTEGRADOR - ANÁLISIS Y RECOMENDACIÓN DE VINOS
---------------------

<img src="https://raw.githubusercontent.com/RodrigoVelasco19/Imagenes/main/Vino2.jpg" width="70%">

#### *Objetivo: Aplicar técnicas de exploración y transformación de datos (EDA y ETL), Machine Learning y Procesamiento de Lenguaje Natural (NLP) para extraer información valiosa sobre vinos y construir un sistema de recomendación basado en reseñas.*
---

### Importación de librerias
---

In [11]:
from google.colab import drive
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

### 5. Modelos de recomendación de vinos
---

Se propone realizar 2 modelos: uno basado exclusivamente en contenido, tomando como información de entrada las distintas caracteristicas de los vinos, y otro colaborativo, tomando en cuenta la puntuación de los usuarios.

#### 5.1. Modelo de recomendación basado en contenido
----

Para llevar a cabo este modelo se llevará a cabo una concatenación de las columnas que incluyen las características de los vinos en una sola columna, para luego crear un vector de caracteristicas unificado para cada vino.

Luego se aplicará el vectorizador TF-IDF para obtener una matriz dispersa, donde las filas representarán los distintos vinos, y las columnas, las distintas palabras incluidas en la columna concatenada.

Luego se calculará la similitud del coseno para un subconjunto de vinos elegido por el usuario del códico, lo cual permitirá medir qué tan similares son entre sí según sus características.

Finalmente se le permite al usuario ingresar un vino o un conjunto de vinos, y se le devuelve los 5 más similares como recomendación.

In [12]:
# Se descarga el dataframe df_clean proveniente del proceso de ETL y EDA

drive.mount('/content/drive')
# Ruta del archivo en Google Drive
file_path = "/content/drive/My Drive/Proyecto-Final-Integrador-Analisis-y-Recomendaciones-de-Vinos/data/df_clean.pkl"

# Cargar el DataFrame
df_clean = pd.read_pickle(file_path)

# Verificar que se cargó correctamente
df_clean.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,pais,descripcion,puntuacion,precio,provincia,variedad,bodega
1,Portugal,"This is ripe and fruity, a wine that is smooth...",87,15.0,Douro,Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",87,14.0,Oregon,Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",87,13.0,Michigan,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",87,65.0,Oregon,Pinot Noir,Sweet Cheeks
5,Spain,Blackberry and raspberry aromas show a typical...,87,15.0,Northern Spain,Tempranillo-Merlot,Tandem


In [13]:
# Se genera un nuevo df para trabajar con los modelos de recomendación
df_clean_mr= df_clean.copy()

# Concatenar las columnas relevantes en un solo texto por vino
df_clean_mr['info_completa'] = df_clean_mr[['pais', 'provincia', 'variedad', 'bodega']].agg(' '.join, axis=1)

# Inicializar el vectorizador TF-IDF con los mismos parámetros optimizados
tfidf = TfidfVectorizer(max_df=0.90, min_df=0.01, ngram_range=(1,2))

# Aplicar TF-IDF sobre la columna combinada
feature_matrix = tfidf.fit_transform(df_clean_mr['info_completa'])

# Ver tamaño de la matriz resultante
feature_matrix.shape

(147228, 125)

In [14]:
# Visualizamos un porción de la matriz obtenida
# Convertir la matriz dispersa a una matriz densa (solo una porción para evitar sobrecarga de memoria)
dense_matrix = feature_matrix.todense()

# Crear un DataFrame de Pandas para visualizar mejor
df_dense = pd.DataFrame(dense_matrix)

# Mostrar una porción de la matriz (por ejemplo, las primeras 5 filas y las primeras 5 columnas)
df_dense.iloc[:20, :20]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.651707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.447683,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.425387,0.0,0.444211,0.276233,0.518676


In [15]:
# Solicitar al usuario cuántos vinos incluir en el subconjunto
num_vinos = int(input("¿Cuántos vinos deseas incluir en el subconjunto de comparación? (hasta 147.228 vinos) "))

# Crear un subconjunto con los primeros 'num_vinos' vinos
subset_feature_matrix = feature_matrix[:num_vinos, :]

# Calcular la similitud del coseno para este subconjunto
cosine_similarities = cosine_similarity(subset_feature_matrix)

¿Cuántos vinos deseas incluir en el subconjunto de comparación? (hasta 147.228 vinos) 5000


In [16]:
feature_matrix.shape


(147228, 125)

In [17]:
# Función para calcular DCG
def dcg_at_k(relevancias, k):
    return np.sum([rel / np.log2(i + 2) for i, rel in enumerate(relevancias[:k])])

# Función para calcular IDCG
def idcg_at_k(k):
    return np.sum([1 / np.log2(i + 2) for i in range(k)])

# Función para calcular NDCG
def ndcg_at_k(relevancias, k):
    dcg = dcg_at_k(relevancias, k)
    idcg = idcg_at_k(k)
    return dcg / idcg if idcg > 0 else 0

# Solicitar al usuario el número de vinos para obtener los más similares
num_vinos_similares = int(input("¿Cuántos vinos deseas comparar para obtener los 5 más similares a ellos? "))

# Crear un DataFrame para almacenar los resultados
resultados_similares = []

for i in range(num_vinos_similares):
    # Obtener las similitudes del vino i
    similitudes = cosine_similarities[i]

    # Ordenar los índices de los vinos más similares (sin contar el propio vino)
    indices_similares = similitudes.argsort()[-6:-1][::-1]  # Los 5 más similares, excluyendo el propio vino

    # Agregar el vino y los 5 más similares a la lista
    vinos_similares = df_clean_mr.iloc[indices_similares]["bodega"].values
    resultados_similares.append([df_clean_mr.iloc[i]["bodega"], vinos_similares])

# Crear un DataFrame con los resultados
df_resultados_similares = pd.DataFrame(resultados_similares, columns=["Vino", "Vinos Similares"])

# Función para agregar NDCG al DataFrame
def agregar_ndcg(df_resultados, k=5):
    relevancias = []
    for idx, row in df_resultados.iterrows():
        # Asumir que los primeros 5 vinos recomendados son relevantes
        relevancia = [1 if i < k else 0 for i in range(len(row['Vinos Similares']))]  # Asumir relevancia de los primeros k
        relevancias.append(relevancia)

    # Calcular NDCG y agregarlo al DataFrame
    df_resultados['ndcg'] = [ndcg_at_k(rel, k) for rel in relevancias]
    return df_resultados

# Agregar la columna NDCG al DataFrame
df_resultados_similares = agregar_ndcg(df_resultados_similares)

# Mostrar el DataFrame con los resultados y la métrica NDCG
df_resultados_similares

¿Cuántos vinos deseas comparar para obtener los 5 más similares a ellos? 10


Unnamed: 0,Vino,Vinos Similares,ndcg
0,Quinta dos Avidagos,"[Fiuza, Muxagat, Quinta do Sagrado, Monte da P...",1.0
1,Rainstorm,"[Lange, Erath, Emerson, Raptor Ridge, Firesteed]",1.0
2,St. Julian,"[Good Harbor, St. Julian, Ste. Chapelle, Black...",1.0
3,Sweet Cheeks,"[Melrose, Björnson, Schmidt, Reustle, Roco]",1.0
4,Tandem,"[Solar de Urbezo, Bodegas Peñafiel, Baigorri, ...",1.0
5,Terre di Giurfo,"[Terre di Giurfo, Cantine di Dolianova, Baglio...",1.0
6,Trimbach,"[Paul Blanck, Pierre Sparr, Domaines Schlumber...",1.0
7,Heinz Eifel,"[Schlink Haus, Wittmann, Fitz-Ritter, Grafen N...",1.0
8,Jean-Baptiste Adam,"[Rieflé, Jean-Marc Bernhard, Lucien Albrecht, ...",1.0
9,Kirkland Signature,"[Vigilance, Hindsight, Rutherford Hill, Stewar...",1.0


In [18]:
# Solicitar al usuario el índice de un vino específico
indice_vino = int(input("Ingresa el índice del vino (0 a {0}): ".format(num_vinos-1)))

# Verificar que el índice esté dentro del rango válido
if 0 <= indice_vino < num_vinos:
    # Obtener las similitudes del vino seleccionado
    similitudes_vino = cosine_similarities[indice_vino]

    # Ordenar los índices de los vinos más similares (sin contar el propio vino)
    indices_similares = similitudes_vino.argsort()[-6:-1][::-1]  # Los 5 más similares, excluyendo el propio vino

    # Mostrar los vinos más similares
    vinos_similares = df_clean_mr.iloc[indices_similares][["bodega", "pais", "provincia", "variedad", "puntuacion", "precio"]]

    # Calcular NDCG para estos 5 vinos recomendados
    relevancia = [1] * 5  # Asumir que todos los 5 primeros vinos son relevantes
    vinos_similares['ndcg'] = ndcg_at_k(relevancia, 5)  # Calcular el NDCG y agregarlo al DataFrame

    # Mostrar los resultados con NDCG
    print("\nLos 5 vinos más similares con su métrica NDCG:")
    print(vinos_similares)

else:
    print("Índice no válido. Por favor, ingresa un índice dentro del rango.")

Ingresa el índice del vino (0 a 4999): 5

Los 5 vinos más similares con su métrica NDCG:
                    bodega   pais          provincia      variedad  \
6          Terre di Giurfo  Italy  Sicily & Sardinia      Frappato   
52    Cantine di Dolianova  Italy  Sicily & Sardinia        Monica   
1013    Baglio di Pianetto  Italy  Sicily & Sardinia  Nero d'Avola   
1017        Casa di Grazia  Italy  Sicily & Sardinia        Grillo   
39    Feudo di Santa Tresa  Italy  Sicily & Sardinia  Nero d'Avola   

      puntuacion  precio  ndcg  
6             87    16.0   1.0  
52            85    14.0   1.0  
1013          88    45.0   1.0  
1017          88    22.0   1.0  
39            86    12.0   1.0  
