<a href="https://colab.research.google.com/github/JCaballerot/Recommender_Systems/blob/main/K_Nearest_Neighbors_Recommender/Book_Crossing_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1 align=center><font size = 5> Most-popular-item Recommender</font></h1>

---

<center>
  <img src="https://storage.googleapis.com/kaggle-datasets-images/1661575/2726067/684ac0c4c14cb46d1047ccb620b45cac/dataset-cover.jpg?t=2021-10-21-03-18-09" width="800" height="300">
</center>


## Objetivo de este Notebook

1. Cargar y preprocesar un Dataset.
2. Realizar un sistema de recomendación basado en MPIR.
3. Comprobar el performance del sistema.

## Tabla de Contenidos

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. <a href="#item31">Contexto</a>  
2. <a href="#item32">Descargar y preparar el Dataset</a>  
3. <a href="#item33">Pre-selección de variables</a>  
4. <a href="#item34">Tratamiento de variables categóricas</a>  
5. <a href="#item34">Tratamiento de variables numéricas</a>  
6. <a href="#item34">Entrenamiento del modelo</a>  

</font>
</div>

## 1. Contexto


El conjunto de datos "Book-Crossing" (también conocido como BX) es una colección de datos relacionados con libros y reseñas de libros. Este conjunto de datos se centra en la interacción de los usuarios con libros y sus calificaciones, y es ampliamente utilizado en aplicaciones de sistemas de recomendación.



<b>Descripción de datos</b>

---

El conjunto de datos Book-Crossing contiene información sobre:

* <b>Libros:</b> Información sobre los libros, incluyendo su título, autor y año de publicación.

* <b>Usuarios:</b> Perfiles de los usuarios que interactúan con los libros, incluyendo su ID y ubicación.

* <b>Calificaciones:</b> Calificaciones numéricas que los usuarios asignan a los libros que han leído.

El conjunto de datos puede ser utilizado para varios propósitos, como la construcción de sistemas de recomendación de libros, el análisis de patrones de lectura y preferencias de los usuarios, y la investigación en el campo de la minería de datos y la inteligencia artificial.

---



<strong>Puede consultar este [link](https://www.kaggle.com/datasets/syedjaferk/book-crossing-dataset) para leer más sobre la fuente de datos Book Crossing.</strong>


## 2. Descargar y preparar Dataset

In [3]:
# Download Book-Crossing Dataset
!curl -o dataset.zip "http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip"
!unzip dataset.zip
!ls -la

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.8M  100 24.8M    0     0  73.0M      0 --:--:-- --:--:-- --:--:-- 73.1M
Archive:  dataset.zip
replace BX-Book-Ratings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: total 143424
drwxr-xr-x 1 root root     4096 Sep  9 08:21 .
drwxr-xr-x 1 root root     4096 Sep  9 07:52 ..
-rw-rw-rw- 1 root root 30682276 Oct 11  2004 BX-Book-Ratings.csv
-rw-rw-rw- 1 root root 77787439 Oct 11  2004 BX-Books.csv
-rw-rw-rw- 1 root root 12284157 Oct 11  2004 BX-Users.csv
drwxr-xr-x 4 root root     4096 Sep  7 13:23 .config
-rw-r--r-- 1 root root 26085508 Sep  9 08:21 dataset.zip
drwxr-xr-x 1 root root     4096 Sep  7 13:24 sample_data


In [6]:
# Principales librerías
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore") # Turn off warnings


In [None]:
ratings = pd.read_csv("BX-Book-Ratings.csv", sep=";", encoding="ISO-8859-1")
books   = pd.read_csv("BX-Books.csv",        sep=";", encoding="ISO-8859-1", error_bad_lines=False)
users   = pd.read_csv("BX-Users.csv",        sep=";", encoding="ISO-8859-1")

In [None]:
users.head()

In [None]:
books.head()

In [None]:
ratings.head()

In [None]:
print("  Users: {} \n  Books: {}\n  Ratings: {}".format(len(users), len(books), len(ratings)))


In [9]:
users.columns = users.columns.str.lower().str.replace('-', '_')
books.columns = books.columns.str.lower().str.replace('-', '_')
ratings.columns = ratings.columns.str.lower().str.replace('-', '_')

### 2.1. Data de usuarios

In [None]:
users.head()

In [None]:
users["age"].describe()

In [10]:
# Ejemplo de remoción de outliers
IQR = np.nanpercentile(users['age'], 75) - np.nanpercentile(users['age'], 25)
lower_threshold = np.nanpercentile(users['age'], 50) - 1.5*IQR
upper_threshold = np.nanpercentile(users['age'], 50) + 1.5*IQR

users = users[(users['age'] > lower_threshold) & (users['age'] < upper_threshold)]

In [None]:
# Establecer el estilo de Seaborn (opcional)
sns.set(style="whitegrid")

# Crear el gráfico de barras
plt.figure(figsize=(12, 5))  # Ajusta el tamaño de la figura si es necesario
ax = sns.countplot(data=users, x='age', color='lightblue')

# Personalizar el eje x
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')

# Ajustar el tamaño de fuente de las etiquetas del eje x
ax.tick_params(axis='x', labelsize=8)

# Agregar etiquetas y título
plt.xlabel('Edad', fontsize=12)
plt.ylabel('Cantidad', fontsize=12)
plt.title('Distribución de Edad de usuarios', fontsize=14)

# Mostrar el gráfico
plt.tight_layout()
plt.show()


### 2.2. Data de libros

In [12]:
#dropping the image columns
books.drop(columns=['image_url_s', 'image_url_m', 'image_url_l'], inplace=True) # drop image-url columns

In [None]:
books.head()

In [None]:
books[books.book_title == 'The Lovely Bones: A Novel']

In [13]:
#converting years of publication to integer
books.year_of_publication = pd.to_numeric(books.year_of_publication, errors='coerce')

In [14]:
#replacing all years of publication that are 0 with NaN
books.year_of_publication.replace(0, np.nan, inplace=True)

In [None]:
books.year_of_publication.describe()

In [15]:
# Ejemplo de remoción de outliers
lower_threshold = 1964
upper_threshold = 2004

books = books[(books['year_of_publication'] >= lower_threshold) & (books['year_of_publication'] <= upper_threshold)]
books.year_of_publication = books.year_of_publication.astype(int)

In [None]:
# Establecer el estilo de Seaborn (opcional)
sns.set(style="whitegrid")

# Crear el gráfico de barras
plt.figure(figsize=(12, 5))  # Ajusta el tamaño de la figura si es necesario
ax = sns.countplot(data=books, x='year_of_publication', color='lightblue')

# Personalizar el eje x
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')

# Ajustar el tamaño de fuente de las etiquetas del eje x
ax.tick_params(axis='x', labelsize=8)

# Agregar etiquetas y título
plt.xlabel('Año de Publicación', fontsize=12)
plt.ylabel('Cantidad', fontsize=12)
plt.title('Distribución de Años de Publicación de Libros', fontsize=14)

# Mostrar el gráfico
plt.tight_layout()
plt.show()


In [16]:
#correcting publisher names and assigning the name 'Other' to those with missing publisher names
books.publisher= books.publisher.str.replace('&amp;', '&', regex=False)

In [17]:
books.publisher.replace(np.nan,'Other', inplace = True)

In [18]:
#replacing the NaN in for book_author with Unknown
books.book_author.replace(np.nan,"Unknown", inplace=True)

In [19]:
#dropping the rows with NaN year of publication
books = books.dropna(how='any', axis = 0)

### 2.3. Data de Ratings

In [None]:
ratings.head()

In [20]:
#removing the rows with an implicit book_rating of 0
ratings = ratings[ratings.book_rating!=0]

In [None]:
ratings.book_rating.hist()

In [None]:
# Establecer el estilo de Seaborn (opcional)
sns.set(style="whitegrid")

# Crear el gráfico de barras
plt.figure(figsize=(12, 5))  # Ajusta el tamaño de la figura si es necesario
ax = sns.countplot(data=ratings, x='book_rating', color='lightblue')

# Personalizar el eje x
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')

# Ajustar el tamaño de fuente de las etiquetas del eje x
ax.tick_params(axis='x', labelsize=8)

# Agregar etiquetas y título
plt.xlabel('Rating del libro', fontsize=12)
plt.ylabel('Cantidad', fontsize=12)
plt.title('Distribución de Rating de libros', fontsize=14)

# Mostrar el gráfico
plt.tight_layout()
plt.show()


### 2.4. Unificando data

In [21]:
df_unified = pd.merge(users[['user_id', 'age']], ratings, on = 'user_id', how = 'inner')
df_unified = pd.merge(df_unified, books[['isbn', 'book_title']], on = 'isbn', how = 'inner')

df_unified.head()

Unnamed: 0,user_id,age,isbn,book_rating,book_title
0,19,14.0,375759778,7,Prague : A Novel
1,8720,31.0,375759778,3,Prague : A Novel
2,24525,28.0,375759778,5,Prague : A Novel
3,38502,34.0,375759778,8,Prague : A Novel
4,108789,27.0,375759778,7,Prague : A Novel


In [22]:
df_unified.loc[df_unified.user_id ==  387]

Unnamed: 0,user_id,age,isbn,book_rating,book_title
2786,387,17.0,198320264,2,Julius Caesar (Oxford School Shakespeare)
2789,387,17.0,373196989,6,Santa Brought A Son : Marrying The Boss's Daug...
2791,387,17.0,451527747,10,Alice's Adventures in Wonderland and Through t...
2800,387,17.0,812504208,9,The Adventures of Tom Sawyer
2803,387,17.0,1590071212,9,Jane Eyre


In [30]:
most_popular = df_unified.groupby('book_title')[['isbn']].count().reset_index()
most_popular.rename(columns = {'isbn' : 'popularity'}, inplace = True)
most_popular.sort_values(by = 'popularity', ascending = False, inplace = True)


In [37]:
print(len(most_popular), 'diferentes ítems en el sistema')

104019 diferentes ítems en el sistema


In [63]:
#Nos quedaremos con los ítems con cierta materialidad de popularidad
print(len(most_popular[most_popular.popularity > 150]), 'diferentes ítems utilizados en el sistema')
items = most_popular[most_popular.popularity > 150].book_title.tolist()

32 diferentes ítems utilizados en el sistema


In [210]:
df_unified_filtered = df_unified[df_unified.book_title.isin(items)]
df_unified_filtered.head()

Unnamed: 0,user_id,age,isbn,book_rating,book_title
470,114,57.0,671027360,10,Angels &amp; Demons
471,1031,52.0,671027360,7,Angels &amp; Demons
472,3344,61.0,671027360,8,Angels &amp; Demons
473,3373,30.0,671027360,5,Angels &amp; Demons
474,4092,27.0,671027360,7,Angels &amp; Demons


# 3. Most-popular-item Recommender

Se basa en la popularidad de los elementos para hacer recomendaciones a los usuarios. En lugar de utilizar algoritmos complicados para analizar las preferencias individuales de los usuarios, este tipo de sistema simplemente recomienda los elementos que son más populares.

La lógica detrás de un Most-Popular-Item Recommender es bastante simple: si un artículo es popular y ha sido apreciado por muchas personas, es más probable que sea del agrado de nuevos usuarios también. Este enfoque es especialmente útil cuando no se dispone de suficiente información sobre los usuarios y sus preferencias.



Ventajas de un Most-Popular-Item Recommender:

* <b>Simplicidad:</b> Es fácil de implementar y no requiere algoritmos complejos.

* <b>Efectividad inicial:</b> Puede funcionar bien en situaciones donde hay poca información sobre los usuarios.

* <b>Escalabilidad:</b> Puede manejar grandes conjuntos de datos sin problemas.



Desventajas de un Most-Popular-Item Recommender:

* <b>Falta de personalización:</b> No tiene en cuenta las preferencias individuales de los usuarios, lo que puede llevar a recomendaciones no relevantes.

* <b>Burbuja de filtro:</b> Puede llevar a una sobrerrepresentación de elementos populares y no descubrir nuevos elementos.

* <b>No considera cambios en el tiempo:</b> No tiene en cuenta las tendencias cambiantes o los gustos cambiantes de los usuarios.
---

### 3.1. Muestreo de datos


El conjunto de datos en machine learning se divide típicamente en dos partes: el conjunto de entrenamiento (train) y el conjunto de prueba (test). Estas divisiones se utilizan para entrenar y evaluar los modelos.



<b>Train:</b> El conjunto de entrenamiento se utiliza para entrenar el modelo de machine learning. Es aquí donde el modelo "aprende" los patrones y relaciones en los datos para poder hacer predicciones o clasificaciones.

<b>Test:</b> El conjunto de prueba se utiliza para evaluar el rendimiento del modelo en datos no vistos durante el entrenamiento. Es una medida objetiva de la capacidad del modelo para generalizar y realizar predicciones precisas en nuevos datos.

In [211]:
# Muestreo
#La función train_test_split de scikit-learn se utiliza para dividir un conjunto de datos en subconjuntos de train y test.
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_unified_filtered, # Base de datos
                               train_size = 0.7, # Especificar el tamaño de train/test
                               random_state = 123) # Semilla aleatoria



In [212]:
# Crear una matriz pivot para el conjunto de entrenamiento
pivot_table_train = train.pivot(index='user_id', columns='isbn', values='book_rating').fillna(0)

# Crear una matriz pivot para el conjunto de prueba
pivot_table_test = test.pivot(index='user_id', columns='isbn', values='book_rating').fillna(0)


### 3.2. KNN recommender


In [171]:
from sklearn.neighbors import NearestNeighbors

In [215]:
# Crear el modelo k-NN
k = 30  # Número de vecinos más cercanos
model = NearestNeighbors(n_neighbors = k,
                                 metric='euclidean',
                                 algorithm='brute',
                                 n_jobs=-1)

model.fit(pivot_table_train.values) # Ajustar el modelo a los datos




In [None]:

# Función para obtener recomendaciones para un usuario específico
def get_recommendations(user_ratings):
    distances, indices = model_knn.kneighbors([user_ratings], n_neighbors=k+1)  # +1 para excluir el propio usuario

    # Obtener los índices de los usuarios más cercanos (excluyendo el propio usuario)
    neighbor_indices = indices[0][1:]

    # Filtrar las calificaciones de los vecinos más cercanos
    neighbor_ratings = pivot_table_entrenamiento.iloc[neighbor_indices]

    # Calcular la puntuación promedio de los libros no calificados por el usuario
    book_scores = neighbor_ratings.mean()

    # Filtrar los libros que el usuario aún no ha calificado
    user_unrated_books = book_scores.index[~np.isnan(book_scores) & (user_ratings == 0)]

    # Ordenar los libros por puntuación promedio en orden descendente para obtener las recomendaciones
    recommendations = book_scores[user_unrated_books].sort_values(ascending=False)

    return recommendations

# Crear una tabla para almacenar las recomendaciones
recomendaciones_tabla = pd.DataFrame(columns=['user_id', 'isbn', 'puntuacion'])

# Para cada usuario en el conjunto de prueba, obtener sus recomendaciones
for user_id in pivot_table_prueba.index:
    if user_id in pivot_table_entrenamiento.index:  # Verificar si el usuario existe en el conjunto de entrenamiento
        user_ratings = pivot_table_prueba.loc[user_id].values
        recommendations = get_recommendations(user_ratings)
        # Agregar las recomendaciones a la tabla
        for isbn, score in recommendations.head(10).items():  # Tomar las 10 mejores recomendaciones
            recomendaciones_tabla = recomendaciones_tabla.append({'user_id': user_id, 'isbn': isbn, 'puntuacion': score}, ignore_index=True)




In [203]:
# Mostrar las primeras filas de la tabla de recomendaciones
recomendaciones_tabla.head()

  user_id        isbn  puntuacion
0     114  006017322X         0.0
1     114  044023722X         0.0
2     114  0670880728         0.0
3     114  0670032379         0.0
4     114  0618346252         0.0


In [205]:
recomendaciones_tabla[recomendaciones_tabla.puntuacion > 5]

Unnamed: 0,user_id,isbn,puntuacion
890,6251,0439064864,7.4
1460,9908,0439064872,8.8
4630,30711,0439136350,7.4
4631,30711,0439139600,5.8
11600,74056,0439064864,9.4
13600,86140,0439064864,5.2
13820,87938,0439064872,5.6
16120,102647,0439139597,9.4
16140,102702,043935806X,6.0
18050,114581,0439136369,5.4


In [206]:
test.head()

user_id,114,254,424,595,638,709,805,882,899,900,...,278254,278350,278356,278390,278422,278543,278550,278552,278554,278843
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Angels &amp; Demons,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Secret Life of Bees,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Catcher in the Rye,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Harry Potter and the Order of the Phoenix (Book 5),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Nanny Diaries: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [208]:
test.head()

user_id,114,254,424,595,638,709,805,882,899,900,...,278254,278350,278356,278390,278422,278543,278550,278552,278554,278843
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Angels &amp; Demons,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Secret Life of Bees,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Catcher in the Rye,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Harry Potter and the Order of the Phoenix (Book 5),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Nanny Diaries: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---
## Gracias por completar este laboratorio!