###Sistema de recomendacion avanzado

Este notebook es un sistema de recomendación avanzado que utiliza el algoritmo de filtrado colaborativo basado en el modelo de factorización matricial SVD (Descomposición de valores singulares) para recomendar productos a los usuarios.

In [1]:
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357251 sha256=c58a92e6a270b4df5c73e1da90f344616743ae8dc8138ddb5cacf0d9403b2114
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully inst

In [2]:
#Librerias
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from surprise import SVD
from surprise import Dataset
from surprise import Reader




In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:

"""
chunksize = 1000

list_dataframes = []

for chunk in pd.read_json('./datos/Electronics.jsonl', lines=True, chunksize=chunksize):
    list_dataframes.append(chunk)

ratings_df = pd.concat(list_dataframes)

#Sacamos un sample de los datos para que sea mas facil de manejar

ratings_df = ratings_df.sample(frac=0.1, random_state=42)
"""

"\nchunksize = 1000\n\nlist_dataframes = []\n\nfor chunk in pd.read_json('./datos/Electronics.jsonl', lines=True, chunksize=chunksize):\n    list_dataframes.append(chunk)\n\nratings_df = pd.concat(list_dataframes)\n\n#Sacamos un sample de los datos para que sea mas facil de manejar\n\nratings_df = ratings_df.sample(frac=0.1, random_state=42)\n"

In [5]:
#Guardar datos en un csv
#ratings_df.to_csv('./datos/Electronics.csv', index=False, escapechar='\\')

#Cargamos los datos
ratings_df = pd.read_csv('/content/drive/MyDrive/Electronics.csv', escapechar='\\')

### Paso 2. Descripción del dataset

In [6]:
ratings_df.head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5,Five Stars,All OK and as described.,[],B005FAPYXS,B005FAPYXS,AES2QJS66UB66K2C7MNRDSEOOGAQ,2015-02-04 20:13:26.000,0,True
1,5,Seems durable and compatible with Fitbit inspire,"Fits well, seems durable, clips are easy to in...",[],B082HGVN48,B082HGVN48,AEPP2SC5F7LFNKNR7ODMPGH2TOMQ,2020-06-23 19:56:40.098,0,True
2,4,Good Value,Windows 8 takes a little bit of getting use to...,[],B00GPH6T8E,B00GPH6T8E,AEDAKOXCEIOGFPF53OJYGAY7JMDA,2014-05-06 22:06:09.000,1,True
3,5,Great unit for the money.,Sounds great and the plus side is you can conn...,[],B07YFXRNHF,B07YFXRNHF,AHLZT2U7JPWQR4532Q46DMWG6NPA,2021-03-28 19:23:29.795,0,True
4,1,plug and play it's not,Instructions and diagrams suck. Too many optio...,[],B075CZGFJZ,B075CZGFJZ,AG4LZFRGTPXYPNBLMVNUEOV4KAGQ,2018-04-02 19:41:17.220,0,True


In [7]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4388694 entries, 0 to 4388693
Data columns (total 10 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   rating             int64 
 1   title              object
 2   text               object
 3   images             object
 4   asin               object
 5   parent_asin        object
 6   user_id            object
 7   timestamp          object
 8   helpful_vote       int64 
 9   verified_purchase  bool  
dtypes: bool(1), int64(2), object(7)
memory usage: 305.5+ MB


In [8]:
ratings_df.describe()

Unnamed: 0,rating,helpful_vote
count,4388694.0,4388694.0
mean,4.098699,1.099181
std,1.412332,20.52545
min,1.0,-1.0
25%,4.0,0.0
50%,5.0,0.0
75%,5.0,0.0
max,5.0,12928.0


### Paso 3: Analisis Exploratorio

Se realiza un analisis exploratorio de los datos para entender mejor la distribución de los datos y la relación entre las variables.

In [9]:
#Eliminamos los nulos en rating, user_id y parent_asin
ratings_df = ratings_df.dropna(subset=['rating', 'user_id', 'parent_asin'])

In [10]:
n_ratings = len(ratings_df["rating"])
n_users = ratings_df["user_id"].nunique()
n_items = ratings_df["parent_asin"].nunique()

print("Numbero de ratings: ", n_ratings)
print("Number de usuarios: ", n_users)
print("Number de items: ", n_items)


Numbero de ratings:  4388694
Number de usuarios:  3518282
Number de items:  630816


In [11]:
user_freq = ratings_df[['user_id', 'rating']].groupby('user_id').count().reset_index()
user_freq.columns = ['user_id', 'n_ratings']
user_freq = user_freq.sort_values('n_ratings', ascending=False)
user_freq.head()

Unnamed: 0,user_id,n_ratings
3150743,AHMNA5UK3V66O2V3DZSBJA4FYMOA,103
398629,AEIIRIHLIYKQGI7ZOCIJTRDF5NPQ,97
3322684,AHSV5AUFONH7QMMUPF7M6FUJRJ6Q_1,90
1594668,AFTZWAK3ZHAPCNSOT5GCKQDECBTQ,84
2495834,AGUTZC4GHLTGYHA3KBEDRF6MHB6A,76


In [12]:
print(f"Numero de ratings por usuario: {user_freq['n_ratings'].mean():.2f}.")

Numero de ratings por usuario: 1.25.


In [13]:
#Producto con el rating promedio mas alto
item_freq = ratings_df[['parent_asin', 'rating']].groupby('parent_asin').mean().reset_index()
item_freq.columns = ['parent_asin', 'mean_rating']
item_freq = item_freq.sort_values('mean_rating', ascending=False)
item_freq.head()

Unnamed: 0,parent_asin,mean_rating
630815,BT008G3W52,5.0
414907,B07RQ8QG2P,5.0
414934,B07RQGT9XR,5.0
414933,B07RQGHSXW,5.0
190125,B00PVVRMA4,5.0


In [14]:
#Producto conb el rating promedio mas bajo
item_freq = ratings_df[['parent_asin', 'rating']].groupby('parent_asin').mean().reset_index()
item_freq.columns = ['parent_asin', 'mean_rating']
item_freq = item_freq.sort_values('mean_rating', ascending=True)
item_freq.head()

Unnamed: 0,parent_asin,mean_rating
236476,B018U1X9L2,1.0
291804,B01NBRCA85,1.0
566317,B09NTW1ZWG,1.0
291815,B01NBS89KP,1.0
173782,B00LETJFNE,1.0


### Paso 3: sistema de recomendación avanzado

Para el sistema de recomendación avanzado, se utiliza el algoritmo de filtrado colaborativo basado en el modelo de factorización matricial SVD (Descomposición de valores singulares) para recomendar productos a los usuarios.

In [15]:
#Obtenemos los features de los productos SVD
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_df[['user_id', 'parent_asin', 'rating']], reader)

In [16]:
#trainset = data.build_full_trainset()


In [34]:
from surprise.model_selection import train_test_split

#Dividimos los datos en train, test y validation
x_train, x_test = train_test_split(data, test_size=0.2, random_state=42)

In [38]:
model = SVD(n_factors=25, n_epochs=200, lr_all=0.001, reg_all=0.01)

model.fit(x_train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7aaaadb107f0>

In [19]:
#Metricas de evaluacion RSME, MSE Y MAE
"""
from surprise import accuracy

testset = trainset.build_anti_testset()
predictions = model.test(testset)

accuracy.rmse(predictions)
accuracy.mae(predictions)
accuracy.mse(predictions)
"""


'\nfrom surprise import accuracy\n\ntestset = trainset.build_anti_testset()\npredictions = model.test(testset)\n\naccuracy.rmse(predictions)\naccuracy.mae(predictions)\naccuracy.mse(predictions)\n'

In [39]:
from surprise import accuracy

predictions = model.test(x_test)

accuracy.rmse(predictions)
accuracy.mae(predictions)
accuracy.mse(predictions)

RMSE: 1.3640
MAE:  1.0670
MSE: 1.8606


1.8606215913480824

In [37]:
def obtener_recomendaciones_producto(model, user_id, n=10):
    # Obtener una lista de todos los ids de los productos
    item_ids = model.trainset.all_items()

    # Obtener una lista de las predicciones de calificación para todos los productos que el usuario no ha calificado aún
    predictions = [(model.trainset.to_raw_iid(item_id), model.predict(user_id, item_id).est) for item_id in item_ids]

    # Ordenar las predicciones por calificación de mayor a menor y obtener los primeros n productos
    predictions.sort(key=lambda x: x[1], reverse=True)
    top_n = predictions[:n]

    # Devolver los ids de los productos
    return [item_id for (item_id, _) in top_n]

print("Recomendaciones para el producto B082HGVN48")
for item in obtener_recomendaciones_producto(model, ''):
    print(item)



Recomendaciones para el producto B082HGVN48
B07ZGGSR19
B019BLYRQG
B07K82SSPT
B071RCVK6W
B0BJ7QMYBF
B0748NSD2M
B08HMVYL3R
B07284MZPT
B00840KRDI
B07Q58BK7L
