# Recomendação de Jogos da Steam

Lucas Ciziks - 125599472 

luciziks@usp.br

Pedro Maçonetto - 12675419

pedromaconetto@usp.br

### RESUMO
Desenvolvimento, avaliação e comparação de diferentes abordagens
de Sistemas de Recomendação para sugestão e ranqueamento de
jogos com base no catálogo da plataforma de compra e venda de
jogos online Steam.

## Bases de Dados
Serão utilizadas duas bases de dados, uma com reviews de usuários de todo o mundo para determinados jogos, que sera nossa base de dados principal, com 8GB de dados. Utilizaremos também uma base de metadados sobre os jogos para auxilio nos algoritmos de recomendação

In [60]:
# Bibliotecas a serem utilizadas no trabalho
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from caserec.recommenders.item_recommendation.item_attribute_knn import ItemAttributeKNN
# import caserecommender as caserec

Como nossa base de dados principal é extremamente grande, com mais de 8GB de tamanho, faremos uma amostragem de 1000 usuários para podermos aplicar os métodos de forma mais otimizada em relação ao tempo de processamento

In [175]:
# df = pd.read_csv('steam_reviews.csv')

# # Separamos os valores unicos de todos os usuários presentes do BD
# steam_id = df['author.steamid'].unique()

# # Escolhemos aleatóriamente 100000 desses usuários
# users = np.random.choice(steam_id, 100000)

# #Criamos um DataFrame separado somente com estes usuários e suas informações
# data_reduzido = df[df['author.steamid'].isin(users)]

# # Trasnformamos por fim este dataframe em um arquivo .csv para facilitar a manipulação dos dados ao longo do tempo
# data_reduzido.to_csv('reviews_reduzido.csv')

Utilizaremos então uma base de dados reduzida, começamos pelo tratamento dos dados.

In [176]:
review = pd.read_csv('reviews_reduzido.csv')
metadados = pd.read_csv('steam_metadados.csv')

In [177]:
metadados.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


In [178]:
review.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,app_id,app_name,review_id,language,review,timestamp_created,timestamp_updated,recommended,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
0,260,260,292030,The Witcher 3: Wild Hunt,85133150,ukrainian,мало багів)))),1611307610,1611307610,True,...,True,False,False,76561198049715362,15,1,5452.0,2571.0,5293.0,1611317000.0
1,272,272,292030,The Witcher 3: Wild Hunt,85131132,polish,~POZIOM TRUDNOŚCI~\n\n🔲Moja 70-letnia babcia m...,1611304281,1611304281,True,...,True,False,False,76561198206132281,48,9,974.0,715.0,844.0,1611330000.0
2,291,291,292030,The Witcher 3: Wild Hunt,85128129,english,"terribly bugs, keeps kicking me out to home sc...",1611299056,1611299056,False,...,True,False,False,76561198332696736,23,1,8565.0,4973.0,8442.0,1611364000.0
3,379,379,292030,The Witcher 3: Wild Hunt,85108654,russian,"Эту игру хвалили уже столько, что я не смогу п...",1611263439,1611263439,True,...,True,False,False,76561198105446265,189,10,10210.0,257.0,10009.0,1611355000.0
4,380,380,292030,The Witcher 3: Wild Hunt,85108570,turkish,Ölmeden önce oynanması gereken oyunlardan. Hik...,1611263305,1611263305,True,...,True,False,False,76561198398754574,13,4,5394.0,3641.0,4926.0,1611349000.0


In [179]:
# Removendo colunas que não serão utilizadas, e renomeando as que serão
review = review.drop(['Unnamed: 0.1', 'Unnamed: 0', 'timestamp_created', 'received_for_free', 'written_during_early_access', 'votes_helpful', 'votes_funny', 'weighted_vote_score', 'timestamp_updated'], axis=1)
review = review.rename(columns = {'app_id': 'itemId', 'app_name': 'itemName', 'review_id': 'reviewId',
       'comment_count': 'commentCount', 'steam_purchase': 'steamPurchase',
       'author.steamid': 'userId', 'author.num_games_owned': 'userGames', 'author.num_reviews': 'userReviews',
       'author.playtime_forever': 'userPlaytimeForever', 'author.playtime_last_two_weeks': 'userPlaytimeLastTwoWeeks',
       'author.playtime_at_review': 'userPlaytimeAtReview', 'author.last_played': 'userLastPlayed'})
metadados.rename(columns = {'appid': 'itemId', 'name': 'itemName'}, inplace= True)

In [180]:
# Como utilizamos uma amostragem da base de dados original, faremos um cruzamento de dados 
# entre os jogos da amostra e os jogos disponiveis nos metadados
commonItems = list(set(metadados.itemName.unique())&set(review.itemName))
review = review.loc[review['itemName'].isin(commonItems)]
metadados = metadados.loc[metadados['itemName'].isin(commonItems)]

In [181]:
# Mapeamento de itens e usuários
itemId = {item: idx for idx, item in enumerate(commonItems)}
userId = {user: idx for idx, user in enumerate(review['userId'])}

metadados['itemId'] = metadados['itemName'].map(itemId).dropna()
review['itemId'] = review['itemName'].map(itemId).dropna()
review['userId'] = review['userId'].map(userId).dropna()

# Binarizando a nota dada pelo usuario (recomendado ou nao-recomendado)
def bin(x):
    return 1 if x else 0
review['rating_bin'] = review['recommended'].apply(bin)

# Separando os generos dos itens do arquivo de metadados
metadados = metadados.drop('genres', axis=1).join(metadados.genres.str.split(';', expand=True)
             .stack().reset_index(drop=True, level=1).rename('genre'))
metadados.dropna(inplace=True)

In [182]:
train, test = train_test_split(review, test_size=.2, random_state=2)

In [187]:
def rmse_user(preds, ratings):
    if len(preds) != len(ratings):
        return -1
    sum = 0
    for i in range(len(preds)):
        sum += pow(preds[i]-ratings[i], 2)
    return np.sqrt(sum/len(preds))


def AP(rec, gt, limiar):
    common = list(set(rec) & set(gt))
    # print(common)
    hit = 0
    i = 0
    score = 0
    while i < len(rec) and hit < limiar:
        if rec[i] in common:
            hit += 1
            score += hit/(i+1)
        i+=1
    return score/hit if hit > 0 else 0

def MAP(rec, gt, limiar= np.inf):
    commom_user = list(set(rec['userId']) & set(gt['userId']))
    score = 0

    for user in commom_user:
        score += AP(rec.loc[rec.userId == user, 'itemId'].tolist(), gt.loc[gt.userId == user, 'itemId'].tolist(), limiar)
    
    return score/len(commom_user)


# Pointwise

## Filtragem Colaborativa


### SVD Optimized


In [171]:
from math import sqrt

def train_svdopt(train, n_factors, lr=0.05, reg=0.02, miter=10):
    global_mean = train['rating_bin'].mean()
    n_users = review['userId'].max()+1
    n_items = review['itemId'].max()+1
    bu = np.zeros(n_users)
    bi = np.zeros(n_items)
    p = np.random.normal(0.1, 0.1, (n_users, n_factors))
    q = np.random.normal(0.1, 0.1, (n_items, n_factors))
    error = []
    for t in range(miter):
        sq_error = 0
        for index, row in train.iterrows():
            u = row['userId']
            i = row['itemId']
            r_ui = row['rating_bin']
            pred = global_mean + bu[u] + bi[i] + np.dot(p[u], q[i])
            e_ui = r_ui - pred
            sq_error = sq_error + pow(e_ui, 2)
            bu[u] = bu[u] + lr * e_ui
            bi[i] = bi[i] + lr * e_ui
            for f in range(n_factors):
                temp_uf = p[u][f]
                p[u][f] = p[u][f] + lr * (e_ui * q[i][f] - reg * p[u][f])
                q[i][f] = q[i][f] + lr * (e_ui * temp_uf - reg * q[i][f])
        error.append(sqrt(sq_error/len(train)))
    
    return global_mean, bu, bi, p, q, error

In [188]:
gl, bu, bi, p, q, error = train_svdopt(train, 4, miter=30)

In [204]:
preds = []
for i, row in test.iterrows():
    preds.append(gl + bu[row['userId']] + bi[row['itemId']] + np.dot(p[row['userId']], q[row['itemId']]))

In [206]:
rmse_user(preds, test['rating_bin'].tolist())

0.32304989648018356

In [207]:
px.line(error)

## Baseada em Conteudo

ItemAttributeKNN

In [185]:
metadados[['itemId', 'genre']].to_csv('meta_genres.dat', index=False, sep='\t', header=False)
train[['userId', 'itemId', 'rating_bin']].to_csv('train.dat', index=False, header=False, sep='\t')
test[['userId', 'itemId', 'rating_bin']].to_csv('test.dat', index=False, header=False, sep='\t')

In [186]:
ItemAttributeKNN('train.dat', 'test.dat', output_file='recs_iaknn.dat' ,metadata_file='meta_genres.dat', as_similar_first=True).compute()

[Case Recommender: Item Recommendation > Item Attribute KNN Algorithm]

train data:: 69783 users and 225 items (104005 interactions) | sparsity:: 99.34%
test data:: 22423 users and 222 items (26002 interactions) | sparsity:: 99.48%

training_time:: 0.180996 sec
>> metadata:: 225 items and 19 metadata (590 interactions) | sparsity:: 86.20%
prediction_time:: 122.730199 sec


Eval:: PREC@1: 0.005218 PREC@3: 0.00553 PREC@5: 0.005084 PREC@10: 0.004428 RECALL@1: 0.004338 RECALL@3: 0.013408 RECALL@5: 0.020127 RECALL@10: 0.035161 MAP@1: 0.005218 MAP@3: 0.009923 MAP@5: 0.01194 MAP@10: 0.014198 NDCG@1: 0.005218 NDCG@3: 0.014453 NDCG@5: 0.018528 NDCG@10: 0.024515 


In [189]:
iaknn = pd.read_csv('recs_iaknn.dat', sep='\t', names=['userId', 'itemId', 'rating_bin'])
iaknn.head()

Unnamed: 0,userId,itemId,rating_bin
0,1,44,1.0
1,1,58,1.0
2,1,67,1.0
3,1,79,1.0
4,1,101,1.0


In [190]:
MAP(iaknn, test, 100)

0.031611993339354084