<img src="https://www.escoladnc.com.br/wp-content/uploads/2022/06/dnc_formacao_dados_logo_principal_preto-1.svg" alt="drawing" width="300"/>

# Recomendação baseada em similaridade de conteúdo - Partes 1 e 2

Este notebook contém a implementação de uma recomendação item-item baseada na similaridade de conteúdo.

**O notebook é dividido em 2 partes**:

1. Representação vetorial com _one-hot-encoding_
2. Representação vetorial com _PCA_

Ambas as partes se utilizam do mesmo pré-processamento.

In [1]:
import os
import json
import numpy as np
import pandas as pd
from google.colab import files
import matplotlib.pyplot as plt
import matplotlib
from cycler import cycler

matplotlib.rcParams['axes.prop_cycle'] = cycler(color=['#007efd', '#FFC000', '#303030'])

# Carregando o dataset

O dataset a ser utilizado (`steam_games.parquet`) contém metadados de 32k jogos da [_Steam_](https://store.steampowered.com/), como:

- `id`: identificador do jogo
- `title`: título do jogo
- `genres`: lista com os gêneros associados ao jogo
- `tags`: lista com tags associadas ao jogo
- `specs`: especificações do jogo
- `release_date`: data de lançamento do jogo
- `price`: preço do jogo
- `sentiment`: avaliação qualitativa do jogo segundo usuários


Upload file `steam_games.parquet`

In [2]:
%%time
_ = files.upload()

Saving steam_games.parquet to steam_games.parquet
CPU times: user 497 ms, sys: 60.3 ms, total: 557 ms
Wall time: 38.8 s


In [5]:
filepath = './steam_games.parquet'
df = pd.read_parquet(filepath)
df.set_index('id', inplace=True)
df.tail()

Unnamed: 0_level_0,publisher,genres,app_name,title,url,release_date,tags,discount_price,reviews_url,specs,price,early_access,developer,sentiment,metascore
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
773640,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",1.49,http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,False,"Nikita ""Ghost_RUS""",,
733530,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",4.24,http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,False,Sacada,,
610660,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",1.39,http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,False,Laush Dmitriy Sergeevich,,
658870,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",,http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,False,"xropi,stev3ns",1 user reviews,
681550,,,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,,"[Early Access, Adventure, Indie, Action, Simul...",,http://steamcommunity.com/app/681550/reviews/?...,"[Single-player, Stats, Steam Leaderboards, HTC...",4.99,True,,Positive,


# Pré-processamento

Para criarmos uma representação vetorial de cada jogo precisamos processar as _features_ de interesse.

In [6]:
df_features = df.copy()

## Release Year
Extraindo o ano de lançamento do campo `release_date`

In [7]:
import re
def extract_year(release_date):
    if type(release_date) == str and re.match('^\d{4}-\d{2}-\d{2}$', release_date):
        return release_date.split('-')[0]

df_features['release_year'] = df_features['release_date'].apply(extract_year)
df_features[['release_date', 'release_year']].head()

Unnamed: 0_level_0,release_date,release_year
id,Unnamed: 1_level_1,Unnamed: 2_level_1
761140,2018-01-04,2018.0
643980,2018-01-04,2018.0
670290,2017-07-24,2017.0
767400,2017-12-07,2017.0
773570,,


## Price
Convertendo o campo `price` para o tipo _float_

In [8]:
def convert_price(price):
    try:
        return float(price)
    except:
        return 0.0
df_features['price_'] = df_features['price'].apply(convert_price)
df_features[['price', 'price_']].head()

Unnamed: 0_level_0,price,price_
id,Unnamed: 1_level_1,Unnamed: 2_level_1
761140,4.99,4.99
643980,Free To Play,0.0
670290,Free to Play,0.0
767400,0.99,0.99
773570,2.99,2.99
