# Video games recommender system
based on steam sales datasets : 
- https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data) gathered by [Insert teacher names]
- https://www.kaggle.com/nikdavis/steam-store-games

In this notebook we'll try to build a video games recommender system based on a steam sales dataset. In order to do so we'll use different techniques to recommend as precisely as possible a game to a user.

_______

## Imports

In [1]:
#Data manipulation
import pandas as pd
import numpy as np
import string
import re

#File manipulation
import os
import gzip
import ast

#DataViz
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

#Helper functions
%load_ext autoreload
%autoreload 2
import helper

## Dataframe Construction

Lets begin with the Australian dataset which contains user activity for steam user in Australia. 

In [2]:
#data = list(helper.parseDataFromFile('data/australian_users_items.json'))

This dataset contains the information of **88,310** different steam users.

In [3]:
#full_user = pd.concat([helper.user_dataframe(elt) for elt in data],axis = 0).reset_index(drop = True)

In [4]:
#full_user.head()

Unnamed: 0,user_id,item_id,item_name,playtime_forever,playtime_2weeks
0,76561197970982479,10,Counter-Strike,6,0
1,76561197970982479,20,Team Fortress Classic,0,0
2,76561197970982479,30,Day of Defeat,7,0
3,76561197970982479,40,Deathmatch Classic,0,0
4,76561197970982479,50,Half-Life: Opposing Force,0,0


In [5]:
#full_user[['item_id','playtime_forever','playtime_2weeks']] = full_user[['item_id','playtime_forever','playtime_2weeks']].astype('int32')


Now, it could be usefull to add features to our dataset. We might find some in the steam dataset found on kaggle and a similar dataset found on https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data, which regroups valuable information relative to games such as the price.

## Adding columns

In [7]:
games_data = list(helper.parseDataFromFile('data/steam_games.json'))
games = pd.DataFrame(games_data)
steam_df = pd.read_csv('data/steam.csv')

In [8]:
#games.head()

Unnamed: 0,app_name,developer,discount_price,early_access,genres,id,metascore,price,publisher,release_date,reviews_url,sentiment,specs,tags,title,url
0,Lost Summoner Kitty,Kotoshiro,4.49,False,"[Action, Casual, Indie, Simulation, Strategy]",761140,,4.99,Kotoshiro,2018-01-04,http://steamcommunity.com/app/761140/reviews/?...,,[Single-player],"[Strategy, Action, Indie, Casual, Simulation]",Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...
1,Ironbound,Secret Level SRL,,False,"[Free to Play, Indie, RPG, Strategy]",643980,,Free To Play,"Making Fun, Inc.",2018-01-04,http://steamcommunity.com/app/643980/reviews/?...,Mostly Positive,"[Single-player, Multi-player, Online Multi-Pla...","[Free to Play, Strategy, Indie, RPG, Card Game...",Ironbound,http://store.steampowered.com/app/643980/Ironb...
2,Real Pool 3D - Poolians,Poolians.com,,False,"[Casual, Free to Play, Indie, Simulation, Sports]",670290,,Free to Play,Poolians.com,2017-07-24,http://steamcommunity.com/app/670290/reviews/?...,Mostly Positive,"[Single-player, Multi-player, Online Multi-Pla...","[Free to Play, Simulation, Sports, Casual, Ind...",Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...
3,弹炸人2222,彼岸领域,0.83,False,"[Action, Adventure, Casual]",767400,,0.99,彼岸领域,2017-12-07,http://steamcommunity.com/app/767400/reviews/?...,,[Single-player],"[Action, Adventure, Casual]",弹炸人2222,http://store.steampowered.com/app/767400/2222/
4,Log Challenge,,1.79,False,,773570,,2.99,,,http://steamcommunity.com/app/773570/reviews/?...,,"[Single-player, Full controller support, HTC V...","[Action, Indie, Casual, Sports]",,http://store.steampowered.com/app/773570/Log_C...


In [9]:
#steam_df.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


Merging this two dataframe before adding them to the **full_user** dataframe could be useful.

In [10]:
#steam_df_ = steam_df.drop(['appid','english','developer','publisher','platforms','required_age',
               'achievements','positive_ratings','negative_ratings','average_playtime',
              'median_playtime','owners'], axis = 1)

In [11]:
#games_ = games.drop(['id','developer','discount_price','early_access','metascore',
                     'publisher','reviews_url','sentiment','title','url'],axis = 1)

In [12]:
#games_ = games_[['app_name','release_date','specs','genres','tags','price']]

In [13]:
#games_.columns = steam_df_.columns

In [14]:
#games_ = games_.dropna(axis = 0, subset = ['name'])

Many nan values in games are data present in **steam_df_**, so to make the majority of those nan disappear we just need to fill **games_** with **steam_df_**'s values.

In [15]:
#games_ = games_.fillna(steam_df_)

In [16]:
#final_games = pd.concat([games_,steam_df_],axis = 0).reset_index(drop = True)
#final_games.shape

(59208, 6)

There might be a few duplicate values in this new dataframe.
Before using the **drop_duplicates** function on games' names, we might want to apply the **text_cleaning** function so that we are sure to eliminate every duplicates.

In [17]:
#final_games['name'] = final_games['name'].astype(str)
#helper.text_cleaning(final_games,'name')

In [18]:
#final_games = final_games.drop_duplicates(subset = 'name')
#final_games.shape

(43167, 6)

Now that the games' names have been cleaned in the **final_games** dataset, we need to do the same in **full_user**.

In [19]:
#helper.text_cleaning(full_user,'item_name')

In [20]:
#user_games = full_user.merge(final_games,how = 'left',left_on = 'item_name',
                             right_on = 'name').drop(['name'],axis = 1)

In [21]:
#user_games.isna().sum()/len(user_games),len(user_games),len(full_user)

(user_id             0.000000
 item_id             0.000000
 item_name           0.000000
 playtime_forever    0.000000
 playtime_2weeks     0.000000
 release_date        0.209084
 categories          0.203795
 genres              0.211113
 steamspy_tags       0.202893
 price               0.211020
 dtype: float64, 5153209, 5153209)

Seems like we're gonna have to drop ±20% of our data in order to no longer have nan values.

In [22]:
#final_df = user_games.dropna(axis = 0)

After dropping the nan values with still have more than 4 million rows which is largely enough.

In [23]:
#final_df.shape

(4020731, 10)

Time to download the resulting dataframe in an easy to load file.

In [2]:
#final_df.to_pickle('data/final_df.pckl')

In [3]:
final_df = pd.read_pickle('data/final_df.pckl')

In [4]:
final_df.head()

Unnamed: 0,user_id,item_id,item_name,playtime_forever,playtime_2weeks,release_date,categories,genres,steamspy_tags,price
0,76561197970982479,10,counterstrike,6,0,2000-11-01,"[Multi-player, Valve Anti-Cheat enabled]",[Action],"[Action, FPS, Multiplayer, Shooter, Classic, T...",9.99
1,76561197970982479,20,teamfortressclassic,0,0,1999-04-01,"[Multi-player, Valve Anti-Cheat enabled]",[Action],"[Action, FPS, Multiplayer, Classic, Shooter, C...",4.99
2,76561197970982479,30,dayofdefeat,7,0,2003-05-01,"[Multi-player, Valve Anti-Cheat enabled]",[Action],"[FPS, World War II, Multiplayer, Action, Shoot...",4.99
3,76561197970982479,40,deathmatchclassic,0,0,2001-06-01,"[Multi-player, Valve Anti-Cheat enabled]",[Action],"[Action, FPS, Multiplayer, Classic, Shooter, F...",4.99
4,76561197970982479,50,halflifeopposingforce,0,0,1999-11-01,"[Single-player, Multi-player, Valve Anti-Cheat...",[Action],"[FPS, Action, Sci-fi, Singleplayer, Classic, S...",4.99


## Feature engineering 

### Price

**price** should be a float column.

In [5]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4020731 entries, 0 to 5153208
Data columns (total 10 columns):
user_id             object
item_id             int32
item_name           object
playtime_forever    int32
playtime_2weeks     int32
release_date        object
categories          object
genres              object
steamspy_tags       object
price               object
dtypes: int32(3), object(7)
memory usage: 291.4+ MB


Lot's of free games are not priced as **0** but as **'Free to play'** (for example), we need to change that.

In [35]:
final_df.loc[:,'price'] = final_df['price'].replace('.*Free.*|.*Demo.*|.*Install.*',0,regex = True)

In [36]:
final_df.loc[:,'price'] = final_df['price'].replace('^Starting at [$]{1}','',regex = True)

After looking for the **'third party'** priced game on the steam store we can see that it's also a free game.

In [41]:
final_df[final_df['price'] == 'Third-party']['item_name'].unique()

array(['peggleextreme'], dtype=object)

In [42]:
final_df.loc[:,'price'] = final_df['price'].replace('Third-party',0,regex = True)

Now we can convert the column in a float column.

In [47]:
final_df.loc[:,'price'] = final_df.loc[:,'price'].astype('float32')

### Release Date

In [79]:
#final_df.loc[:,'release_date'] = pd.to_datetime(final_df['release_date'],errors = 'coerce',format = '%Y/%m')