# Playful: Find your new favorite computer game
Here is the basic outline of how I  built [an app that recommends computer games on Steam](http://www.playful.live/) using a combination of python and PostgreSQL.

## Import stuff
My config.py file is not on GitHub. You need your own Steam API key and database information.

In [31]:
import json
import pandas as pd
import numpy as np
from app.config import api_key, db_username, db_password, db_host, db_port
from urllib.request import Request, urlopen
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import pickle
from lightfm import LightFM
from lightfm.evaluation import recall_at_k
from lightfm.cross_validation import random_train_test_split
from scipy import sparse
import math
import random

## Scrape reviews for user IDs
Scraping hub has [a detailed example of how to scrape reviews from the Steam store using scrapy]((https://blog.scrapinghub.com/2017/07/07/scraping-the-steam-game-store-with-scrapy/), complete with code in a GitHub repo. 

I scraped all of the reviews, which took about 4 days, in case later on I want to incorporate some of that information into the recommendations. For now the only thing I'm using from that exercize is a list of ~400,000 unique Steam user IDs of the review writers. I did not include any other Steam users, so my recommendations are biased toward games owned by people who have written reviews. 

Due to space limitations on GitHub, I am sharing only a small part of 1 of the 3 scrapy output files.

In [2]:
def load_reviews():
    reviews = []
    path_to_scraped_data = 'example_data//'
    files = ['scraped_reviews.jl']
    
    for file in files:
        with open(''.join((path_to_scraped_data, file)), 'r') as f:
            for line in f:
                reviews.append(json.loads(line))
                
    return reviews

scraped_reviews = load_reviews()

user_ids = []
for review in scraped_reviews:
    try:
        user_ids.append(review['user_id'])
    except KeyError:
        pass
    
unique_users = list(set(user_ids))
print('There are', len(unique_users), 'unique steam user IDs in the sample data.')

There are 1094 unique steam user IDs in the sample data.


## API calls for game ownership
This took about 5 minutes, and you have to be online for the API call to work.

In the real app, I'm using a pickled version of the results to avoid complications in case a user deletes their account.

In [3]:
def getGamesOwned(player_id):
    req = Request('http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key=%s&steamid=%s&format=json&include_played_free_games=True&include_appinfo=True'%(api_key, player_id))
    try:
        data_raw = urlopen(req).read()
        data_json = json.loads(data_raw)
        return data_json['response']['games']
    except:
        return []

def get_all_games_owned_by_players(user_ids):
    users_and_their_games = {}
    for idx, gamer_id in enumerate(user_ids):
        users_and_their_games[gamer_id] = getGamesOwned(gamer_id)
    return users_and_their_games

users_and_their_games = get_all_games_owned_by_players(unique_users)

## Put the ownership data into pandas and PostgreSQL
Every user-game pair gets its own row in the database. For example, say I have data for only 2 unique Steam users, Katie and Minchun. If Katie owns 20 games and Minchun owns 3 games, I'll end up with 23 rows. 

You have to have a SQL server installed and running with apppropriate password information for this section to work. Also, I used Windows. The syntax will be different on a Mac or Linux.

In [4]:
user_column = []
app_column = []

for user in unique_users:
    for game in users_and_their_games[user]:
        user_column.append(user)
        app_column.append(game['appid'])

user_game_df = pd.DataFrame({'user_id':user_column, 'app_id':app_column})

db_name  = 'playful'
engine = create_engine('postgresql+psycopg2://%s:%s@%s:%s/%s'%(db_username,db_password,db_host,db_port,db_name))

if not database_exists(engine.url):
    create_database(engine.url)
    
user_game_df.to_sql('user_games_table', engine, if_exists='replace')

user_game_df.head()

Unnamed: 0,app_id,user_id
0,211500,76561198382802605
1,427100,76561198382802605
2,324310,76561198382802605
3,12520,76561198172905937
4,11480,76561198172905937


## SQL query for most popular games
This is how I came up with the list of the 12 most popular games on the app homepage. I'll convert the game IDs into actual names shortly.

At scale, this SQL query was much faster than a similar analysis in pandas.

In [5]:
sql_query = """ SELECT app_id, COUNT(user_id) AS "n_owners"
                FROM user_games_table
                GROUP BY app_id
                ORDER BY n_owners DESC
                LIMIT 12
            """

con = None
con = psycopg2.connect(database=db_name, user=db_username, password=db_password, host=db_host, port=db_port)
most_popular_game_ids = pd.read_sql_query(sql_query, con).app_id.values

print('Here are the IDs of the most-owned games')
for game in most_popular_game_ids:
    print(game)

Here are the IDs of the most-owned games
223530
550
72850
218620
730
620
230410
205790
304930
4000
238960
49520


## Determine unique users and games

In [6]:
sql_query = """ SELECT *
                FROM user_games_table
            """

df = pd.read_sql_query(sql_query, con)
df.drop_duplicates(inplace=True)

unique_users = df.user_id.unique()
unique_games = df.app_id.unique()

n_users = len(unique_users)
n_games = len(unique_games)
n_datapoints = len(df)
sparsity = 100* n_datapoints / (n_users*n_games)

print('number of data points', n_datapoints)
print('number of users:', n_users)
print('number of games:', n_games)
print('Sparsity of data in the example interactions matrix: {:4.3f}%'.format(sparsity))

number of data points 234568
number of users: 911
number of games: 19946
Sparsity of data in the example interactions matrix: 1.291%


## Mappers
Each game has 3 different ways we can refer to it:
* the game's name (gamename)
* the game's Steam ID (gameid)
* the game's location in the interactions matrix (idx)

I made 6 different mapper dictionaries to convert from one game representation of a game to another. The game name to Steam ID mapping is from the API, but here and in the app I'm using stored data for that data and 2 of the mapper dictionaries. 

The users also get mapped to indexes in the matrix.

In [7]:
## Game name and game ID information from API
# req = Request('http://api.steampowered.com/ISteamApps/GetAppList/v2/?key=%s'%(api_key))
# data_raw = urlopen(req).read()
# data_json = json.loads(data_raw)['applist']['apps']

## Saved game name and game ID info
with open('app//playful//static//data//all_game_info.txt', 'r') as f:
    all_game_info = json.load(f)
    
gameid_to_name = {}
gamename_to_gameid = {}
for app in all_game_info:
    gameid_to_name[app['appid']] = app['name']
    gamename_to_gameid[app['name']] = app['appid']

idx_to_name = {}
idx_to_gameid = {}
name_to_idx = {}
gameid_to_idx = {}

for idx, gameid in enumerate(unique_games):
    idx_to_gameid[idx] = gameid
    gameid_to_idx[gameid] = idx
    
    try:
        idx_to_name[idx] = gameid_to_name[gameid]
    except KeyError:
        idx_to_name[idx] = "Could not identify this game. Maybe it's new?"
        
    try:
        name_to_idx[gameid_to_name[gameid]] = idx
    except KeyError:
        pass
      
userid_to_idx = {}
idx_to_userid = {}
for (idx, userid) in enumerate(unique_users):
    userid_to_idx[userid] = idx
    idx_to_userid[idx] = userid
    
# examples
game_idx = 2000
game_id = idx_to_gameid[game_idx]
game_name = gameid_to_name[game_id]
print(game_name, 'will be game number', game_idx, 'in the interactions matrix and has Steam game ID', game_id)

print('\nThe most-owned games in this sample of data by name instead of game ID:')
for gameid in most_popular_game_ids:
    print(gameid_to_name[gameid])

Killing Floor: Incursion will be game number 2000 in the interactions matrix and has Steam game ID 690810

The most-owned games in this sample of data by name instead of game ID:
Left 4 Dead 2 Beta
Left 4 Dead 2
The Elder Scrolls V: Skyrim
PAYDAY 2
Counter-Strike: Global Offensive
Portal 2
Warframe
Dota 2 Test
Unturned
Garry's Mod
Path of Exile
Borderlands 2


## Build the sparse interactions matrix
I and J specify the locations in the sparse matrix where the data V will go.

### Ownership data
The data in this case are all 1's that we put in the matrix to indicate which owner owns which game. All of the remaining entries in the matrix are zeroes, meaning we don't have any information about whether a given user is interested in a particular game.

### Hours played data
The API calls also give me the number of hours each user has played, so I could use some function of that number instead of just the binary owns/doesn't own. I played around with this a little bit, and LightFM can do that, but it's not as simple as just swapping the ones in the data for the hours played. They need to go in as sample weights instead, and in a sparse matrix form that matches the training data. If only I had another two weeks...

Here are some additional considerations if I were to use hours played data. 
* **What does it mean when a user owns a game but hasn't played it?**   
Maybe they just bought the game and are really super excited about it, but I would assume that means they weren't that interested in the game, and so ideally I would put a -1 in the matrix. I don't think LightFM can handle that.
* **Sometimes people leave a game on even when they aren't playing it.**    
I could either apply a time cutoff or use the log of the hours played.
* **Some games end quickly while others lend themselves to much longer playtimes.**   
I could normalize the times by average time played or perhaps based on genre.
* **Older games have an advantage.**  
This is true, and my model also totally fails to account for changes in user preferences over time. However! The API call also tells me how long a user has spent playing each game in the last two weeks, so I could train on just that data.

In [8]:
def map_id(idx_to_switch_out, mapper):
    return mapper[idx_to_switch_out]

I = df.user_id.apply(map_id, args=[userid_to_idx]).values
J = df.app_id.apply(map_id, args=[gameid_to_idx]).values 
V = np.ones_like(I)

interaction_matrix = sparse.coo_matrix((V, (I, J)), dtype=np.float64)

## Split the data into training and test sets
This split is not as straightforward as some other maching learning algorithms because I need *some* information in about what a user owns to make recommendations, so I can't just hold a group of users out entirely. Instead, I split the data into two sets with the same users, but my training data contains 80% of the users' games, and the test data contains the other 20%. The python package LightFM includes a handy function for doing that for me. 

In [9]:
traindata, testdata = random_train_test_split(interaction_matrix)

## Implement matrix factorization
LightFM uses stochastic gradient descent to solve for the latent vectors,or embeddings, that characterize each game and user in the interactions matrix. 

Hyperparameters that must be chosen for the model include:
* the length of the latent vectors (no_components)
* the learning rate to use during gradient descent
* the number of iterations, or epochs, to use when trying to fit the data
* the exact form of the loss function (the default is called WARP)

Ideally one would use a grid search or start with random points within a grid search to decide what values to use for the various hyperparameters. That takes awhile, so here I'm showing the fit with the hyperparameters I used. Note that I did not do a proper grid search, but there is graph in backup slides at playful.live showing that the number of components in particular is certainly improved from the default value of 10.

In [10]:
model = LightFM(no_components=25, learning_rate=0.045)
model.fit(traindata, epochs=40) 

<lightfm.lightfm.LightFM at 0x1990fc197b8>

## Recall@k
There are a lot of different validation metrics one can use to evaluate recommender systems. The one I used when optimizing my hyperparameters is called recall@k.  

Recall refers to the number of true positives / (the number of true positives + the number of false negatives), and I like it better than precision (true positives / (true positives + false positives)) here because recall, unike precision, does not assume that a zero in the matrix (lack of ownership) means that person won't like the game if we recommended it. 

Recall@k tells us this: if I recommend only k games (12 games in this example) out of my list of ~20,000 games to users based on their games in the training data, how likely am I to recommend the games they own that I held out when training the model? 

And again LightFM has a handy function.

In [78]:
example_recall = recall_at_k(model, testdata, k=12).mean()
true_model_recall = 0.083

print('recall@12 for this example:', example_recall)
print('recall@12 for my actual model:', true_model_recall)

recall@12 for this example: 0.0476204036598
recall@12 for my actual model: 0.083


### Comparison with just recommending the most popular games
This is a super relevant and important comparison to make, but the math is not straightforward. I tried simulating it with a for loop, but that approach hadn't found a single hit (a randomly dropped game that was one of the 12 most popular games) even after running all night. In contrast, LightFM's recall_at_k function is incredibly fast, I think because they're making good use of things like cython and sparse matrices. If I had another two weeks, this comparison is definitely something I would want to sort out. Just qualitatively though, I will note that the there is a lot of diversity in the genres of those 12 most-owned games (e.g., a physics sandbox vs a first-person shooter vs a strategy game), and the recommendations my model produces have a lot more game features that are obviously in common with each other.

###  Comparison with random guessing
If we randomly pick 12 games out of 20K and don't care about the order within that list of 12, the probability of picking the 12 games that we dropped is related to the [hypergeometric distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution) and works out 12 / 20K. Note the exact number of unique games in the Steam store changed between when I first created my model and when I created this example.

In [79]:
print('Chance of picking the 12 dropped games by random guessing:', 12./len(unique_games))
print('which is', round(true_model_recall/(12./len(unique_games))), 'times worse than my model')

Chance of picking the 12 dropped games by random guessing: 0.0006016243858417727
which is 138 times worse than my model


## The item similarity matrix
The model item embeddings are vectors that represent each game. (These are the things that the matrix factorization model fitting figured out). We take the dot product of this matrix by its transpose, normalize, and voila, there is a matrix of similarities between games. 

In [14]:
game_similarity_matrix = model.item_embeddings.dot(model.item_embeddings.T)
normalizeto = np.array([np.sqrt(np.diagonal(game_similarity_matrix))])
game_similarity_matrix = game_similarity_matrix / normalizeto / normalizeto.T

## The cold start problem
One major drawback of collaborative filtering is that if a user or game isn't in the interactions matrix, the model has no way to make recommendations. That's why recommenders still need things like game features (developer studio, genre, tags, etc.) and user features (games owned, demographics, etc.). 

### New games
My model never recommends any bright, shiny, brand new games. If I were to retrain the model every week (which I would definitely set up if I had another 2 weeks to work on this), then I would start to pick up the new games, but they won't show up right away. If that's the kind of recommendations you want (i.e., of the games that came out in the last, say, month, which ones are most relevant to me as a user?), you are in luck because that is exactly what the Steam store already does, or at least, is trying to do. 

### New users
For a brand new user, I show them the most popular games by number of owners (see list above), but 'new user' in this context doesn't only mean brand new users who don't own any games. It means any user who isn't in the interactions matrix. My app works for any Steam user who owns games, which means I need some information about the user. Specifically, I use the games they own and how many hours they have played each game.

## API call for user information

This example uses my Steam vanityurl (which has to be set by the user in their Steam settings - just having a Steam account name is not enough!), but the app can also use the 17-digit Steam user ID.

In [15]:
def convert_input_to_userid(input_id):
    """ 
    Take user input from app (Steam user ID or vanity URL) and output Steam user ID for further API calls ]
    """
    req = Request('http://api.steampowered.com/ISteamUser/ResolveVanityURL/v0001/?key=%s&vanityurl=%s'%(api_key, input_id))

    try:
        data_raw = urlopen(req).read()
    except HTTPError:
        return input_id

    data_json = json.loads(data_raw)

    try:
        return int(data_json['response']['steamid'])
    except KeyError:
        return input_id


def get_user_games(user_id):
    """ 
    Take Steam ID and make an API call to return users's owned games and hours played 
    """
    req = Request('http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key=%s&steamid=%s&format=json&include_played_free_games=True&include_appinfo=True'%(api_key, user_id))
    try:
        data_raw = urlopen(req).read()
        data_json = json.loads(data_raw)
        return data_json['response']['games']
    except:
        return []
    
example_steam_urlname = 'elizabethferriss' 
user_id = convert_input_to_userid(example_steam_urlname)
user_game_info = get_user_games(user_id)

print('My games')
print(user_game_info)

My games
[{'appid': 400, 'playtime_forever': 310}, {'appid': 15130, 'playtime_forever': 529}, {'appid': 22330, 'playtime_forever': 4893}, {'appid': 22320, 'playtime_forever': 101}, {'appid': 40700, 'playtime_forever': 551}, {'appid': 3900, 'playtime_forever': 1}, {'appid': 3990, 'playtime_forever': 137}, {'appid': 8800, 'playtime_forever': 30471}, {'appid': 16810, 'playtime_forever': 0}, {'appid': 34440, 'playtime_forever': 0}, {'appid': 34450, 'playtime_forever': 0}, {'appid': 34460, 'playtime_forever': 0}, {'appid': 8930, 'playtime_forever': 39603}, {'appid': 32360, 'playtime_forever': 581}, {'appid': 32460, 'playtime_forever': 675}, {'appid': 61510, 'playtime_forever': 129}, {'appid': 620, 'playtime_forever': 43}, {'appid': 203770, 'playtime_forever': 1473}, {'appid': 39140, 'playtime_forever': 1129}]


## Rank user's games based on hours played

In [16]:
user_game_ids = [app['appid'] for app in user_game_info]
user_hours_played = [app['playtime_forever'] for app in user_game_info]
userdf = pd.DataFrame({'appid': user_game_ids, 'hours_played' : user_hours_played})
userdf = userdf.sort_values(by='hours_played', ascending=False)
userdf['game_name'] = [gameid_to_name[gameid] for gameid in userdf.appid]
user_game_ids = userdf.appid.values
user_hours_played = userdf.hours_played.values
userdf.head()

Unnamed: 0,appid,hours_played,game_name
12,8930,39603,Sid Meier's Civilization V
7,8800,30471,Sid Meier's Civilization IV: Beyond the Sword
2,22330,4893,The Elder Scrolls IV: Oblivion
17,203770,1473,Crusader Kings II
18,39140,1129,FINAL FANTASY VII


## Make recommendations based on the user's most-played games
For each game, get the column in game similarity matrix for the user's most-played game and sort.

The recommendations here are much different from the ones on the actual app because here I'm only using a very small selection of users to train my model. 

In [17]:
def idx_to_recs(game_idx):
    game_recs_scores = game_similarity_matrix[game_idx]
    df = pd.DataFrame({'game_idx':list(idx_to_name.keys()), 'scores':game_recs_scores})
    df = df.sort_values(by='scores', ascending=False)
    df['gameID'] = [idx_to_gameid[idx] for idx in df.game_idx]
    df['games'] = [idx_to_name[idx] for idx in df.game_idx]
    df = df[~df.gameID.isin(user_game_ids)] # filter out games already owned
    return df['games'].values

nrecgroups =  10
nrecs_per_group = 8
games_already_recommended = []
for n in range(nrecgroups):
    user_gameid= user_game_ids[n]
    print('  People who own', gameid_to_name[user_gameid], 'also own:')
    recs = idx_to_recs(gameid_to_idx[user_gameid])
    recs = [rec for rec in recs if rec not in games_already_recommended] # don't recommend anything twice
    for rec in recs[0:nrecs_per_group]:
        games_already_recommended.append(rec)
        print(rec)
    print()

  People who own Sid Meier's Civilization V also own:
Warhammer 40,000: Dawn of War II - Retribution
Call of Duty: Black Ops
Star Conflict
Fallout Shelter
Magicka
Call of Duty: Black Ops - Multiplayer
Torchlight
This War of Mine

  People who own Sid Meier's Civilization IV: Beyond the Sword also own:
Saints Row 2
STAR WARS™: Knights of the Old Republic™
Pillars of Eternity
Torchlight II
Divinity: Original Sin (Classic)
Titan Quest Anniversary Edition
Rogue Legacy
Darksiders

  People who own The Elder Scrolls IV: Oblivion  also own:
Need for Speed: Hot Pursuit
Brütal Legend
Amnesia: The Dark Descent
Grand Theft Auto: Vice City
The Elder Scrolls V: Skyrim
Arma 2: DayZ Mod
Hotline Miami
Metro 2033

  People who own Crusader Kings II also own:
Worms Reloaded
The Witcher 3: Wild Hunt
Company of Heroes: Opposing Fronts
Rising Storm/Red Orchestra 2 Multiplayer
RimWorld
Castle Crashers
Don't Starve Together
Natural Selection 2

  People who own FINAL FANTASY VII also own:
STAR WARS™ Empire a