# Building the Recommendation system

In this notebook, I am building the recommendation system.  
* I will clean the data using the same process in the EDA notebook
* I then make a cosine distance matrix and implement several functions to help search it. 
* Finally, I made a custom python class `CustomSearch` that can run a search on 1 or multiple games in a single line call

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances
from sklearn.feature_selection.variance_threshold import VarianceThreshold
from scipy import sparse
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import csv
from collections import Counter

# Content: 

Import Game Meta Data

In [2]:
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir('./GB_API_Scrape//game_meta_data/') if isfile(join('./GB_API_Scrape//game_meta_data/', f))] #get list of files
list_of_meta = [] # fill this list with dictionaries 
feature_list = ['guid', 'name', 'concepts', 'themes', 'deck', 'developers', 'publishers', 'genres'] #list of features I care about
for file in onlyfiles:
    #looping through files, open them, put their contents into a dict, and add it to the list
    with open(f'./GB_API_Scrape//game_meta_data/{file}', 'r') as f:
        game = json.load(f)
        game_dict = {key:value for key, value in game.items() if key in feature_list}

        # convert lists to strings 
        for key, value in game_dict.items():
            if type(value) == list:
                game_dict[key] = ', '.join(value)
        list_of_meta.append(game_dict)
# make pandas dataframe
df = pd.DataFrame(list_of_meta)
df.fillna("", inplace=True)

## Here we have the raw data frame.  We need to:
* Dummy the columns 
* Drop features that are too sparse
* Drop games that have too few features

In [3]:
df.head()

Unnamed: 0,concepts,deck,developers,genres,guid,name,publishers,themes
0,"Achievements, PlayStation Trophies, Steam, Dig...",Kill The Bad Guy is a puzzle-game where physic...,Exkee,"Strategy, Simulation",3030-46539,Kill the Bad Guy,,
1,,Zeal is an indie online ARPG developed by Lyca...,Lycanic Studios,"Action, Role-Playing, MOBA",3030-68714,Zeal,,"Fantasy, Medieval"
2,,Vertical Drop Heroes HD is an action platformer.,Nerdook Productions,"Action, Role-Playing, Platformer",3030-48249,Vertical Drop Heroes HD,,Fantasy
3,"Unreal Engine 4, PlayStation VR Support",A puzzle mystery game for PS VR.,Tarsier Studios,Puzzle,3030-57976,Statik,,
4,,A compilation of all three Banner Saga titles.,Stoic,"Strategy, Role-Playing, Compilation",3030-68731,The Banner Saga Trilogy,,Fantasy


In [4]:
def split_features_from_col(df, col):
    '''
    Returns a dataframe of 1 hot encoded features from the selected col 
    '''
    df[col] = ['' if entry == None else entry for entry in df[col] ]
    cvec = CountVectorizer(stop_words='english', tokenizer=lambda x: x.split(', '))
    bow = cvec.fit_transform(df[col])
    ret_df  = pd.DataFrame(bow.todense(),
                       columns=map(lambda x: col + "_" + x, cvec.get_feature_names()))
    ret_df.drop(col+'_', 1, inplace=True)
    return ret_df

def split_features(df, list_of_cols):
    '''
    Returns a dataframe of 1 hot encoded features from a list of cols
    '''
    ret_df = df.loc[:, ['name', 'guid']]
    for col in list_of_cols:
        ret_df = pd.merge(ret_df, split_features_from_col(df, col), left_index=True, right_index=True)
        
    return ret_df

In [5]:
# We have 4 features that we want to split
dummied_df = split_features(df, ['concepts', 'genres', 'themes', 'developers'])

In [6]:
dummied_df.set_index("name", inplace=True)

dummied_df.drop("guid", 1, inplace=True)

dummied_df.shape

(1746, 5070)

##  Manual Feature Reduction

In [7]:
dummied_df.var().sort_values(ascending=False)[0:10]

concepts_digital distribution    0.236484
concepts_steam                   0.193403
themes_fantasy                   0.171660
genres_action                    0.169722
themes_sci-fi                    0.154435
concepts_indie                   0.128014
concepts_steam achievements      0.127613
genres_adventure                 0.114020
genres_role-playing              0.111472
concepts_achievements            0.108037
dtype: float64

I don't want features like concepts_digital distribution or concepts_steam.\* or concepts_pax.\* or concepts_e3.\*

In [8]:
drop_cols = ["concepts_digital distribution", "concepts_wasd movement", 
             "concepts_achievements", "concepts_playstation trophies",
              "concepts_subtitles"]
for col in dummied_df.columns:
    if ("concepts_steam" in col) or ("concepts_pax" in col) or ("concepts_e3" in col):
        drop_cols.append(col)
drop_cols[0:10]
print(f"Dropping {len(drop_cols)} columns")

Dropping 58 columns


In [9]:
dummied_df.drop(drop_cols, 1, inplace=True)

### Dropping games with too few (5 or less) features

In [10]:
thresh = 6
(dummied_df.iloc[:,2:].T.sum()>=thresh).sum()

1073

In [11]:
dummied_df.drop(dummied_df[dummied_df.iloc[:,2:].T.sum()<thresh].index, 0, inplace=True)

In [12]:
dummied_df.shape

(1073, 5012)

## Using Sklearn's Variance Threshold to drop features with variance of .005 and less

In [13]:
vt = VarianceThreshold(.005)
thresh_df = vt.fit_transform(dummied_df)

In [14]:
thresh_df.shape

(1073, 1208)

## Using TruncatedSVD to reduce feature space
* The goal was to capture at least 90% of the variance
* I found that 300 components does this

In [15]:
svd = TruncatedSVD(n_components=300)
content = svd.fit_transform(thresh_df)

# Now to make the simularity matrix
* uses cosine distance
* simularity matrix stored in pandas dataframe called `distance_df`


In [16]:
sparse_content = sparse.csr_matrix(content)

In [17]:
distances = pairwise_distances(content, metric='cosine') 
distance_df = pd.DataFrame(distances, index=dummied_df.index, columns=dummied_df.index)
distance_df.head()

name,Zeal,Giana Sisters: Twisted Dreams,Warriors All-Stars,Ultra Street Fighter IV,Arcania: Gothic 4,Lichdom: Battlemage,The King of Fighters Collection: The Orochi Saga,Alien Shooter,JumpJet Rex,Anthem,...,Alienation,World Heroes 2 JET,Way of Redemption,Mystereet F: Tantei-tachi no Curtain Call,Hotline Miami 2: Wrong Number,The Magic Circle,Gran Turismo Sport,Guts and Glory,Mighty No. 9,Sonic Forces
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zeal,0.0,0.938141,0.639236,1.00557,0.781601,0.58813,1.002414,0.845765,0.97744,0.53277,...,0.977181,1.001847,0.51722,1.017724,0.899563,0.837721,0.996237,1.009289,0.911808,0.871629
Giana Sisters: Twisted Dreams,0.938141,0.0,0.952108,0.913315,0.904198,0.929111,1.003138,0.959817,0.836322,0.999555,...,0.898772,1.004193,0.942976,1.002356,0.916096,0.962112,1.002712,0.92663,0.780074,0.821885
Warriors All-Stars,0.639236,0.952108,0.0,0.951291,0.963152,0.814962,0.927697,0.902171,0.999816,0.834512,...,0.992497,1.006173,0.715051,1.000826,0.931356,0.873099,1.015287,1.004599,0.941035,0.889685
Ultra Street Fighter IV,1.00557,0.913315,0.951291,0.0,0.928852,0.88648,0.843954,0.999383,0.918656,1.001679,...,0.939886,0.780165,1.004629,0.95073,1.003902,0.960288,1.003845,0.998059,0.935235,1.000497
Arcania: Gothic 4,0.781601,0.904198,0.963152,0.928852,0.0,0.800334,1.001393,0.961998,0.917262,0.934299,...,0.971011,0.967038,0.947819,0.946803,0.979807,0.865154,0.995119,0.99623,0.977566,0.958615


In [18]:
# Functions that interact with the content filter

def getSimilarGames(game, num=10):
    '''
    This function searches for a game and returns the similar games
    '''
    return get_simular_games_from_title(search_game(game)[0], num)

def search_game(search): # Added to class
    '''
    This helper function looks for games that match the search and returns them as a list
    '''
    return [game for game in distance_df.columns if search.lower() in game.lower() ]

def get_simular_games_from_title(title, num):
    '''
    This helper function returns the top num similar games given a title
    '''
    return distance_df[title].sort_values(ascending=True).index[1:num+1]

# requires the dummied_df, and needs to be run through vt and svd (maybe use a pipe)
def get_feature_vec(game):
    '''
    This function returns the binary vector associated with the feature space of a single game entry in the dummied dataframe
    '''
    title = search_game(game)[0]
    return dummied_df.loc[title, :].values

def combine_vec(v1, v2, method = 'or'):
    '''
    combines 2 feature vectors in the specified method
    method = {'union', 'and', 'or', 'intersect', 'add'}
    '''
    # add = v1+v2
    # XOR = (v1+v2) %2
    # or = (v1+v2)>0
    # and/intersect = (v1*v2) 
    
    if method == 'or' or method == 'union':
        return ((v1+v2)>0).astype(int)
    if method == 'and' or method == 'intersect':
        return v1*v2
    if method == 'add':
        return v1+v2

def transform_vector(vector):
    '''
    Given a binary vector of features, returns the transformed vector after feature reduction
    '''
    return svd.transform(vt.transform(vector.reshape(1, -1)))

# Requires content
def getCosineToVector(vector):
    '''
    returns a vector of cosine distances from a custom transformed vector to every game
    '''
    return cosine_distances(vector, content)


In [19]:
getSimilarGames("Fez")

Index(['VVVVVV', 'Escape Goat 2', 'Celeste', 'Badland', 'Machinarium',
       'Jak II', 'Mutant Mudds', 'Unravel', 'Mutant Mudds Super Challenge',
       'Pneuma: Breath of Life'],
      dtype='object', name='name')

# I made a class to run Custom Searches with multiple games
Here is an example of it being used

In [20]:
from model_assets.CustomSearch import CustomSearch

In [21]:
testClass = CustomSearch(["uncharted 4", "tomb raider"])

In [22]:
testClass.SearchSimilarGames()

['Uncharted: The Nathan Drake Collection',
 'Hitman',
 'The Last of Us',
 'Metal Gear Solid V: The Phantom Pain',
 'Star Wars Battlefront',
 'Resident Evil: Revelations',
 "Tom Clancy's The Division",
 'Far Cry 4',
 'inFamous: First Light',
 'Bloodborne']

In [23]:
CustomSearch(["Fez", "N++"]).SearchSimilarGames()

['Celeste',
 'Badland',
 'Ninja Senki',
 'VVVVVV',
 'Red Goddess: Inner World',
 'Mutant Mudds Super Challenge',
 'Magician Lord',
 'Escape Goat 2',
 'Kero Blaster',
 'The Bridge']

### For comparison, here is the same search done without the custom class.

In [24]:
# get the combined feature vectors
combo = combine_vec(get_feature_vec("Fez"), get_feature_vec("N++"))

In [25]:
# find the cosine distances to all the games
dists = getCosineToVector(transform_vector(combo))

In [26]:
# put in a series to identify the games
pd.Series(dists[0], index=distance_df.index).sort_values()[2:12]

name
Celeste                         0.666249
VVVVVV                          0.668538
Badland                         0.670113
Ninja Senki                     0.679771
Escape Goat 2                   0.694125
Red Goddess: Inner World        0.709668
Machinarium                     0.712792
Mutant Mudds Super Challenge    0.713111
Magician Lord                   0.722529
The Bridge                      0.723623
dtype: float64

## These exports are the requirements for the `CustomSearch` class

In [27]:
import pickle as pkl


# with open("./model_assets/features_df.csv", "w+") as f:
#     dummied_df.to_csv(f)

# with open("./model_assets/content.pkl", "wb+") as f:
#     pkl.dump(content, f)

# with open("./model_assets/svd.pkl", "wb+") as f:
#     pkl.dump(svd, f)

# with open("./model_assets/vt.pkl", "wb+") as f:
#     pkl.dump(vt, f)