# Final project: Content Based Game Recomendation
The goal of my project is to recommend games given a user profile and content of game descriptions.

## Importing the data

Here game data is from a json file generated with the script 'igdb_data.py'. The keywords and genres information was obtained in a similiar way (ie. replacing "games" with "genres" in the script and running). Then the genres and keywords are converted from id to word form.

In [1]:
import pandas as pd

# Load to DF
data = pd.read_json('./data/game_data.txt')
df_main = data.loc[data['category']==0]

# Categories we want
df_features = df_main[['name', 'id', 'total_rating_count', 'genres', 'summary', 'keywords']]
df_features = df_features.set_index('id')

# Drop games with missing data
df_features = df_features.dropna()

# Save copy for later
df_info = df_features.copy()

df_features.shape

(17630, 5)

In [2]:
df_features.sample(5)

Unnamed: 0_level_0,name,total_rating_count,genres,summary,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
26945,Guts and Glory,3.0,"[10, 32]",Guts and Glory is a game about racing to the f...,"[129, 350, 426, 558, 652, 1783, 1800, 4157, 54..."
23651,Echo of Soul,4.0,[12],The Wrath of the Goddess expansion is now live...,"[296, 900]"
6630,Sword of Mana,12.0,[12],The Mana Tree Needs Defenders! \n \nThe locati...,"[226, 227, 296, 301, 558, 729, 760, 939, 1026,..."
3871,Dave Mirra Freestyle BMX 2,5.0,"[10, 14]","Ride around big open levels on your bmx bike, ...","[661, 662, 1166, 1503, 1800, 4227, 4330, 4345,..."
4185,Strike Force Bowling,2.0,[14],Strike Force Bowling is a video game of the sp...,"[299, 1166, 12866]"


### Convert ids

In [3]:
# Index keywords by id
keywords = pd.read_json('./data/keyword_data.txt')
keywords = keywords.set_index('id')

# Index genres by id
genres = pd.read_json('./data/genre_data.txt')
genres = genres.set_index('id')

# Convert id's to words
df_features['genres'] = df_features['genres'].apply(lambda x: [genres.loc[id_]['name'] for id_ in x])
df_features['keywords'] = df_features['keywords'].apply(lambda x: [keywords.loc[id_]['name'] for id_ in x])

df_features.sample(5)

Unnamed: 0_level_0,name,total_rating_count,genres,summary,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1405,Hawken,24.0,[Shooter],War is a Machine Pilot hulking death machines ...,"[modern warfare, helicopter, vehicular combat,..."
29723,Autumn Night 3D Shooter,0.0,"[Shooter, Indie]",A 3D shooter made in the spirit of the the 90s...,"[shooter, action-adventure]"
7716,Alone in the Dark: Illumination,1.0,"[Shooter, Adventure]",A darkness has fallen over the town of Lorwich...,"[action-adventure, horror, survival horror, un..."
18841,Gibbous - A Cthulhu Adventure,10.0,"[Point-and-click, Adventure, Indie]",Gibbous - A Cthulhu Adventure is a comedy poin...,"[cartoon, comedy, parody, adventure, classic, ..."
13167,JumpJet Rex,10.0,"[Platform, Racing, Adventure, Indie]","JumpJet Rex is a punishing, old school 2-D pla...","[dinosaurs, retro, gravity, character customiz..."


## Extracting Summary Information with TF-IDF
In this section I use TF-IDF to convert the game summaries into a list of most imporant words.

### Text Pre-Processing

In [4]:
import contractions
import regex as re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Remove special characters and normalize casing
df_features['summary'] = df_features['summary'].str.lower()
df_features['summary'] = df_features['summary'].str.replace("-", " ")
df_features['summary'] = df_features['summary'].apply(lambda x: re.sub(r'[^\w\d\s\']+', '', x))

# Expand contractions
df_features['summary'] = df_features['summary'].apply(lambda x: x.split())
df_features['summary'] = df_features['summary'].apply(lambda x: [contractions.fix(word) for word in x])

# Remove Stopwords
stop_words = set(stopwords.words('english'))
df_features['summary'] = df_features['summary'].apply(lambda x: [word for word in x if word not in stop_words])
df_features['summary'] = [' '.join(map(str, l)) for l in df_features['summary']]

# Tokenize
df_features['summary'] = df_features['summary'].apply(word_tokenize)

df_features.sample(5)

Unnamed: 0_level_0,name,total_rating_count,genres,summary,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7637,Sniper Elite: Nazi Zombie Army,13.0,"[Shooter, Tactical]","[featuring, co, op, campaign, 1, 4, players, n...","[zombies, sniper, horror, steam, undead, steam..."
116417,Creature in the Well,13.0,"[Puzzle, Hack and slash/Beat 'em up, Adventure...","[creature, well, top, pinball, inspired, hack,...",[pinball]
32919,Pitfall Planet,0.0,"[Puzzle, Adventure, Indie]","[pitfall, planet, couch, co, op, puzzle, solvi...","[kid friendly, puzzle, steam achievements, ste..."
24297,Cake Mania: In the Mix,1.0,[Simulator],"[following, hectic, time, management, style, g...","[simulation, cooperative play, nintendo wi-fi ..."
35359,Vektor Wars,0.0,"[Shooter, Indie]","[first, person, cyber, shooter, tribute, class...","[first person shooter, robots, sci-fi, digital..."


### Use gensim to create TF-IDF model

In [5]:
from gensim import models, corpora, similarities

# Create dictionary and corpus
documents = df_features['summary']
mydict = corpora.Dictionary(documents)
corpus = [mydict.doc2bow(summary) for summary in documents]

# TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs='ntc')
tfidf_corpus = tfidf[corpus]

### Get top 25 words for all games

In [6]:
import random

def getTopN(item_n, n):
    df = pd.DataFrame(tfidf_corpus[item_n])
    df = df.sort_values(by=[1], ascending=False)[:n]
    return [mydict[word] for word in df[0]]

# Example
rand = random.randint(0, len(corpus))
print(df_features.iloc[rand]['name'])
print(getTopN(rand, 10))

Robotrek
['rococo', 'hackers', 'inventor', "'lnvention", 'father', 'half', 'downstairs', 'welcome', 'panicked', 'someday']


In [7]:
# Get 25 most important words for each summary
topwords = [getTopN(x, 25) for x in range(len(corpus))]
df_features['summary_top'] = topwords
df_features.sample(5)

Unnamed: 0_level_0,name,total_rating_count,genres,summary,keywords,summary_top
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
144133,Gloam,10.0,[Arcade],"[gloam, player, embodies, small, candle, must,...","[reflexion, logic puzzle]","[gloam, candle, light, embodies, participants,..."
11283,Agar.io,26.0,"[Strategy, Indie]","[agario, massively, multiplayer, top, strategy...","[simple, online, growing, minimalist, free-to-...","[agario, cells, browser, cell, political, turk..."
2153,FIFA 13,136.0,"[Simulator, Sport]","[fifa, 13, captures, drama, unpredictability, ...","[soccer, football, sports, achievements, psp, ...","[fifa, 13, pitch, football, pausing, unpredict..."
26544,Tamagotchi Connection: Corner Shop 3,2.0,[Arcade],"[business, booming, tamatown, good, dozen, new...","[shopkeeping, colorful]","[violetchi, shops, memetchi, kuchipatchi, mame..."
22713,Gunmetal Arcadia Zero,1.0,"[Shooter, Platform, Adventure, Indie]","[unrest, brews, city, arcadia, monstrous, enem...","[cyberpunk, retro, side scroller, first person...","[arcadia, vireo, gunmetal, brews, crt, evokes,..."


## Using BoW to Find Similiar Games
Here the set of summary, keywords, and genre words are combined to make a BoW that can be used with cosine-similarity to find similar games.

### Create BoW

In [8]:
df_features['bow'] = df_features['summary_top'] + df_features['genres'] + df_features['keywords']
df_features['bow'].sample(5)

id
7595     [imagine, hope, blanket, pound, oblivion, orga...
54780    [subsurface, circular, thomas, volume, buried,...
23442    [caravan, elaborate, scoring, joy, dreamcast, ...
51268    [bonecraft, wranglers, porno, lubbock, hod, ha...
11642    [spiky, fly, jump, passages, geometry, flip, i...
Name: bow, dtype: object

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Get BoW vectorizer
bows = df_features['bow'].apply(lambda x: " ".join(x)) # Expects a string and not tokens
vectorizer_bow = CountVectorizer()
vector_bow = vectorizer_bow.fit_transform(bows)

In [10]:
def getSimiliar(bow, n):
# Returns top n similiar games given a BoW
    
    #Create BoW based on our model
    bow = vectorizer_bow.transform([" ".join(bow)])
    
    # Create cosine-similarity matrix
    cosine_compare = cosine_similarity(bow, vector_bow)
    
    # Get top n games
    topn = pd.DataFrame(cosine_compare[0]).sort_values(by=[0], ascending=False)[:n]
    df_topn = df_features.iloc[topn.index]
    
    return df_topn

In [11]:
import random

# Test it out!

rand = random.randint(0, len(bows))
print(df_features.iloc[rand]['name'])
bow = " ".join(df_features.iloc[rand]['bow'])
getSimiliar(bow, 5)

Code Lyoko


Unnamed: 0_level_0,name,total_rating_count,genres,summary,keywords,summary_top,bow
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Thief II: The Metal Age,95.0,"[Shooter, Simulator, Adventure]","[ultimate, thief, back, tread, softly, make, w...","[ghosts, stealth, sword, fantasy, thief, death...","[thief, softly, religions, tread, fanatical, r...","[thief, softly, religions, tread, fanatical, r..."
25921,Atelier Firis: The Alchemist and the Mysteriou...,12.0,"[Role-playing (RPG), Strategy, Adventure]","[second, entry, 'mysterious, ', saga, follows,...","[anime, fantasy, jrpg, role playing, roaming e...","[mistlud, firis, liane, ertona, 'mysterious, e...","[mistlud, firis, liane, ertona, 'mysterious, e..."
25947,ICARUS.1,1.0,[Indie],"[icarus1, abandoned, decades, crew, mia]","[puzzle solving, spaceship]","[icarus1, mia, decades, abandoned, crew]","[icarus1, mia, decades, abandoned, crew, Indie..."
25944,Doom & Destiny Advanced,1.0,"[Role-playing (RPG), Adventure, Indie]","[venture, nerdy, anti, heroes, times, face, cr...","[fantasy, comedy, role playing]","[nerdy, 100, fight, fetch, asynchronous, embod...","[nerdy, 100, fight, fetch, asynchronous, embod..."
25943,The Sandbox Evolution,0.0,"[Platform, Puzzle, Simulator, Indie]","[sandbox, evolution, unique, pixel, world, bui...","[action-adventure, simulation, pixel graphics,...","[evolution, sandbox, pixel, worlds, 10x, playe...","[evolution, sandbox, pixel, worlds, 10x, playe..."


## Making a User Profile
Now that we can find similiar games given a BoW we need to create a user profile based on games the user likes. For this project I'm going to simulate a user by grabbing some random games. Then I'll create a BoW based on those games. Instead of simply combining all the terms for all games I'll instead randomly select a few words from each. This will allow for new suggestions each time the program is run and combined elements from games that the user might have never thought of!

### Generate a User

In [12]:
import nltk

def generateUser(n):
# Returns a user with a list of n games they like    
    name = random.choice(nltk.corpus.names.words())
    randomlist = random.sample(range(0, df_features.shape[0]), n)
    
    print("Hello! My name is " + name + " and I like:")
    for game in randomlist:
        print(df_features.iloc[game]['name'])

    return randomlist

generateUser(5)

Hello! My name is Amalee and I like:
World War II GI
Hidden Expedition: Amazon
NightSky
35MM
Dick Wilde 2


[1728, 8147, 7849, 9487, 16969]

### Create a User Profile

In [20]:
import math

def getProfile(gamelist):
# Creates a randomized BoW based on game list
    
    # Collect terms for each category
    genres = []
    keywords = []
    summs = []
    
    for game in gamelist:
        genres += df_features.iloc[game]['genres']
        keywords += df_features.iloc[game]['keywords']
        summs += df_features.iloc[game]['summary_top']
    
    # Want a little of each category
    total = len(genres) + len(keywords) + len(summs)
    
    gn = math.ceil(len(genres)/len(gamelist))    
    kn = math.ceil(len(keywords)/len(gamelist))
    sn = math.ceil(len(summs)/len(gamelist))
    
    genre_pick = random.choices(genres, k = gn)
    keyword_pick = random.choices(keywords, k = kn)
    summ_pick = random.choices(summs, k = sn)
    
    bow = genre_pick + keyword_pick + summ_pick
    
    return bow


print(getProfile(generateUser(5)))

Hello! My name is Geoffry and I like:
Othello
Major League Baseball 2K7
Tower 3D
Pale Moon Crisis
Hay Day
['Quiz/Trivia', 'Simulator', 'oakland athletics', 'voice chat', 'tree', 'houston astros', 'farming', 'oakland athletics', 'horror', 'cover athlete', 'goat', 'sheep', 'steam achievements', 'commentary', 'day/night cycle', 'major league baseball', 'farm simulator', 'mud', 'bucket', 'collectibles', 'difficulty level', 'xbox live', 'games that ask you to "press start" but will accept other buttons', 'horror', 'reversi', 'controls', 'selling', 'sports', 'ios', 'controls', 'users', 'jeanne', 'lovingly', 'tower', 'handcrafted', 'handcrafted', 'version', 'handcrafted']


## Putting It All Together
Now that we can generate a list of games given a BoW, and a way to create a BoW given a users prefernces its time to put it all together!

In [14]:
# Make some users
user1 = generateUser(5)
print()
user2 = generateUser(5)

Hello! My name is Leoline and I like:
Invasion From Beyond
Super Chibi Knight
OGame
Wing Commander: Prophecy
Persona 4 Arena

Hello! My name is Ulises and I like:
Dark Shadows - Army of Evil
Elebits: The Adventure of Kai and Zero
The Stillness of the Wind
Dungeon Hunter
Beavis and Butt-head


In [15]:
def getRecommendations(games):
    bow = getProfile(games)
    similiar = getSimiliar(bow, 5)
    
    print("I recommend these games: ")
    
    for game in similiar['name']:
        print(game)
        
getRecommendations(user1)
print()
getRecommendations(user2)

I recommend these games: 
Invasion From Beyond
Wing Commander: Prophecy
Red Wake Carnage
Persona 4 Arena
Days of War

I recommend these games: 
Elebits: The Adventure of Kai and Zero
Captain Tsubasa: Ōgon Sedai no Chōsen
Fossil Fighters: Champions
Hasbro Family Game Night 3
Phase 10


In [16]:
# Can run again for different results!
getRecommendations(user1)
print()
getRecommendations(user2)

I recommend these games: 
Space Rangers
Space Rangers 2: Dominators
OGame
Space Rangers HD: A War Apart
Shop Heroes

I recommend these games: 
Yi and the Thousand Moons
Shalnor Legends: Sacred Lands
Elebits: The Adventure of Kai and Zero
The Wizards
Heroes of Loot 2


## Analysis

### Purpose
The purpose of my project was to create a game recommendation engine. Rather than using the ever so popular collabrative recommender system I instead try a content based approach. The goal is to be able to take a list of games the user enjoys and recommend games that have similiar elements to those given. The program chooses random elements from each game instead of finding an overall theme which provides some unique games.

## Functionality
My program simply takes a list of games (albeit they must be in the database of games) and returns five recommended games. For the purpose of this project I create sample user profiles by selecting random games in the database. How well does my program work? Since recomendations are subjective it's hard to say but from the results I've seen I say not very well. For one, the random selection of keywords is a neat idea but can cause some really weird combination and then the recomendations become poor (for example we could get a BoW like "pinball horror survival cute"). In the future I'd keep the stochastic nature of the recomendations but try to maintain themes identified in the user profile. I'd also prevent game recommendations of games in the user profile and sequals/spin offs for those games which appeared a lot.


## Challenges
When I orginally attempted this project I started out by creating a TF-IDF matrix for all game summaries and useing cosine similarity and ran into two issues:

The first was that the data set was huge and so naturally there were a lot of features for TF-IDF. Computing the cosine similiarity for all of that was difficult! To overcome this I used gensims LSI model to reduce the feature set. That allowed me to find similiar games based on summaries, which worked well.

The second problem was it wasn't good for recommendations. To overcome this I adapted to the BoW approach seen here. This allows for generes, a very important feature of game suggestion, to appear. I also included user generated keywords as they are good descripters of games. Combining all three led to better results but there is definitally room for improvement!