## Objective
* Gather the data using Giant Bomb API.
* Complete exploratory data analysis.
* Analyze recommendation methods.

## Background Information
* With the number of products increasing exponentially, it burdens the consumer in which products to purchase. A novel solution is the use of recommender systems (engines) to "recommend" relevant products to the consumers based on their preferences. Applications of recommender systems include areas such as playlist generators for video and music services like Netflix, YouTube, and Spotify. Additionally, product recommendations for services such as Amazon. In this project, we'll explore novel techniques in recommending video games using the Giant Bomb video game database. 

## Process:
* Preprocessing (NLP packages)
* Exploratory Data Analysis conducted utilizing various python packages (Numpy, Matplotlib, Pandas, and Plotly).'
* Recommendation Methods.
    * TF-IDF
        * Cosine Similarity
        * Cosine Similarity + Singular Value Decomposition
        * K-Nearest Neighbors
        * K-Nearest Neighbors + Singular Value Decomposition
* PostgreSQL database.



## Table of Contents:
* Part I: Data Exploration
    * Gathering
    * Preprocessing
    * Exploration
* Part II: Recommendation Methods
    * TF-IDF
    * Cosine Similarity
    * KNN
    * SVD
        * Cosine Similarity
        * KNN
    * Results
* Part III: PostgreSQL database for application deployment.
    

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import pybomb
import re
import seaborn as sns
import time

from PIL import Image
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from wordcloud import WordCloud

params = {'text.usetex': False, 'mathtext.fontset': 'stixsans'}
plt.rcParams.update(params)

# Part I: Data Exploration

### Gathering the data.

Let's begin by fetching the identifiers (ID's) of PC video games and store it into a csv.

In [None]:
"""Pulls video game ID's from giantbomb."""

# Store various video game features into a list
## Despite only needing the video game ID's, we'll take other features to compare.
name_list = []
image_url_list = []
id_list = []
original_game_rating_list = []
original_release_date_list = []
platforms_list = []


# Request video game contents from the giantbomb api
## Instantiate it with a loop of getting 100 games at a time up to 50000
for x in range(100, 50000, 100):
    ## API Key
    my_key = 'f3e0c1a5f79182d471034230dd277db19eb873ef'
    ## API fields
    games_client = pybomb.GamesClient(my_key)
    return_fields = ('id', 'name', 'image', 'platforms', 'original_release_date',  'original_game_rating')
    limit = 100
    offset = x
    sort_by = 'name'
    filter_by = {'platforms': pybomb.PC}
    
    ## Now we pull the games and store it in the response
    response = games_client.search(
        filter_by = filter_by,
        return_fields = return_fields,
        sort_by = sort_by,
        desc = False,
        limit = limit,
        offset = offset
    )
    ## Iterate through the response.results and store the features into lists
    for x in response.results:
        ### Features
        name_list.append(x['name'])
        image_url_list.append(x['image']['super_url'])
        id_list.append(x['id'])
        original_release_date_list.append(x['original_release_date'])
        platforms_list.append('PC')
        ### Append original game rating feature if it exists, if it not it's none
        if x['original_game_rating'] == None:
            original_game_rating_list.append('None')
        else:
            original_game_rating_list.append(x['original_game_rating'][0]['name'])

# Export id list as csv
data =  {'id': id_list}

tmp_df = pd.DataFrame(data)

tmp_df.to_csv('data/id_list.csv')

Next, we'll fetch the contents of the video games using the ID's and then storing it into our final csv file for data analysis.

In [None]:
id_list = pd.read_csv('data/id_list.csv', index_col = 0)

In [None]:
id_list

In [None]:
"""Pulls video game contents using the video game id's collected previously."""
# Load in ID CSV
id_list = pd.read_csv('data/id_list.csv', index_col = 0)

# Store various video game features into a list
name_list = []
image_url_list = []
original_game_rating_list = []
original_release_date_list = []
platforms_list = []
developers_list = []
genres_list = []
themes_list = []
concepts_list = []
franchises_list = []



# Request video game contents from the giantbomb api
## Giantbomb api has a maximum number of requests per hour, we'll utilize the time commands to limit our requests per hour while being automated
counter = 0
for i in id_list['id']:
    counter += 1
    print(counter)
    ## If counter does not equal to 0
    if counter % 1000 != 0:
        ## Sleep for two seconds
        time.sleep(2)
        ## Take a maximum of 10 tries before returning an error.
        incomplete = True
        tries = 10
        while incomplete and tries > 0:
            try:
                ## API Key
                my_key = 'f3e0c1a5f79182d471034230dd277db19eb873ef'
                game_client = pybomb.GameClient(my_key)
                ## API fields
                game_id = i
                return_fields = ('id', 'name',
                                 'genres', 'themes',
                                 'franchises', 'developers',
                                 'platforms', 'original_release_date',
                                 'original_game_rating', 'image')
                ## Now we pull the games and store it in the response
                response = game_client.fetch(game_id)
            
                
                ## Store the response.results contents into their respective feature list
                ### Name
                if ('name' in response.results) & (response.results['name'] != None):
                    name_list.append(response.results['name'])
                else:
                    name_list.append('None')

                ### Image URL
                if ('image' in response.results) & ('super_url' in response.results['image']) & (response.results['image']['super_url'] != None):
                    image_url_list.append(response.results['image']['super_url'])
                else:
                    image_url_list.append('None')

                ### Original Release Date
                if ('original_release_date' in response.results) & (response.results['original_release_date'] != None):
                    original_release_date_list.append(response.results['original_release_date'])
                else:
                    original_release_date_list.append('None')

                ### Original Game Rating 
                if ('original_game_rating' in response.results) & (response.results['original_game_rating'] != None):
                    original_game_rating_list.append(response.results['original_game_rating'][0]['name'])
                else:
                    original_game_rating_list.append('None')

                ### Platform
                if ('platforms' in response.results) & (response.results['platforms'] != None):
                    platforms_list.append(response.results['platforms'][0]['name'])
                else:
                    platforms_list.append('None')

                ### Developers
                if ('developers' in response.results) & (response.results['developers'] != None):
                    developers_list.append(response.results['developers'][0]['name'])
                else:
                    developers_list.append('None')

                ### Genres
                tmp_list = []
                if ('genres' in response.results):
                    if (response.results['genres'] != None):
                        for x in response.results['genres']:
                            tmp_list.append(x['name'])
                        genres_list.append(tmp_list)
                    else:
                        genres_list.append('None')
                else:
                    genres_list.append('None')



                ### Concepts
                tmp_list = []
                if ('concepts' in response.results):
                    if (response.results['concepts'] != None):
                        for x in response.results['concepts']:
                            tmp_list.append(x['name'])
                        concepts_list.append(tmp_list)
                    else:
                        concepts_list.append('None')
                else:
                    concepts_list.append('None')


                ### Themes
                tmp_list = []
                if ('themes' in response.results):
                    if (response.results['themes'] != None):
                        for x in response.results['themes']:
                            tmp_list.append(x['name'])
                        themes_list.append(tmp_list)
                    else:
                        themes_list.append('None')
                else:
                    themes_list.append('None')



                ### Franchises
                tmp_list = []
                if 'franchises' in response.results:
                    if (response.results['franchises'] != None):
                        for x in response.results['franchises']:
                            tmp_list.append(x['name'])
                        franchises_list.append(tmp_list)
                    else:
                        franchises_list.append('None')

                else:
                    franchises_list.append('None')
                    
                incomplete = False
            except:
                tries -=1
                
        if incomplete == True:
            print('It is failing a lot')
        
        else:
            print('Success')

    else:
        # If the counter equals 1, take a 10 minute break.
        time.sleep(600)
        
        
# Export video game dataframe as a csv
data =  {'id': id_list,
         'name': name_list,
         'original_game_rating': original_game_rating_list, 
         'original_release_date_': original_release_date_list,
         'platform': name_list,
         'developer': developers_list,
         'genre': genres_list,
         'theme': themes_list,
         'concept': concepts_list,
         'franchise': franchises_list,
         'image_url': image_url_list}

tmp_df = pd.DataFrame(data)

tmp_df.to_csv('data/fetched_video_games.csv')


Let us begin by reading in the CSV file containing the data, and examining the data contents such as the number of features and the number of samples. It seems there are 11 column entries (features) and 33403 row entries (number of samples).


In [None]:
df = pd.read_csv('data/fetched_video_games.csv', index_col = 0)

### Preprocessing the data.

Preprocessing including a few steps:

(1) Filter out entries that do not have a genre, theme, and concept value.

(2) Drop entries that contain adult and adult values.

(3) Text Processing
    * Convert the fields from objects into strings.
    * All fields with no values were labeled as None.
    * Strip the quotation marks.
    * Remove whitespace between strings
    * Adjust entries with multiple descriptors. 
(4) Creation of our feature inputted into our algorithms, which is the concatenation of the game rating (Age elibility), developer, genre, theme, concept, and franchise.

Those features were selected for the following reasons:

* If a game is given a game rating (age rating) do not want to recommend games that are not suitable for the user.

* If a user enjoys a game, they may enjoy a game from the same developer and franchise.

* Genre, theme, and concept features all relate to the atmosphere and environment of the game.

In [None]:
def NONE(x):
    """Function to label a column as None if there are no contents."""
    if x == 'None':
        x = ''
        return x 
    else:
        return x

In [None]:
# Preprocessing

## Drop entries that don't have a genre, theme, and concept value.
df = df.loc[(df['genre'] != 'None') & (df['theme'] != 'None') & (df['concept'] != 'None')].copy()

## Drop entries that are related to anime and adult.
df = df[(~df['theme'].str.contains("Anime")) & (~df['theme'].str.contains("Adult"))]
df = df.reset_index(drop = True)

## Apply text processing across various columns.
column_list = ['original_game_rating', 'original_release_date_', 'platform',
               'developer', 'genre', 'theme', 'concept', 'franchise']
#for x in df.columns[2:]:
for x in column_list:
    df[x] = df[x].apply(str)
    
    
    ## If a column has no contents, it is labeled as none.
    df[x] = df[x].apply(NONE)
    
    
    ## Remove Brackets
    df[x] = df[x].apply(lambda i: i.strip('[]'))
    
    ## Strip quotation marks
    df[x] = df[x].apply(lambda i: i.strip("''"))
    df[x] = df[x].apply(lambda i: i.strip('""'))
    df[x] = df[x].apply(lambda i: re.sub('"', '', i))
    df[x] = df[x].apply(lambda i: re.sub("'", '', i))
        
    ## Remove Whitespace between multiple string values
    df[x] = df[x].apply(lambda i: i.replace('  ', ' '))  

    # Fix ','
    df[x] = df[x].apply(lambda i: i.replace("','", ' '))

    # Fix ","
    df[x] = df[x].apply(lambda i: i.replace('","', ' '))
    df[x] = df[x].apply(lambda i: i.replace(',', ' '))
    
    # Fix Genre Action Adventure
    
    # Fix '-'
    df[x] = df[x].apply(lambda i: i.replace('-', ' '))
    
    

# Total Content Description
df['total_contents'] = df['original_game_rating'] + " " + df['developer'] + " " + df['genre'] + " " + df['theme'] + " " + df['concept'] + " " + df['franchise'] + " " + df['platform']

### Feature exploration

Our dataframe has 37026 entries now.

Let's explore the developer, genre, theme, and concept features by examining the distribution using Bar plots.

In [None]:
def Bar_Plot(df, x, y, title, x_title, y_title):
    """Function which returns a bar plot of a feature."""    
    # Plot
    bar = px.bar(df, x = x,
                 y = y, orientation = 'h',
                 color = y, color_discrete_sequence = px.colors.qualitative.Pastel
                )
    
    bar.update_xaxes(linewidth = 1, linecolor = 'black', 
                     gridcolor = 'LightPink',  
                     ticks = "outside", tickwidth = 2,
                     tickcolor = 'black', ticklen = 12,
                     title = x_title, title_font = dict(size = 22),
                    ) 
    bar.update_yaxes(linewidth = 1, linecolor = 'black', 
                     gridcolor = 'LightPink', ticks = "outside",
                     tickwidth = 2, tickcolor = 'black',
                     ticklen = 12, title = y_title,
                     title_font = dict(size = 22),
                    )
    
    
    bar.update_layout(
        title = title,
        title_font = dict(size = 26),
        font = dict(size = 14),
        legend = dict(
            x = 1,
            y = 1,
            traceorder = "normal",
            font = dict(
                family = "sans-serif",
                size = 18,
                color = "black"
            ),
            bgcolor = "#f7f7f7",
            bordercolor = "#f7f7f7",
            borderwidth = 1
        ),
        plot_bgcolor = "#f7f7f7", paper_bgcolor = "#f7f7f7",
        width = 1000, height = 600, 
        hoverlabel = dict(
            font_size = 24, 
            font_family = "Rockwell")
    )

    return bar

Majority of games do not have a developer listed.

In [None]:
# Distribution of developers.
Bar_Plot(df = df, x = df['developer'].value_counts().values[0:10],
         y = df['developer'].value_counts().index[0:10], title = 'Distribution of Developers',
         x_title = 'Count', y_title = 'Developer')

Most video games are in the adventure genre.

In [None]:
# Distribution of genres.
Bar_Plot(df = df, x = df['genre'].value_counts().values[0:10],
         y = df['genre'].value_counts().index[0:10], title = 'Distribution of Genres',
         x_title = 'Count', y_title = 'Genre')

Majority of games have a fantasy theme.

In [None]:
# Distribution of themes.
Bar_Plot(df = df, x = df['theme'].value_counts().values[0:10],
         y = df['theme'].value_counts().index[0:10], title = 'Distribution of Themes',
         x_title = 'Count', y_title = 'Theme')

Majority of games have single word titles.

In [None]:
# Distribution of concepts.
Bar_Plot(df = df, x = df['concept'].value_counts().values[0:10],
         y = df['concept'].value_counts().index[0:10], title = 'Distribution of Concepts',
         x_title = 'Count', y_title = 'Concept')

Next, we'll explore those previous features with word clouds.

In [None]:
def plot_word_cloud(df):
    """Function to create and plot a worcloud"""

    ## Collect all strings 
    tmp_contents = ''
    for x in df:
        tmp_contents += x
    
    # The regex expression is used to eliminate all non english letters
    regex_expression = r"[a-zA-Z]+"
    
    # Word Cloud
    wc = WordCloud(width = 2500, height = 1000, max_words = 10000,
                      relative_scaling = 0, background_color = 'black', contour_color = "black",
                      regexp = regex_expression, random_state = 2, colormap = 'rainbow',
                      collocations = False,
             ).generate(tmp_contents)
    
    # Set figure size
    plt.figure(figsize = (20, 15))
    
    # Display image
    plt.imshow(wc) 
    
    # No axis details
    plt.axis("off");
    
    return plt.show()

In [None]:
# Wordcloud of total contents.
plot_word_cloud(df = df['total_contents'])

In [None]:
# Wordcloud of genres.
plot_word_cloud(df = df['genre'])

In [None]:
# Wordcloud of themes.
plot_word_cloud(df = df['theme'])

In [None]:
# Wordcloud of concepts.
plot_word_cloud(df = df['concept'])

### Part II: Recommendation Methods

#### TF-IDF Vectorization

TF-IDF is a numerical statistic that shows the relevance of keywords to
some specific documents or it can be said that, it provides those keywords, using which some specific documents can be identified or categorized [1]. TF-IDF is a combination of two different words i.e. Term
Frequency and Inverse Document Frequency. 

Term Frequency (TF) is used to measure that how many times a term is present in a document as presented in Equation (1).

$TF = \frac{Term}{Total Words}$ (Equation 1)

Inverse document frequency (IDF) assigns lower weight to frequent words and assigns greater weight for the words that are infrequent as presented in Equation (2).

$IDF = log_e(\frac{Total Documents}{Document Frequency})$ (Equation 2)

TF-IDF is the multiplication of the term frequency and inverse document frequency as presented in Equation (3).


$TF-DF = TF*IDF$ (Equation 3)

Next, we'll be using the tf-idf vectorizer from sci-kit learn to transform our dataframe. Main parameters of interest are the max features and stop word list.

* Max number of features - selects the top n features with the highest td-idf scores. (~1250 in our cases, to have a balance between running time and recommendation fidelity)
* Stop words list - words that will be eliminated from tf-idf calculations. (eliminate words which have nothing to do with video game content)

In [None]:
def tf_idf_vectorizer(df, max_features):
    "Function to return the td-idf matrix and parameters."
    # Stop words
    stop_words_list = ['000', '007', '07th', '09',
                       '10', '100', '101', '1047', 
                        '11', '12', '120', '13', '130cm',
                       '13am', '13th', '14', '141', '15',
                       '1500', '16', '17', '18', '180', '1939',
                       '1942', '1960s', '1980s', '1990s',
                       '1995', '1996', '1997', '1998', '1999',
                       '19th', '1c', '1soft', '1st', '20', '2000',
                       '2001','2002', '2003', '2004', '2005', '2006',
                       '2007', '2008', '2009', '2010', '2011', '2012',
                       '2013', '2014', '2015', '2016', '2017', '2018',
                       '2019', '2020', '2020venture',  '20th', '21',
                       '21st', '22', '221b', '227',  '22nd', '23rd', '24',
                       '258', '285',  '2darray',  '2nd', '2x2',
                       '2xl', '3000', '3000ad', '32x', '343',
                       '35', '358', '360', '369', '3a', '3d6',
                       '3ds', '3g', '3lv', '3rd', '3x3', '40',
                       '400', '44', '45', '46', '4a', '4bit', 
                       '4head', '4j', '4sdk', '4x', '50th',
                       '51', '5200', '562', '5656', '59',  '5d', '5pb',
                       '5th', '60', '6010', '64', '6e6e6e', '76', '777',
                       '777next', '7dfps', '7th', '800', '82', '8888888',
                       '88mm', '8floor', '8monkey', '8th', '935', '98',
                       '98demake', '9heads', '9th', 'a2a', 'a2z',
                       '0verflow', '10kbit', '10tacle', '10tons', '2049er', '20xx',
                       '22cans', '2awesome', '2bad', '2d', '2dengine', '2dogs', '2k',
                       '34bigthings', '3d', '3dclouds', '3division', '3do', '3drunkmen',
                       '3rdeye', '3vision', '3vr', '49ers', '49games', '4d', '4fufelz',
                       '4gency', '5bit', 'aaa', 'aaaaaaaaaaaaaaaaaaaaaaaaa', 'ab',
                       'e10', 'e3', 'e404', 'pax',
                       'achievements', 'com', 'comachievements', 'companynameintitle',
                       'crowdfunded', 'declarativetitle', 'digitaldistribution',
                       'e32005', 'e32007', 'e32008', 'e32009', 'e32010',
                       'e32011', 'e32012', 'e32013', 'e32014', 'e32015',
                       'e32016', 'e32017', 'e32018', 'e32019', 'e32020',
                       'easyanticheat', 'epicgamesstore', 'gametitlesthatarealsoquestions', 'gog',
                       'humblebundle', 'kickstarterfunded', 'licensedgame', 'onlive',
                       'paxeast2005', 'paxeast2007', 'paxeast2008', 'paxeast2009',
                       'paxeast2010', 'paxeast2011', 'paxeast2012', 'paxeast2013',
                       'paxeast2014', 'paxeast2015', 'paxeast2016', 'paxeast2017',
                       'paxeast2018', 'paxeast2019', 'paxeast2020',
                       'paxprime', 'paxprime2005', 'paxprime2007', 'paxprime2008',
                       'paxprime2009', 'paxprime2010', 'paxprime2011', 'paxprime2012', 'paxprime2013',
                       'paxprime2014', 'paxprime2015', 'paxprime2016', 'paxprime2017', 'paxprime2018',
                       'paxprime2019', 'paxprime2020', 'paxsouth2005', 'paxsouth2007', 'paxsouth2008',
                       'paxsouth2009', 'paxsouth2010', 'paxsouth2011', 'paxsouth2012', 'paxsouth2013',
                       'paxsouth2014', 'paxsouth2015', 'paxsouth2016', 'paxsouth2017', 'paxsouth2018',
                       'paxsouth2019', 'paxsouth2020',
                       'paxwest2005', 'paxwest2007', 'paxwest2008', 'paxwest2009',
                       'paxwest2010', 'paxwest2011', 'paxwest2012', 'paxwest2013',
                       'paxwest2014', 'paxwest2015', 'paxwest2016', 'paxwest2017',
                       'paxwest2018', 'paxwest2019', 'paxwest2020', 'playstation',
                       'playstationplus', 'playstationtrophies', 'realphotosoncoverart',
                       'secretachievements', 'smartdelivery', 'steam', 'steamapplearcade',
                       'steamcloud', 'steamgreenlight', 'steamremoteplaytogether', 'steamtradingcards',
                       'steamturnnotifications', 'threewordgametitlewithconjunctionorpreposition',
                       'trophies', 'valveindexsupport', 'xboxonexenhanced', 'xboxplayanywhere']
    
    tf = TfidfVectorizer(stop_words = stop_words_list, max_features = max_features)
    
    # Fit and Transform using TD-IDF Vectorizer
    tfidf_matrix = tf.fit_transform(df['total_contents'].values.astype('U'))
    
    # Observe the frequency of each word in the matrix
    df_tfidf = pd.DataFrame(tfidf_matrix.todense(), columns = tf.get_feature_names())
    
    return df_tfidf, tf

Next, we plot the distribution of tf-idf and word cloud.

In [None]:
# TF-IDF distribution.
## Initalize TF-IDF vectors
tf_idf_predictions, tf = tf_idf_vectorizer(df = df, max_features = 1250)
tmp_features = tf.get_feature_names()

## Create dataframe
data =  {'Words': tmp_features,
         'TF-IDF': tf.idf_
        }

tmp_df = pd.DataFrame(data = data)

sorted_df = tmp_df.sort_values(by = 'TF-IDF', ascending = False)

In [None]:
# TF-IDF distribution.
## Bar Plot
Bar_Plot(df = sorted_df[0:20], x = 'TF-IDF',
         y = 'Words', title = 'Distribution of TF-IDF Words',
         x_title = 'TF-IDF', y_title = 'Words')

In [None]:
# Word Cloud
## Create word frequencies
tmp_dict = dict(zip(sorted_df['Words'].values, sorted_df['TF-IDF'].values))

## The regex expression is used to eliminate all non english letters
regex_expression = r"[a-zA-Z]+"

## Word Cloud
wc_freq = WordCloud(width = 2500, height = 1000, max_words = 10000,
                  relative_scaling = 0, background_color = 'black', contour_color = "black",
                  regexp = regex_expression, random_state = 2, colormap = 'rainbow',
                  collocations = False,
         ).generate_from_frequencies(tmp_dict)

## Set figure size
plt.figure(figsize = (20, 15))

## Display image
plt.imshow(wc_freq) 

## No axis details
plt.axis("off");

plt.show()

### Algorithm Analysis

Algorithms were selected based on their application to the problem of Information Retrieval - since our problem involves content descriptions and not the use of user collaborative filtering to provide recommendations. Such methods include identifying similarity pairs between points, thus the use of cosine similarity and k-nearest neighbors is an adequate choice. 

The algorithms will be critiqued based on

(1) Effective running time (how long does it take to run). It's imperative as running time will be a crucial metric in application deployment as the platform has constraints on how long the application can take for calculations.

(2) Relevance -  my opinion on if the recommendations are viable. It's important that the recommendations have a consistent relevancy towards the user's inputted games. There are six test cases, which will be judging this.

#### Cosine Similarity

https://www.sciencedirect.com/topics/computer-science/cosine-similarity [2]

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

$ sim(x,y) = \frac{xy}{|x||y|}$

In [None]:
def cs_game_recommendations(df, game_1, game_2, game_3, game_4, game_5):
    # Recording elapsed time
    start = time.time()
    
    # Dataframe
    df = df

    # Input IDS
    ## Checks for the datatype of the inputted games either None or the title of the game
    input_ids = []
    for x in game_1, game_2, game_3, game_4, game_5:
        if x != None:
            input_ids.append(df[df['name'] == x].index[0])

    # Iterate through each game selected and append the game's description into a list  
    game_text_list = []
    for x in game_1, game_2, game_3, game_4, game_5:
        if (x != None) & (df['name'].isin([x]).any() == True):
                          game_text_list.append(((df[df['name'] == x]['total_contents'].values)))
        elif (x != None) & (df['name'].isin([x]).any() == False):
                            return( 'Game inputted is not in dataset')
    
    # Concatenate the strings
    game_text_strings = ''
    for x in game_text_list:
        game_text_strings += x 
    
    # Insert a new row with the concatenated string
    df = df.append({'name' : 'User Input' , 'total_contents' : game_text_strings[0]} , ignore_index = True)
   
    # TF-IDF Vectorizer
    tf_idf_predictions, tf = tf_idf_vectorizer(df = df, max_features = 1250)

    # Cosim Similarity
    cosine_sim = cosine_similarity(tf_idf_predictions)

    ## Labeling the name and genre of the results in the cosine similarity 
    titles = df[['name', 'genre']]
    indices = pd.Series(df.index, index = df['name'])
    
    ## Creation of the cosine similarity list
    idx = indices['User Input']
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    sim_scores = sim_scores[1:16]
    game_indices = [i[0] for i in sim_scores if i[0] not in input_ids]
    
    # After, the game recommendations have completed, drop the entry used to concatenate the strings.
    df.drop(df.tail(1).index,inplace = True)
    
    end = time.time()
    
    print(f'Time Elapsed: {end - start} seconds ')
    
    return titles.iloc[game_indices][0:10]

##### Test Cases

Adequate results from the test cases, but there is an issue with test case #2, as the amount of batman games is overwhelming.

In [None]:
# 1 Action Platformer and 1 Action Adventure 
test_case_1 = cs_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'Batman: Arkham City', game_3 = None,
                     game_4 = None, game_5 = None)

test_case_1

In [None]:
# 4 Action Platformers and 1 ActionAdventure 
test_case_2 = cs_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'Castlevania', game_3 = 'Fumiko!',
                     game_4 = '99 Levels To Hell', game_5 = 'Batman: Arkham Asylum')

test_case_2

In [None]:
# 5 Action Platformers 
test_case_3 = cs_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'A.R.E.S. Extinction Agenda EX', game_3 = 'Fumiko!',
                     game_4 = '99 Levels To Hell', game_5 = 'Zack Zero')

test_case_3

In [None]:
# 5 MMORGPS
test_case_4 = cs_game_recommendations(df = df, game_1 = 'Albion Online',
                     game_2 = 'ArcheAge', game_3 = 'World of Warcraft',
                     game_4 = 'City of Heroes', game_5 = 'City of Villains')

test_case_4

In [None]:
# 4 Action Platformers and 1 ActionAdventure
test_case_5 = cs_game_recommendations(df = df, game_1 = '8-Bit Hordes',
                     game_2 = '8-Bit Invaders!', game_3 = '9th Company: Roots of Terror',
                     game_4 = 'A Game of Thrones: Genesis', game_5 = 'Batman: Arkham Asylum')

test_case_5

In [None]:
# 5 Shooters
test_case_6 = cs_game_recommendations(df = df, game_1 = '8bit Killer',
                     game_2 = 'Alien Swarm', game_3 = 'Doom VFR',
                     game_4 = 'Earth Defense Force 5', game_5 = 'Fortnite')

test_case_6

#### K-Nearest Neighbors

The  K-Nearest  Neighbor  (KNN)  is  one  of  the  simplest  lazy  machine  learning  algorithms. Algorithm  objective  is  to  classify  objects  into  one  of  the  predefined  classes  of  a  sample  group  that  was  created  by  machine  learning. The algorithm does not require the use of training data to perform classification, training data can be used during the testing phase. KNN is based on finding the most similar objects (documents) from sample groups about a mutal distance metric [3].

In [None]:
def knn_game_recommendations(df, game_1, game_2, game_3, game_4, game_5):
    # Recording elapsed time
    start = time.time()
    
    # Dataframe
    df = df

    # Iterate through each game selected and append the game's description into a list  
    game_text_list = []
    for x in game_1, game_2, game_3, game_4, game_5:
        if (x != None) & (df['name'].isin([x]).any() == True):
                          game_text_list.append(((df[df['name'] == x]['total_contents'].values)))
        elif (x != None) & (df['name'].isin([x]).any() == False):
                            return( 'Game inputted is not in dataset')
    
    # Concatenate the strings
    game_text_strings = ''
    for x in game_text_list:
        game_text_strings += x 
    
    # TD-IDF Vectorizer
    tf_idf_inputs, tf = tf_idf_vectorizer(df = df, max_features = 1250)
    
    # Nearest Neighbors
    nn = NearestNeighbors(n_neighbors = 15, algorithm='ball_tree', metric = 'minkowski')
    nn.fit(tf_idf_inputs)
    
    # Transforming the predictions
    tf_idf_predictions = tf.transform([str(game_text_strings)])
    results = nn.kneighbors(tf_idf_predictions.todense())
    
    # Input IDS
    ## Checks for the datatype of the inputted games either None or the title of the game
    input_ids = []
    for x in game_1, game_2, game_3, game_4, game_5:
        if x != None:
            input_ids.append(df[df['name'] == x].index[0])
    
    # Recommended Game ID's
    ## Checks to see if any of the recommended titles are not the inputted games - do not get recommended games you selected
    tmp_ids = [x for x in results[1][0]]
    top_10_ids = []
    for x in tmp_ids:
        if x not in input_ids:
            top_10_ids.append(x)
    
    # The TOP 10 games selected
    ## Returns the title of recommendedd games
    top_10_games_list = []
    for x in top_10_ids:
        top_10_games_list.append(df[df.index == x]['name'].values[0])
    
    
    # Labeling the name and genre of the top 10_games
    titles = df[['name', 'genre']]
    indices = pd.Series(df.index, index = df['name'])

    end = time.time()
    print(f'Time Elapsed: {end - start} seconds ')
    
    return titles.iloc[top_10_ids][0:10]

##### Test Cases

Adequate results from the test cases, but there is an issue with test case #2, as the amount of batman games is overwhelming.

In [None]:
# 1 Action Platformer and 1 Action Adventure 
test_case_1 = knn_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'Batman: Arkham City', game_3 = None,
                     game_4 = None, game_5 = None)

test_case_1

In [None]:
# 4 Action Platformers and 1 ActionAdventure 
test_case_2 = knn_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'Castlevania', game_3 = 'Fumiko!',
                     game_4 = '99 Levels To Hell', game_5 = 'Batman: Arkham Asylum')

test_case_2

In [None]:
# 5 Action Platformers 
test_case_3 = knn_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'A.R.E.S. Extinction Agenda EX', game_3 = 'Fumiko!',
                     game_4 = '99 Levels To Hell', game_5 = 'Zack Zero')

test_case_3

In [None]:
# 5 MMORGPS
test_case_4 = knn_game_recommendations(df = df, game_1 = 'Albion Online',
                     game_2 = 'ArcheAge', game_3 = 'World of Warcraft',
                     game_4 = 'City of Heroes', game_5 = 'City of Villains')

test_case_4

In [None]:
# 4 Action Platformers and 1 ActionAdventure
test_case_5 = knn_game_recommendations(df = df, game_1 = '8-Bit Hordes',
                     game_2 = '8-Bit Invaders!', game_3 = '9th Company: Roots of Terror',
                     game_4 = 'A Game of Thrones: Genesis', game_5 = 'Batman: Arkham Asylum')

test_case_5

In [None]:
# 5 Shooters
test_case_6 = knn_game_recommendations(df = df, game_1 = '8bit Killer',
                     game_2 = 'Alien Swarm', game_3 = 'Doom VFR',
                     game_4 = 'Earth Defense Force 5', game_5 = 'Fortnite')

test_case_6

#### Singular Value Decomposition

Singular value decomposition takes a rectangular matrix of gene expression data (defined as A, where A is a n x p matrix) in which the n rows represents the genes, and the p columns represents the experimental conditions [4].

The SVD theorem states:

$ A_{nxp} = U_{nxn}S_{nxp}V^T_{pxp} $ (Equation 1)

$ U^TU = I_{nxn}$

$ V^TV = I_{pxp}$

SVD is useful as a dimensional reduction technique - reducing the number of features but capturing the variance. 

Plotting the relationship between the number of components and experience demonstrated that ~450 number of components is sufficient to represent ~85% of the data.

In [None]:
# Plotting the Cumulative Summation of the Explained Variance
## SVD
tf_idf_predictions, tf = tf_idf_vectorizer(df = df, max_features = 1250)

SVD_components_variance = []
Number_of_components = []
for x in range(0, 800, 50):
    tsv = TruncatedSVD( n_components = x, algorithm = 'randomized', n_iter = 5).fit(tf_idf_predictions)
    tsv_variance_components = tsv.explained_variance_ratio_.sum()
    Number_of_components.append(x)
    SVD_components_variance.append(tsv_variance_components)

## Plot Parameters
plt.figure(figsize = (20, 10))
plt.plot(Number_of_components, SVD_components_variance,  '-o')
plt.xlabel('Number of Components', fontsize = 24)
plt.xticks(fontsize = 18)
plt.yticks([x / 10.0 for x in range(0, 10, 1)], fontsize = 18)
plt.ylabel('Variance (%)', fontsize = 24) 
plt.title('SVD Variance', fontsize = 24)

## Annotate plot 
plt.text(200, SVD_components_variance[9] + 0.015,
         '85% cutoff', size = 20, color = 'red', weight = 'semibold')

plt.hlines(y = 0.85, color = 'red', linestyle = '-', xmin = 0.0, xmax = 450)
plt.vlines(x = 450, color = 'red', linestyle = '-', ymin = 0.0, ymax = 0.85)

plt.text(440, SVD_components_variance[9] + 0.015,
         str(round(SVD_components_variance[9], 3)), size = 20, color = 'blue', weight = 'semibold')

plt.text(455, 0.4,
         '450 components are sufficient', size = 20, color = 'red', weight = 'semibold')

plt.tight_layout()
plt.show()

#### Cosine Similarity + SVD

In [None]:
def cs_svd_game_recommendations(df, game_1, game_2, game_3, game_4, game_5):
    # Recording elapsed time
    start = time.time()
    
    # Dataframe
    df = df

    # Input IDS
    ## Checks for the datatype of the inputted games either None or the title of the game
    input_ids = []
    for x in game_1, game_2, game_3, game_4, game_5:
        if x != None:
            input_ids.append(df[df['name'] == x].index[0])

    # Iterate through each game selected and append the game's description into a list  
    game_text_list = []
    for x in game_1, game_2, game_3, game_4, game_5:
        if (x != None) & (df['name'].isin([x]).any() == True):
                          game_text_list.append(((df[df['name'] == x]['total_contents'].values)))
        elif (x != None) & (df['name'].isin([x]).any() == False):
                            return( 'Game inputted is not in dataset')
    
    # Concatenate the strings
    game_text_strings = ''
    for x in game_text_list:
        game_text_strings += x 
    
    # Insert a new row with the concatenated string
    df = df.append({'name' : 'User Input', 'total_contents' : game_text_strings[0]} , ignore_index = True)
   
    # TD-IDF Vectorizer
    tf_idf_predictions, tf = tf_idf_vectorizer(df = df, max_features = 1250)
    
    # SVD
    tsv = TruncatedSVD(n_components = 450, algorithm = 'randomized', n_iter = 5, random_state = 7)

    svd_predictions = tsv.fit_transform(tf_idf_predictions)
    
    # Cosim Similarity
    cosine_sim = cosine_similarity(svd_predictions)
    
    ## Labeling the name and genre of the results in the cosine similarity 
    titles = df[['name', 'genre']]
    indices = pd.Series(df.index, index = df['name'])
    
    ## Creation of the cosine similarity list
    idx = indices['User Input']
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    sim_scores = sim_scores[1:16]
    game_indices = [i[0] for i in sim_scores if i[0] not in input_ids]
    
    # After, the game recommendations have completed, drop the entry used to concatenate the strings.
    df.drop(df.tail(1).index,inplace = True)
    
    end = time.time()
    
    print(f'Time Elapsed: {end - start} seconds ')
    
    return titles.iloc[game_indices][0:10]

##### Test Cases

Performed better on test case #2, but running time will be an issue.

In [None]:
# 1 Action Platformer and 1 Action Adventure 
test_case_1 = cs_svd_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'Batman: Arkham City', game_3 = None,
                     game_4 = None, game_5 = None)

test_case_1

In [None]:
# 4 Action Platformers and 1 ActionAdventure 
test_case_2 = cs_svd_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'Castlevania', game_3 = 'Fumiko!',
                     game_4 = '99 Levels To Hell', game_5 = 'Batman: Arkham Asylum')

test_case_2

In [None]:
# 5 Action Platformers 
test_case_3 = cs_svd_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'A.R.E.S. Extinction Agenda EX', game_3 = 'Fumiko!',
                     game_4 = '99 Levels To Hell', game_5 = 'Zack Zero')

test_case_3

In [None]:
# 5 MMORGPS
test_case_4 = cs_svd_game_recommendations(df = df, game_1 = 'Albion Online',
                     game_2 = 'ArcheAge', game_3 = 'World of Warcraft',
                     game_4 = 'City of Heroes', game_5 = 'City of Villains')

test_case_4

In [None]:
# 4 Action Platformers and 1 ActionAdventure
test_case_5 = cs_svd_game_recommendations(df = df, game_1 = '8-Bit Hordes',
                     game_2 = '8-Bit Invaders!', game_3 = '9th Company: Roots of Terror',
                     game_4 = 'A Game of Thrones: Genesis', game_5 = 'Batman: Arkham Asylum')

test_case_5

In [None]:
# 5 Shooters
test_case_6 = cs_svd_game_recommendations(df = df, game_1 = '8bit Killer',
                     game_2 = 'Alien Swarm', game_3 = 'Doom VFR',
                     game_4 = 'Earth Defense Force 5', game_5 = 'Fortnite')

test_case_6

#### K-Nearest Neighbors + SVD

In [None]:
def knn_svd_game_recommendations(df, game_1, game_2, game_3, game_4, game_5):
    # Recording elapsed time
    start = time.time()
    
    # Dataframe
    df = df

    # Iterate through each game selected and append the game's description into a list  
    game_text_list = []
    for x in game_1, game_2, game_3, game_4, game_5:
        if (x != None) & (df['name'].isin([x]).any() == True):
                          game_text_list.append(((df[df['name'] == x]['total_contents'].values)))
        elif (x != None) & (df['name'].isin([x]).any() == False):
                            return( 'Game inputted is not in dataset')
    
    # Concatenate the strings
    game_text_strings = ''
    for x in game_text_list:
        game_text_strings += x 
    
    # TD-IDF Vectorizer
    #tf_idf_inputs, tf = tf_idf_vectorizer(df = df, max_features = 3500)
    tf_idf_inputs, tf = tf_idf_vectorizer(df = df, max_features = 1250)
    
    # SVD
    tsv = TruncatedSVD(n_components = 450, algorithm='randomized',n_iter=5, random_state = 7)
    svd_inputs = tsv.fit_transform(tf_idf_inputs)
    
    # Nearest Neighbors
    nn = NearestNeighbors(n_neighbors = 15, algorithm='ball_tree', metric = 'minkowski')
    nn.fit(svd_inputs)
    
    # Transforming the predictions
    tf_idf_predictions = tf.transform([str(game_text_strings)])
    
    svd_predictions = tsv.transform(tf_idf_predictions)
    results = nn.kneighbors(svd_predictions)
    
    # Input IDS
    ## Checks for the datatype of the inputted games either None or the title of the game
    input_ids = []
    for x in game_1, game_2, game_3, game_4, game_5:
        if x != None:
            input_ids.append(df[df['name'] == x].index[0])
    
    # Recommended Game ID's
    ## Checks to see if any of the recommended titles are not the inputted games - do not get recommended games you selected
    tmp_ids = [x for x in results[1][0]]
    top_10_ids = []
    for x in tmp_ids:
        if x not in input_ids:
            top_10_ids.append(x)
    
    # The TOP 10 games selected
    ## Returns the title of recommendedd games
    top_10_games_list = []
    for x in top_10_ids:
        top_10_games_list.append(df[df.index == x]['name'].values[0])
    
    
    # Labeling the name and genre of the top 10_games
    titles = df[['name', 'genre']]
    indices = pd.Series(df.index, index = df['name'])

    end = time.time()
    print(f'Time Elapsed: {end - start} seconds ')
    
    return titles.iloc[top_10_ids][0:10]

##### Test Cases

Performed better than its previous iteration but not as strong as cs + svd

In [None]:
# 1 Action Platformer and 1 Action Adventure 
test_case_1 = knn_svd_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'Batman: Arkham City', game_3 = None,
                     game_4 = None, game_5 = None)

test_case_1

In [None]:
# 4 Action Platformers and 1 ActionAdventure 
test_case_2 = knn_svd_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'Castlevania', game_3 = 'Fumiko!',
                     game_4 = '99 Levels To Hell', game_5 = 'Batman: Arkham Asylum')

test_case_2

In [None]:
# 5 Action Platformers 
test_case_3 = knn_svd_game_recommendations(df = df, game_1 = '30XX',
                     game_2 = 'A.R.E.S. Extinction Agenda EX', game_3 = 'Fumiko!',
                     game_4 = '99 Levels To Hell', game_5 = 'Zack Zero')

test_case_3

In [None]:
# 5 MMORGPS
test_case_4 = knn_svd_game_recommendations(df = df, game_1 = 'Albion Online',
                     game_2 = 'ArcheAge', game_3 = 'World of Warcraft',
                     game_4 = 'City of Heroes', game_5 = 'City of Villains')

test_case_4

In [None]:
# 4 Action Platformers and 1 ActionAdventure
test_case_5 = knn_svd_game_recommendations(df = df, game_1 = '8-Bit Hordes',
                     game_2 = '8-Bit Invaders!', game_3 = '9th Company: Roots of Terror',
                     game_4 = 'A Game of Thrones: Genesis', game_5 = 'Batman: Arkham Asylum')

test_case_5

In [None]:
# 5 Shooters
test_case_6 = knn_svd_game_recommendations(df = df, game_1 = '8bit Killer',
                     game_2 = 'Alien Swarm', game_3 = 'Doom VFR',
                     game_4 = 'Earth Defense Force 5', game_5 = 'Fortnite')

test_case_6

#### Comparison of Algorithms

In terms of performance, CS + SVD > KNN + SVD > CS = KNN. However, running time is an issue for cosine similarity methods, thus moving forward in deploying our application, KNN will be selected.

In [None]:
values  =  [['Cosine Similarity',
           'Cosine Similarity + SVD',
           'K-Nearest Neighbors',
           'K-Nearest Neighbors + SVD',
          ], #1st col
 
          ['18.409',
           '26.427',
           '4.094',
           '12.552',

          ], #2nd col
          
          ['3rd',
           '1st',
            '3rd',
           '2nd',
          ], ]


fig  =  go.Figure(data = [go.Table(
    columnorder  =  [1,2, 3, 4, 5, 6, 7],
    columnwidth  =  [150, 100, 100],
    header  =  dict(
        values  =  [['Method'],
                    ['Elapsed Time (seconds)'],
                    ['Relevance']],
        line_color = 'darkslategray',
        fill_color = '#90ee90',
        align = ['left','center'],
        font = dict(color = 'Light Gray', size = 24),
        height = 40
    ),
    cells = dict(
        values = values,
        line_color = 'darkslategray',
        fill = dict(color = ['#ffb6c1', 'White']),
        font_size =  16,
        height = 40)
)
                         ]
                 )
#width  =  4600, height  =  950,
fig.update_layout(width  =  900, height  =  500,
                  title  =  'Table 1: Algorithms for Game Recommendations',
                  title_font  =  {'size': 24})
fig.update_xaxes(automargin = True)

# PART III: PostgreSQL Database

Before deploying our application, our data needs to be stored in a safe and reliable database. 

(1) Store our CSV into a PostgreSQL database, which is saved on the heroku platform.

(2) Reliably pull our data from (1) into a dataframe.

In [None]:
import psycopg2
from sqlalchemy import create_engine

In [None]:
# Removing data entries form dataframe for maximum performance (SQL database only supports up to 10k entries)
# Drop empty developer entries
df = df.drop(df.loc[df['developer'] == ''].index).reset_index(drop = True)

# Drop entries without a franchise listed
df = df.drop(df.loc[df['franchise'] != ''].index).reset_index(drop = True)

In [None]:
# Write to SQL database
DATABASE_URL = 'Heroku URL'
engine = create_engine(DATABASE_URL)
df.to_sql('video_games', con=engine)

In [None]:
# Read in SQL
DATABASE_URL = 'Heroku URL'
conn = psycopg2.connect(DATABASE_URL, sslmode = 'require')
df = pd.read_sql('select * from video_games', con = conn, index_col = 'index') 

For more information on the deployment of the application, look into the app_deployment folder.

# References

[1] S. Qaiser and R. Ali, "Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents", International Journal of Computer Applications, vol. 181, no. 1, pp. 25-29, 2018. Available: 10.5120/ijca2018917395.

[2] "Recommendation system Based On Cosine Similarity Algorithm", International Journal of Recent Trends in Engineering and Research, vol. 3, no. 9, pp. 6-10, 2017. Available: 10.23883/ijrter.2017.3423.iss9x.

[3] B. Trstenjak, S. Mikac and D. Donko, "KNN with TF-IDF based Framework for Text Categorization", Procedia Engineering, vol. 69, pp. 1356-1364, 2014. Available: 10.1016/j.proeng.2014.03.129.

[4] "Singular Value Decomposition", Iridl.ldeo.columbia.edu, 2020. [Online]. Available: http://iridl.ldeo.columbia.edu/dochelp/StatTutorial/SVD/index.html. [Accessed: 20- Jul- 2020].
