# Introduction

Purpose of this notebook is to will go through how you can quickly build a recommendation engine based on Steam dataset with video games ratings data. I won't add [a big introduction into what recommendation engines are](https://datacamp.pxf.io/ORz0Or). I assume you know what they are and why you want to build one. 

This notebook is set up to deliver a Gradio interface based on a manual input where you can ask it to recommend a list of 10 games based on your input.

Dataset: [Steam Games on Kaggle](https://www.kaggle.com/nikdavis/steam-store-games?select=steam.csv)

Model: Cosine Similarity

Sources & References

* Weighted Rating methodology from [Anime Recommendation System](https://www.kaggle.com/seyi92coding/anime-recommendation-system#Weighted-Rating).
* Recommender Engine adapted from [geekculture on medium](https://medium.com/geekculture/creating-content-based-movie-recommender-with-python-7f7d1b739c63)
* Analysis based on Nik's [Steam Data Exploration](https://nik-davis.github.io/posts/2019/steam-data-exploration/)






# Import Modules

In [1]:
import pandas as pd
import numpy as np
import re
import itertools
import matplotlib.pyplot as plt

# Upload your Dataset

We will using the Steam's dataset from Kaggle containing important information such as number of ratings each game received, what platform the games are on and release date.

The dataset includes a lot of features we won't use in this notebook but the most important ones are:

* appid - Unique Identifier of each item
* name - User-friendly Identifier
* release_date
* platforms
* genres
* steamspy_tags
* positive ratings/ total_ratings


In [2]:
filename = "/content/steam-clean-games.csv"

df = pd.read_csv(filename,  error_bad_lines=False, encoding='utf-8') 

#These are just quick checks to make sure the dataset looks correct
print(df.shape)
df.head()

(27075, 18)


Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


# Cleaning and Missing Values

Normally this section would be must bigger depending on how much noise is in the data but the data is already been cleaned.
The aim here is  to ensure the data absent of null values with the correct columns making it suitable for further manipulation.

This includes:     
* Reviewing column names
* Checking for null values



In [3]:
df.columns

Index(['appid', 'name', 'release_date', 'english', 'developer', 'publisher',
       'platforms', 'required_age', 'categories', 'genres', 'steamspy_tags',
       'achievements', 'positive_ratings', 'negative_ratings',
       'average_playtime', 'median_playtime', 'owners', 'price'],
      dtype='object')

In [4]:
#Which columns have null values?
print(df.columns[df.isna().any()].tolist())

#How many null values per column? - Count the missing values in each column
df.isnull().sum()

[]


appid               0
name                0
release_date        0
english             0
developer           0
publisher           0
platforms           0
required_age        0
categories          0
genres              0
steamspy_tags       0
achievements        0
positive_ratings    0
negative_ratings    0
average_playtime    0
median_playtime     0
owners              0
price               0
dtype: int64

# Engineering New Data

There's new data we would like to add to enhance our dataset. This will hopefully enhance our recommendations.

You can enhance your datasets by create new categories and values based on the features within the dataset. This is how you can get more value out of your data and tailor it to the objective at hand.

Firstly, we need a column for the release year. Then we need some sort of scoring system which we can extrapolate from the ratings columns.

The equation we're building here is based on the Weighted Ratings formula from [Anime Recommmendation Kaggle tutorial](https://www.kaggle.com/seyi92coding/anime-recommendation-system#Weighted-Rating), you might have also seen it used [in a DataCamp tutorial](https://datacamp.pxf.io/ORz0Or). 
Hence why we are calculating the mean and upper quantile based on the score and total_ratings column.

I would say the weighted_score column would be more accurate than the score column since it's based on how all the other games are rated within the dataset.

And finally the genres column. 
The genres column contains the data we will use to compare games based on keywords. First we have to format the text correctly so it can be interpreted by the [TF-IDF vectorizer](https://datacamp.pxf.io/5bKEy3) further down.



In [5]:
# the function to extract years
def extract_year(date):
   year = date[:4]
   # some games do not have the info about year in the column title. So, we should take care of the case as well.
   if year.isnumeric():
      return int(year)
   else:
      return np.nan

df['year'] = df['release_date'].apply(extract_year)
df.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price,year
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19,2000
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99,1999
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99,2003
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99,2001
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99,1999


In [6]:
#Create score column
def create_score(row):
  pos_count = row['positive_ratings']  
  neg_count = row['negative_ratings']
  total_count = pos_count + neg_count
  average = pos_count / total_count
  return round(average, 2)

def total_ratings(row):
  pos_count = row['positive_ratings']  
  neg_count = row['negative_ratings']
  total_count = pos_count + neg_count
  return total_count

df['total_ratings'] = df.apply(total_ratings, axis=1)
df['score'] = df.apply(create_score, axis=1)


In [7]:
# Calculate mean of vote average column
C = df['score'].mean()

# Calculate the minimum number of votes required to be in the chart, m
m = df['total_ratings'].quantile(0.90)
print(m)

908.6000000000022


In [8]:
# calculate the weighted rating for each qualified game
# Function that computes the weighted rating of each game
def weighted_rating(x, m=m, C=C):
    v = x['total_ratings']
    R = x['score']
    # Calculation based on the IMDB formula
    return round((v/(v+m) * R) + (m/(m+v) * C), 2)

# Define a new feature 'score' and calculate its value with `weighted_rating()`
df['weighted_score'] = df.apply(weighted_rating, axis=1)

#Print the top 15 games
df[['name', 'total_ratings', 'score', 'weighted_score']].head(15)

Unnamed: 0,name,total_ratings,score,weighted_score
0,Counter-Strike,127873,0.97,0.97
1,Team Fortress Classic,3951,0.84,0.82
2,Day of Defeat,3814,0.9,0.86
3,Deathmatch Classic,1540,0.83,0.79
4,Half-Life: Opposing Force,5538,0.95,0.92
5,Ricochet,3442,0.8,0.78
6,Half-Life,28855,0.96,0.95
7,Counter-Strike: Condition Zero,13559,0.89,0.88
8,Half-Life: Blue Shift,4242,0.9,0.87
9,Half-Life 2,70321,0.97,0.97


In [9]:
#The reason we're adding this is for tags with multiple words, we need to connect by '-' before we split them by ' '
df['steamspy_tags'] = df['steamspy_tags'].str.replace(' ','-')
#TFIDF
df['genres'] = df['steamspy_tags'].str.replace(';',' ')
# count the number of occurences for each genre in the data set
counts = dict()
for i in df.index:
  #for each element in list (each row, split by ' ', in genres column) 
  #-- we're splitting by space so tfidf can interpret the cells
   for g in df.loc[i,'genres'].split(' '):
#if element is not in counts(dictionary of genres)
      if g not in counts:
        #give genre dictonary entry the value of 1
         counts[g] = 1
      else:
        #increase genre dictionary entry by 1
         counts[g] = counts[g] + 1

#Test Genre Counts
counts.keys()
print(counts['Action'])

10322


# Import the Recommender Modules

These are the modules we need to build our recommendation engine.

[TfidfVectorizer](https://datacamp.pxf.io/9W2Dn0) will turn each unique word into a word vector. Linear_kernel will be used to apply the cosine similarity function on each row per game. And fuzzywuzzy is needed to match similar phrases to the user input.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
!pip install fuzzywuzzy
#https://pypi.org/project/fuzzywuzzy/
from fuzzywuzzy import fuzz

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0




# Transform word vectors and calculate cosine distance

We need to convert the data into a format that can interpreted by our Similarity and Vectorizer modules.

Depending on how much text data you have, this may take some time. Here we apply the vectorizer with a stop_words parameter which is optional but recommended in order to remove unnecessary noise from your text data. TD-IDF is about identify the most important words within your text data.

Then we can apply the cosine similarity which is the most important part where we compare the distance between each game based on their collection of word vectors.

In [11]:
# create an object for TfidfVectorizer
tfidf_vector = TfidfVectorizer(stop_words='english')
# apply the object to the genres column
# convert the list of documents (rows of genre tags) into a matrix 
tfidf_matrix = tfidf_vector.fit_transform(df['genres'])

In [12]:
tfidf_matrix.shape
#The tfidf_matrix is the matrix with 27075 rows(games) and 370 columns(genres)

(27075, 370)

In [13]:
print(list(enumerate(tfidf_vector.get_feature_names())))

[(0, '1980s'), (1, '1990'), (2, '2d'), (3, '360'), (4, '3d'), (5, '40k'), (6, '4x'), (7, '5d'), (8, '6dof'), (9, 'abstract'), (10, 'access'), (11, 'action'), (12, 'adventure'), (13, 'agriculture'), (14, 'aliens'), (15, 'alternate'), (16, 'america'), (17, 'animation'), (18, 'anime'), (19, 'apocalyptic'), (20, 'arcade'), (21, 'arena'), (22, 'arts'), (23, 'assassin'), (24, 'atmospheric'), (25, 'attack'), (26, 'audio'), (27, 'awkward'), (28, 'base'), (29, 'baseball'), (30, 'based'), (31, 'basketball'), (32, 'batman'), (33, 'battle'), (34, 'beat'), (35, 'beautiful'), (36, 'benchmark'), (37, 'bikes'), (38, 'blood'), (39, 'bmx'), (40, 'board'), (41, 'book'), (42, 'bowling'), (43, 'builder'), (44, 'building'), (45, 'bullet'), (46, 'capitalism'), (47, 'card'), (48, 'cartoon'), (49, 'cartoony'), (50, 'casual'), (51, 'cats'), (52, 'character'), (53, 'chess'), (54, 'choices'), (55, 'choose'), (56, 'cinematic'), (57, 'city'), (58, 'class'), (59, 'classic'), (60, 'click'), (61, 'clicker'), (62, 'col



In [14]:
# create the cosine similarity matrix
sim_matrix = linear_kernel(tfidf_matrix,tfidf_matrix) 
print(sim_matrix)

[[1.         1.         0.53785954 ... 0.16328518 0.         0.        ]
 [1.         1.         0.53785954 ... 0.16328518 0.         0.        ]
 [0.53785954 0.53785954 1.         ... 0.         0.         0.        ]
 ...
 [0.16328518 0.16328518 0.         ... 1.         0.61573349 0.61573349]
 [0.         0.         0.         ... 0.61573349 1.         1.        ]
 [0.         0.         0.         ... 0.61573349 1.         1.        ]]


In [15]:
# create a function to find the closest title name
def matching_score(a,b):
  #fuzz.ratio(a,b) calculates the Levenshtein Distance between a and b, and returns the score for the distance
   return fuzz.ratio(a,b)
   # exactly the same, the score becomes 100

# Making our Recommendation Engine User-friendly

These functions fine-tune our inputs so we can return recommendations that the end-user wants.

The Recommendation Engine is made up of multiple functions which I will work through one by one. The complexity is dependent on the number of outputs you want, how sophisticated you want your results to be and what parameters you want. This includes:

* Functions to return relevant values for each game title
* Function to find closest game title 
* Function to return top 10 games


In [16]:
##These functions needed to return different attributes of the recommended game titles

#Convert index to title_year
def get_title_year_from_index(index):
   return df[df.index == index]['year'].values[0]
#Convert index to title
def get_title_from_index(index):
   return df[df.index == index]['name'].values[0]
#Convert index to title
def get_index_from_title(title):
   return df[df.name == title].index.values[0]
#Convert index to score
def get_score_from_index(index):
   return df[df.index == index]['score'].values[0]
#Convert index to weighted score
def get_weighted_score_from_index(index):
   return df[df.index == index]['weighted_score'].values[0]
#Convert index to total_ratings
def get_total_ratings_from_index(index):
   return df[df.index == index]['total_ratings'].values[0]
#Convert index to platform
def get_platform_from_index(index):
  return df[df.index == index]['platforms'].values[0]
   
# A function to return the most similar title to the words a user type
# Without this, the recommender only works when a user enters the exact title which the data has.
def find_closest_title(title):
  #matching_score(a,b) > a is the current row, b is the title we're trying to match
   leven_scores = list(enumerate(df['name'].apply(matching_score, b=title))) #[(0, 30), (1,95), (2, 19)~~] A tuple of distances per index
   sorted_leven_scores = sorted(leven_scores, key=lambda x: x[1], reverse=True) #Sorts list of tuples by distance [(1, 95), (3, 49), (0, 30)~~]
   closest_title = get_title_from_index(sorted_leven_scores[0][0])
   distance_score = sorted_leven_scores[0][1]
   return closest_title, distance_score
   # Bejeweled Twist, 100

In [17]:
#find_closest_title returns only one title but I want a dropdown of the 10 closest game titles
def closest_names(title):
   leven_scores = list(enumerate(df['name'].apply(matching_score, b=title)))
   sorted_leven_scores = sorted(leven_scores, key=lambda x: x[1], reverse=True)
   top_closest_names = [get_title_from_index(i[0]) for i in sorted_leven_scores[:10]]  #[['Team Fortress Classic', 'Deathmatch Classic', 'Counter-Strike',~~]
   return top_closest_names

closest_names('Valhalla')


['Walhall',
 'Brawlhalla',
 'Valhalla Hills',
 'Valhall 2000',
 'Zahalia',
 'Die for Valhalla!',
 'Parallax',
 'Taking Valhalla VR',
 'Earthfall',
 'Valley']

# Game Recommendations based on Inputs

We want to return a recommended games for a game chosen by the end user. With the Gradio Interface we can fine-tune it so that it returns:     
* **N** games results
* Within the minimum of **N** years
* Available on a selected platform
* Above a minimum score
* Sorted from top to bottom based on **N** value

The **N** is the user input.

There's a lot going on in this Recommender function so even with the comments within the code I'll break it down the main actions in the function as bullet points :
* Return closest game title match
* Create an empty Dataframe to store results
* Update closest game variable to the chosen dropdown option 
* Return a list of the most similar game indexes as a list
* Only return the games that meet the user's preferences including: selected platform, minimum score, minimum release year and sorted based on chosen value.
* Append results in the dataframe n amount of times
* Dataframe will contain attributes based on chosen games



In [None]:
!pip install gradio
import gradio as gr

In [None]:
#https://gradio.app/docs/#i_slider

def gradio_contents_based_recommender_v2(game, how_many, dropdown_option, sort_option, min_year, platform, min_score):
  #Return closest game title match
  closest_title, distance_score = find_closest_title(dropdown_option)
  #Create a Dataframe with these column headers
  recomm_df = pd.DataFrame(columns=['Game Title', 'Year', 'Score', 'Weighted Score', 'Total Ratings'])
  #Make the closest title whichever dropdown option the user has chosen
  closest_title = dropdown_option
  #find the corresponding index of the game title
  games_index = get_index_from_title(closest_title)
  #return a list of the most similar game indexes as a list
  games_list = list(enumerate(sim_matrix[int(games_index)]))
  #Sort list of similar games from top to bottom
  similar_games = list(filter(lambda x:x[0] != int(games_index), sorted(games_list,key=lambda x:x[1], reverse=True)))
  #Print the game title the similarity matrix is based on
  print('Here\'s the list of games similar to '+'\033[1m'+str(closest_title)+'\033[0m'+'.\n')
  #Only return the games that are on selected platform
  n_games = []
  for i,s in similar_games:
    if platform in get_platform_from_index(i):
      n_games.append((i,s))
  #Only return the games that are above the minimum score
  high_scores = []
  for i,s in n_games:
    if get_score_from_index(i) > min_score:
      high_scores.append((i,s))
    
  #Return the game tuple (game index, game distance score) and store in a dataframe
  for i,s in n_games[:how_many]: 
    #Dataframe will contain attributes based on game index
    row = {'Game Title': get_title_from_index(i), 'Year': get_title_year_from_index(i), 'Score': get_score_from_index(i), 
           'Weighted Score': get_weighted_score_from_index(i), 
           'Total Ratings': get_total_ratings_from_index(i),}
    #Append each row to this dataframe       
    recomm_df = recomm_df.append(row, ignore_index = True)
  #Sort dataframe by Sort_Option provided by user
  recomm_df = recomm_df.sort_values(sort_option, ascending=False)
  #Only include games released same or after minimum year selected
  recomm_df = recomm_df[recomm_df['Year'] >= min_year]

  return recomm_df

#Create list of unique calendar years based on main df column
years_sorted = sorted(list(df['year'].unique()))
#Ask user for input
print("What games do you want most similar to?:")

names = closest_names(input())
#Interface will include these buttons based on parameters in the function with a dataframe output
dropdown = gr.Interface(gradio_contents_based_recommender_v2, ["text", gr.inputs.Slider(1, 20, step=int(1)), 
                                                            gr.inputs.Dropdown(names), 
                                                            gr.inputs.Radio(['Year','Score','Weighted Score','Total Ratings']),
                                                            gr.inputs.Slider(int(years_sorted[0]), int(years_sorted[-1]), step=int(1)),
                                                            gr.inputs.Radio(['windows','xbox','playstation','linux','mac']),
                                                            gr.inputs.Slider(0, 10, step=0.1)],
                        "dataframe")

dropdown.launch(debug=True)

# How we can improve this recommender engine?

This is the most basic version of a recommender engine you can [do with Gradio](https://gradio.app/working_with_ml/). There's loads more you can do but I wanted the first interaction to be simple and build off of that. Some things we can do to improve this includes:

* We can update the cosine similarity algorithm with something more nuanced and sophisticated.
* The games were compared based on steam tags but there are other features within the dataset that could help enhance the predictions
* The name matcher has a long way to go before it can compete with regular search bars making for a better user experience when inputting the game you want similar options to.

If you want to build your own recommendation engine with Gradio, feel free to borrow this as a starting point. Also you want some help walking through the steps of building one, then I can recommend these Datacamp courses.

* [Building Recommendation Engines in Python](https://datacamp.pxf.io/4e20gM)
* [Building Recommendation Engines with PySpark](https://datacamp.pxf.io/yRWA4v)
