# Which Video Game Critic Has The Most Similar Taste To Yours?

Many [recommendation engines](hhttps://github.com/SeyiAgboola/Recommender-Engines-with-Gradio/blob/main/Recommend_Video_Games_Based_on_Metacritic_Reviews_With_Gradio.ipynb) focus on the items and compare their similar attributes. Many also use collaborative filtering to match users with other users with similar tastes. But in this notebook we will match users to the very critics who provide the input data (review scores).

We will use [Metacritic's Review Dataset from 2011-2019 via Kaggle](https://www.kaggle.com/datasets/skateddu/metacritic-critic-games-reviews-20112019) and build our recommendation engine on a cosine similarity matrix.

The main steps of this notebook will be to:

* Install the Dependencies
* Check and clean the data
* Identify the Top 100 Most Popular Games
* Build the User Profile
* Update the Main DataFrame with the New Profile
* Build the Similarity Matrix
* Match user with the most simiar video game critic
* Recommend video games based on the highest rated games by the most similar video game critic

Full code can be accessed on my GitHub here or it is also viewable on my Kaggle page.

If you wish to learn how to use Python for similar use cases, I can recommend [DataCamp's Recommendation Engine course](https://datacamp.pxf.io/4e20gM) which only takes 4 hours and you can literally copy their code and make adjustments.

# Install Dependencies

These are the modules needed to execute the code.

In [None]:
#DataFrame Manipulation
import pandas as pd
import numpy as np

#Cosine Similarity Matrix
import scipy as sp
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity

#Text matching
!pip install python-Levenshtein
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
from datetime import datetime #Formatting date variables
from random import randint #Generate random entries

# Check and Clean Dataset

Once we've uploaded the dataset, we should check that the data is in a usable condition. And make changes accordingly including:

* Checking the first few rows
* Checking the data types for each column
* Identifying and addressing the null values
* Formatting any inapprorpriate data values such as date values presented as string values



In [2]:
#Convert uploaded CSV into a dataframe
df = pd.read_csv("metacritic_critic_reviews.csv",  error_bad_lines=False, encoding='utf-8') 
#Show first 5 rows in DataFrame
df.head()



  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,name,review,game,platform,score,date
0,LEVEL (Czech Republic),"Portal 2 is a masterpiece, a work of art that ...",Portal 2,PC,100.0,"May 25, 2011"
1,GameCritics,So do we need Portal 2? Do I need it? Maybe no...,Portal 2,PC,100.0,"May 8, 2011"
2,PC Games (Russia),Portal 2 exceeds every expectation. It has a s...,Portal 2,PC,100.0,"May 6, 2011"
3,Adventure Gamers,"Like its predecessor, Portal 2 is not an adven...",Portal 2,PC,100.0,"Apr 29, 2011"
4,Armchair Empire,"Pile on the ""Oh, yes!"" moments of solving some...",Portal 2,PC,100.0,"Apr 28, 2011"


In [3]:
#Show an overview of the dataset including the data types for each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125876 entries, 0 to 125875
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   name      125876 non-null  object 
 1   review    125876 non-null  object 
 2   game      125876 non-null  object 
 3   platform  125876 non-null  object 
 4   score     124311 non-null  float64
 5   date      125832 non-null  object 
dtypes: float64(1), object(5)
memory usage: 5.8+ MB


In [4]:
#Which columns have null values?
print(df.columns[df.isna().any()].tolist())

#How many null values per column? - Count the missing values in each column
df.isnull().sum()

['score', 'date']


name           0
review         0
game           0
platform       0
score       1565
date          44
dtype: int64

In [5]:
#This will drop any null values in the dataset
df.dropna(inplace=True)

In [6]:
#Create date column by converting the date into a datetime object then returning only the year
def add_year(full_date):
  datetime_object = datetime.strptime(full_date, '%b %d, %Y') #Converts to datetime based on structure of string
  return datetime_object.year #returns only the year

df['year'] = df['date'].apply(add_year)

#Add the year in brackets to the name of the game to avoid confusion between remakes/remasters and reboots
def year_game(row):
  calendar_year = str(row['year'])
  year_game_combined = str(row['game']) + " (" + calendar_year + ")"
  return year_game_combined

#Updates game column to include year of release
df['game'] = df.apply(year_game, axis=1)
df.head()

Unnamed: 0,name,review,game,platform,score,date,year
0,LEVEL (Czech Republic),"Portal 2 is a masterpiece, a work of art that ...",Portal 2 (2011),PC,100.0,"May 25, 2011",2011
1,GameCritics,So do we need Portal 2? Do I need it? Maybe no...,Portal 2 (2011),PC,100.0,"May 8, 2011",2011
2,PC Games (Russia),Portal 2 exceeds every expectation. It has a s...,Portal 2 (2011),PC,100.0,"May 6, 2011",2011
3,Adventure Gamers,"Like its predecessor, Portal 2 is not an adven...",Portal 2 (2011),PC,100.0,"Apr 29, 2011",2011
4,Armchair Empire,"Pile on the ""Oh, yes!"" moments of solving some...",Portal 2 (2011),PC,100.0,"Apr 28, 2011",2011


# Top 100 Most Popular Games

We need to generate the most relevant list of games for an user possible so they don't have to input games which wouldn't be an efficient user experience. (A search input function is possible but time-consuming to build)
To do that we will rank games based on aggregated sum scores (popularity) and hope users have at least played enough of those games to provide relevant input data.

To do this we need to:
* Calculate the popularity score
* Create a separate DataFrame of top 100 sorted


In [7]:
#Create list of game titles
unique_titles = df['game'].unique()
#Create dictionary to store sum score (total score of reviews combined per game)
title_sum_score = {}
#Loop through each title and assign a float value
for i in unique_titles:
  title_sum_score[i] = 0.0

#For each row, add the score to the matching dictionary entry
def total_score_per_title(row):
  title_sum_score[row['game']]+=float(row['score'])

df.apply(total_score_per_title, axis=1)

0         None
1         None
2         None
3         None
4         None
          ... 
125871    None
125872    None
125873    None
125874    None
125875    None
Length: 124267, dtype: object

In [8]:
#Create DataFrame from dictionary with a score columnn
sum_df = pd.DataFrame.from_dict(title_sum_score, orient='index', columns=['score'])
#Reset index to turn game title into an accessible column, rename to "game", sort values by score
sum_df = sum_df.reset_index().rename(columns={'index':'game'}).sort_values('score', ascending=False)

sum_df.head()

Unnamed: 0,game,score
3134,XCOM 2 (2016),13052.0
4999,Monster Hunter: World (2018),12808.0
4983,Red Dead Redemption 2 (2018),12683.0
4060,Destiny 2 (2017),12633.0
3119,Dark Souls III (2016),12431.0


In [9]:
#Locate the first 100 games
top100_sum_df = sum_df.iloc[:100]
#Print unique game titles in the top 100 games
top100_sum_df['game'].unique()

array(['XCOM 2 (2016)', 'Monster Hunter: World (2018)',
       'Red Dead Redemption 2 (2018)', 'Destiny 2 (2017)',
       'Dark Souls III (2016)', 'Far Cry 5 (2018)',
       'The Legend of Zelda: Breath of the Wild (2017)',
       'Resident Evil 7: biohazard (2017)', 'INSIDE (2016)',
       'Dragon Ball FighterZ (2018)', 'The Witcher 3: Wild Hunt (2015)',
       'DOOM (2016)', 'God of War (2018)', 'Fallout 4 (2015)',
       "Assassin's Creed Odyssey (2018)",
       'Deus Ex: Mankind Divided (2016)',
       'Wolfenstein II: The New Colossus (2017)', 'Injustice 2 (2017)',
       'Shadow of the Tomb Raider (2018)',
       "Assassin's Creed Origins (2017)", 'Battlefield 1 (2016)',
       "Uncharted 4: A Thief's End (2016)", 'Prey (2017)',
       'Super Mario Odyssey (2017)', 'Dead Cells (2018)',
       'Far Cry Primal (2016)', "Tom Clancy's The Division (2016)",
       'Horizon Zero Dawn (2017)', 'Yooka-Laylee (2017)',
       'Metal Gear Solid V: The Phantom Pain (2015)',
       'Batman: A

In [10]:
#------IGNORE----#
#Just double checking we don't get our indexes mixed up

random = randint(0,int(top100_sum_df.shape[0])-1)
print(random) #index within the top100 fdf
game_name = top100_sum_df['game'].iloc[random]
print(game_name) #game title
current_game_reviews = df.loc[df['game'] == game_name] #all the rows (reviews by different critics for that game title)
#----------
#Store index of the current subset of reviews
current_game_indexes = current_game_reviews.index
#Select random review from that subset
random = randint(0,int(current_game_indexes.shape[0])-1)
#Assign index for that review
current_review_index = current_game_indexes[random]
current_review_index

89
Unravel (2016)


69480

# Create Function to Loop Through At Least 10 Games

This function randomly loops through atleast 10 games from the top 100 and presents a random review for the user assess how similar it is to their own sentiment of the game

Based on their input, each game is given a score adjusted based on the critic's original score with a positive/negative adjustment based on the user's input. I purposely don't reveal the critics score as I feel they are rarely a true reflection of a player's sentiment quantified.

There's a lot going on in the function below but overall we're taking the score of the reviewer and updating it based on how close the user feels it is to their own perception. The maths is basic and arbitary and I haven't been able to find a way to make the inputs more user friendly. Still it does the job for now with comments explaining most of it.

In [20]:
#Create new user profile to store new entries relating to current user
profile = {'game':[],
        'name':[],
        'score': []}

def loop_10_games():
  #Create counter to track no. of games reviewed
  counter = 0
  #Create inputs tracker to track which games have been added (avoid duplicates)
  inputs = []
  #Keep running until user returns no more than 11 entries
  print("Based on the sentiment of each review quote, please score them on how relevant it is to your own perspective of that game title.")
  print("These are the scores with the corresponding responses:")
  print("""
  2 - I would rate the game MUCH better /n 
  1 - I would rate the game a little better /n 
  0 - This is spot on /n 
  -1 - I would rate the game worst /n
  -2 - I would rate the game MUCH worst /n
  N - No comment
   """)
  while counter < 10:
    #Generate random number between 0 and the total df row length
      random_top_100 = randint(0,int(top100_sum_df.shape[0])-1)
      #If current random integer already exists in inputs...
      while top100_sum_df['game'].iloc[random_top_100] in inputs:
        #Rerun random algorithm 
        random_top_100 = randint(0,int(top100_sum_df.shape[0])-1)  
      current_game = top100_sum_df['game'].iloc[random_top_100]
      current_game_reviews = df.loc[df['game'] == current_game]
      #----------
      #Store index of subset, and select at random
      current_game_indexes = current_game_reviews.index
      random = randint(0,int(current_game_indexes.shape[0])-1)
      current_loc = current_game_indexes[random]
#Communicate with user the quote and game it is referring to
      print("Do you agree with the sentiment of this quote? Rate the relevancy between -2 to +2")
      print(df['game'].iloc[current_loc]) 
      # print(df['score'].iloc[current_loc])
      print(df['review'].iloc[current_loc])
      # print(random_top_100)
      # print(current_loc)
      #Define user inputs and mathematical significance   
      user_response = input()
      user_responses = ["I would rate the game MUCH better", "I would rate the game a little better", "This is spot on", "No comment", "I would rate the game worst", "I would rate the game much worst"]
      user_increments = [20,10,0,'NaN',-20,-30]
      #Append current review to inputs
      inputs.append(top100_sum_df['game'].iloc[random_top_100])

#If response = X, apply +/- calculation
      if user_response == "1": #"I would rate the game a little better"
        user_score = int(df['score'].iloc[current_loc]) + 10
        #Ensure score sticks to 100 max
        if user_score > 100:
          user_score == 100
          #Append result to profile
        profile['game'].append(df['game'].iloc[current_loc])
        profile['name'].append(1001)
        profile['score'].append(user_score)
        counter+=1
      elif user_response == "2": #"I would rate the game MUCH better"
        user_score = int(df['score'].iloc[current_loc]) + 20
        if user_score > 100:
          user_score == 100
        profile['game'].append(df['game'].iloc[current_loc])
        profile['name'].append(1001)
        profile['score'].append(user_score)
        counter+=1
      elif user_response == "0": #"This is spot on"
        user_score = int(df['score'].iloc[current_loc])
        profile['game'].append(df['game'].iloc[current_loc])
        profile['name'].append(1001)
        profile['score'].append(user_score)
        counter+=1
      elif user_response == "-1": #"I would rate the game worst"
        user_score = int(df['score'].iloc[current_loc]) - 20
        if user_score < 20:
          user_score == 20
        profile['game'].append(df['game'].iloc[current_loc])
        profile['name'].append(1001)
        profile['score'].append(user_score)
        counter+=1
      elif user_response == "-2": #"I would rate the game much worst"
        user_score = int(df['score'].iloc[current_loc]) - 30
        if user_score < 20:
          user_score == 20
        profile['game'].append(df['game'].iloc[current_loc])
        profile['name'].append(1001)
        profile['score'].append(user_score)
        counter+=1
        #If user can't score, skip to the next iteration
      elif user_response == "N": #"No comment"
        continue
      elif user_response == "End":
        break
      else:
        continue

loop_10_games()
my_ratings = pd.DataFrame(profile)
my_ratings.head()


In [12]:
my_ratings

Unnamed: 0,game,name,score
0,DOOM (2016),1001,86
1,Tomb Raider: Definitive Edition (2014),1001,67
2,Okami HD (2017),1001,83
3,Far Cry 5 (2018),1001,100
4,Football Manager 2018 (2017),1001,58
5,Rocket League (2017),1001,110
6,Uncharted: The Lost Legacy (2017),1001,105
7,Nioh (2017),1001,87
8,Pillars of Eternity: Complete Edition (2017),1001,70
9,Call of Duty: Infinite Warfare (2016),1001,61


# Add Profile to Main DataFrame

By adding the new user profile to the main dataframe, we can compare the profiel equally with all the other reviewers based how they scored the same group of games.

In [13]:
main_df = df[['game','name','score']]

In [14]:
def add_profile(current_df, profile_df):
  #Make user profile a part of the dataset
  complete_df = pd.concat([current_df, profile_df], axis=0)
  # rename the columns to userID, itemID and rating
  complete_df.columns = ['itemID', 'userID', 'rating']
  # use the transform method group by userID and count to keep the games with more than 25 reviews
  complete_df['reviews'] = complete_df.groupby(['itemID'])['rating'].transform('count')
  return complete_df

updated_df = add_profile(main_df, my_ratings)

# Pivot Data into Similarity Matrix

Here we convert our dataframe into a similarity matrix to measure how similar each critic including the new user profile is to each other.

In [15]:
def pivot_data_similarity(full_df):
  pivot = full_df.pivot_table(index=['itemID'], columns=['userID'], values='rating')
  #Applying lambda function to multiple rows using Dataframe.apply()
  #(x-np.mean(x))/(np.max(x)-np.min(x)) = Formula
  pivot_n = pivot.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1)

  # step 2 - Fill NaNs with Zeros
  pivot_n.fillna(0, inplace=True)

  # step 3 - Transpose the pivot table
  pivot_n = pivot_n.T

  # step 4 - Locate the columns that are not zero (unrated)
  pivot_n = pivot_n.loc[:, (pivot_n != 0).any(axis=0)]

  # step 5 - Create a sparse matrix based on our pivot table
  piv_sparse = sp.sparse.csr_matrix(pivot_n.values)

  #Compute cosine similarity between samples in X and Y.
  game_similarity = cosine_similarity(piv_sparse)

  #Turn our similarity kernel matrix into a dataframe
  sim_matrix_df = pd.DataFrame(game_similarity, index = pivot_n.index, columns = pivot_n.index)

  return sim_matrix_df

new_sim_df = pivot_data_similarity(updated_df)

In [16]:
new_sim_df.head()

userID,1001,1UP,3DJuegos,4Players.de,Absolute Games,AceGamez,ActionTrip,Adventure Gamers,App Trigger,Arcade Sushi,...,X-ONE Magazine UK,XBLA Fans,XGN,Xbox Achievements,Xbox Tavern,XboxAddict,Yahoo!,YouGamers,ZTGD,games(TM)
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001,1.0,0.0,-0.01626,-0.005577,0.0,0.0,-2e-05,0.0,0.0,0.01586,...,0.033157,0.0,0.004446,0.086324,0.018113,-0.008428,-0.000349,0.0,-0.010809,-0.002354
1UP,0.0,1.0,-0.023054,0.001183,-0.029014,0.0,-0.024743,0.0,0.0,0.0,...,0.0,0.0,-0.000375,0.0,0.0,0.0,0.0,0.005301,0.0244,0.015251
3DJuegos,-0.01626,-0.023054,1.0,0.010908,-0.078993,0.0,0.00742,0.006608,-0.044282,0.00531,...,-0.07008,0.008202,0.052381,0.021039,0.036036,0.024795,-0.058755,0.025963,0.014181,-0.134861
4Players.de,-0.005577,0.001183,0.010908,1.0,0.006609,0.0,-0.014663,-0.023619,0.012125,0.009012,...,-0.002902,-0.021947,0.00494,-0.006968,-0.030265,-0.002292,0.006233,0.002292,-0.005861,0.006301
Absolute Games,0.0,-0.029014,-0.078993,0.006609,1.0,0.0,0.037939,-0.019202,0.0,0.0,...,0.0,0.0,-0.009964,0.0,0.0,0.0,0.0,-0.031578,-0.026115,0.008426


# Return most similar profiles to current user

This function answers the question - "which critic's taste is most similar to the current user" and returns the top 5 highest matches.

Using our similarity DataFrame we sort by critic similar to the inputted user and print out the top 5 highest entries.

In [32]:
most_similar_critics = []

def game_recommendation(reviewer):
    """
    This function will return the top 5 reviewers with the highest cosine similarity value and show their match percentage.
         
    """
    top_5_most_similar = []
    #Counter for Ranking
    number = 1
    print('Recommended critics based on how similar your tastes are:')
    #for row in sorted similary df
    for n in new_sim_df.sort_values(by = reviewer, ascending = False).index[1:6]:
      #append result to our pre-assigned list
        top_5_most_similar.append(n)
        #print out the result and match percentage
        print("#" + str(number) + ": " + n + ", " + str(round(new_sim_df[reviewer][n]*100,2)) + "% " + "match")
        #track the count
        number +=1 
    return top_5_most_similar

most_similar_critics = game_recommendation(1001)

Recommended critics based on how similar your tastes are:
#1: Xbox Achievements, 8.63% match
#2: Time, 4.64% match
#3: PC Invasion, 3.78% match
#4: SomosXbox, 3.73% match
#5: The Escapist, 3.59% match


In [18]:
#DataFrame containing the highest matching critic's reviews sorted by Score
critic_titles = df[df['name'] == most_similar_critics[0]].sort_values('score', ascending=False)

#This function looks at the Critic's DataFrame, pulls out their highest ranked games
def top_critic(critic):
    #Counter for Ranking
    number = 1
    print("These are your most similar critic\'s ({}) highest scored games:\n".format(critic))
    #critic_titles = df[df['name'] == most_similar_critics[0]].sort_values('score', ascending=False)
    for n in range(len(critic_titles['game'][:10])):
      print("#" + str(number) + ": " + str(critic_titles.iloc[n]['game']) + ", " + str(critic_titles.iloc[n]['score']))
      number +=1  

top_critic(critic_titles.iloc[0]['name'])

These are your most similar critic's (Xbox Achievements) highest scored games:

#1: Grand Theft Auto V (2014), 100.0
#2: Red Dead Redemption 2 (2018), 95.0
#3: Overwatch (2016), 93.0
#4: Sunset Overdrive (2014), 92.0
#5: Prey (2017), 92.0
#6: The Witcher 3: Wild Hunt (2015), 92.0
#7: Batman: Arkham Knight (2015), 92.0
#8: Assassin's Creed Odyssey (2018), 90.0
#9: Alien: Isolation (2014), 90.0
#10: Forza Horizon 4 (2018), 90.0


# Build Recommender Function

This puts all that code into one function.

Now that we've created our main and top 100 dataframes, we can pull out random reviews and let the user input their relevancy scores on the sentiment of each review.

They will be returned the top 5 reviewers that are most similar to in terms of taste and the top reviewers favourite 10 games.

In [None]:
def recommend_games():
  #Create profile
  profile = {'game':[], 'name':[], 'score': []}
  #Receive inputs based on df and top100_df
  loop_10_games()
  profile_df = pd.DataFrame(profile)

  #Add profile to main df
  full_df = add_profile(main_df, profile_df)
  #Create the similarity matrix based on updated df
  new_sim_df = pivot_data_similarity(full_df)

  # Find most similar critic profiles
  most_similar_critics = []

  most_similar_critics = game_recommendation(1001)

  # Inspect the most similar to the user preferences
  critic_titles = df[df['name'] == most_similar_critics[0]].sort_values('score', ascending=False)
  #Return most similar values
  top_critic(critic_titles.iloc[0]['name'])

recommend_games()

# Next Steps and Areas for Improvement

This is a minimum viable product where I wanted to see if the concept would work. This meant that I missed out on making this a better user experience and the results were not as good as I could make them. If I get round to improving on this concept, I would:

* Create Input options rather than user typing in numbers to reference their choices
* Allow users to choose the games to inform on instead of responding to random games within the top 100 games within the dataset.
* Increase match percentages by applying more sophisticated math rigour to the calculations