# Text-based Recommendation with Metacritic

This is notebook that shows you how you can build a recommendation engine based on text data within a dataset. In this case, I will be using the reviews from Metacritic on Video Game releases.

You can access the original dataset [here on Kaggle](https://www.kaggle.com/skateddu/metacritic-critic-games-reviews-20112019).

For the particular dataset I used in this [notebook you can access that here](https://www.kaggle.com/seyi92coding/metacritic-reviews-text-only-per-game).

# Import Dependencies

In [2]:
import pandas as pd
import numpy as np
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0




In [3]:
df = pd.read_csv("/content/Metacritic_Reviews_Only.csv",  error_bad_lines=False, encoding='utf-8')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,Game Title,Reviews
0,0,Portal 2,So do we need Portal 2? Do I need it? Maybe no...
1,1,The Elder Scrolls V: Skyrim,Perfect games do not exist and Skyrim is no ex...
2,2,The Legend of Zelda: Ocarina of Time 3D,Even though the game is just a remake of a twe...
3,3,Batman: Arkham City,The diehard fans of Bruce Wayne can set up a g...
4,4,Super Mario 3D Land,Super Mario 3D Land is a perfect blend between...


# Clean and Format Data

There's not much cleaning involved this time round but depending on the importance of the structure you need your text data in, you might want to do additional things to make it more appropriate. 

Below I looked to:

* Remove redundant columns - To Speed up processing
* Drop missing values - NaN values tend to lead to a lot of errors in the code

In [5]:
#Remove title from review
def remove_title(row):
  game_title = row['Game Title']
  body_text = row['Reviews']
  new_doc = body_text.replace(game_title, "")
  return new_doc

df['Reviews'] = df.apply(remove_title, axis=1)
#drop redundant column
df = df.drop(['Unnamed: 0'], axis=1)

In [6]:
df.info()
# df['Game Title'].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3908 entries, 0 to 3907
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Game Title  3908 non-null   object
 1   Reviews     3908 non-null   object
dtypes: object(2)
memory usage: 61.2+ KB


In [7]:
#Which columns have null values?
print(df.columns[df.isna().any()].tolist())

#How many null values per column? - Count the missing values in each column
df.isnull().sum()

[]


Game Title    0
Reviews       0
dtype: int64

In [8]:
df.dropna(inplace=True) #Drop Null Reviews
print(df.isnull().sum())

Game Title    0
Reviews       0
dtype: int64


# Text-based similarities

We will be using the TF-IDF model to vectorize our text data. TF-IDF by default generates a column for every word in all of your documents(the reviews). This  will contain both very common words that appear in every document, and words that appear so rarely they provide no value in finding similarities between items.

To offset this we will set a minimum word count of 2 and exclude words that appear in over 70% of the reviews, hopefully this allows us to establish clearer distinctions between the reviews.



In [9]:
# Instantiate the vectorizer object to the vectorizer variable
#Minimum word count 2 to be included, words that appear in over 70% of docs should not be included
vectorizer = TfidfVectorizer(min_df=2, max_df=0.7)

# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(df['Reviews'])

# Look at the features generated
print(vectorizer.get_feature_names())






# Creating the TF-IDF DataFrame


Now that you have generated our TF-IDF features as vectors, we need to get them in a format that we can use to make recommendations. So we will wrap the array into a DataFrame and will assign the video game titles to the DataFrame's index.

This will leave us with a DataFrame where each row represents a game, and each column represeting a word extracted from the reviews for that game.

In [11]:
# Create Dataframe from TF-IDFarray
tfidf_df = pd.DataFrame(vectorized_data.toarray(), columns=vectorizer.get_feature_names())

# Assign the game titles to the index
tfidf_df.index = df['Game Title']
print(tfidf_df.head())

                                          00  000  007  ...   zx  être  τhere
Game Title                                              ...                  
Portal 2                                 0.0  0.0  0.0  ...  0.0   0.0    0.0
The Elder Scrolls V: Skyrim              0.0  0.0  0.0  ...  0.0   0.0    0.0
The Legend of Zelda: Ocarina of Time 3D  0.0  0.0  0.0  ...  0.0   0.0    0.0
Batman: Arkham City                      0.0  0.0  0.0  ...  0.0   0.0    0.0
Super Mario 3D Land                      0.0  0.0  0.0  ...  0.0   0.0    0.0

[5 rows x 25563 columns]




# Comparing all the games with TF-IDF

We will compare all the video games (the rows) with the cosine similarity metric to find the similarities between them. We do this by generating a matrix of all of the game review cosine similarities aas a DataFrame.

Based on how many of the same words each game has, they will be given a similarity score. By the output you will see when the same game interacts, they get a score of 1.00000 which makes sense since they have 100% similarity.

In [12]:
# Find the cosine similarity measures between all game and assign the results to cosine_similarity_array.
cosine_similarity_array = cosine_similarity(tfidf_df)

# Create a DataFrame from the cosine_similarity_array with tfidf_df.index as its rows and columns.
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_df.index, columns=tfidf_df.index)

# Print the top 5 rows of the DataFrame
cosine_similarity_df.head()

Game Title,Portal 2,The Elder Scrolls V: Skyrim,The Legend of Zelda: Ocarina of Time 3D,Batman: Arkham City,Super Mario 3D Land,Deus Ex: Human Revolution,Pushmo,Total War: Shogun 2,FIFA Soccer 12,Battlefield 3,LIMBO,Assassin's Creed: Brotherhood,The Witcher 2: Assassins of Kings,Dead Space 2,Bastion,Crysis 2,DiRT 3,Battlefield 3: Back to Karkand,Super Street Fighter IV: 3D Edition,Frozen Synapse,Mario Kart 7,Star Wars: The Old Republic,Saints Row: The Third,SpaceChem,Super Street Fighter IV: Arcade Edition,Football Manager 2012,Unity of Command,The Binding of Isaac,Trine 2,Rift,Shift 2: Unleashed,Anno 2070,Orcs Must Die!,Terraria,L.A. Noire: The Complete Edition,VVVVVV,F1 2011,The Book of Unwritten Tales,Cave Story 3D,Gemini Rue,...,Jumping Joe & Friends,Time Carnage,Kingdom Come: Deliverance - From The Ashes,Legendary Eleven,Castle of Heart,Black Clover: Quartet Knights,Fallout 76,Extinction,OVERKILL's The Walking Dead,Immortal: Unchained,Bullet Witch,Baseball Riot,Lust for Darkness,Fear Effect Sedna,Milanoir,Crisis on the Planet of the Apes VR,Carnival Games for Nintendo Switch,Nickelodeon Kart Racers,Morphies Law,Past Cure,Out of Ammo,Desert Child,Agony,ARK Park,Tennis World Tour,Yet Another Zombie Defense HD,Bravo Team,KURSK,New Gundam Breaker,Gungrave VR,Senran Kagura Reflexions,Underworld Ascendant,Heavy Fire: Red Shadow,Hollow,One Piece: Grand Cruise,Super Seducer: How to Talk to Girls,Fantasy Hero: Unsigned Legacy,Gene Rain,The Quiet Man,Wild West Online
Game Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
Portal 2,1.0,0.076978,0.119635,0.099916,0.099387,0.142591,0.128299,0.084359,0.074228,0.197972,0.132618,0.087776,0.12212,0.164433,0.172019,0.143876,0.098574,0.067234,0.071421,0.135737,0.09793,0.093416,0.097695,0.142187,0.043201,0.096326,0.046733,0.130828,0.245271,0.096459,0.088289,0.097993,0.086599,0.107879,0.080828,0.137098,0.099809,0.199834,0.106137,0.171135,...,0.056196,0.067611,0.031763,0.04748,0.113229,0.049188,0.140018,0.164447,0.054361,0.082428,0.095823,0.099595,0.041537,0.170513,0.090779,0.052346,0.034858,0.08027,0.089265,0.145185,0.105383,0.059543,0.131854,0.048444,0.058148,0.093478,0.104366,0.062081,0.057856,0.083302,0.066135,0.064428,0.098138,0.077267,0.046197,0.075991,0.05905,0.084386,0.09987,0.037423
The Elder Scrolls V: Skyrim,0.076978,1.0,0.090908,0.060537,0.06538,0.062413,0.057312,0.044211,0.045294,0.085709,0.117415,0.051207,0.094034,0.056393,0.173283,0.067244,0.040022,0.041559,0.066252,0.062147,0.076424,0.055055,0.046813,0.036114,0.037881,0.048226,0.02426,0.06243,0.090289,0.060215,0.042785,0.053046,0.034923,0.082322,0.05124,0.151212,0.067612,0.072808,0.059041,0.061483,...,0.072399,0.032384,0.024812,0.056755,0.08426,0.025474,0.153111,0.082293,0.027866,0.041552,0.063742,0.039358,0.025529,0.052369,0.034759,0.028721,0.014793,0.042287,0.064025,0.06076,0.058358,0.030585,0.060743,0.037512,0.035075,0.043001,0.044215,0.026993,0.029258,0.036274,0.066196,0.049196,0.046459,0.125415,0.035747,0.02794,0.052998,0.030242,0.049319,0.027255
The Legend of Zelda: Ocarina of Time 3D,0.119635,0.090908,1.0,0.06971,0.298485,0.076733,0.229433,0.057838,0.119264,0.109632,0.098468,0.067832,0.077429,0.083751,0.135027,0.084378,0.060749,0.050252,0.239153,0.090924,0.200346,0.061165,0.049843,0.05628,0.052534,0.088504,0.031631,0.083703,0.11796,0.068614,0.058864,0.069773,0.049024,0.092727,0.067311,0.224429,0.144846,0.127649,0.192684,0.122352,...,0.041102,0.042495,0.027855,0.032994,0.095398,0.031826,0.08324,0.086107,0.028297,0.045594,0.087432,0.03337,0.024566,0.071775,0.053289,0.029509,0.018177,0.047099,0.053945,0.064272,0.070193,0.036594,0.068566,0.038016,0.04333,0.046462,0.059519,0.033309,0.035646,0.055951,0.05415,0.035914,0.066421,0.052684,0.035731,0.029892,0.04623,0.035381,0.054201,0.027532
Batman: Arkham City,0.099916,0.060537,0.06971,1.0,0.047933,0.081773,0.042146,0.034527,0.051795,0.103954,0.058263,0.064522,0.073548,0.069071,0.084522,0.081744,0.059189,0.040303,0.045113,0.073886,0.043712,0.048371,0.080846,0.039571,0.051662,0.056682,0.025724,0.070027,0.099688,0.056156,0.046323,0.074443,0.040281,0.069428,0.07987,0.074677,0.066807,0.131385,0.062547,0.093122,...,0.022232,0.027848,0.012477,0.024015,0.075887,0.023248,0.077037,0.08577,0.023194,0.063474,0.080049,0.033699,0.025628,0.06929,0.047726,0.027321,0.012412,0.038265,0.03063,0.084595,0.056099,0.032548,0.062889,0.041527,0.030906,0.029005,0.046985,0.031734,0.030693,0.030285,0.038291,0.029772,0.043385,0.035933,0.027564,0.025698,0.026108,0.043439,0.059832,0.023298
Super Mario 3D Land,0.099387,0.06538,0.298485,0.047933,1.0,0.063703,0.241666,0.055694,0.101044,0.092144,0.07749,0.040892,0.062952,0.061374,0.111424,0.074244,0.053492,0.046113,0.25465,0.084194,0.533772,0.05441,0.046119,0.050386,0.048479,0.066943,0.03149,0.078488,0.107064,0.066953,0.046409,0.064585,0.051206,0.101116,0.046687,0.245824,0.146921,0.083242,0.163263,0.071114,...,0.047134,0.044823,0.023838,0.034566,0.101142,0.028023,0.07521,0.110568,0.029687,0.04954,0.060208,0.051479,0.022783,0.067851,0.054777,0.028718,0.014611,0.116657,0.043734,0.071596,0.059049,0.031895,0.061482,0.034486,0.037576,0.046584,0.047836,0.030393,0.031131,0.054796,0.046223,0.031689,0.05608,0.050798,0.031955,0.060861,0.038233,0.032078,0.058543,0.025356


In [62]:
# Find the values for the game Batman: Arkham City
cosine_similarity_series = cosine_similarity_df.loc['Batman: Arkham City']

# Sort these values highest to lowest
ordered_similarities = cosine_similarity_series.sort_values(ascending=False)

# Print the results
print(ordered_similarities)

Game Title
Batman: Arkham City                                     1.000000
Batman: Arkham Knight                                   0.671353
Batman: Return to Arkham                                0.630896
Batman: Arkham City - Armored Edition                   0.592272
Batman: Arkham Origins                                  0.555570
                                                          ...   
Total War: WARHAMMER II - Curse of the Vampire Coast    0.009232
FLY'N                                                   0.008768
Table Mini Golf                                         0.007746
ZEN Pinball 2: Portal Pinball                           0.005996
Aggressors: Ancient Rome                                0.003856
Name: Batman: Arkham City, Length: 3908, dtype: float64


# Build the user preference profile

People tend to like a bunch of games instead of just one. When it comes to recommendations the more the merrier (unless we're talking about backlogs). Here we will first generate a profile for a user by listing all of the games they have previously enjoyed playing.

Firstly we must create a subset of the data containing only the games we've previously enjoyed and store them separately.

Next we will calculate an average score for those particular games. Finally we will aggregate the scores of the games into an array to create a summary of a user's preferences that we will use to recommend new games.



In [13]:
list_of_games_enjoyed = ['The Legend of Zelda: Ocarina of Time 3D', 'Mario Kart 7', 'Carnival Games for Nintendo Switch']

# Create a subset of only the games the user has enjoyed
#A DataFrame's .reindex(index_list) method can be used to take a subset of the rows in a DataFrame, when slicing by a list containing indices.
games_enjoyed_df = tfidf_df.reindex(list_of_games_enjoyed)

# Inspect the DataFrame
games_enjoyed_df.head()


Unnamed: 0_level_0,00,000,007,00s,01,02,03,04,05,054,058,06,060,061,062,063,064,065,066,067,068,069,07,070,08,081,082,09,10,100,1000,100h,100th,101,102,103,104,105,106,107,...,zeldas,zellner,zen,zenimax,zenith,zeno,zer0,zerg,zero,zeroes,zeros,zest,zestiria,zeus,zhao,ziggler,ziggurat,zip,zipline,zipping,zippy,zips,zodiac,zoe,zoink,zombi,zombie,zombies,zombified,zone,zones,zoning,zoo,zoom,zoomed,zooming,zuma,zx,être,τhere
Game Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
The Legend of Zelda: Ocarina of Time 3D,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mario Kart 7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011961,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Carnival Games for Nintendo Switch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# Generate the user profile by finding the average scores of games they enjoyed
user_prof = games_enjoyed_df.mean()
user_prof[1000:1100]

aisle         0.0
aj            0.0
ak            0.0
aka           0.0
akane         0.0
             ... 
aloof         0.0
aloy          0.0
alpha         0.0
alphadream    0.0
alps          0.0
Length: 100, dtype: float64

# User profile based recommendations

Now we have built the user profile based on the aggregate of the games they have enjoyed, we can compare it to the larger tfidf DataFrame.
But first we will remove the enjoyed games since we don't want to recommend them.

Then we will calculate the user profile's cosine similarity against the original td-idf minus the aggregated games.

In [15]:
# Drop the games enjoyed as you would not want to suggest games that the user has already played
tfidf_subset_df = tfidf_df.drop(list_of_games_enjoyed, axis=0)

# user_prof contains a single column representing previously enjoyed games
# reshape turns the column into a single array representing the user profile
# Calculate the cosine_similarity between user_prof and all the game profiles in tfidf_subset_df.
similarity_array = cosine_similarity(user_prof.values.reshape(1, -1), tfidf_subset_df)
similarity_df = pd.DataFrame(similarity_array.T, index=tfidf_subset_df.index, columns=["similarity_score"])

# Sort the values from high to low by the values in the similarity_score
sorted_similarity_df = similarity_df.sort_values(by="similarity_score", ascending=False)

# Inspect the most similar to the user preferences
print(sorted_similarity_df.head())

                      similarity_score
Game Title                            
Mario Kart 8                  0.494152
Mario Kart 8 Deluxe           0.478379
Star Fox 64 3D                0.458390
Super Mario 3D Land           0.454152
Super Mario 3D World          0.365928


In [26]:
sorted_similarity_df.iloc[0]['similarity_score']

0.5548091310182418

# Text Matching

We can't trust people to spell anything correctly including complicated, poorly worded video game titles. So we need to build a way to match a text input to a game in the dataset.

So we create a function that finds the closest title to reference in the dataset.

In [17]:
# create a function to find the closest title
def matching_score(a,b):
  #fuzz.ratio(a,b) calculates the Levenshtein Distance between a and b, and returns the score for the distance
   return fuzz.ratio(a,b)
   # exactly the same, the score becomes 100

#Convert index to title_year
def get_title_from_index(index):
   return df[df.index == index]['Game Title'].values[0]

# A function to return the most similar title to the words a user type
# Without this, the recommender only works when a user enters the exact title which the data has.
def find_closest_title(title):
  #matching_score(a,b) > a is the current row, b is the title we're trying to match
   leven_scores = list(enumerate(df['Game Title'].apply(matching_score, b=title))) #[(0, 30), (1,95), (2, 19)~~] A tuple of distances per index
   sorted_leven_scores = sorted(leven_scores, key=lambda x: x[1], reverse=True) #Sorts list of tuples by distance [(1, 95), (3, 49), (0, 30)~~]
   closest_title = get_title_from_index(sorted_leven_scores[0][0])
   distance_score = sorted_leven_scores[0][1]
   return closest_title, distance_score
   # Bejeweled Twist, 100

find_closest_title('Batman Arkham Knight')

('Batman: Arkham Knight', 98)

# Build Recommender Function

Our recommender function will take in two inputs. The game title and the keyword exclusion. The keyword exclusion was added when I realised that the recommendations were returning a lot of DLCs and sequels which isn't a very useful recommender.


By combining everything we've done from building the user profile onwards we will pull out the Top 5 games we want to recommend.


1. Text Match the closest title in the dataset
2. Assign number for the final ranking
3. Create your user profile based on previous games
4. Create TFIDF subset without previously mentioned titles
5. Calculate cosine similarity based on selected titles and convert back into DataFrame
6. Sort DataFrame by similarity
7. Return most similarity game titles that don't contain keyword

In [69]:
def recommend_games(title, keyword):
  #Insert closest title here
  title, distance_score = find_closest_title(title)
  #Counter for Ranking
  number = 1
  print('Recommended because you played {}:\n'.format(title))

  list_of_games_enjoyed = [title]
  games_enjoyed_df = tfidf_df.reindex(list_of_games_enjoyed)
  user_prof = games_enjoyed_df.mean()
  
  tfidf_subset_df = tfidf_df.drop([title], axis=0)
  similarity_array = cosine_similarity(user_prof.values.reshape(1, -1), tfidf_subset_df)
  similarity_df = pd.DataFrame(similarity_array.T, index=tfidf_subset_df.index, columns=["similarity_score"])

  # Sort the values from high to low by the values in the similarity_score
  sorted_similarity_df = similarity_df.sort_values(by="similarity_score", ascending=False)

  # Inspect the most similar to the user preferences
  print(sorted_similarity_df.head())

  number = 0
  rank = 1

  for n in sorted_similarity_df.index:
    if rank <= 5:
      if keyword.lower() not in n.lower():
        print("#" + str(rank) + ": " + n + ", " + str(round(sorted_similarity_df.iloc[number]['similarity_score']*100,2)) + "% " + "match")
        number+=1
        rank +=1
      else:
        continue


recommend_games('Mortal Kombat', 'Kombat')

Recommended because you played Mortal Kombat:

                         similarity_score
Game Title                               
Street Fighter X Tekken          0.566978
Gravity Rush                     0.506782
Escape Plan                      0.505798
Tearaway                         0.487806
Rayman Origins                   0.485630
#1: Street Fighter X Tekken, 56.7% match
#2: Gravity Rush, 50.68% match
#3: Escape Plan, 50.58% match
#4: Tearaway, 48.78% match
#5: Rayman Origins, 48.56% match


If you want to build your own recommendation engine, feel free to borrow this as a starting point. Also you want some help walking through the steps of building one, then I can recommend these Datacamp courses.

* [Building Recommendation Engines in Python](https://datacamp.pxf.io/MXkjON)
* [Building Recommendation Engines with PySpark](https://datacamp.pxf.io/KeP9bz)
