In [1]:
from src.clean import clean_data
from src.nlp import apply_preprocess_text
from src.recsys import GameRecommender
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

In [2]:
# Run if first time running notebook and nltk
# import nltk
# nltk.download('punkt')
# nltk.download('stopwords')

## The Dataset
This dataset is from [Kaggle](https://www.kaggle.com/datasets/deepcontractor/top-video-games-19952021-metacritic?resource=download), and contains reviews from Metacritic.com for video games from 1995-2021. There is a meta_score which is the average of critic reviews for a video game, and user_review which is the average among users.

In [3]:
df = pd.read_csv('data/all_games.csv')
df.head(10)

Unnamed: 0,name,platform,release_date,summary,meta_score,user_review
0,The Legend of Zelda: Ocarina of Time,Nintendo 64,"November 23, 1998","As a young boy, Link is tricked by Ganondorf, ...",99,9.1
1,Tony Hawk's Pro Skater 2,PlayStation,"September 20, 2000",As most major publishers' development efforts ...,98,7.4
2,Grand Theft Auto IV,PlayStation 3,"April 29, 2008",[Metacritic's 2008 PS3 Game of the Year; Also ...,98,7.7
3,SoulCalibur,Dreamcast,"September 8, 1999","This is a tale of souls and swords, transcendi...",98,8.4
4,Grand Theft Auto IV,Xbox 360,"April 29, 2008",[Metacritic's 2008 Xbox 360 Game of the Year; ...,98,7.9
5,Super Mario Galaxy,Wii,"November 12, 2007",[Metacritic's 2007 Wii Game of the Year] The u...,97,9.1
6,Super Mario Galaxy 2,Wii,"May 23, 2010","Super Mario Galaxy 2, the sequel to the galaxy...",97,9.1
7,Red Dead Redemption 2,Xbox One,"October 26, 2018",Developed by the creators of Grand Theft Auto ...,97,8.0
8,Grand Theft Auto V,Xbox One,"November 18, 2014",Grand Theft Auto 5 melds storytelling and game...,97,7.9
9,Grand Theft Auto V,PlayStation 3,"September 17, 2013","Los Santos is a vast, sun-soaked metropolis fu...",97,8.3


## Data Cleaning

In this section we do a few of the basic steps to clean the data and further expand on some features that may help with creating a good recommender system. We use the function `clean_data` which does the below:
- Cleans up whitespace
- Changes `release_date` to a datetime column and create `year` and `decade` columns
- Creates the `platform_type` column which generalizes video game console platforms to a particular brand (e.g. Nintendo Switch = Nintendo)

In [4]:
df = clean_data(df)
df.head()

Unnamed: 0,name,platform,release_date,summary,meta_score,user_review,year,decade,platform_type
0,The Legend of Zelda: Ocarina of Time,Nintendo 64,1998-11-23,"As a young boy, Link is tricked by Ganondorf, ...",99,9.1,1998,1990,Nintendo
1,Tony Hawk's Pro Skater 2,PlayStation,2000-09-20,As most major publishers' development efforts ...,98,7.4,2000,2000,PlayStation
2,Grand Theft Auto IV,PlayStation 3,2008-04-29,[Metacritic's 2008 PS3 Game of the Year; Also ...,98,7.7,2008,2000,PlayStation
3,SoulCalibur,Dreamcast,1999-09-08,"This is a tale of souls and swords, transcendi...",98,8.4,1999,1990,Sega
4,Grand Theft Auto IV,Xbox 360,2008-04-29,[Metacritic's 2008 Xbox 360 Game of the Year; ...,98,7.9,2008,2000,Xbox


### NLP tasks

Since we don't have individual user reviews, and instead an average of critic and user reviews. We'll have to use item to item comparisons and analyze the written summaries for each of the video games. So some natural language processing techniques will have to be used on the data. I use the function `apply_preprocess_text` which does tokenization, stopword removal, and stemmization on each of the summaries for the video games.

In [5]:
df = apply_preprocess_text(df, "summary")
df.head(10)

  0%|          | 0/18800 [00:00<?, ?it/s]

Unnamed: 0,name,platform,release_date,summary,meta_score,user_review,year,decade,platform_type
0,The Legend of Zelda: Ocarina of Time,Nintendo 64,1998-11-23,"young boy , link trick ganondorf , king gerudo...",99,9.1,1998,1990,Nintendo
1,Tony Hawk's Pro Skater 2,PlayStation,2000-09-20,major publish ' develop effort shift number ne...,98,7.4,2000,2000,PlayStation
2,Grand Theft Auto IV,PlayStation 3,2008-04-29,[ metacrit 's 2008 ps3 game year ; also known ...,98,7.7,2008,2000,PlayStation
3,SoulCalibur,Dreamcast,1999-09-08,"tale soul sword , transcend world histori , to...",98,8.4,1999,1990,Sega
4,Grand Theft Auto IV,Xbox 360,2008-04-29,[ metacrit 's 2008 xbox 360 game year ; also k...,98,7.9,2008,2000,Xbox
5,Super Mario Galaxy,Wii,2007-11-12,[ metacrit 's 2007 wii game year ] ultim ninte...,97,9.1,2007,2000,Nintendo
6,Super Mario Galaxy 2,Wii,2010-05-23,"super mario galaxi 2 , sequel galaxy-hop origi...",97,9.1,2010,2010,Nintendo
7,Red Dead Redemption 2,Xbox One,2018-10-26,develop creator grand theft auto v red dead re...,97,8.0,2018,2010,Xbox
8,Grand Theft Auto V,Xbox One,2014-11-18,grand theft auto 5 meld storytel gameplay uniq...,97,7.9,2014,2010,Xbox
9,Grand Theft Auto V,PlayStation 3,2013-09-17,"lo santo vast , sun-soak metropoli full self-h...",97,8.3,2013,2010,PlayStation


### Building a Basic Recommender

The code below shows a very basic content recommender solely based off of the summaries about each one of the video games. The game we will base our recommendations off of is **The Legend of Zelda: Ocarina of Time**.

In [6]:

# Create a bag-of-words representation
summary_tokens = df['summary']
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(summary_tokens)

# Compute the cosine similarity between each pair of items in the DTM
cosine_sim = cosine_similarity(dtm)

# Get the index of the item you want to recommend similar items for (Legend of Zelda Game)
item_index = 0 

# Get the cosine similarities for the item with the specified index
similarity_scores = list(enumerate(cosine_sim[item_index]))

# Sort the similarity scores in descending order
sorted_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

# Get the top 10 most similar items to the item with the specified index
top_items = [i[0] for i in sorted_scores[1:11]]

# Free up memory from notebook
del dtm
del cosine_sim

# Show these columns
cols = ["name", "meta_score", "user_review", "platform"]
rec_df = df.iloc[top_items][cols]

You can see the results below, where you can tell this is definitely a content based recommender. Deus Ex: Human Revolution - The Missing Link, is also shown 3 times on 3 different platforms as the word "Link" just also so happens to be the name of the main protagonist in The Legend of Zelda.

In [7]:
rec_df

Unnamed: 0,name,meta_score,user_review,platform
617,The Legend of Zelda: The Minish Cap,89,8.9,Game Boy Advance
36,The Legend of Zelda: Twilight Princess,96,8.9,GameCube
56,The Legend of Zelda: Twilight Princess,95,9.0,Wii
1121,The Legend of Zelda: Spirit Tracks,87,7.9,DS
8939,The Legend of Zelda: Tri Force Heroes,73,7.4,3DS
40,The Legend of Zelda: The Wind Waker,96,9.0,GameCube
53,The Legend of Zelda: A Link to the Past,95,9.0,Game Boy Advance
5498,Deus Ex: Human Revolution - The Missing Link,78,7.3,PlayStation 3
6963,Deus Ex: Human Revolution - The Missing Link,76,7.0,Xbox 360
8845,Deus Ex: Human Revolution - The Missing Link,73,7.3,PC


### An Intermediate Recommender

In content-based recommendation systems, the most similar items are often recommended to the user. However, this can lead to a lack of diversity in recommendations, as similar items tend to have similar features. To overcome this limitation, the `GameRecommender` class includes a method to incorporate diversity into the recommendation process.

The class calculates the diversity by computing the dissimilarity between each recommended item and all the previously recommended items. This dissimilarity matrix is then subtracted from 1 to obtain a similarity matrix that represents the diversity of each item with respect to the previously recommended items. This diversity matrix is then multiplied by the similarity matrix that represents the similarity of each item to the target item.

Additionally, `GameRecommender` allows for the incorporation of rating scores into the recommendation process. Two options are available: `meta_score` and `user_review`. When the `meta_score` option is chosen, the meta critic score is used to weight the similarity and diversity scores for each item, while the `user_review` option uses user ratings. This provides the user with the ability to tailor the recommendation process based on their preference.

Under the hood, `GameRecommender` uses cosine similarity, euclidean distance, or manhattan distance metrics to compute similarity between the items in the dataset. It then uses a combination of similarity, diversity, and rating scores (if specified) to generate a list of recommended items for a given target item. By default, the predict method returns a list of 10 recommended items.


In [8]:
recsys = GameRecommender(df, metric = 'cosine')
recsys.fit()

Below is an example using dissimilarity to incorporate some diversity. You can see that there have been some changes for the better, although it will continue to recommend the Legend of Zelda games.

In [9]:
recs = recsys.predict(0, None)
cols = ["name", "meta_score", "user_review", "platform"]
recs = df.iloc[recs][cols]
recs

Unnamed: 0,name,meta_score,user_review,platform
617,The Legend of Zelda: The Minish Cap,89,8.9,Game Boy Advance
56,The Legend of Zelda: Twilight Princess,95,9.0,Wii
1121,The Legend of Zelda: Spirit Tracks,87,7.9,DS
36,The Legend of Zelda: Twilight Princess,96,8.9,GameCube
3778,LostWinds,81,3.3,Wii
5498,Deus Ex: Human Revolution - The Missing Link,78,7.3,PlayStation 3
8939,The Legend of Zelda: Tri Force Heroes,73,7.4,3DS
40,The Legend of Zelda: The Wind Waker,96,9.0,GameCube
12775,D4: Dark Dreams Don't Die,67,7.1,PC
53,The Legend of Zelda: A Link to the Past,95,9.0,Game Boy Advance


These results incorporate diversity as well as the `meta_score`:

In [10]:
recs = recsys.predict(0, "meta_score")
recs = df.iloc[recs][cols]
recs

Unnamed: 0,name,meta_score,user_review,platform
36,The Legend of Zelda: Twilight Princess,96,8.9,GameCube
56,The Legend of Zelda: Twilight Princess,95,9.0,Wii
40,The Legend of Zelda: The Wind Waker,96,9.0,GameCube
617,The Legend of Zelda: The Minish Cap,89,8.9,Game Boy Advance
54,The Legend of Zelda: Majora's Mask,95,9.1,Nintendo 64
1121,The Legend of Zelda: Spirit Tracks,87,7.9,DS
53,The Legend of Zelda: A Link to the Past,95,9.0,Game Boy Advance
242,Prince of Persia: The Sands of Time,92,8.1,GameCube
3778,LostWinds,81,3.3,Wii
434,The Legend of Zelda: Phantom Hourglass,90,8.0,DS


These results incorporate diversity as well as the `user_review`:

In [11]:
recs = recsys.predict(0, "user_review")
recs = df.iloc[recs][cols]
recs

Unnamed: 0,name,meta_score,user_review,platform
617,The Legend of Zelda: The Minish Cap,89,8.9,Game Boy Advance
56,The Legend of Zelda: Twilight Princess,95,9.0,Wii
40,The Legend of Zelda: The Wind Waker,96,9.0,GameCube
36,The Legend of Zelda: Twilight Princess,96,8.9,GameCube
54,The Legend of Zelda: Majora's Mask,95,9.1,Nintendo 64
1121,The Legend of Zelda: Spirit Tracks,87,7.9,DS
53,The Legend of Zelda: A Link to the Past,95,9.0,Game Boy Advance
502,TimeSplitters 2,90,8.7,PlayStation 2
8939,The Legend of Zelda: Tri Force Heroes,73,7.4,3DS
5498,Deus Ex: Human Revolution - The Missing Link,78,7.3,PlayStation 3


### Summary

The current implementation of the GameRecommender class provides a solid foundation for generating recommendations based on similarity and diversity, while allowing for the incorporation of scoring parameters. However, further improvements could be made by considering additional factors such as video game platform, decade of creation, and applying weights for similarity, dissimilarity, and scoring. By incorporating these additional factors, the GameRecommender class could provide more personalized and accurate recommendations for individual users, taking into account their specific preferences and tastes.