# Recommender Systems

## Recommending Video Games using Content-Based Recommendation

### Student Info
- **Student name:** Ihsan Hepsen
- **Student ID:** 0145029-14
- **Student email:** ihsan.hepsen@student.kdg.be

## Objective
- Objective of this notebook is building a content-based recommender system using Steam's video game dataset which is available on kaggle.com. I will make recommendations for a couple of games after building the recommender system.

### What are Recommender Systems?
- A recommender system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item.
- Recommender systems are utilized in a variety of areas, but are most commonly recognized as playlist generators for video and music services, product recommenders for online stores, or content recommenders for social media platforms.
- There are different approaches to building recommender systems, including collaborative filtering, content-based filtering, and hybrid methods (combination of collaborative filtering and content-based filtering).

### What does "collaborative filtering" mean?
- Collaborative filtering relies on the past behavior and similarities among users when making the recommendations.

### What does "content-based" mean?
- Content-based recommendation is based on the descriptive features of the content when making the recommendations. Here are some descriptive features:
    * Content descriptions.
    * Content overviews.
    * Content tags or labels.
- These are just a few examples that can be used for content-based recommendation systems.
- For example, a recommendation system for books might use the title, author, and description of each book as features, and use this information to recommend books to users based on their reading preferences.
- In this notebook I will use tags and game description as descriptive features.

### Imports
- Down below, you will find all the necessary package and library imports for this notebook.

In [19]:
import re

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

## Importing the dataset
- As mentioned earlier, I will be using Steam game dataset to build my recommender system. I will be applying content-based approach because I think recommending similar games based on the game's descriptive features is more effective and the user is more likely to play those recommended games.
- I will extract the rating percentage from `all_reviews` column and filter out the games rated lower than 55%.
    * **Please Note:** This filtering is a personal choice. Recommender system to be built will work the same if you skip this part. Main reason for this filtering is to still have good-rated video games to recommend to a user.

In [1]:
df = pd.read_csv('data/steam_games.csv')

##### Filtering games rated lower than 55%.

In [53]:
pattern = '(\d{1,2})%'

df['rating_percentage'] = None

for index, row in df.iterrows():
    rating_str = str(row['all_reviews'])
    rating_match = re.search(pattern, rating_str)
    if rating_match:
        rating_percentage = float(rating_match.group(1))
        df.at[index, 'rating_percentage'] = rating_percentage

# Filter the dataframe to only include rows where the rating is at least 55%
filtered = df[df['rating_percentage'] >= 55]

- Let's have a look at the data

In [55]:
filtered.head()

Unnamed: 0,url,types,name,desc_snippet,recent_reviews,all_reviews,release_date,developer,publisher,popular_tags,...,languages,achievements,genre,game_description,mature_content,minimum_requirements,recommended_requirements,original_price,discount_price,rating_percentage
0,https://store.steampowered.com/app/379720/DOOM/,app,DOOM,Now includes all three premium DLC packs (Unto...,"Very Positive,(554),- 89% of the 554 user revi...","Very Positive,(42,550),- 92% of the 42,550 use...","May 12, 2016",id Software,"Bethesda Softworks,Bethesda Softworks","FPS,Gore,Action,Demons,Shooter,First-Person,Gr...",...,"English,French,Italian,German,Spanish - Spain,...",54.0,Action,"About This Game Developed by id software, the...",,"Minimum:,OS:,Windows 7/8.1/10 (64-bit versions...","Recommended:,OS:,Windows 7/8.1/10 (64-bit vers...",$19.99,$14.99,92.0
2,https://store.steampowered.com/app/637090/BATT...,app,BATTLETECH,Take command of your own mercenary outfit of '...,"Mixed,(166),- 54% of the 166 user reviews in t...","Mostly Positive,(7,030),- 71% of the 7,030 use...","Apr 24, 2018",Harebrained Schemes,"Paradox Interactive,Paradox Interactive","Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...",...,"English,French,German,Russian",128.0,"Action,Adventure,Strategy",About This Game From original BATTLETECH/Mec...,,"Minimum:,Requires a 64-bit processor and opera...","Recommended:,Requires a 64-bit processor and o...",$39.99,,71.0
3,https://store.steampowered.com/app/221100/DayZ/,app,DayZ,The post-soviet country of Chernarus is struck...,"Mixed,(932),- 57% of the 932 user reviews in t...","Mixed,(167,115),- 61% of the 167,115 user revi...","Dec 13, 2018",Bohemia Interactive,"Bohemia Interactive,Bohemia Interactive","Survival,Zombies,Open World,Multiplayer,PvP,Ma...",...,"English,French,Italian,German,Spanish - Spain,...",,"Action,Adventure,Massively Multiplayer",About This Game The post-soviet country of Ch...,,"Minimum:,OS:,Windows 7/8.1 64-bit,Processor:,I...","Recommended:,OS:,Windows 10 64-bit,Processor:,...",$44.99,,61.0
4,https://store.steampowered.com/app/8500/EVE_On...,app,EVE Online,EVE Online is a community-driven spaceship MMO...,"Mixed,(287),- 54% of the 287 user reviews in t...","Mostly Positive,(11,481),- 74% of the 11,481 u...","May 6, 2003",CCP,"CCP,CCP","Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...",...,"English,German,Russian,French",,"Action,Free to Play,Massively Multiplayer,RPG,...",About This Game,,"Minimum:,OS:,Windows 7,Processor:,Intel Dual C...","Recommended:,OS:,Windows 10,Processor:,Intel i...",Free,,74.0
6,https://store.steampowered.com/app/601150/Devi...,app,Devil May Cry 5,"The ultimate Devil Hunter is back in style, in...","Very Positive,(408),- 87% of the 408 user revi...","Very Positive,(9,645),- 92% of the 9,645 user ...","Mar 7, 2019","CAPCOM Co., Ltd.","CAPCOM Co., Ltd.,CAPCOM Co., Ltd.","Action,Hack and Slash,Great Soundtrack,Demons,...",...,"English,French,Italian,German,Spanish - Spain,...",51.0,Action,About This Game The Devil you know returns in...,Mature Content Description The developers de...,"Minimum:,OS:,WINDOWS® 7, 8.1, 10 (64-BIT Requi...","Recommended:,OS:,WINDOWS® 7, 8.1, 10 (64-BIT R...",$59.99,$70.42,92.0


- Quick look on dataset columns.

In [4]:
filtered.columns

Index(['url', 'types', 'name', 'desc_snippet', 'recent_reviews', 'all_reviews',
       'release_date', 'developer', 'publisher', 'popular_tags',
       'game_details', 'languages', 'achievements', 'genre',
       'game_description', 'mature_content', 'minimum_requirements',
       'recommended_requirements', 'original_price', 'discount_price',
       'rating_percentage'],
      dtype='object')

- Checking data frame size:

In [3]:
filtered.size

296121

- Missing value count:

In [6]:
filtered.isna().sum()

url                             0
types                           0
name                            0
desc_snippet                 2132
recent_reviews              11550
all_reviews                     0
release_date                   39
developer                      54
publisher                     537
popular_tags                   13
game_details                  161
languages                       1
achievements                 6453
genre                          88
game_description               21
mature_content              12847
minimum_requirements         6226
recommended_requirements     6222
original_price                305
discount_price               8868
rating_percentage               0
dtype: int64

## Preprocessing

- Previously in this notebook, I have had a quick look at the nan values. There are some columns that are unnecessary for my case.
- Here are the unnecessary columns:
    - `url`: website link.
    - `types`: the only type is 'app'.
    - `developer`: Developer of the game.
    - `publisher`: Publisher of the game.
- I call these column unnecessary as they do not provide useful information regarding content-based recommendation.
- I will go ahead and drop all these columns.
- I will not drop any missing values as it drastically reduces the number of video games I can use in my recommender system.

In [10]:
cols_to_drop = ['url', 'types', 'developer', 'publisher']
filtered = filtered.drop(cols_to_drop, axis=1)

## TF-IDF Model
- In this part I will be building a TF-IDF model to use in the calculation of similarities between games.
- What is TF-IDF?
    - A statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is often used as a weighting factor in information retrieval and text mining.
    - Is a simple yet very effective measure of relevance.
    - Product of two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
    - Tf-idf can be used to calculate the weights of the terms in the documents of a dataset, which can then be used to determine the similarity between the documents and make content-based recommendations.
- As stated earlier, I am going to use `popular_tags` and `desc_snippet` (game description) columns to use in my content-based recommendations. I especially chose these 2 columns because they contain all the genres a game has and all the essential detail about the game. I think game description and popular tags on a video game will be quite useful when calculating video games' similarity to each other.

In [11]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=3)

filtered.popular_tags = filtered.popular_tags.fillna('')
filtered.desc_snippet = filtered.desc_snippet.fillna('')

# Concatenate the 'popular_tags' and 'desc_snippet' columns
game_data = filtered.popular_tags + ' ' + filtered.desc_snippet

# Create the tf-idf vector
tfidf_model = vectorizer.fit_transform(game_data)
print(f'Matrix contains {tfidf_model.shape[0]} games and {tfidf_model.shape[1]} words')

Matrix contains 14101 games and 8843 words


- Let's see what the TF-IDF looks like with the popular video game terms.

In [12]:
popular_terms = [term for term, count in vectorizer.vocabulary_.items() if count > 1000]
columns = vectorizer.get_feature_names_out()
tfidf_model_df = pd.DataFrame.sparse.from_spmatrix(tfidf_model, columns=columns)
tfidf_model_df[popular_terms].head(10)

Unnamed: 0,fps,gore,demons,shooter,person,great,soundtrack,multiplayer,singleplayer,fast,...,shelters,conservation,fireflies,cavern,example,carriers,newspapers,ho,predict,reboot
0,0.098742,0.096923,0.15011,0.085884,0.079059,0.071825,0.072913,0.206851,0.055184,0.103062,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.08118,0.08241,0.077931,0.062372,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.097774,0.0,0.0,0.085042,0.0,0.0,0.0,0.136549,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12568,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.143897,0.22286,0.0,0.117374,0.106635,0.10825,0.102367,0.081928,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139139,0.05568,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.048787,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064851,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.071775,0.072863,0.0,0.055146,0.10299,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.071745,0.0,0.0,0.125144,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Recommending New Video Games

- In this section I will build the recommender system using _cosine_ similarity to find similar games.
- I will fallow the steps below:
    1. Calculate cosine scores for the video games.
    2. Using the cosine scores to find the similar games.
    3. Writing a function that takes a video game name and can print out the other recommended games.
- After these steps we will be able to build a fully-functional content-based recommender system.

### Cosine Similarity
- Cosine similarity is a measure of how similar two vectors are to each other. It ranges from -1 (completely dissimilar) to 1 (completely similar). It's often used to compare documents or strings of text, by treating the text as vectors and calculating the cosine similarity between the vectors.
- I am going to calculate the cosine similarity between different video games using sklearn's `linear_kernel` function. Tf-idf model will be used in `linear_kernel` function when calculating cosine similarities.
##### About `linear_kernel` function
- The linear_kernel function is a kernel function that can be used to calculate the dot product of two arrays.
- The dot product of two arrays is a measure of the similarity between the arrays.
- By dividing the dot product of two arrays by the product of their norms, the linear_kernel function can be used to calculate the cosine similarity between the arrays. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The resulting value will be between -1 and 1.

In [22]:
cosine_similarity = linear_kernel(tfidf_model, tfidf_model)
cosine_similarity

array([[1.        , 0.06198649, 0.08933478, ..., 0.01940889, 0.01074578,
        0.        ],
       [0.06198649, 1.        , 0.01564544, ..., 0.11333223, 0.00779801,
        0.02948727],
       [0.08933478, 0.01564544, 1.        , ..., 0.03110616, 0.03023731,
        0.0016292 ],
       ...,
       [0.01940889, 0.11333223, 0.03110616, ..., 1.        , 0.00254007,
        0.0297193 ],
       [0.01074578, 0.00779801, 0.03023731, ..., 0.00254007, 1.        ,
        0.00239216],
       [0.        , 0.02948727, 0.0016292 , ..., 0.0297193 , 0.00239216,
        1.        ]])

- Making an `indices` variable to access the game using game's name.


In [57]:
indices = pd.Series(filtered.index, index=filtered.name).drop_duplicates()
indices['TERA']

12

- The "TERA" game is located at index 12.

### Building the Recommender Function
- Building `get_recommendation` function. This function accepts 4 parameters:
    1. `name`: Video game name
    2. `top_n`: Number of recommended games you wish to receive.
    3. `cosine_sim`: Cosine similarities for the games.
    4. `show_genres`: Boolean to additionally show a video game's genre.

In [50]:
def get_recommendation(name, top_n=10, cosine_sim=cosine_similarity, show_genres=False) -> None:
    idx = indices[name]
    cosine_scores = sorted(enumerate(cosine_sim[idx]), key=lambda x : x[1], reverse=True)
    cosine_scores = cosine_scores[1:top_n+1]  # skipping the first as it's the same game as the passed parameter.
    game_indexes = [i[0] for i in cosine_scores]
    print(f"If you liked \"{name}\" you may also like:")
    if show_genres:
        display(filtered[['name', 'popular_tags']].iloc[game_indexes])
    else:
        display(filtered.name.iloc[game_indexes])

In [51]:
get_recommendation('Life is Strange 2', top_n=5, show_genres=True)

If you liked "Life is Strange 2" you may also like:


Unnamed: 0,name,popular_tags
799,The Jackbox Party Pack,"Casual,Local Multiplayer,Funny,Multiplayer,Com..."
548,Quiplash,"Casual,Multiplayer,Comedy,Co-op,Indie,Strategy..."
456,Drawful 2,"Casual,Indie,Local Multiplayer,Funny,Strategy,..."
263,The Jackbox Party Pack 3,"Local Multiplayer,Funny,Casual,Multiplayer,Boa..."
3990,Ruckus Ridge VR Party,"Action,Indie,VR,4 Player Local,Local Co-Op,Loc..."


### Trying the Recommendation Function

#### Finding Similar Games to "Life is Strange 2"
- Life is Strange 2, categories: Story Rich, Adventure, Single player, Third Person, Simulator...

In [58]:
get_recommendation('Life is Strange 2', show_genres=True)

If you liked "Life is Strange 2" you may also like:


Unnamed: 0,name,popular_tags
799,The Jackbox Party Pack,"Casual,Local Multiplayer,Funny,Multiplayer,Com..."
548,Quiplash,"Casual,Multiplayer,Comedy,Co-op,Indie,Strategy..."
456,Drawful 2,"Casual,Indie,Local Multiplayer,Funny,Strategy,..."
263,The Jackbox Party Pack 3,"Local Multiplayer,Funny,Casual,Multiplayer,Boa..."
3990,Ruckus Ridge VR Party,"Action,Indie,VR,4 Player Local,Local Co-Op,Loc..."
176,Pummel Party,"Multiplayer,Funny,Board Game,Casual,4 Player L..."
876,Among Us,"Casual,Multiplayer,Local Multiplayer,Space,Onl..."
5377,Muddledash,"Indie,Racing,Casual,Platformer,Local Multiplay..."
143,The Jackbox Party Pack 4,"Casual,Local Multiplayer,Indie,Funny,Multiplay..."
27379,Party Saboteurs: After Party,"Strategy,Indie,Local Multiplayer,Pixel Graphic..."


#### Finding Similar Games to "Yakuza 0"
- Yakuza 0, categories: Story Rich, Action, Beat 'em up, Great Soundtrack, Open World...

In [59]:
get_recommendation('Yakuza 0', top_n=5)

If you liked "Yakuza 0" you may also like:


6441                Outlast: Whistleblower DLC
1626     Resident Evil / biohazard HD REMASTER
3114             Remothered: Tormented Fathers
8457      Dead by Daylight - The 80's Suitcase
14450    Alien: Isolation - Corporate Lockdown
Name: name, dtype: object

## Conclusion
- In this notebook we have successfully built a content-based recommender system to recommend video games. Our recommender system relies on cosine similarities between the games to efficiently recommend video games.
- In this notebook, we covered the following:
    * what are recommender systems
    * what are collaborative and content-based recommendation
    * what is TF-IDF and how it is used for recommender systems
    * what is cosine similarity and how it is used for recommender systems
    * how to recommend video games based on a video game name