## Content Filtering

- In this part, we will be building a simple NLP based content filtering model through the use of cosine similarity, so as to intepret existing game titles bought by different users. The final pre-processed data will then be output into a dataframe file and deployed onto streamlit app

- We will be utlising 2 commonly well known libraries in Sklearn that can help us to vectorize the word features of each game titles, where they will be computed later for similarity comparison using the cosine similarity library

In [2]:
pip install rake-nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rake-nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Installing collected packages: rake-nltk
Successfully installed rake-nltk-1.0.6


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import roc_curve, auc
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances, cosine_similarity
import nltk
from rake_nltk import Rake
#from transformers import pipeline

In [4]:
games = pd.read_csv('/content/drive/MyDrive/Steam games dataset/games.csv')
meta = pd.read_json('/content/drive/MyDrive/Steam games dataset/games_metadata.json', lines=True)
app_list = pd.read_parquet('/content/drive/MyDrive/Steam games dataset/app_details.pq')
recco = pd.read_csv('/content/drive/MyDrive/Steam games dataset/recommendations.csv')

In [5]:
#merge the original games dataframe and meta dataframe, then merge the developer, publisher details from the app_list dataframe
games = games.merge(meta, on='app_id', how='left')
games = games.merge(app_list, on='app_id', how='left')

In [6]:
games.isnull().sum()

app_id                0
title                 0
date_release          0
win                   0
mac                   0
linux                 0
rating                0
positive_ratio        0
user_reviews          0
price_final           0
price_original        0
discount              0
steam_deck            0
description           0
tags                  0
developer         12201
publisher         12164
dtype: int64

In [8]:
games = games.fillna('')

In [9]:
games.isnull().sum()

app_id            0
title             0
date_release      0
win               0
mac               0
linux             0
rating            0
positive_ratio    0
user_reviews      0
price_final       0
price_original    0
discount          0
steam_deck        0
description       0
tags              0
developer         0
publisher         0
dtype: int64

In [10]:
games.reset_index(drop=True, inplace=True)

In [12]:
games = games.sort_values(by=['user_reviews', 'positive_ratio'], ascending=[False, False])
recco = recco.merge(games, on='app_id', how='left')

In [14]:
len(games), len(recco)

(46068, 10072270)

**Building the Game Catalouge: Cosine Similarity**

We will incorporate the use of a powerful NLP algorithm library known as RAKE (**Rapid Automatic Keyword Extraction**) NLTK. It is an algorithm which is designed to automatically extract essential keywords integrated in the NLTK library in conjunction with commonly known stopwords and punctuation: https://pypi.org/project/rake-nltk/

In [15]:
#Use rake and ntlk library to extract key words from description
nltk.download('stopwords')
nltk.download('punkt')

games['Key_words'] = ''

r = Rake()

for i in range(len(games)):
  
    r.extract_keywords_from_text(str(games['description'][i]))
    key_words_dict_scores = r.get_word_degrees()
    games['Key_words'][i] = list(key_words_dict_scores.keys())
games.tail()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games['Key_words'][i] = list(key_words_dict_scores.keys())


Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,discount,steam_deck,description,tags,developer,publisher,Key_words
36317,501060,C.S.S. CITADEL VR,2016-07-15,True,False,False,Negative,0,10,7.99,7.99,0.0,True,Are you ready to survive and escape from the C...,"[Action, Adventure, VR]",Winged Minds,Winged Minds,"[ready, survive, escape, c, citadel]"
36641,1686500,Discord Bot Workshop [EARLY ACCESS],2021-09-22,True,False,False,Negative,0,10,7.99,7.99,0.0,True,"Make Discord bots faster, smarter, easier with...","[Design & Illustration, Game Development, Util...",Discord Bot Workshop,Discord Bot Workshop,"[make, discord, bots, faster, smarter, easier,..."
39528,1374740,Isolated Life,2020-08-26,True,False,False,Negative,0,10,9.99,9.99,0.0,True,"In this world of the future, you've fallen int...","[Simulation, Adventure, Building, Open World S...",,,"[world, future, fallen, island, middle, ocean,..."
39899,1589890,便利商店‪6,2021-06-14,True,True,False,Negative,0,10,12.99,12.99,0.0,True,打造最輕鬆易玩的經營遊戲 ! 建設最有特色的【 便利商店 】! 玩家可以輕輕鬆鬆來經營一家或...,"[Exploration, Simulation, City Builder, Automa...",West Dos (HK) LTD.,West Dos (HK) LTD.,"[打造最輕鬆易玩的經營遊戲, 建設最有特色的, 【, 便利商店, 】!, 玩家可以輕輕鬆鬆來..."
44524,1007970,This Side (Early Access Game),2019-05-15,True,False,False,Negative,0,10,9.99,9.99,0.0,True,Fight your natural fears and help your friend ...,"[Indie, Adventure, Horror, Co-op, Early Access...","Rising, Crystal Realms Entertainment","Rising, Crystal Realms Entertainment","[fight, natural, fears, help, friend, find, wa..."


- Combine the developer, publisher, tags and description words all into 1 column

In [16]:
games['developer_str'] = games['developer'].apply(lambda x: x.split(','))
games['publisher_str'] = games['publisher'].apply(lambda x: x.split(','))

In [17]:
games['tags_str'] = games['tags'].apply(lambda x: [tag.lower() for tag in x])
games['developer_str'] = games['developer_str'].apply(lambda x: [tag.lower() for tag in x])
games['publisher_str'] = games['publisher_str'].apply(lambda x: [tag.lower() for tag in x])

In [18]:
games['tags_description'] = games['tags_str'].str.join(' ') + ' ' + games['Key_words'].str.join(' ') + ' ' + games['developer_str'].str.join(' ') + ' ' + games['publisher_str'].str.join(' ')
games.tail()

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,...,steam_deck,description,tags,developer,publisher,Key_words,developer_str,publisher_str,tags_str,tags_description
36317,501060,C.S.S. CITADEL VR,2016-07-15,True,False,False,Negative,0,10,7.99,...,True,Are you ready to survive and escape from the C...,"[Action, Adventure, VR]",Winged Minds,Winged Minds,"[ready, survive, escape, c, citadel]",[winged minds],[winged minds],"[action, adventure, vr]",action adventure vr ready survive escape c cit...
36641,1686500,Discord Bot Workshop [EARLY ACCESS],2021-09-22,True,False,False,Negative,0,10,7.99,...,True,"Make Discord bots faster, smarter, easier with...","[Design & Illustration, Game Development, Util...",Discord Bot Workshop,Discord Bot Workshop,"[make, discord, bots, faster, smarter, easier,...",[discord bot workshop],[discord bot workshop],"[design & illustration, game development, util...",design & illustration game development utiliti...
39528,1374740,Isolated Life,2020-08-26,True,False,False,Negative,0,10,9.99,...,True,"In this world of the future, you've fallen int...","[Simulation, Adventure, Building, Open World S...",,,"[world, future, fallen, island, middle, ocean,...",[],[],"[simulation, adventure, building, open world s...",simulation adventure building open world survi...
39899,1589890,便利商店‪6,2021-06-14,True,True,False,Negative,0,10,12.99,...,True,打造最輕鬆易玩的經營遊戲 ! 建設最有特色的【 便利商店 】! 玩家可以輕輕鬆鬆來經營一家或...,"[Exploration, Simulation, City Builder, Automa...",West Dos (HK) LTD.,West Dos (HK) LTD.,"[打造最輕鬆易玩的經營遊戲, 建設最有特色的, 【, 便利商店, 】!, 玩家可以輕輕鬆鬆來...",[west dos (hk) ltd.],[west dos (hk) ltd.],"[exploration, simulation, city builder, automa...",exploration simulation city builder automation...
44524,1007970,This Side (Early Access Game),2019-05-15,True,False,False,Negative,0,10,9.99,...,True,Fight your natural fears and help your friend ...,"[Indie, Adventure, Horror, Co-op, Early Access...","Rising, Crystal Realms Entertainment","Rising, Crystal Realms Entertainment","[fight, natural, fears, help, friend, find, wa...","[rising, crystal realms entertainment]","[rising, crystal realms entertainment]","[indie, adventure, horror, co-op, early access...",indie adventure horror co-op early access dark...


In [None]:
recco['user_id'].value_counts()[recco['user_id'].value_counts() > 0].sort_values(ascending=False).head(10)

3632140    151
5136197    135
155275     135
4461402    125
4647856    121
4844875    120
5298698    114
1360533    113
4526095    108
3566780     99
Name: user_id, dtype: int64

In [None]:
recco['user_id'].value_counts()[recco['user_id'].value_counts() == 2].sort_values(ascending=False).head()

3069520    2
2297111    2
2383325    2
4605741    2
4522436    2
Name: user_id, dtype: int64

**Content Filtering: CountVectorizer**

- Count Vectorizer is a vectorization technique using the bag-of-words model from sklearn that counts the frequency of each word in a given document and outputs the vector count for each word in the entire word corpus

In [19]:
count = CountVectorizer()

# Create a matrix of feature vectors


# compute cosine similarity matrix
#cosine_sim_matrix = cosine_similarity(sparse_matrix)

In [20]:
# Define a function to get recommendations for a game
def get_recommendations(game_profile):
    # Compute the cosine similarity between the game profile and the game features
    sparse_matrix = csr_matrix(count.fit_transform(games['tags_description']))
    game_profile_vec = csr_matrix(count.transform([game_profile]))
    game_similarities = cosine_similarity(sparse_matrix, game_profile_vec)

    # Sort the games by similarity score
    games['similarity_score'] = game_similarities
    sorted_games = games.sort_values('similarity_score', ascending=False)
    sorted_games = sorted_games[sorted_games['similarity_score'] != 0]

    game_games = game_profile.split(', ')
    sorted_games = sorted_games[~sorted_games['title'].isin(game_games)]

    top_10 = sorted_games[['title', 'similarity_score']].head(10)

    if top_10['similarity_score'].sum() == 0:
      return 'No similar games found'
    else:
      return top_10



In [22]:
#function to retrieve game list from user
def user_profile(userid):
  user = recco[recco['user_id'] == userid]
  game_list = [row for row in user['title']]
  return ', '.join(game_list)

In [23]:
user_profile(2297111)

'Lost Ark, Hero Siege'

In [24]:
get_recommendations(user_profile(3632140))

Unnamed: 0,title,similarity_score
45742,Angela's Odyssey,0.34193
44276,Hold My Beer,0.31926
26536,International Space Station Tour VR,0.30474
24826,Bane of Asphodel,0.288467
41194,Wrath of Loki VR Adventure,0.283487
27886,Squirt's Adventure,0.27992
32358,Trains VR,0.273867
36825,Franchise Hockey Manager 4,0.271434
26100,A Butterfly's Dream,0.267091
31320,Franchise Hockey Manager 5,0.259819


**Content Filtering: TFIDVectorizer**

- Unlike Count Vectorizer, TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer computes the frequency of each word appearing in the document and takes into account important words calculated through inverse document frequency, which gives more information regarding important words in each document within the entire corpus

In [26]:
tfidf = TfidfVectorizer()

# Create a matrix of feature vectors


# Compute the cosine similarity matrix
#cosine_sim_matrix = cosine_similarity(features_matrix)

In [27]:
# Define a function to get recommendations for a game
def get_recommendations_2(game_profile):
    # Compute the cosine similarity between the game profile and the game features
    sparse_matrix = csr_matrix(tfidf.fit_transform(games['tags_description']))
    game_profile_vec = csr_matrix(tfidf.transform([game_profile]))
    game_similarities = cosine_similarity(sparse_matrix, game_profile_vec)

    # Sort the games by similarity score
    games['similarity_score'] = game_similarities.round(3)
    sorted_games = games.sort_values(['similarity_score', 'positive_ratio', 'user_reviews'], ascending=False)

    game_games = game_profile.split(', ')
    sorted_games = sorted_games[~sorted_games['title'].isin(game_games)]

    top_10 = sorted_games[['title', 'similarity_score']].head(10).reset_index().drop(columns='index')

    if top_10['similarity_score'].sum() == 0:
      return 'No similar games found'
    else:
      return top_10


In [28]:
get_recommendations_2("Assassin's Creed® Odyssey")

Unnamed: 0,title,similarity_score
0,Assassin's Creed® III Remastered,0.313
1,Assassin's Creed® Origins,0.297
2,Discovery Tour by Assassin’s Creed®: Ancient E...,0.283
3,Assassin’s Creed® Liberation HD,0.277
4,Assassin’s Creed® Chronicles: China,0.264
5,Assassin's Creed® Unity,0.262
6,Assassin's Creed™: Director's Cut Edition,0.249
7,Assassin’s Creed® Rogue,0.24
8,Dishonored: The Brigmore Witches,0.239
9,Dishonored - The Knife of Dunwall,0.239


In [29]:
get_recommendations_2("Sid Meier's Civilization® III Complete")

Unnamed: 0,title,similarity_score
0,Civilization IV: Beyond the Sword,0.339
1,Sid Meier's Pirates! Gold Plus (Classic),0.322
2,Sid Meier's Covert Action (Classic),0.318
3,Silent Service,0.318
4,Sid Meier's Civilization IV: Colonization,0.314
5,Civilization IV®: Warlords,0.288
6,Sid Meier's Civilization® IV,0.251
7,Hero Generations,0.232
8,Sid Meier's Railroads!,0.229
9,Sid Meier's Civilization®: Beyond Earth™,0.213


Prepare the dataframe for output file to be used in streamlit app deployment

In [30]:
def bool_convert(text):
  if text == True:
    return 'Yes'
  else:
    return 'No'

In [31]:
games['win'] = games['win'].apply(bool_convert)
games['mac'] = games['mac'].apply(bool_convert)
games['linux'] = games['linux'].apply(bool_convert)

In [32]:
#games['price_final'] = games['price_final'].apply(lambda x: 'Free' if x == 0 else x)
games.rename(columns={"price_final": "Price", "win": "Windows", "mac": "Mac", "linux": "Linux", "rating": "Rating"}, inplace=True)

In [33]:
games.to_parquet('/content/drive/MyDrive/Steam games dataset/Bigclip.pq')

**NOTE**: PLease allow your browser to allow show pop-ups by sites as show below 