# <span style="color:#6042f5"><b>Recommendation</b>
Now i will use previously processed datasets to try to recommend something.

## <span style="color:darkgrey"><b>Importy</b>

In [154]:
import numpy as np
import pandas as pd
import ast
import scipy.sparse as sp

from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer, Normalizer

## <span style="color:#a8eb34"><b>Preparation</b>

In [155]:
games = pd.read_csv('./datasets/processed_data/games.csv')
games.dropna(subset='title',inplace=True)
users = pd.read_csv('./datasets/processed_data/users.csv',dtype={'user_id':int,                                                         'name':str,                                                             'hours_played': str})

users_view = users.copy()

users.name = users.name.apply(ast.literal_eval)
users.hours_played = users.hours_played.apply(ast.literal_eval)
users.hours_played = users.hours_played.apply(lambda x: list(np.array(x)+1))

games.supported_languages =  games.supported_languages.apply(ast.literal_eval)
games.supported_languages =  games.supported_languages.apply(lambda x: list(x))

games.tags = games.tags.apply(ast.literal_eval)
games.tags = games.tags.apply(lambda x: list(x))

games.game_features = games.game_features.apply(ast.literal_eval)
games.game_features = games.game_features.apply(lambda x: list(x))

In [156]:
games.head(3)

Unnamed: 0,title,win,mac,linux,steam_deck,desc,supported_languages,tags,game_features
1,-circle triangle square-,0,0,0,0,Puzzle game using three types of objects ○ △ a...,"[English, Japanese]","[Casual, Puzzle, Physics, Relaxing, 2D, Single...",[Single-player]
2,Circles,1,1,0,1,,[],[],[]
3,Fallalypse,1,1,1,1,A group of terrorists has arranged a nuclear h...,"[English, Japanese, Russian, Traditional Chine...","[Early Access, Action, Adventure, Indie, Casua...","[Single-player, Online PvP, Steam Achievements..."


In [157]:
users.head(3)

Unnamed: 0,user_id,name,hours_played
0,5250,"[Alien Swarm, Cities Skylines, Counter-Strike,...","[5.9, 145.0, 1.0, 1.0, 1.0, 1.0, 63.0, 1.2, 1...."
1,76767,"[Age of Empires II HD Edition, Alien Swarm, Ar...","[14.1, 1.8, 1.0, 1.0, 1.0, 25.0, 23.0, 13.5, 6..."
2,86540,"[Age of Empires II HD Edition, Age of Empires ...","[1.7, 1.0, 1.2, 1.0, 1.0, 1.0, 1.0, 58.0, 1.0,..."


## <span style="color:#a8eb34"><b>Encoding data</b>

### <span style="color:#6e174c"><b>Users</b>

In [158]:
encoder = MultiLabelBinarizer(sparse_output=True)
data_encoded:sp.csr_matrix = encoder.fit_transform(users.name.values)
data_encoded = data_encoded.astype(np.float64)

row_indices, col_indices = data_encoded.nonzero()
values = np.array(users.hours_played).flatten() if isinstance(users.hours_played, np.ndarray) else [item for sublist in users.hours_played for item in sublist]
for i ,(row, col) in enumerate(zip(row_indices, col_indices)):
    data_encoded[row,col] = values[i]
    
tfidf_transformer = TfidfTransformer(norm='l2')
tfidf_transformer.fit(data_encoded)
data_encoded_tfidf = tfidf_transformer.transform(data_encoded)

oc_matrix = data_encoded.transpose().dot(data_encoded)
octfidf_matrix = data_encoded_tfidf.transpose().dot(data_encoded_tfidf)
games_names_vec = np.array(encoder.classes_).flatten()

oc_matrix.setdiag(0)
octfidf_matrix.setdiag(0)

> 📝 <span style="color:lightblue">Komentarz:</span> Ok lets se what i have done here, i have created occurance matrix that is going to be normalize by tf-idf. The formula for occurance matrix is $A^{T}A$, where A is matrix with data about each user as rows and column as products in out example, we have games, but some games are much more playable than others, for example a lot of users have lots of hours in CS2, so this game is going to be really recomended, it is the problem that we had on a lecture. So firstly i encoded the matrix so it would have the OneHot encoding then made the matrix have 0 in columns where user doesnt have this specific game, 1 if he has a game but not played it, and >1 where he has some hours played in this specific games, you can see hours by formula $hours_{time} - 1$. 

### <span style="color:#6e174c"><b>Games</b>

In [159]:
platform_vec = games[['win', 'mac', 'linux', 'steam_deck']] # already encoded
desc_vec = games.desc.values # needd tfidf encoding
tags_vec = games.tags.values # need encoding multilabel
game_features_vec = games.game_features.values # need encoding multilabel

In [160]:
tfidf_encoder = TfidfVectorizer(stop_words='english')
tfidf_encoder.fit(desc_vec)
desc_vec = tfidf_encoder.transform(desc_vec)

multi_vectorizer = MultiLabelBinarizer(sparse_output=True)
game_features_vec = multi_vectorizer.fit_transform(game_features_vec)
tags_vec = multi_vectorizer.fit_transform(tags_vec)

combined_games = sp.hstack([platform_vec,desc_vec,game_features_vec,tags_vec]) 

> 📝 <span style="color:lightblue">Komentarz:</span> Firstly i vectorize the description in tf-idf format and then one-hot the tags and game features, i will calculate distance between them separately and then check what is mean distance

## <span style="color:#a8eb34"><b>Predicting</b>

### <span style="color:#6e174c"><b>Occurance matrix</b>

In [161]:
user415 = pd.DataFrame({
    'hours_played': ast.literal_eval(users_view.loc[415].hours_played)
},index=pd.Series(ast.literal_eval(users_view.loc[415]['name']))).sort_values(by='hours_played', ascending=False)
user415

Unnamed: 0,hours_played
Half-Life 2,31.0
Half-Life 2 Lost Coast,2.0
Eternal Silence,1.9
Half-Life Source,0.9
Counter-Strike Source,0.1
Half-Life 2 Deathmatch,0.0
Half-Life Deathmatch Source,0.0


> 📝 <span style="color:lightblue">Komentarz:</span> Let us see our user, he mostly playes half-life two and other games, we will try to reccomend him some new games, based on occurance matrix. Here is his game list `['Counter-Strike Source', 'Eternal Silence', 'Half-Life 2', 'Half-Life 2 Deathmatch', 'Half-Life 2 Lost Coast', 'Half-Life Deathmatch Source', 'Half-Life Source']`. He is really half-life lover, lets see what we can do about this gentleman :)

In [162]:
recommendation:sp.csr_matrix = data_encoded.dot(oc_matrix)
recommendation[data_encoded.nonzero()] = 0
recommendation_tfidf:sp.csr_matrix = data_encoded_tfidf.dot(octfidf_matrix)
recommendation_tfidf[data_encoded_tfidf.nonzero()] = 0

  self._set_arrayXarray(i, j, x)


In [163]:
non_normalized = pd.DataFrame({
    'games': games_names_vec[recommendation[415].nonzero()[1]],
    'weight': np.array(recommendation[415][recommendation[415].nonzero()])[0]
}).sort_values(by='weight', ascending=False)
non_normalized.head(20)

Unnamed: 0,games,weight
3714,Counter-Strike Global Offensive,67855200.0
3419,Dota 2,33721240.0
1172,Team Fortress 2,23905000.0
3009,Garrys Mod,15046330.0
1087,The Elder Scrolls V Skyrim,7337526.0
2564,Left 4 Dead 2,7257411.0
3717,Counter-Strike,6450919.0
1532,Sid Meiers Civilization V,5103981.0
4111,Battle Nations,5080268.0
2298,Mount Blade Warband,4968153.0


> 📝 <span style="color:lightblue">Komentarz:</span> I can assure you that if i were this guy i wouldn't be very happy about recommendations. We are getting here most popular games on steam. Of course it doesnt mean that he wouldnt be satisfied, but only reasonable recommendations here are portal 2, left4dead 1 and 2, tf2 and garrys mode. Lets see if tfidf, change anything.

In [164]:
normalized = pd.DataFrame({
    'games': games_names_vec[recommendation_tfidf[415].nonzero()[1]],
    'weight': np.array(recommendation_tfidf[415][recommendation_tfidf[415].nonzero()])[0]
}).sort_values(by='weight', ascending=False)
normalized.head(20)

Unnamed: 0,games,weight
2869,Half-Life 2 Episode One,10.244847
2868,Half-Life 2 Episode Two,7.955495
1993,Portal,5.203963
1172,Team Fortress 2,4.304037
3717,Counter-Strike,3.120978
3596,Day of Defeat Source,2.85707
2870,Half-Life,2.599879
1992,Portal 2,1.947506
2865,Half-Life Opposing Force,1.881915
3597,Day of Defeat,1.839014


> 📝 <span style="color:lightblue">Komentarz:</span> We can see, that we have a lot of nice recommendations, lots of sequels to the games that he is already playing or have, mostly half-life and other cs games. But why? We know that tfidf embrace 'words' that appears a lot in specific texts, like obama in obama and politicians articles, what is importart is that, here we have similar thing, we do not want to look at those popular games that happen to be on every account, but specific games that a specific group of user have, so then after transformation, we are going to have much higher `hours_played` near games that are specific for a specific group of users, so it might be better to recommend them those specific group games than, some general ones.   

### <span style="color:#6e174c"><b>Product similarity</b>

(86343, 101061) (86343, 4) (86343, 100579) (86343, 36) (86343, 442)
