# Recommender 

## Motivation
We have 3 main files:
1. games.csv (app_id, title, rating, positive_ratio, user_reviews)
2. users.csv (user_id, products, reviews)
3. recommendations.csv (app_id, helpful, is_recommended, hours, user_id)

We are going to be using lightFM model for our recommender, thus we need to change our rating a little bit, because there is not enough information when is_recommended is 0 or 1. We are going to **transform** this into **score** (0-10) by:
- If user **recommends** the game and has **a lot** of hours -> **very high score**
- If user **recommends** the game and has **somewhat few** hours -> **high score**
- If user **recommends** the game and has **very little** hours -> **medium score**
- If user **does not recommend** the game and has **very little** hours -> **very low score**
- If user **does not recommend** the game and has **somewhat few** hours -> **low score**
- If user **does not recommend** the game and has **a lot** of hours -> **close to average**

Then we can define function that gets hours played by some user in a game and returns if it's: **a lot** / **somewhat few** / **very little**, where we define these as **intensity** (0-1).

$$
intens(game_{id}, h) = \frac{1}{1 + e^{-\alpha \cdot (h - avg(game_{id})) / avg(game_{id})}}
$$

Then we can define function score that redefines the **score** that user gave for the game:

$$
score(game_{id}, h, recommends) = recommends * (5 + intens(game_{id}, h) * 5) + (1 - recommends) * (intens(game_{id}, h) * 5)
$$

### Preprocessing
We have 14mln users... that's a lot. Average number of reviews per user is ~2.88 so very little considering we have about 50k games! Thus we will be getting rid of users that are under that average. To do that, we need to remove them from recommendations.csv and games.csv! As for games, we are getting rid of games that have less than 5 recommendations overall.


### Training



### Evaluation



### Metrics




### Dependencies

In [1]:
import lightfm
from lightfm import LightFM
from tqdm import tqdm
import numpy as np
import hyperopt as hyp
import pandas as pd
import joblib as jb
from lightfm.evaluation import recall_at_k
from scipy.sparse import coo_matrix, load_npz, save_npz
import matplotlib.pyplot as plt
import pickle

### Load datasets

In [2]:
games = pd.read_csv('./data/games.csv')
users = pd.read_csv('./data/users.csv')
recommendations = pd.read_csv('./data/recommendations.csv')

In [3]:
games.head()

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,discount,steam_deck
0,13500,Prince of Persia: Warrior Within™,2008-11-21,True,False,False,Very Positive,84,2199,9.99,9.99,0.0,True
1,22364,BRINK: Agents of Change,2011-08-03,True,False,False,Positive,85,21,2.99,2.99,0.0,True
2,113020,Monaco: What's Yours Is Mine,2013-04-24,True,True,True,Very Positive,92,3722,14.99,14.99,0.0,True
3,226560,Escape Dead Island,2014-11-18,True,False,False,Mixed,61,873,14.99,14.99,0.0,True
4,249050,Dungeon of the ENDLESS™,2014-10-27,True,True,False,Very Positive,88,8784,11.99,11.99,0.0,True


In [4]:
users.head()

Unnamed: 0,user_id,products,reviews
0,7360263,359,0
1,14020781,156,1
2,8762579,329,4
3,4820647,176,4
4,5167327,98,2


In [5]:
recommendations.head()

Unnamed: 0,app_id,helpful,funny,date,is_recommended,hours,user_id,review_id
0,975370,0,0,2022-12-12,True,36.3,51580,0
1,304390,4,0,2017-02-17,False,11.5,2586,1
2,1085660,2,0,2019-11-17,True,336.5,253880,2
3,703080,0,0,2022-09-23,True,27.4,259432,3
4,526870,0,0,2021-01-10,True,7.9,23869,4


### Data analysis

In [6]:
print(f'Average number of recommendations per user: {users["reviews"].mean()}')
avg_app_recommendations = recommendations.groupby('app_id')['review_id'].count().mean()
print(f'Average number of recommendations per app: {avg_app_recommendations}')
print(f'Number of users: {users.shape[0]}')
print(f'Number of games: {games.shape[0]}')
print(f'Number of recommendations: {recommendations.shape[0]}')

Average number of recommendations per user: 2.8767377246459964
Average number of recommendations per app: 1094.2513693166711
Number of users: 14306064
Number of games: 50872
Number of recommendations: 41154794


### Preprocessing

In [7]:
# Get rid of users that have less than mean reviews
avg_users_reviews = users['reviews'].mean()
users = users[users['reviews'] >= avg_users_reviews]
recommendations = recommendations[recommendations['user_id'].isin(users['user_id'])]

# Get rid of games with less than 5 recommendations
summed = recommendations.groupby('app_id').count() >= 5
games = games[games['app_id'].isin(summed[summed['review_id']].index)]
recommendations = recommendations[recommendations['app_id'].isin(games['app_id'])]

# Get rid of users and games that do not have any recommendations
games = games[games['app_id'].isin(recommendations['app_id'])]
users = users[users['user_id'].isin(recommendations['user_id'])]

In [8]:
print(f'Average number of recommendations per user: {users["reviews"].mean()}')
avg_app_recommendations = recommendations.groupby('app_id')['review_id'].count().mean()
print(f'Average number of recommendations per app: {avg_app_recommendations}')
print(f'Number of users: {users.shape[0]}')
print(f'Number of games: {games.shape[0]}')
print(f'Number of recommendations: {recommendations.shape[0]}')

Average number of recommendations per user: 7.61178278813486
Average number of recommendations per app: 821.3645547161173
Number of users: 3771654
Number of games: 34944
Number of recommendations: 28701763


In [9]:
recommendations.to_csv('./data/recommendations2.csv', index=False)

In [9]:
SKIP_REST = False

### Data preparation for training

In [10]:
unique_user_ids = users['user_id'].unique()
unique_game_ids = games['app_id'].unique()

user_index = {user_id: idx for idx, user_id in enumerate(unique_user_ids)}
app_index = {game_id: idx for idx, game_id in enumerate(unique_game_ids)}
reverse_user_index = {idx: user_id for user_id, idx in user_index.items()}
reverse_app_index = {idx: game_id for game_id, idx in app_index.items()}

In [11]:
if not SKIP_REST:
  avg_hours_of_game = recommendations.groupby('app_id')['hours'].mean()

  # Transform interactions into scores

  def intens(game_id, hours):
      if avg_hours_of_game[game_id] == 0:
          return 0.9
      
      return 1 / (1 + np.exp(-hours / avg_hours_of_game[game_id]))

  def score(game_id, hours, recommends):
      return recommends * (5 + 5 * intens(game_id, hours)) + (1 - recommends) * (5 * intens(game_id, hours))

  scores = recommendations.apply(lambda x: score(x['app_id'], x['hours'], x['is_recommended']), axis=1)

  score_matrix = coo_matrix((scores, (recommendations['user_id'].map(user_index), recommendations['app_id'].map(app_index))))

  save_npz('./data/score_matrix.npz', score_matrix)

### Then we feed the score_matrix into "preprocessing.ipynb" file...

In [12]:
# Test users with 60% of history and train users with 100% of history (this is used for training)
interactions = load_npz('./data/train_and_test.npz').tocsr()

# Test users with 40% of history (this is used for testing)
rest_test = load_npz('./data/rest_test.npz').tocsr()

### Training & hyperopt

In [13]:
def fit(model, name, epochs=10):
  for epoch in range(1, epochs + 1):
    model.fit_partial(interactions, epochs=5, num_threads=20)

    val_recall = recall_at_k(
      model,
      rest_test,
      k=10,
      num_threads=20
    ).mean()

    print(f"Epoch {epoch}: [TEST]Recall@10 = {val_recall:.4f}")

    with open(f'./data/model/lightfm_{name}.pkl', 'wb') as f:
      pickle.dump(model, f)


def loadModel(name) -> LightFM:
  with open(f'./data/model/lightfm_{name}.pkl', 'rb') as f:
    model = pickle.load(f)
    return model

In [14]:
LOAD = False
name = 'warp64'

if LOAD:
  model = loadModel(name)
else:
  model = LightFM(no_components=64, loss='warp')
  fit(model, name, epochs=10)

Epoch 1: [TEST]Recall@10 = 0.0822
Epoch 2: [TEST]Recall@10 = 0.0832
Epoch 3: [TEST]Recall@10 = 0.0839
Epoch 4: [TEST]Recall@10 = 0.0832


KeyboardInterrupt: 

In [15]:
val_recall = recall_at_k(
      model,
      rest_test,
      k=20,
      num_threads=20
    ).mean()

print(f"Epoch: [TEST]Recall@10 = {val_recall:.4f}")

Epoch: [TEST]Recall@10 = 0.1369
