# 3. User-Game Matrix - Exploration

## Introduction

Typically, in recommender systems we have to create a user-game matrix with the users to the rows and the items (in this case we have games) to the columns.

This notebook explore the user-game matrix which has been already created in separate notebook (see detail below)

In [1]:
import pandas as pd
import numpy as np
import pickle
from scipy import sparse
import gzip

## Building the User-Game Matrix

Due to the large memory requirements, the user-game matrix has been created in the notebook "User-Games_Matrix_Building" available in the `help` directory
The matrix, in sparse format, has been saved to a pickle file which you can read using the code below

## Exploring the User-Game Matrix

In [2]:
def decompress_pickle(input_file):
    with gzip.open(input_file, 'rb') as f:
        data = pickle.load(f)
    return data

user_game_matrix = decompress_pickle('matrix/user_game_matrix.pkl.gz')
user_game_matrix


  data = pickle.load(f)


<12663134x37420 sparse matrix of type '<class 'numpy.int32'>'
	with 47967516 stored elements in Compressed Sparse Row format>

In [3]:
print("There are", '{0:,.0f}'.format(user_game_matrix.shape[0]) , "users")
print("and", '{0:,.0f}'.format(user_game_matrix.shape[1]-1) , "games")
# the first column of the matrix contains the user_id!

There are 12,663,134 users
and 37,419 games


In [5]:
user_game_matrix_final = user_game_matrix[:, 1:]
user_game_matrix_final.shape

(12663134, 37419)

In [6]:
user_game_matrix[:, 0].toarray()


array([[       0],
       [       1],
       [       2],
       ...,
       [12663131],
       [12663132],
       [12663133]], dtype=int32)

In [16]:
user_game_matrix[2590]

<1x37420 sparse matrix of type '<class 'numpy.intc'>'
	with 2 stored elements in Compressed Sparse Row format>

## Sparse Matrix

A matrix with a huge number of zero values is a sparse matrix

"Very large matrices require a lot of memory, and some very large matrices that we wish to work with are sparse." https://machinelearningmastery.com/sparse-matrices-for-machine-learning/

In [7]:
type(user_game_matrix)

scipy.sparse._csr.csr_matrix

user_game_matrix is a Compressed Sparse Row matrix. It supports indexing and can be used in machine learning pipelines. 

From the scipy documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

Advantages of the CSR format


- efficient arithmetic operations CSR + CSR, CSR * CSR, etc.

- efficient row slicing

- fast matrix vector products

Disadvantages of the CSR format


- slow column slicing operations (consider CSC)

- changes to the sparsity structure are expensive (consider LIL or DOK)

An example of indexing a sparse matrix is provided below

In [9]:
# The first column contains the user id
user_game_matrix[:, 0].toarray() # you can also use todense()

array([[       0],
       [       1],
       [       2],
       ...,
       [12663131],
       [12663132],
       [12663133]])

In [10]:
print('{0:,.0f}'.format(len(user_game_matrix[:, 0].toarray())))

12,663,134


In [11]:
sparsity = 1 - (user_game_matrix.count_nonzero() / (user_game_matrix.shape[0]*user_game_matrix.shape[1]))

print("There sparsity of the matrix is '{0:.4%}'".format(sparsity))
print("There density of the matrix is '{0:.4%}'".format(1-sparsity))

There sparsity of the matrix is '99.9899%'
There density of the matrix is '0.0101%'


## Mapping of User ID and Games ID

The mapping between the original user id (in the recommendation dataset) and the new one starting from 0 is provided below

In [12]:
def decompress_csv(input_file):
    # Read the compressed CSV file
    with gzip.open(input_file, 'rt', encoding='utf-8') as f:
        df = pd.read_csv(f)
    return df

# Example usage
users_idx = decompress_csv('matrix/users_idx.csv.gz')
# The other columns, from the second contains the games sorted by the index (app_id_categorical)
games_idx = decompress_csv('matrix/games_idx.csv.gz')

In [13]:
users_idx.head()

Unnamed: 0,user_id,user_id_categorical
0,0,0
1,2,1
2,3,2
3,4,3
4,5,4


In [14]:
games_idx.head()

Unnamed: 0,app_id,app_id_categorical
0,10,1
1,20,2
2,30,3
3,40,4
4,50,5


Indexing for a user and find the games she has recommended

In [15]:
# Example: User_id = 3 -> User_id_categorical = 2
USER_ID = 2
sample = user_game_matrix[USER_ID,:].toarray()
games_indices = np.where(sample == 1)
games_indices

(array([0, 0], dtype=int64), array([10393, 27909], dtype=int64))

app_id_categorical = 10393 and 27909

In [16]:
games_idx[games_idx["app_id_categorical"]==10393]

# corresponds to this app_id in the recommendation dataset

Unnamed: 0,app_id,app_id_categorical
10392,552990,10393


In [17]:
games_idx[games_idx["app_id_categorical"]==27909]

Unnamed: 0,app_id,app_id_categorical
27908,1407200,27909


### Reducing the dataset to only users who perform at least K recommendations

The code below may be needed if we do not include side / query features (e.g. games metadata). The additional features solve the cold-stat problem i.e. for users with limited number of recommendations the model would have difficulty to predict relevance.

Some papers filter for users who perform at least k recommendations

e.g. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T. S. (2017, April). Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web (pp. 173-182).

In [18]:
# removing the user id
user_game_matrix_final = user_game_matrix[:,1:]
user_game_matrix_final.shape

(12663134, 37419)

In [19]:
sum_of_recs = np.sum(user_game_matrix_final, axis=1)
sum_of_recs.shape

(12663134, 1)

In [20]:
# Adding this column to the user game matrix, so that I can filter it later
sparse_matrix = sparse.hstack((user_game_matrix_final, sum_of_recs), format="csr")
sparse_matrix.shape

(12663134, 37420)

In [21]:
condition_column = -1  # the column index to filter 
threshold_value = 5  # Set the k values aka number of recommendations

# Extract the values in the condition column
condition_values = sparse_matrix[:, condition_column].toarray().flatten()

# Create a boolean mask for rows based on the condition
row_mask = condition_values > threshold_value

# Filter the sparse matrix based on the condition for rows
user_game_matrix_k_recs = sparse_matrix[row_mask, :]

In [22]:
# removing the last column
user_game_matrix_k_recs = user_game_matrix_k_recs[:,:-1]
print("There are", '{0:,.0f}'.format(user_game_matrix_k_recs.shape[0]) , "users")
print("and", '{0:,.0f}'.format(user_game_matrix_k_recs.shape[1]-1) , "games")

There are 1,236,797 users
and 37,418 games


In [23]:
sparsity = 1 - (user_game_matrix_k_recs.count_nonzero() 
/ (user_game_matrix_k_recs.shape[0]*user_game_matrix_k_recs.shape[1]))

print("There sparsity of the matrix is '{0:.4%}'".format(sparsity))
print("There density of the matrix is '{0:.4%}'".format(1-sparsity))

There sparsity of the matrix is '99.9652%'
There density of the matrix is '0.0348%'


## 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=78c133f5-defd-458d-ba8f-cbdc9ae58cfb' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>