### Pandas & SciPy for RecSys datasets

In [1]:
import numpy as np 
import pandas as pd 
import scipy.sparse as sp

In [3]:
data = pd.read_csv('interactions.csv', parse_dates=['start_date'])
data.head()

Unnamed: 0,user_id,item_id,progress,rating,start_date
0,126706,14433,80,,2018-01-01
1,127290,140952,58,,2018-01-01
2,66991,198453,89,,2018-01-01
3,46791,83486,23,5.0,2018-01-01
4,79313,188770,88,5.0,2018-01-01


### Feature Description
- `progress` - reading progress as a percentage
- `rating` - book's rating provided by user (from 1 to 5, a lot of missing values)
- `start_date` - date when the user started reading the book

In [4]:
# Check duplicates
is_duplicate = data.duplicated(subset=['user_id', 'item_id'],  keep=False)

print('N duplicates: ', is_duplicate.sum())

N duplicates:  160


In [5]:
dup_user_ids = data[is_duplicate]['user_id'].values

# Let's have a look at duplicate
data[is_duplicate][data[is_duplicate]['user_id'] == 142896]

Unnamed: 0,user_id,item_id,progress,rating,start_date
18393,142896,219838,100,,2018-01-09
293393,142896,219838,30,5.0,2018-05-22


In [6]:
df_duplicates = data[is_duplicate].sort_values(by=['user_id', 'start_date'])
df = data[~is_duplicate]

In [7]:
# Fix duplicate DataFrame
df_duplicates = df_duplicates.groupby(['user_id', 'item_id']).agg({
    'progress': 'max',
    'rating': 'max',
    'start_date': 'min'
})

df = df.append(df_duplicates.reset_index(), ignore_index=True)

In [8]:
# Look at unique valeus 
df.nunique()

user_id       151600
item_id        59599
progress         101
rating             5
start_date       730
dtype: int64

We have:
- Roughly 150k unique users
- Rougly 60k unique items

But in the data we have `1.5m` rows much more than unique users, it's called `low cardinality`

### Pandas 
Some types for memory optimization

### CategoryDType

In [9]:
def num_bytes_format(num_bytes, float_prec=4):
    """
    Shows total df memory usage 
    
    """
    units = ['bytes', 'Kb', 'Mb', 'Gb', 'Tb', 'Pb', 'Eb']
    for unit in units[:-1]:
        if abs(num_bytes) < 1000:
            return f'{num_bytes:.{float_prec}f} {unit}'
        num_bytes /= 1000
    return f'{num_bytes:.4f} {units[-1]}'

In [10]:
# Select only user-item logs 
user_item_df = df[['user_id', 'item_id']].copy()

# Have a look at the current columns type 
user_item_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1532998 entries, 0 to 1532997
Data columns (total 2 columns):
 #   Column   Non-Null Count    Dtype
---  ------   --------------    -----
 0   user_id  1532998 non-null  int64
 1   item_id  1532998 non-null  int64
dtypes: int64(2)
memory usage: 23.4 MB


Let's compare memory usage for different type of columns:
- Int
- String
- Category

In [11]:
num_bytes_ints = user_item_df.memory_usage(deep=True).sum()
num_bytes_string = user_item_df.astype('string').memory_usage(deep=True).sum()
num_bytes_cat = user_item_df.astype('category').memory_usage(deep=True).sum()

In [12]:
print('Int Column Types: ', num_bytes_format(num_bytes_ints))
print('String Column Types: ', num_bytes_format(num_bytes_string))
print('Category Column Types: ', num_bytes_format(num_bytes_cat))

Int Column Types:  24.5281 Mb
String Column Types:  191.5619 Mb
Category Column Types:  21.8180 Mb


### IntegerDType
This type allows columns having NaN values. Provides memory reduction as well

In [13]:
ratings = df['rating'].copy()

ratings_float32 = ratings.astype(np.float32).memory_usage(deep=True)
ratings_int32 = ratings.astype(pd.Int32Dtype()).memory_usage(deep=True)
ratings_int8 = ratings.astype(pd.Int8Dtype()).memory_usage(deep=True)

In [14]:
print('Float_64  Type: ', num_bytes_format(ratings.memory_usage(deep=True)))
print('Float_32 Type: ', num_bytes_format(ratings_float32))
print('Int_32 Type: ', num_bytes_format(ratings_int32))
print('Int_8 Type:', num_bytes_format(ratings_int8))

Float_64  Type:  12.2641 Mb
Float_32 Type:  6.1321 Mb
Int_32 Type:  7.6651 Mb
Int_8 Type: 3.0661 Mb


### Sparse Type
Sparse Type - a data type for dealing with sparse data

The main idea - store only "known values", the rest values don't store. We fill them with a constant value and store only this single constant

This type is created using:
- `dtype`
- `fill_value` - constant for missing data

In [22]:
# Sparse Type Creation
sparse_type = pd.SparseDtype(np.float32, np.nan)

We have lots of missing values in `ratings` DataFrame. Let's convert it into `pd.SparseDtype`

In [19]:
print('N Missing: ', ratings.isna().sum())

N Missing:  1247643


In [26]:
ratings_sparse = ratings.astype(sparse_type)
print('Sparse Type:', num_bytes_format(ratings_sparse.memory_usage(deep=True)))

Sparse Type: 2.2830 Mb


### Sparse Matrix 
Stores only known values 

- `coo_matrix` - A sparse matrix in COOrdinate format
- `csc_matrix` - Compressed Sparse Column matrix
- `csr_matrix` - Compressed Sparse Row matrix
- `bsr_matrix` - Block Sparse Row matrix
- `dia_matrix` - Sparse matrix with DIAgonal storage
- `dok_matrix` - Dictionary Of Keys based sparse matrix
- `lil_matrix` - Row-based list of lists sparse matrix

**Classes for Sparse Matrix Creation**
- `coo_matrix`: (row, column, value)
- `dok_matrix`: (dict - key:(row, column), value:value)
- `lil_matrix`: list of lists

**Optimized Classes for Matrix Storage and Manipulation**
- `csr_matrix`
- `csc_matrix`
- `bsr_matrix`
- `dia_matrix`

**Most Commonly Used**

`coo_matrix, csr_matrix and csc_matrix`

In [30]:
# coo_matrix matrix creation
rows_indxs = [1,  1, 0,  4,   2, 2]
cols_indx = [0,  1, 0,  5,   3, 3]
values = [-2, 7, 19, 1.0, 6, 8]

coo = sp.coo_matrix((values, (rows, cols)))
coo

<5x6 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in COOrdinate format>

In [31]:
coo.todense()

matrix([[19.,  0.,  0.,  0.,  0.,  0.],
        [-2.,  7.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 14.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  1.]])

In [32]:
# Accessing rows, columns and values
coo.row, coo.col, coo.data

(array([1, 1, 0, 4, 2, 2], dtype=int32),
 array([0, 1, 0, 5, 3, 3], dtype=int32),
 array([-2.,  7., 19.,  1.,  6.,  8.]))

In [33]:
# csr_matrix/csc_matrix creation
csr = coo.tocsr()
csr

<5x6 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [34]:
csr.todense()

matrix([[19.,  0.,  0.,  0.,  0.,  0.],
        [-2.,  7.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., 14.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  1.]])

In [35]:
csr.indptr, csr.indices, csr.data

(array([0, 1, 3, 4, 4, 5], dtype=int32),
 array([0, 0, 1, 3, 5], dtype=int32),
 array([19., -2.,  7., 14.,  1.]))

### Matrix Creation from the Original DataFrame

In [37]:
df.head()

Unnamed: 0,user_id,item_id,progress,rating,start_date
0,126706,14433,80,,2018-01-01
1,127290,140952,58,,2018-01-01
2,66991,198453,89,,2018-01-01
3,46791,83486,23,5.0,2018-01-01
4,79313,188770,88,5.0,2018-01-01


Let's enumerate unique IDs and Items

In [44]:
# Users mapping
unique_users_indxs = dict(enumerate(df['user_id'].unique()))
unique_users_mapping = {v: k for k, v in unique_users_indxs.items()}

In [46]:
# Items mapping
unique_items_indxs = dict(enumerate(df['item_id'].unique()))
unique_items_mapping = {v: k for k, v in unique_items_indxs.items()}

In [51]:
# The main sparse mattrix creation
rows = df['user_id'].map(unique_users_mapping.get)
cols = df['item_id'].map(unique_items_mapping.get)

print('Rows N Missing: ', rows.isna().sum())
print('Columns N Missing: ', cols.isna().sum())

Rows N Missing:  0
Columns N Missing:  0


In [52]:
matrix = sp.coo_matrix((
    np.ones(df.shape[0], dtype=np.int8),
    (rows, cols)
))

Obtained a matrix:
- Rows - Users
- Columns - Items
- Values - Binary

In [59]:
print('Matrix Shape: ', matrix.shape)
print('Memory Usage: ', num_bytes_format(matrix.data.nbytes + matrix.row.nbytes + matrix.col.nbytes))

Matrix Shape:  (151600, 59599)
Memory Usage:  13.7970 Mb


In [60]:
# Now, lets use the progress feature and create interaction values
df['weight'] = ((df['progress'] + 1) / 101) * (2 ** df['rating'])
df['weight'] = df['weight'].astype(np.float32)

In [62]:
# Redefine the matrix
matrix = sp.coo_matrix((
    df['weight'],
    (rows, cols)
))

print('Memory Usage: ', num_bytes_format(matrix.data.nbytes + matrix.row.nbytes + matrix.col.nbytes))

Memory Usage:  18.3960 Mb
