# Batch architecture in Recommendation Systems

In this notebook we are going to explain how to create use the batch architecture to deploy a recommendation system solution.

The batch architecture is the most common architecture used in recommendation systems. It is based on the idea of creating a batch process that will run periodically to update the recommendations. It is very common in the industry to run the process every night. 

The batch process will read the data from the data sources, train a machine learning algorithm, score the model to produce the top k recommendations for every user, and finally, store the recommendations in a dataset.

Once the data is in the database, it can be query from the front end of the website or via an internal backend process. The recommendations will be shown to the user by just doing a `SELECT` into the database. This process is very fast, it can be implemented in single digit milliseconds. A reduced query time can be beneficial for the user experience.


## 0 Global Settings and Imports

In [1]:
import numpy as np
import logging
import sqlite3

from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.models.sar import SAR


In [2]:
# Top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = "100k"

# Other data settings
USER_COL = "userID"
ITEM_COL = "itemID"
RATING_COL = "rating"
TIMESTAMP_COL = "timestamp"
PREDICTION_COL = "prediction"

# Model settings
SIMILARITY_TYPE = "jaccard"
TIME_DECAY = 30 # number of days until the weight of the ratings are decayed by 1/2
SEED = 42

# Database parameters
DATABASE = "recodb"
TABLE_NAME = "recommendations"

logging.basicConfig(level=logging.DEBUG, format="%(asctime)s %(levelname)-8s %(message)s")

## 1 Data Preparation

We are going to use Movielens dataset for this example.

In [3]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE
)

# Convert the float precision to 32-bit in order to reduce memory consumption 
data[RATING_COL] = data[RATING_COL].astype(np.float32)

data.head()

2023-11-17 12:28:33,324 DEBUG    Starting new HTTPS connection (1): files.grouplens.org:443
2023-11-17 12:28:34,196 DEBUG    https://files.grouplens.org:443 "GET /datasets/movielens/ml-100k.zip HTTP/1.1" 200 4924029
2023-11-17 12:28:34,197 INFO     Downloading https://files.grouplens.org/datasets/movielens/ml-100k.zip
100%|██████████| 4.81k/4.81k [00:01<00:00, 3.11kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


## 2 Model scoring

We are going to score all the data using SAR. Notice first that we are not doing any train/test split. In production, we want to use all the available data. 

The model can be trained offline and then deployed into production. Depending on the behavior of your model or your data, there might be a degradation of the performance over time. It is a good practice to retrain the model periodically.


In [4]:
model = SAR(
    col_user=USER_COL,
    col_item=ITEM_COL,
    col_rating=RATING_COL,
    col_timestamp=TIMESTAMP_COL,
    similarity_type=SIMILARITY_TYPE, 
    time_decay_coefficient=30, 
    timedecay_formula=True,
    normalize=True
)

In [5]:
with Timer() as train_time:
    model.fit(data)

print(f"Took {train_time.interval} seconds for training.")

2023-11-17 12:28:36,276 INFO     Collecting user affinity matrix
2023-11-17 12:28:36,282 INFO     Calculating time-decayed affinities
2023-11-17 12:28:36,328 INFO     Creating index columns
2023-11-17 12:28:36,409 INFO     Calculating normalization factors
2023-11-17 12:28:36,447 INFO     Building user affinity sparse matrix
2023-11-17 12:28:36,453 INFO     Calculating item co-occurrence
2023-11-17 12:28:36,806 INFO     Calculating item similarity
2023-11-17 12:28:36,807 INFO     Using jaccard based similarity
2023-11-17 12:28:36,891 INFO     Done training


Took 0.6349668000000293 seconds for training.


In [6]:
with Timer() as scoring_time:
    top_k = model.recommend_k_items(data, top_k=TOP_K, remove_seen=True)

print("Took {} seconds for scoring.".format(scoring_time.interval))

2023-11-17 12:28:36,903 INFO     Calculating recommendation scores
2023-11-17 12:28:37,254 INFO     Removing seen items


Took 0.39262639999992643 seconds for scoring.


In [7]:
top_k.sort_values(by="prediction", ascending=False, inplace=True)

top_k.head(50)

Unnamed: 0,userID,itemID,prediction
5230,532,69,4.665657
5231,532,172,4.645321
5232,532,423,4.643408
8460,849,204,4.628795
5233,532,174,4.619922
2140,118,195,4.619201
9190,928,174,4.594107
2141,118,183,4.58369
5234,532,385,4.578142
2142,118,89,4.577481


Even though all the items in the dataset have been given a score, we are going to just take to top k recommendations for every user.

In [8]:
top_k_recommendations = top_k.groupby("userID").head(TOP_K)
top_k_recommendations

Unnamed: 0,userID,itemID,prediction
5230,532,69,4.665657
5231,532,172,4.645321
5232,532,423,4.643408
8460,849,204,4.628795
5233,532,174,4.619922
...,...,...,...
6775,685,245,1.496370
6776,685,307,1.482737
6777,685,313,1.475710
6778,685,294,1.466042


Now let"s look at the results for a specific user


In [10]:
user_id = 54

In [11]:
items_seen = data[data[USER_COL] == user_id]
items_seen

Unnamed: 0,userID,itemID,rating,timestamp
232,54,106,3.0,880937882
336,54,595,3.0,880937813
512,54,742,5.0,880934806
806,54,302,4.0,880928519
1352,54,676,5.0,880935294
...,...,...,...,...
68542,54,634,1.0,892681013
70980,54,250,4.0,880933834
74116,54,823,2.0,880938088
78663,54,405,4.0,880934806


In [12]:
items_predicted = top_k[top_k[USER_COL] == user_id].sort_values(
    by=PREDICTION_COL, ascending=False
)
items_predicted

Unnamed: 0,userID,itemID,prediction
1300,54,300,2.784323
1301,54,294,2.601673
1302,54,248,2.548543
1303,54,286,2.458506
1304,54,282,2.436808
1305,54,271,2.433754
1306,54,293,2.3683
1307,54,315,2.367518
1308,54,222,2.357715
1309,54,301,2.354047


## 3 Batch deployment

The batch deployment is storing the result in a database to be accessible for an external process.

This notebook uses [SQLite](https://docs.python.org/3/library/sqlite3.html) as a database, however, any other relational or non-relational database can be used. 

In [13]:
# Establish a connection to the database
conn = sqlite3.connect(database=DATABASE)

# Create a cursor object to execute SQL queries
cur = conn.cursor()

# Drop table if it already exists
query = "DROP TABLE IF EXISTS " + TABLE_NAME + ";"
cur.execute(query)

# Create a table to store your data
create_table_query = f"""
CREATE TABLE {TABLE_NAME} (
    user_id INT PRIMARY KEY,
    item1 TEXT,
    item2 TEXT,
    item3 TEXT,
    item4 TEXT,
    item5 TEXT,
    item6 TEXT,
    item7 TEXT,
    item8 TEXT,
    item9 TEXT,
    item10 TEXT
);
"""
cur.execute(create_table_query)

# Commit the changes and close the connection
conn.commit()


In [14]:
# Create a function to prepare and return the data for insertion
def prepare_data(user_group):
    user_id = user_group.name
    recommendations = user_group["itemID"].tolist()
    recommendations.extend([None] * (TOP_K - len(recommendations)) )  # Fill empty slots with NULL
    return (user_id, *recommendations)

# Use apply and groupby to efficiently generate the insert_data list
insert_data = top_k_recommendations.groupby("userID").apply(prepare_data).tolist()

In [15]:
# Define the SQL statement for the bulk insert
insert_sql = f"""
    INSERT INTO {TABLE_NAME} (user_id, item1, item2, item3, item4, item5, item6, item7, item8, item9, item10)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
"""

# Use executemany to insert the data in a single transaction
cur.executemany(insert_sql, insert_data)


# Commit the changes 
conn.commit()


## 3.1 Query the database

Once the data is stored in the database, we can query it. The typical query is to get the top k recommendations for a specific user.

In [16]:
query = f"SELECT * FROM {TABLE_NAME} WHERE user_id = {user_id}"
cur.execute(query)
data = cur.fetchall()
print(data)


[(54, '300', '294', '248', '286', '282', '271', '293', '315', '222', '301')]


In [17]:
# Close the database connection
cur.close()
conn.close()

The batch architecture is the most common and simple architecture in recommendation systems. It works very well when the data is not changing very often, it is easy to implement, and it has a very low latency.

Real examples of industries using this architecture are retail, media and entertainment, ads, gaming and travel.