# Ratings Predictor with Tensorflow.

Ratings predictor for MovieLens dataset with Alternating Least Squares (ALS) model

ALS algorithm considers a matrix R of users rating movies, which has m-users rows and n-items columns (m-users x n-items). This matrix should be transformed in 2 matrices: U which has m-users x k-recommendations and P with n-items x k-recommendations, so if we multiply U and P it approximates to R.

This work imports the dataset and pre-process it to transform it into that matrix R that then will be factorized to obtain the U and P matrices. 
Note that Matrices R, U and P constitute the Machine Learning model for the recommendation system.

#### References

[1] Takács, G. and Tikk, D. (2012). Alternating least squares for personalized ranking. Proceedings of the sixth ACM conference on Recommender systems - RecSys '12. [online] Available at: https://www.researchgate.net/publication/254464370_Alternating_least_squares_for_personalized_ranking [Accessed 18 Dec. 2018].

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
from IPython.display import display
from scipy.sparse import coo_matrix

DATA = "data/movielens_1m/"

def open_dataset(dataset_name, fields, given_encoding):
    dataset_path = DATA + dataset_name + ".dat"
    dataframe = pd.read_csv(dataset_path, sep='::', names=fields, header=None, encoding=given_encoding, engine='python')
    return dataframe
    
# Importing the ratings dataset into a Pandas DataFrame.
ratings_df = open_dataset("ratings", ["UserID","MovieID","Rating","TimeStamp"], "utf-8")

# Removing the Timestamp column since we would not need it.
ratings_df = ratings_df.drop(['TimeStamp'], axis=1)
display(ratings_df)


Unnamed: 0,UserID,MovieID,Rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


### Pre-processing: Converting data frame to Matrix.

Here we'll convert the current ratings dataframe to a matrix with 0-based indexing so it can be processed easily by Tensorflow. Here we'll be normalizing the matrices. Also we'll be creating the user_map (represents the U matrix) and movies_map (represents the P matrix), which will be the matrix factors that constitute the model.

In [3]:
#---Mapping item ratings matrix.
ratings = ratings_df.as_matrix(["UserID", "MovieID", "Rating"])
ratings[:,0] -= 1
ratings[:,1] -= 1

users_map = ratings[:,0]
movies_map = ratings[:,1]

  


### Defining Training and test sets.


Here using the newly mapped ratings we'll define the training sets and test sets. We'll use the 90% of the dataset as the training set and the remaining 10% as the test set.

In [5]:
# We'll use the 10% of the dataset as the test set.
test_set_size = int(len(ratings) / 10)

test_set_idx = np.random.choice(range(len(ratings)), size=test_set_size, replace=False)

test_set_idx = sorted(test_set_idx)

ts_ratings = ratings[test_set_idx]

tr_ratings = np.delete(ratings, test_set_idx, axis=0)

# Training sets.
u_tr, m_tr, r_tr = zip(*tr_ratings)

# Test sets.
u_ts, m_ts, r_ts = zip(*ts_ratings)

# Training sparse coordinate matrix.
tr_sparse = coo_matrix((r_tr, (u_tr, m_tr)), shape=(len(u_tr), len(m_tr)))

# Test sparse coordinate matrix.
ts_sparse = coo_matrix((r_ts, (u_ts, m_ts)), shape=(len(u_ts), len(m_ts)))
print(ts_sparse)

  (0, 3407)	4
  (0, 594)	5
  (0, 1034)	5
  (0, 2790)	4
  (0, 2017)	4
  (0, 2320)	3
  (0, 3185)	4
  (0, 1028)	5
  (1, 1356)	5
  (1, 1791)	3
  (1, 3029)	4
  (1, 367)	4
  (1, 2851)	3
  (1, 2027)	4
  (1, 1197)	4
  (1, 1123)	5
  (1, 162)	4
  (1, 20)	1
  (1, 2500)	5
  (1, 3677)	3
  (1, 1243)	3
  (1, 355)	5
  (1, 1244)	2
  (2, 2996)	3
  (2, 1290)	4
  :	:
  (6039, 212)	5
  (6039, 1833)	4
  (6039, 231)	5
  (6039, 259)	4
  (6039, 2857)	4
  (6039, 3818)	5
  (6039, 1135)	4
  (6039, 1187)	4
  (6039, 1188)	5
  (6039, 2145)	1
  (6039, 1899)	5
  (6039, 1911)	3
  (6039, 1920)	4
  (6039, 317)	4
  (6039, 1944)	5
  (6039, 1951)	5
  (6039, 1234)	4
  (6039, 1258)	3
  (6039, 1284)	4
  (6039, 3288)	5
  (6039, 447)	4
  (6039, 1718)	5
  (6039, 2749)	2
  (6039, 3702)	4
  (6039, 2018)	5


## Implementing WALS (Weighted Alternate Least Squares) in Tensorflow

Weighted Alternate Least Squares is the implementation of the ALS algorithm in Tensorflow.

Implementation based on this tutorial: https://cloud.google.com/solutions/machine-learning/recommendation-system-tensorflow-create-model

In [7]:
def train_model(args, tr_sparse):
    '''
    Based on: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/10_recommend/endtoend/wals_ml_engine/trainer/wals.py#L46
    '''
    
    # Getting parameters.
    dim = args['latent_factors']
    num_iters = args['num_iters']
    reg = args['regularization']
    unobs = args['unobs_weight']
    wt_type = args['wt_type']
    feature_wt_exp = args['feature_wt_exp']
    obs_wt = args['feature_wt_factor']
    col_wts = args['column_weights']
    row_wts = args['row_weights']
    
    # Generating the input tensor.
    input_tensor = tf.SparseTensor(indices=list(zip(tr_sparse.row, tr_sparse.col)), values=(tr_sparse.data).astype(np.float32), dense_shape=tr_sparse.shape)
    
    # Generating the model.
    model = tf.contrib.factorization.WALSModel(tr_sparse.shape[0], tr_sparse.shape[1], dim, unobserved_weight=unobs, regularization=reg, row_weights=row_wts, col_weights=col_wts)
    
    row_factor = model.row_factors[0]
    col_factor = model.col_factors[0]
    
    # Training the model.
    sess = tf.Session(graph=input_tensor.graph)
    
    with input_tensor.graph.as_default():
        row_update_op = model.update_row_factors(sp_input=input_tensor)[1]
        col_update_op = model.update_col_factors(sp_input=input_tensor)[1]
    
        sess.run(model.initialize_op)
        sess.run(model.worker_init)

        for _ in range(num_iters):
            sess.run(model.row_update_prep_gramian_op)
            sess.run(model.initialize_row_update_op)
            sess.run(row_update_op)
            sess.run(model.col_update_prep_gramian_op)
            sess.run(model.initialize_col_update_op)
            sess.run(col_update_op)
            
    
    out_row = row_factor.eval(session=sess)
    out_col = col_factor.eval(session=sess)
    
    # After evaluating the output we close the training session.
    sess.close()
    
    # Out_row: users, out_col: items. Row_factor: out_row before training, Col_factor: out_col before training.
    return out_row, out_col, row_factor, col_factor
        
    

def generate_recommendations(user_ids, user_rated, row_factor, col_factor, num_recommendations):
    '''
    user_rated: indices.
    Code based on: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/10_recommend/endtoend/wals_ml_engine/trainer/model.py#L325
    '''
    
    assert(row_factor.shape[0] - len(user_rated)) >= num_recommendations
    
    # Retrieve the user factor.
    user_factor = row_factor[user_ids]
    
    # Dot product of item factors with user factor gives predicted ratings.
    predicted_ratings = col_factor.dot(user_factor)
    
    # Find candidate recommended item indexes sorted by predicted ratings.
    num_recomm = num_recommendations + len(user_rated)
    candidate_items = np.argsort(predicted_ratings)[-num_recomm:]
    
    # Remove previously rated items and take top k
    recommended_items = [i for i in candidate_items if i not in user_rated]
    recommended_items = recommended_items[-num_recommendations:]
    
    # Flip to sort the highest rated first.
    recommended_items.reverse()
    
    return recommended_items



def make_weights(data, weight_type, obs_weight, feature_wt_exp, axis):
    '''
    Obtained from: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/10_recommend/endtoend/wals_ml_engine/trainer/task.py
    '''
    
    # Recipricol of sum of number of items across rows (if axis is 0)
    frac = np.array(1.0/(data > 0.0).sum(axis))
    
    # Filter any invalid entries
    frac[np.ma.masked_invalid(frac).mask] = 0.0
    
    # Normalize weights according to assumed distribution of ratings
    if weight_type == 0:
        wts = np.array(np.power(frac, feature_wt_exp)).flatten()
    else:
        wts = np.array(obs_weight * frac).flatten()

    # check again for any numerically unstable entries
    assert np.isfinite(wts).sum() == wts.shape[0]
    return wts

### Training WALS model

In [8]:
# Defining initial parameters.

# Defining weights for WALS model.
col_wts = make_weights(tr_sparse, 0, 100.0, 0.08, 0)
row_wts = make_weights(tr_sparse, 0, 100.0, 0.08, 1)

INITIAL_PARAMS = {
    "weights": True,
    "latent_factors": 34,
    "num_iters": 20,
    "regularization": 9.83,
    "unobs_weight": 0.001,
    "wt_type": 0,
    "feature_wt_factor": 189.8,
    "feature_wt_exp": 0.08,
    "column_weights": col_wts,
    "row_weights": row_wts
}

output_row, output_col, r_factor, c_factor = train_model(INITIAL_PARAMS, tr_sparse)




### Calculate RMSE

Calculate the Root Mean Squared Error to see the algorithm prediction performance.

In [9]:
def calculate_rmse(out_row, out_col, actual):
    '''
    Obtained from: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/10_recommend/endtoend/wals_ml_engine/trainer/wals.py#L24
    '''
    mse = 0
    
    for i in range(actual.data.shape[0]):
        row_pred = out_row[actual.row[i]]
        col_pred = out_col[actual.col[i]]
        err = actual.data[i] - np.dot(row_pred, col_pred)
        mse += err * err
        
    mse /= actual.data.shape[0]
    rmse = np.sqrt(mse)
    
    return rmse

# Evaluating training performance.
train_rmse = calculate_rmse(output_row, output_col, tr_sparse)
print(train_rmse)

# Evaluating test performance.
test_rmse = calculate_rmse(output_row, output_col, ts_sparse)
print(test_rmse)

3.7520542928348384


3.748865055205765


### Generating recommendations with WALS model.

In [10]:
n_recommendations = 5
# User id: 3
user_idx = np.searchsorted(users_map, 3)

# Taken from training set, but we can use the test set as well.
already_rated_by_user = [2355, 1197, 1287]

# Indices where already_rated_by_user movies are in the movies_map
indices_user_rated_movies = [np.searchsorted(movies_map, i) for i in already_rated_by_user]

recommendations = generate_recommendations(user_idx, indices_user_rated_movies, output_row, output_col, n_recommendations)

# Movies ID recommended for user_id: 3.
print(recommendations)

[23145, 441293, 727418, 641859, 628451]
