<img src="Instacart_logo_small.png" alt="Instacart" style="width: 100px;"/>

# Recommend products to Instacart Customers

## Introduction

As per the English dictionary, **Recommendation** means a suggestion/advice that something is good.
<br><br>In today's world of multiple choices, customers are often confused with options. Browsing through hundreds of products makes shopping a challenging and time-consuming experience. 
<br>So, how about we login to a shopping site and see 10 recommended products tailored to our taste? Just **Add** those to cart, **checkout** and done! This is what a recommendation engine does! Thus a ***Recommendation Engine*** is a Machine Learning Technique that let us predict what a user may or may not like among a list of given items. 
<br><br> Here, let's use the dataset provided by **Instacart** to build a recommendation engine and Evaluate how our recommendation works?

## Prepare Data

#### Load Libraries

In [1]:
# Imports
from implicit.als import AlternatingLeastSquares
from datetime import datetime
from pathlib import Path

import scipy.sparse as sparse
import implicit
import pandas as pd
import numpy as np
import pickle
import time
from joblib import dump, load
from sklearn import metrics
import random

#### Load Data

In [2]:
# Order datasets
df_order_products_prior = pd.read_csv("instacart_2017_05_01/order_products_prior.csv")
df_order_products_train = pd.read_csv("instacart_2017_05_01/order_products_train.csv")
df_orders = pd.read_csv("instacart_2017_05_01/orders.csv") 
# Products
df_products = pd.read_csv("instacart_2017_05_01/products.csv")
# Departments
df_departments = pd.read_csv("instacart_2017_05_01/departments.csv")
# Merge prior orders and products
# Merge prior orders and products
df_merged_order_products_prior = pd.merge(df_order_products_prior, df_products, on="product_id", how="left")
df_merged_order_products_prior = pd.merge(df_merged_order_products_prior, df_departments, on="department_id", how="left")

In [3]:
df_merged_order_products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,department
0,2,33120,1,1,Organic Egg Whites,86,16,dairy eggs
1,2,28985,2,1,Michigan Organic Kale,83,4,produce
2,2,9327,3,0,Garlic Powder,104,13,pantry
3,2,45918,4,1,Coconut Butter,19,13,pantry
4,2,30035,5,0,Natural Sweetener,17,13,pantry


## Part 1 - Basic Recommendations

### 1. Trending Products at Instacart

New to Instacart and not sure what to order? Let's check out what other customers are buying?

In [4]:
def get_k_popular(k, df_items):
    """
    Returns the `k` most popular products based on purchase count in the dataset
    """
    popular_products = list(df_items["product_name"].value_counts().head(k).index)
    return popular_products

In [5]:
get_k_popular(10,df_merged_order_products_prior)

['Banana',
 'Bag of Organic Bananas',
 'Organic Strawberries',
 'Organic Baby Spinach',
 'Organic Hass Avocado',
 'Organic Avocado',
 'Large Lemon',
 'Strawberries',
 'Limes',
 'Organic Whole Milk']

### 2. Recommendations from the Departments that the Customers are interested in

In [6]:
def get_k_popular_dept_items(k, dept_id, df_items):
    """
    Returns the `k` most popular products from the Dept that is passed as a parameter
    
    k       : No. Of Recommendations
    dept_id : Pass in the Department Id that you are looking details for
    df_items: Pass in the Dataframe with Details
    
    """
    dept_popular_products = list(df_items[df_items.department_id == dept_id]["product_name"].value_counts().head(k).index)
    return dept_popular_products

#### Let's see what are the 10 popular products from Department_id = 6 (International)

In [7]:
get_k_popular_dept_items(10,6,df_merged_order_products_prior)

['Organic Sea Salt Roasted Seaweed Snacks',
 'Taco Seasoning',
 'New Mexico Taco Skillet Sauce For Chicken',
 'Sriracha Chili Sauce',
 'Original Roasted Seaweed Snacks',
 'Coconut Milk',
 'Organic Spicy Taco Seasoning',
 'Sriracha Hot Chili Sauce',
 'Roasted Sesame Seaweed Snacks',
 'Sliced Water Chestnuts']

### 3. Recommended Items that other Customers Often Buy Again

In [8]:
def get_reordered_prods(k, df_items):
    """
    Returns the `k` most popular products from the Dept that is passed as a parameter
    
    k       : No. Of Recommendations
    df_items: Pass in the Dataframe with Details
    
    """
    reordered_products = list(df_items["product_name"].value_counts().head(10).index)
    return reordered_products

In [9]:
df = df_merged_order_products_prior[df_merged_order_products_prior.reordered ==1]
get_reordered_prods(10, df)

['Banana',
 'Bag of Organic Bananas',
 'Organic Strawberries',
 'Organic Baby Spinach',
 'Organic Hass Avocado',
 'Organic Avocado',
 'Organic Whole Milk',
 'Large Lemon',
 'Organic Raspberries',
 'Strawberries']

### 4. Shop Unique Items

In [10]:
def get_k_unique(k, df_items):
    """
    Returns the `k` unique products based on purchase count in the dataset
    """
    unique_products = list(df_items["product_name"].value_counts().tail(k).index)
    return unique_products

In [11]:
get_k_unique(10,df_merged_order_products_prior)

['Max White With Polishing Star Soft Toothbrush',
 'Salsa, Black Bean',
 'Flame Roasted Red Peppers Spreadable Cheese',
 'Original Salted Caramel Protein Energy Bar',
 'Seasoned Southern Style Red Beans And Rice',
 "Frittata, Farmer's Market",
 'Drink Distinct All Natural Soda Pineapple Coconut & Nutmeg',
 'Chelated Magnesium 250 Mg Gluten Free',
 'Hot Oatmeal Multigrain Raisin',
 'Zingz Queso Fundido Baked Snack Crackers']

## Part 2 - Build Reommendation Engine using Collaborative Filtering

**Collborative Filtering:** A technique used for Recommendations by collecting user’s past behaviors (items previously purchased or reordered) as well as similar decisions made by other users.
<br><br> **Assumption:** If a person A has the same opinion as a person B on a product, A is more likely to have B's opinion on a different product than that of a randomly chosen person. Hence, These predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an average (non-specific) score for each item of interest, for example based on its number of times bought/ reordered etc.

#### Prepare the DataSet

In this project, we will use the Prior Data for training a Model. 

In [12]:
# Training Dataset Based on Reordered Quantity
data = pd.merge(df_orders.loc[df_orders.eval_set == "prior"], df_order_products_prior[["order_id", "product_id","reordered"]], on="order_id")
data = data.dropna()
data = data.copy()
data = data[["user_id", "product_id","reordered"]]
data = data.groupby(["user_id", "product_id"])['reordered'].sum().reset_index()

In [14]:
# Convert product names into numerical IDs
c = data.product_id.astype('category')
d = dict(enumerate(c.cat.categories))
#print (d)
data["user_id"] = data["user_id"].astype("category").cat.codes
data["product_id"] = data["product_id"].astype("category").cat.codes
data.head()
data['prev_product_id'] = data['product_id'].map(d)
data[:10]

Unnamed: 0,user_id,product_id,reordered,prev_product_id
0,0,195,9,195
1,0,10249,8,10249
2,0,10317,0,10317
3,0,12416,9,12416
4,0,13021,2,13021
5,0,13165,1,13165
6,0,17110,0,17110
7,0,25119,7,25119
8,0,26070,1,26070
9,0,26387,1,26387


The implicit library expects data as a item-user matrix. So, we create two matricies, one for fitting the model (item-user) and one for recommendations (user-item)

In [15]:
# item_user matrix
sparse_item_user = sparse.csr_matrix((data['reordered'].astype(float), (data['product_id'], data['user_id'])))
# user_item matrix
sparse_user_item = sparse.csr_matrix((data['reordered'].astype(float), (data['user_id'], data['product_id'])))

In [16]:
## Prep the Data
item_user_matrix = sparse.coo_matrix((data["reordered"],
                                            (data["product_id"],
                                             data["user_id"])))
# Contruct a sparse matrix for our users and items containing number of reordered
item_user_matrix = item_user_matrix.tocsr()

Let's check the sparsity of the matrix!

In [17]:
matrix_size = item_user_matrix.shape[0]*item_user_matrix.shape[1] # Number of possible interactions in the matrix
num_purchases = len(item_user_matrix.nonzero()[0]) # Number of items interacted with
sparsity = 100*(1 - (num_purchases/matrix_size))
sparsity

99.94798258401877

For Testing purpose, we will hide a certain percentage of the user/item interactions from the model during the training phase, chosen at random. Then, check during the test phase how many of the items that were recommended the user actually ended up purchasing in the end. 

Our test set is an exact copy of the original data. The training set, however, will mask a random percentage of user/item interactions and act as if the user never purchased the item (making it a sparse entry with a zero). We then check in the test set which items were recommended to the user that they ended up actually purchasing. If the users frequently ended up purchasing the items most recommended to them by the system, we can conclude the system seems to be working.

As an additional check, we can compare our system to simply recommending the most popular items to every user (beating popularity is a bit difficult). This will be our baseline.

This method of testing isn’t necessarily the “correct” answer, because it depends on how you want to use the recommender system. However, it is a practical way of testing performance I will use for this example.

Now that we have a plan on how to separate our training and testing sets, let’s create a function that can do this for us. We will also import the random library and set a seed so that you will see the same results as I did.

### Split Train and Test Data

In [18]:
def make_train_test(data, pct_test = 0.2):
    '''
    This function will take in the original user-item matrix and "mask" a percentage of the original ratings where a
    user-item interaction has taken place for use as a test set. The test set will contain all of the original ratings, 
    while the training set replaces the specified percentage of them with a zero in the original ratings matrix. 
    
    '''
    test_set = data.copy() # Make a copy of the original set to be the test set. 
    test_set[test_set != 0] = 1 # Store the test set as a binary preference matrix
    training_set = data.copy() # Make a copy of the original data we can alter as our training set. 
    nonzero_inds = training_set.nonzero() # Find the indices in the ratings data where an interaction exists
    nonzero_pairs = list(zip(nonzero_inds[0], nonzero_inds[1])) # Zip these pairs together of user,item index into list
    random.seed(0) # Set the random seed to zero for reproducibility
    num_samples = int(np.ceil(pct_test*len(nonzero_pairs))) # Round the number of samples needed to the nearest integer
    samples = random.sample(nonzero_pairs, num_samples) # Sample a random number of user-item pairs without replacement
    user_inds = [index[0] for index in samples] # Get the user row indices
    item_inds = [index[1] for index in samples] # Get the item column indices
    training_set[user_inds, item_inds] = 0 # Assign all of the randomly chosen user-item pairs to zero
    training_set.eliminate_zeros() # Get rid of zeros in sparse array storage after update to save space
    return training_set, test_set, list(set(user_inds)) # Output the unique list of user rows that were altered  

In [19]:
df_train, df_test, item_users_altered = make_train_test(item_user_matrix, pct_test = 0.2)

###  Implicit Matrix Factorization using ALS (Alternating Least Squares)

**Implicit Data:** The data that we gather from the users behaviour, with no ratings or specific actions are Implicit Data. For example, with star ratings we know that a 1 means the user did not like the item and a 5 that they really loved it. But here, we do not have a rating for any item. So, we need to build a Recommendation Engine based on what items a customer purchased and how many times (Reordered).

**Alternating Least Squares (ALS):** 
Alternating Least Squares (ALS) is a the model we’ll use to fit our data and find similarities. ALS uses Matrix Factorization method for recommendations.

**Matrix Factorization:** The idea is to take a large matrix and factor it into some smaller representation of the original matrix.  
<br>We have an original matrix R of size **MxN**, where M is the number of users and N is the number of items. This matrix is quite sparse, since most users only interact with a few items each. We can factorize this matrix into two separate smaller matrices: one with dimensions **MxK** which will be our latent user feature vectors for each user (U) and a second with dimensions **KxN**, which will have our latent item feature vectors for each item (V). Multiplying these two feature matrices together approximates the original matrix, but now we have two matrices that are dense including a number of latent features K for each of our items and users.We calculate U and V so that their product approximates R as closely as possible: **R ≈ U x V.**

In [20]:
def confidence_matrix(input_matrix, alpha):
    """
    Given a utility matrix,
    Returns the given matrix converted to a confidence matrix
    """
    return (input_matrix * alpha).astype("double")

In [21]:
def implicit_als(input_matrix, **kwargs):
    """
    Given the utility matrix and model parameters,
    Builds models and writes it to disk 
    Args:
    sparse_data (csr_matrix): Our sparse user-by-item matrix
    alpha_val (int)         : The rate in which we'll increase our confidence in a preference with more interactions.
    
    """
    start = time.time()
    
    # Build model
    print("Building ALS model with alpha: {} ".format(kwargs["alpha"]))
    model = AlternatingLeastSquares(factors=20, regularization=0.1, iterations=40)
    #model.approximate_similar_items = True
    
    # Calculate the confidence by multiplying it by alpha value.
    data_conf = confidence_matrix(input_matrix, kwargs["alpha"])
    
    model.fit(data_conf)

    # Save model to disk
    filename = 'baseline_model.sav'
    pickle.dump(model, open(filename, 'wb'))
    
    print("Completed in {:.2f}s".format(time.time() - start))


#### Call the Function to build the Model

In [22]:
# Specify model params and build it
## Alpha's in the range [10, 50] with a step size of 5 were tried. alpha = 15 was found to have the best overall 
## recall value. 
model_params = {"alpha": 15} 

# Build the Model
implicit_als(df_train, **model_params)

als_model = pickle.load(open('baseline_model.sav', 'rb'))



Building ALS model with alpha: 15 


HBox(children=(IntProgress(value=0, max=40), HTML(value='')))


Completed in 277.98s


### Find Similar Items 

In [23]:
def find_similar_items(model,item_id,n_similar):
    """
    Given an item, prints similar items
    """
    product_id = []
    product_name = []
    scores = []
    
    similar =  model.similar_items(item_id, n_similar)
    for item in similar:
        idx, score = item

        product_id.append(idx)
        scores.append(score)
        product_name.append(df_products.product_name.loc[df_products.product_id==idx].iloc[0])
    print("Similar Items to Item: ",df_products.product_name.loc[df_products.product_id==item_id].iloc[0])
    print("----------------------------------------------------------------------")
    print(list(product_name))

In [24]:
find_similar_items(als_model,33120,10)

Similar Items to Item:  Organic Egg Whites
----------------------------------------------------------------------
['Organic Egg Whites', 'Organic Grass fed Creamline Yogurt Vanilla', 'Gluten FreeBread & Pizza Crust Mix', 'Pumpkin Pie', 'T/Gel Original Formula Therapeutic Shampoo', 'Original', 'Just Real Fruits & Veggies Snack Apple, Green Pea, Pineapple', 'Boneless Fried Chicken & Waffles', 'Salmon Creations Lemon Dill Salmon', 'Mocha Iced Coffee']


Similar Items return some heathy options. Let's Evaluate the performance of the recommender system.

## Evaluating the Recommender System

In [25]:
def auc_score(predictions, test):
    '''
    This simple function will output the area under the curve using sklearn's metrics. 
    
    parameters:
    
    - predictions: your prediction output
    
    - test: the actual target result you are comparing to
    
    returns:
    
    - AUC (area under the Receiver Operating Characterisic curve)
    '''
    fpr, tpr, thresholds = metrics.roc_curve(test, predictions)
    return metrics.auc(fpr, tpr)   

Now, utilizing the above function inside of a second function, we will calculate the AUC for each user in our training set that had at least one item masked. It should also calculate AUC for the most popular items for our users to compare.

In [26]:
def calc_mean_auc(training_set, altered_users, predictions, test_set):
    '''
    This function will calculate the mean AUC by user for any user that had their user-item matrix altered. 
    
    parameters:
    
    training_set - The training set resulting from make_train, where a certain percentage of the original
    user/item interactions are reset to zero to hide them from the model 
    
    predictions - The matrix of your predicted ratings for each user/item pair as output from the implicit MF.
    These should be stored in a list, with user vectors as item zero and item vectors as item one. 
    
    altered_users - The indices of the users where at least one user/item pair was altered from make_train function
    
    test_set - The test set constucted earlier from make_train function
    
    returns:
    
    The mean AUC (area under the Receiver Operator Characteristic curve) of the test set only on user-item interactions
    there were originally zero to test ranking ability in addition to the most popular items as a benchmark.
    '''
    
    
    store_auc = [] # An empty list to store the AUC for each user that had an item removed from the training set
    popularity_auc = [] # To store popular AUC scores
    pop_items = np.array(test_set.sum(axis = 0)).reshape(-1) # Get sum of item iteractions to find most popular
    item_vecs = predictions[1]
    for user in altered_users: # Iterate through each user that had an item altered
        training_row = training_set[user,:].toarray().reshape(-1) # Get the training set row
        zero_inds = np.where(training_row == 0) # Find where the interaction had not yet occurred
        # Get the predicted values based on our user/item vectors
        user_vec = predictions[0][user,:]
        pred = user_vec.dot(item_vecs).toarray()[0,zero_inds].reshape(-1)
        # Get only the items that were originally zero
        # Select all ratings from the MF prediction for this user that originally had no iteraction
        actual = test_set[user,:].toarray()[0,zero_inds].reshape(-1) 
        # Select the binarized yes/no interaction pairs from the original full data
        # that align with the same pairs in training 
        pop = pop_items[zero_inds] # Get the item popularity for our chosen items
        store_auc.append(auc_score(pred, actual)) # Calculate AUC for the given user and store
        popularity_auc.append(auc_score(pop, actual)) # Calculate AUC using most popular and score
    # End users iteration
    
    return float('%.3f'%np.mean(store_auc)), float('%.3f'%np.mean(popularity_auc))  
   # Return the mean AUC rounded to three decimal places for both test and popularity benchmark

Now, let's see how our recommender system is doing. To use this function, we will need to transform our output from the ALS function to csr_matrix format and transpose the item vectors.

In [27]:
alpha = 15
user_vecs, item_vecs = implicit.alternating_least_squares((df_train*alpha).astype('double'), 
                                                          factors=20, 
                                                          regularization = 0.1, 
                                                          iterations = 50)



HBox(children=(IntProgress(value=0, max=50), HTML(value='')))




In [30]:
calc_mean_auc(df_train, item_users_altered, 
              [sparse.csr_matrix(user_vecs), sparse.csr_matrix(item_vecs.T)], df_test)
# AUC for our recommender system

(0.851, 0.769)

#### Conclusion:

Thus, we can see that our recommender system beats popularity. The Recommender system had a mean AUC of **0.85**, while the popular item benchmark had a lower AUC of **0.77**.

### Get Recommendations for Users

Let’s examine the recommendations given to a particular user and see if the user has purchased any of the recommended products!

In [31]:
def get_recommendations_items(model,user_id,n_count):
    """
    Get Recommendations for users
    """
    # Recommend items for a user 49676
    recommendations = model.recommend(user_id, sparse_user_item, N = n_count)
    product_id = []
    product_name = []
    scores = []
    print("Recommended Prodcucts for user_id ",user_id)
    print("----------------------------------------------------------------------")
    for item in recommendations:
        idx, score = item
        #print(item)
        pid = data.prev_product_id.loc[data.product_id == idx].iloc[0]
        ##product_id.append(df_train.product_id.loc[df_train.product_id == idx].iloc[0])
        product_name.append(df_products.product_name.loc[df_products.product_id==pid].iloc[0])
        #scores.append(score)
    print (product_name)

#### Get Recommendations for User#1

In [32]:
# Let's test for user# 1
get_recommendations_items(als_model,1,10)

Recommended Prodcucts for user_id  1
----------------------------------------------------------------------
['Light Strawberry Blueberry Yogurt', 'All-Seasons Salt', 'Indian Hemp & Haitian Vetiver Deodorant', 'Diet Mountain Dew Mini Cans', 'Moisturizing Conditioner Aloe Vera + Macadamia Oil', 'Turkey Stew With Barley & Carrots Natural Dog Food', "Chocolate Builder's Protein Bar", 'Franks, Beef, Deli Style, Bun Size', 'SmartBlend Chicken & Rice Formula Adult Dry Dog Food', 'Milk, Low Fat, 1% Milkfat']


#### Check the products the User Actually Bought

In [33]:
customers_arr = np.array(data.user_id.unique()) # Array of customer IDs from the ratings matrix
products_arr = np.array(data.product_id.unique()) # Array of product IDs from the ratings matrix

In [34]:
def get_actual_buy(user_id):
    """
    Get the product List the Customer actually bought
    """
    act_products = []
    purchased_ind = list(data[data.user_id ==user_id]['prev_product_id'].unique())
    print("Actual Prodcucts for user_id ",user_id)
    print("----------------------------------------------------------------------")
    for pid in purchased_ind:
        #id = data.prev_product_id.loc[data.product_id == purchased_ind].iloc[0]
        act_products.extend(df_products[df_products.product_id == pid]['product_name'])

    print( list(act_products))

#### Check the products actually bought by User#1

In [35]:
get_actual_buy(1)

Actual Prodcucts for user_id  1
----------------------------------------------------------------------
['Fresh Breath Oral Rinse Mild Mint', 'Nutter Butter Cookie Bites Go-Pak', 'Beef Empanadas', 'Casareccia', 'Hearth Baked Style Twin French Bread', 'Frizz Ease Original Formula Medium To Coarse Frizzy Hair Serum', "Children's Allergy Relief Chewable Grape Tablets", 'Chocolate Almond Fudge Ice Cream', '2 in 1  Cavity and Enamel Protection Strawberry Flavor Kids Toothpaste', 'Organic Ginger Beet Kraut', 'Apple Pie Spice', 'Classic Cheddar Popcorn', 'Whoppers Robin Eggs', 'Hazelnut Toffee Dark Chocolate Bar', 'Karamel Sutra Core Ice Cream', 'Decaffeinated Dark Italian Roast Ground Coffee', 'Mexican-Style Ground Spiked Eggnog Dark Chocolate', '100% Grape Juice Concentrate', 'Stock-In-A-Box', 'Salt and Vinegar Chips', "Lori's Lemon Tea", 'Natural Dog Food Turkey & Sweet Potato Formula', 'Cranberry Zero Soda', 'Extra Strength Glacier Mint Drops With Natural Menthol', 'Ultra Soft & Strong Bat

So, it turns out that the user#1 bought some Food Items and thus some different food items are recommended by our Model. Interestingly, User#1 bought a Deodorant('Fresh Collection Denali Scent Deodorant') and one('Indian Hemp & Haitian Vetiver Deodorant') is recommended.

Also Dog Food('Natural Dog Food Turkey & Sweet Potato Formula') was actually bought by the user and "SmartBlend Chicken & Rice Formula Adult Dry Dog Food" is recommended. 

While shopping, I **do not** want to see the Items that I usually buy as "Recommnedations". I would like to see some **similar** items which are bought by the other customers and catch my attention. So, it looks like our **Recommendation Engine** is working just fine!