**Importing Libraries**

Few of the commonly used libraries in python. Swifter is used to perform parallel processing.

(Note that no This notebook can run on a regular CPU in less than 20 minutes, no external GPU is required. However if GPU is avaiable then its even faster)


In [2]:
from itertools import combinations
# !pip install swifter
import swifter
import pandas as pd
import json
import numpy as np
import os

**Preprocessing**

In this part data from 'transactions.txt' file is parsed in a way that items from each transaction are kept track of in terms of their occurance with each other. In other words their copurchase frequency is counted as the 'transaction.txt' file is preprocessed. 

However due to memory limitations the file is parced in three parts where copurchase frequency is calculated for each part. 

The value of *buffer* can be adjusted based on the computer requirements. 

This whole cell does all the work from reading the file, to parcing, keeping tracking of copurchase frequency and saving the copurchase frequency files for future purposes. 

Additionally use of functions helps to a big extent in the below cell in terms of memory management as it creates temporary variables. Also, the mutability property of dictionary data types is also helpful to efficient data flow.

Note that if you want to run all cells sontinupusly then you can comment the part where it saves the file. The only reason for which the file is saved is because if the anything goes wrong for instance the session crashes or an experiment fails, then we can start from where we left after the saving the files by importing the same files. In that case we do not need to redo this same 15 min cell processing. This also allows us to work with the buffer of anh given size.

The cell takes around 14 minutes to perform on a regular CPU. 

In [81]:
copurchase_data = {}

all_transactions_path = 'transactions.txt'

def update_copurchase_data(rec):
  first_item, second_item = rec

  if first_item not in copurchase_data:
    copurchase_data[first_item] = {}
  if second_item not in copurchase_data[first_item]:
    copurchase_data[first_item][second_item] = 0

  if second_item not in copurchase_data:
    copurchase_data[second_item] = {}
  if first_item not in copurchase_data[second_item]:
    copurchase_data[second_item][first_item] = 0
  
  copurchase_data[first_item][second_item] += 1
  copurchase_data[second_item][first_item] += 1


def count_copurchase_frequency(record):
  try:
    record = [item['item'] for item in eval(record)['itemList']]
    purchase_couples = combinations(record,2)

    #Cannot wrap a swifter function over another one hence only apply is use here.
    pd.Series(purchase_couples).apply(update_copurchase_data)
  except:
    None


def reset_copurchase_data():
  global copurchase_data #The global copurchase_data is reset
  copurchase_data = {}


#alter the value of buffer based on computational requirements.
def get_copurchase_frequency(all_transactions_path, buffer = 500000):
    start = 0
    end = buffer
    to_break = False

    while not to_break:
      
      with open(all_transactions_path, 'r') as file:
        lines = file.readlines()[start:end]
        if len(lines) != buffer:
            to_break = True
        file.close()

      pd.Series(lines).swifter.apply(count_copurchase_frequency) 

      with open(f'copurchase_data_{start}_to_{end}.json', 'w') as fp:
        json.dump(copurchase_data, fp)
        fp.close() 
      
      reset_copurchase_data()
        
      start = end
      end = start + buffer

get_copurchase_frequency(all_transactions_path) #Takes around 14 minutes to compute on a regular CPU.

Pandas Apply: 100%|██████████| 500000/500000 [09:59<00:00, 833.45it/s]  
Pandas Apply: 100%|██████████| 500000/500000 [07:12<00:00, 1155.42it/s]
Pandas Apply: 100%|██████████| 377443/377443 [05:37<00:00, 1118.04it/s]


**Reloading Data**

This is helpful especiallly in experimantations as we dont need to re-run the preprocessing part.

In [8]:
full_data = []
for file in os.listdir():
    if file.startswith('copurchase_data') and file.endswith('.json'):
        with open(file) as json_file:
            full_data.append(json.load(json_file))

**Product Table**

The product table originally came in the file 'product.txt'. However it was parsed into a tsv file using the script in 'products.txt' file.

Note that the 'Product ID' is made as the index of the table for faster query results.

In [9]:
all_products = pd.read_csv('all_products.tsv', sep='\t', index_col= 'Product ID')
all_products_idx = all_products.index.values.tolist()

**Recommendation Matrix**

This cell aims to generate 5 recommendations for each of the products in the product table. 

The product IDs of these 5 recommendations are stored along with the Id from product table in a json file. 

The reason for json file is the faster query results when its imported as a dictionary.

The cell takes around 5 minutes on a regular CPU to perform this operation on all the 70k products.

In [14]:
copurchase_recommendations = dict(zip(all_products_idx, [None]*len(all_products_idx)))

def get_recommendation_matrix(id):
    id_lst = [data.get(id, {}) for data in full_data] 
    id_copurchase = {}
    unique_id_lst = []

    for data in id_lst:
        unique_id_lst = list(set(unique_id_lst + list(data.keys()))) 

    if len(unique_id_lst) != 0:
        def counting(rec):
            id_count = 0
            for data in id_lst:
                id_count += data.get(rec, 0)                
            id_copurchase[rec] = id_count
        
        pd.Series(unique_id_lst).apply(counting)
        
        id_copurchase = np.array(sorted(id_copurchase.items(), key=lambda x: x[1], reverse=True))[:, 0][:5]

        copurchase_recommendations[id] = id_copurchase.tolist()     

pd.Series(all_products_idx).swifter.apply(get_recommendation_matrix) #Takes around 5 minutes on a regular CPU
with open('recommendation_matrix.json', 'w') as fp:
        json.dump(copurchase_recommendations, fp)
        fp.close()

Pandas Apply: 100%|██████████| 70771/70771 [03:43<00:00, 316.45it/s] 


**User Input Function**

This is the function that will take an input query of a product ID fom the user. 

Then it will import the recommendation matrix that was saved in the previoius cell. Since the matrix all the previous cells do not need to be run when quering from user.

The function then imports the product table with product ID as its index. it the converts it into a dictionary for faster query results.

The function the prints all the recommendations for that product.

In [53]:
query_ID = '20592676_EA'

def get_user_recommendations(query_ID):
   with open('recommendation_matrix.json') as json_file:
      copurchase_recommendations = json.load(json_file)

   all_products_dict = pd.read_csv('all_products.tsv', sep='\t', index_col= 'Product ID').drop(columns='MCH Category').to_dict()['Item Name']

   try:
      recommend_ids =  copurchase_recommendations[query_ID]
      recommendations = {recommend_id:all_products_dict[recommend_id] for recommend_id in copurchase_recommendations[query_ID]}
      print(f"Recommendations for '{all_products_dict[query_ID]}'")
      print(recommendations)
   
   except:
      print(f'No recommendations for the product.')


get_user_recommendations(query_ID)

Recommendations for 'Celebration Cupcakes, Chocolate'
{'20189092_EA': 'Plastic Bags', '20379763_EA': 'Celebration Cupcakes, White', '20175355001_KG': 'Bananas, Bunch', '20668578_EA': 'PENNY ROUNDING - DO NOT TOUCH', '20812144001_EA': 'Grade A White Eggs, Large'}


In [54]:
get_user_recommendations('20801754003_C15')

Recommendations for '7 Up'
{'20175355001_KG': 'Bananas, Bunch', '20801754001_C15': 'Pepsi', '20189092_EA': 'Plastic Bags', '20962518_EA': 'Milk, 2%', '20668578_EA': 'PENNY ROUNDING - DO NOT TOUCH'}


In [55]:
get_user_recommendations('20000053_EA')

Recommendations for 'French Dijon Mustard'
{'20175355001_KG': 'Bananas, Bunch', '20127708001_KG': 'Sweet Potatoes', '20130301_EA': 'Montreal Steak Spice', '20143381001_KG': 'Roma Tomatoes', '20055266001_EA': 'Hass Avocado'}


**Deployment**

If this model is kept on production then both the recommendation matrix and the product table can be stored as a variable as hence do not need to be imported. This would allow even faster querying.

The function would then just need to lookup into the dictionary for producing fast results. 

For making the recommendations the only the below mentioned function and the two files that it takes the data from are required.

Note that the time taken to evaluate the results is almost 0.1 sec.

The production function might look like this:

In [56]:
query_ID = '20145949_EA'

with open('recommendation_matrix.json') as json_file:
      copurchase_recommendations = json.load(json_file)

all_products_dict = all_products.drop(columns='MCH Category').to_dict()['Item Name']


def get_user_recommendations(query_ID): 
   try:
      recommendations = {recommend_id:all_products_dict[recommend_id] for recommend_id in copurchase_recommendations[query_ID]}
      print(f"Recommendations for '{all_products_dict[query_ID]}'")
      print(recommendations)
   
   except:
      print(f'No recommendations for the product.')

   
get_user_recommendations(query_ID)

Recommendations for 'Parisienne Bread'
{'20189092_EA': 'Plastic Bags', '20175355001_KG': 'Bananas, Bunch', '20107500001_EA': 'Green Onion', '20668578_EA': 'PENNY ROUNDING - DO NOT TOUCH', '20143381001_KG': 'Roma Tomatoes'}


**Evaluation Metrics**

One of the easy and quick ways to evaluate this model is to check how relevent are the recommended items are to the MCH category of the item for which they are recommended for.

So the metrics would be given a list of recommendations for a given item what percentage of them belong to the same MCH category of the the item.

There are many outliers like 'Plastic Bags', 'Bananas, Bunch' and 'PENNY ROUNDING - DO NOT TOUCH' which are recommended for many items even though they do not necessarily belong to the same MCH category of the the item they are recommended for. 

(Since this coding test only asks to propose an evaluation matrics, there is no code for the proposed evaluation metrics)

**A Better Model (Naive Approach)**

A very naive model can be made the recommends items from the same MCH category as the given item. 

Since this is a naive and very simple model there are many ways to improve this model.

The recommendations of this model will be very relevent to the item being queried.

In [49]:
from random import sample
mch_categories = pd.read_csv('mch_categories.tsv', sep='\t', index_col='code')
all_products = pd.read_csv('all_products.tsv', sep='\t', index_col= 'Product ID')

def naive_model_user_recommendations(query_ID):
    try:
        MCH_Category, Item_name = all_products.loc[query_ID, :]
        recommendations = sample(all_products[all_products['MCH Category'] == MCH_Category]['Item Name'].tolist(), 5)
        print(f"recommendations for '{Item_name}':")
        print(recommendations)
    except:
        print("No recommendations for this product.")


query_ID = '20000053_EA'
naive_model_user_recommendations(query_ID)

recommendations for 'French Dijon Mustard':
['Panda Brand Oyster Sauce', 'Hot Pepper Sauce', 'Sweet Chili Sauce for Chicken', 'Cherry Shiraz Wine Jelly', 'French Garlic Ketchup']


In [50]:
naive_model_user_recommendations('20801754003_C15')

recommendations for '7 Up':
['Mountain Dew Code Red', 'Mandarin Drink', 'Lime', 'Strawberry Malt Beverage', 'Ginger Ale']


In [51]:
naive_model_user_recommendations('20592676_EA')

recommendations for 'Celebration Cupcakes, Chocolate':
['Pumpkin Pie', 'Rice Cake, Low Sugar', 'Two-Bite Brownies', 'Lemon Snaps', 'Lemon Rollat']
