# Basket Analysis and Recommender System

In this notebook, we will perform basket analysis using association rules and build a recommender system using the implicit library. Explaination on the individual steps, how the libraries work and evaluation of the algorithm and model will be made in their respective sections.

### Contents:
- [Import Libraries and Load Dataset](#Import-Libraries-and-Load-Dataset)
- [Apriori - Basket Analysis](#Apriori---Basket-Analysis)
- [Recommender System](#Recommender-System)

# Import Libraries and Load Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.sparse as sparse
import random
import implicit
import pickle

from apyori import apriori
from collections import defaultdict
from pandas.api.types import CategoricalDtype

In [2]:
full_df = pd.read_csv('../dataset/cleaned/combined_cleansed.csv')
validation_set = pd.read_csv('../dataset/cleaned/validation.csv')

# Apriori - Basket Analysis

Association rules captures pattern of items appearing frequently together. For example, we have a total of 10 orders containing varying items, out of the 10 orders, diapers and beer are bought together in 4 of them. The algorithm will detect such patterns and calculate the lift and confidence of beer and diaper. More detailed explaination on confidence and lift is provided below:
    
- Confidence = The likelihood that item B is bought when item A is bought
    - I have 100 transactions where bread is bought, 20 of them contains both bread and jam, Confidence of bread→jam = 20/100 = 0.2 = 20%
    - The likelihood of buying jam when bread is purchased is 20%

- Lift = the increase in ratio of the sale of B when A is sold, it can be calculated by (Confidence A→B) / (Support B).
    - A higher lift means that the likelihood of the products being bought together is higher
    - A lift lesser than 1 means that the items are not likely to be bought together
    - A lift equals to 1 means that there are no association between both products
    - Lift(bread → jam) = (20/100) / (10/100) = 2
    - The likelihood of buying jam and bread together is 2 times more likely than just buying bread alone

Some use cases on how basket analysis can be assist business:
- X can be recommended if Y is present in cart when customer is checking out 
- X and Y could be combined into a new product, such as having Y in flavors of X.
- Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other.
- Promotional discounts could be applied to just one out of the two items.
- Advertisements on X could be targeted at buyers who purchase Y

## Prep Dataset for Basket Analysis

Association rules utilises a nested list, we will first prepare our dataset by incorporating all items purchased in one order in a list, which will be nested into a master list containing all orders.

In [3]:
def add_to_dict(x):
    prod_dict[x[0]].append(x[1])

In [4]:
prod_dict = defaultdict(list)

In [5]:
full_df[['order_id','product_name']].apply(add_to_dict, axis = 1);

In [6]:
purchase_list = list(prod_dict.values())

In [7]:
# saving prepared data

with open('../pickles/purchase_list.data', 'wb') as filehandle:
    pickle.dump(purchase_list, filehandle)

In [8]:
purchase_list = [x for x in purchase_list if len(x) > 1]

- We will only keep orders with more than 1 item
- Transaction with only 1 item will increase our total transaction count and will not be purposeful to capture patterns

## Applying the Algorithm

The algorithm takes in 4 parameters, an explaination of the parameters chosen for the algorithm below:
- min_support
    - To only include items that appeared in at least 3000 times out of our total transaction
- min_confidence
    - To only include items which are bought together for at least 20% out of the total transactions where only the second item is bought
- min_lift 
    - To only include rules where it is minimally 2 times more to purchse both 2 items compared to purchasing only the first item
- min length
    - I want at least 2 products in all the rules

In [9]:
association_rules = apriori(purchase_list, min_support=3000/len(purchase_list), min_confidence=0.2, min_lift=2, min_length=2)
association_results = list(association_rules)

In [10]:
# saving the results

with open('../pickles/association_rules_01_percent_min_support.data', 'wb') as filehandle:
    pickle.dump(association_results, filehandle)

In [11]:
len(association_results)

421

We have a total of 421 associations of purchases after filtering. We will convert them into a dataframe and explore the rules which the algorithm had determined.

In [12]:
result = []
for item in association_results:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    value0 = str(items[0])
    value1 = str(items[1])
    
    # second index for the inner listing
    value2 = str(item[1])[:7]
    
    value3 = str(item[2][0][2])[:7]
    value4 = str(item[2][0][3])[:7]
    
    rows = (value0, value1, value2, value3, value4)
    result.append(rows)

labels = ['Antecedent', 'Consequents', 'Support', 'Confidence', 'Lift']
product_suggestions = pd.DataFrame(result, columns = labels)

In [13]:
product_suggestions.head()

Unnamed: 0,Antecedent,Consequents,Support,Confidence,Lift
0,Apple Honeycrisp Organic,Bag of Organic Bananas,0.00774,0.27941,2.26811
1,Bag of Organic Bananas,Apples,0.00101,0.25411,2.06274
2,Clementines,Apples,0.00131,0.3282,33.614
3,Asparation/Broccolini/Baby Broccoli,Banana,0.00198,0.36813,2.3923
4,Baby Cucumbers,Hass Avocados,0.00103,0.22707,14.0823


- Clementines appeared in 0.00131 of the total transactions (0.13%)
- The likelihood of someone buying Apples when they purchase Clementines is 0.32 (32%)
- Past transanctions shows that people are 33 times more likely to buy Apples and Clementines compared to just buying Apples

# Recommender System

#### Implicit Data
In this project, we are purely leveraging on implicit data which is gathered through customer's behaviour. In this case we got these data from the purchasing behavour, what items the user had purchased and how many times they had purchased the item. Using implicit data yield some advantage over explicit data.
- Getting explicit ratings and reviews may not always be easy as it requires additional actions by the customer
- Explicit ratings and reviews provided by customer can be skewed to certain level of biasness based on the situation the rating is provided and cultural habits
    - Ratings and reviews can be dependent on the user's mood
    - A user's 4 star rating can be equivalent another user's 5 star rating due to cultural differences

#### Recommender System
Interaction between customer and item is a basis of how our recommender system works. An absence of interaction could mean that the customer do not like the item or more often, the customer do not know about the item yet.
A good recommender system is able to identify hidden features a user like based on their past bahaviour and behaviour of similar users and matching them with products that has these hidden features. 

## Prepping the dataset

We will be utilising the Alternating Least Squares model to fit our data and find similarities. For ALS, we need to utilise matrix factorization, matrix factorization is taking a large matrix and factor it into some smaller representation of the original. 

A problem in collaborative filtering for recommender systems is that our original matrix have millions of different dimensions, but our tastes may not be so complex. For example, I could have bought hundreds of different products, but these products may only represent a few different tastes.

Using matrix factorization, we can mathematically reduce the dimensionality of our original all users by all items matrix into a smaller all items and what tastes they represent vector and each user and their taste value vector. These tastes are latent or hidden features which we learn them from our data. This reduction makes it much more computationally efficient and also gives us better results as we can reason items in a more compact taste space.

We convert our data into a user by product sparse matrix and their taste values will be the number of purchase a user made for the product

In [14]:
collaborative_df = full_df.groupby(['user_id', 'product_name', 'product_id'])['product_id'].agg('count').to_frame('purchase_count').reset_index()

In [15]:
# get a list of unique users
users = list(np.sort(collaborative_df['user_id'].unique()))
# get a list of unique products
products = list(collaborative_df['product_id'].unique())
# get a list of purchase count
purchase_count = list(collaborative_df['purchase_count'])

# get the row indices
cols = collaborative_df['user_id'].astype('category', CategoricalDtype(categories = users)).cat.codes
# get the column indices
rows = collaborative_df['product_id'].astype('category', CategoricalDtype(categories = products)).cat.codes

collaborative_sparse = sparse.csr_matrix((purchase_count, (rows, cols)), shape = (len(products), len(users)))

## Alternating Least Squares

ALS is an iterative optimization process where we try to arrive closer to a factorized representation of our original data with each iteration.

We have our original matrix of size users * products and the feedback data is the purchase count of the product. The original matrix is then turned into one matrix consisting of users and hidden features and another with items and hidden features. In these 2 matrices we have weights for how each user and product relates to each hidden feature. We calculate these 2 matrices so that their product is as close as the original matrix as possible.

The model merge the preference a user have for an item with the confidence level we have for that preference. We start out with representing missing values as a negative preference with a low confidence value while existing values have a positive preference with a high confidence value. 

Preference is a binary representation derived from feedback data, purchase count. If the user had purchased the item, it is set to 1, else it will be set to 0

Confidence is calculated using the magnitude of the feedback data, we will have a larger confidence the more times a the customer purchased the item. The rate of which our confidence increases is set through a linear scaling factor alpha. This means that there is only one interaction between a user and the item the confidence will be higher than that of the user with an item which was not purchased before given the alpha value.

In [180]:
# closest to the power point

model = implicit.als.AlternatingLeastSquares(factors=50, regularization=0.1, iterations=20)
alpha_val = 40
data_conf = (collaborative_sparse * alpha_val).astype('double')
model.fit(data_conf)
user_items = data_conf.T.tocsr()

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




In [181]:
def get_recommendations_and_validation_orders(df, model, fitted, user):
    recommendations = model.recommend(user, fitted, N = 4, filter_already_liked_items = True)
    product_dict = dict(zip(full_df.product_id, full_df.product_name))
    
    print('Recommended items for User {} are: \n'.format(user))
    for i in recommendations:
        print(i[0], product_dict.get(i[0]), i[1])
    print('===========================================================')
    print('User {} validation transactions are:\n'.format(user))
    print(validation_set[validation_set['user_id'] == user][['product_name', 'user_id']].to_string(index = False))

In [182]:
get_recommendations_and_validation_orders(collaborative_sparse, model, user_items, 1)

Recommended items for User 1 are: 

13517 Whole Wheat Bread 1.1968588
20063 Hazelnuts in Milk Chocolate, 33% Cocoa 1.1774435
26853 Complete Wheat 100% Whole Wheat Bread 1.1398611
15487 Raspberry English Tea Scones 1.1145719
User 1 validation transactions are:

                     product_name  user_id
                             Soda        1
            Organic String Cheese        1
         0% Greek Strained Yogurt        1
 XL Pick-A-Size Paper Towel Rolls        1
           Milk Chocolate Almonds        1
                       Pistachios        1
            Cinnamon Toast Crunch        1
       Aged White Cheddar Popcorn        1
               Organic Whole Milk        1
              Organic Half & Half        1
                Zero Calorie Cola        1


__Relevant Recommendations__
- Recommended Hazelnut in Milk Chocolate
- Purchased Milk Chocolate Almonds

__Room for Improvement__
- Recommended bread twice

In [183]:
get_recommendations_and_validation_orders(collaborative_sparse, model, user_items, 58144)

Recommended items for User 58144 are: 

34172 Top Ramen Shrimp Flavor Instant Noodle Soup 1.1561155
39322 Caramel Almond and Sea Salt Nut Bar 1.1357532
35175 Mini Stuffers Hamburger Dill Chips 1.1336474
19604 Medium Scarlet Raspberries 1.0747831
User 58144 validation transactions are:

                                      product_name  user_id
                        Electrolyte Enhanced Water    58144
                                            Banana    58144
 Air Chilled Organic Boneless Skinless Chicken ...    58144
                              Lime Sparkling Water    58144
                          Non Fat Raspberry Yogurt    58144
                                   Farfalle No. 93    58144
                Total 0% Nonfat Plain Greek Yogurt    58144
                             Original Orange Juice    58144
                     Best Sloppy Joe Skillet Sauce    58144
                       Organic Cauliflower Florets    58144
                        Grated Parmigiano Reggiano   

__Relevant Recommendations__<br>
- Recommended Medium Scarlet Raspberries
- Purchased Banana
- Purchased Raspberry Yoghurt

In [184]:
get_recommendations_and_validation_orders(collaborative_sparse, model, user_items, 114401)

Recommended items for User 114401 are: 

44898 Organic Mac And Trees Fun Shape Macaroni & Cheese 1.1370347
35488 Organic Dry Roasted Premium Flaxseed 1.1338583
2190 Spicy Red Lentil Sauce 1.1187676
21702 Puna Coconut Pineapple 1.1060191
User 114401 validation transactions are:

                                  product_name  user_id
                                    Whole Milk   114401
 No Pulp Calcium & Vitamin D Pure Orange Juice   114401
                 Original Fresh Stack Crackers   114401
                         Cheddar Broccoli Rice   114401
                              Corn Pops Cereal   114401
                       Eggo Strawberry Waffles   114401
       Original 100% Pure No Pulp Orange Juice   114401
                            Orange Juice To-Go   114401
                 All Natural Peach Tea Bottles   114401
                          Hickory Smoked Bacon   114401


__Relevant Recommendations__<br>
- Recommended Puna Chocolate Pineapple (Fuit Juice)
- Purchased fruit juices

In [185]:
get_recommendations_and_validation_orders(collaborative_sparse, model, user_items, 3754)

Recommended items for User 3754 are: 

33502 Double Cheese Baked Snack Mix 1.1659267
45339 Men's Refresh Dandruff Shampoo 1.060647
29642 Ultra Soft Bath Tissue 1.0497061
13810 Reclosable Gallon Freezer Bags 1.0313741
User 3754 validation transactions are:

                                      product_name  user_id
                              Twice Baked Potatoes     3754
                            Whipped Sweet Potatoes     3754
 100% Natural Skin & Hair Revitalizing Coconut Oil     3754


__Relevant Recommendations__<br>
- Recommended Shampoo
- Purchased Skin and Hair products

In [186]:
get_recommendations_and_validation_orders(collaborative_sparse, model, user_items, 200372)

Recommended items for User 200372 are: 

30890 MCT Oil 1.26377
8651 Shipping Packaging Tape Heavy Duty 1.2522316
17419 Sprouted Whole Wheat Bread 1.209813
17018 Ghee Vanilla Bean 1.1393975
User 200372 validation transactions are:

                product_name  user_id
                   Diet Cola   200372
       Original Potato Chips   200372
  Salsa Con Queso Medium Dip   200372
        Pure Sport Body Wash   200372
          Snickers Ice Cream   200372
 Raspberry Cheesecake Gelato   200372
                    Rosemary   200372
                Red Potatoes   200372
   2% Low Fat Cottage Cheese   200372


__Relevant Recommendations__<br>
- None

__Possible Reasons for Recommendation__<br>
- Recommended Ghee Vanilla Bean (a kind of bread spread)
- Previous orders includes a variety of spreads
<br><br>
- Recommended MCT Oil (Suppliment for weight loss or energy)
- Previous orders includes supplements and energy drinks

# Next Steps and Future Improvements
- Deploy the recommender system and evaluate the effectiveness based on the take up rate of recommended items and evaluate the results

- Add features tags to the products like organic, natural, convenient, fresh, price range, manufacturer for the next iteration of improvement

- Use neural networks, Sequential to predict customer's next purchase and perform association rules for new recommendations

- Additional analysis can be performed on the customer with more data to gain insights on additional segments based on their recency of their last purchase, frequency of order, amount spent in the store.