# Market Basket Suggestions Based on Past Item Activity

This notebook will process the retail event data for apparel and use it in a function to suggest additional items to purchase. Those suggestions are based on the items commonly interaccted with in other user sessions. 

### Data Frames

In [1]:
import pandas as pd

Reading the October Apparel Data

In [2]:
octApparelData = pd.read_csv('data/2019-Oct-apparel.csv')
print(octApparelData.shape)
octApparelData.head()

(1542924, 10)


Unnamed: 0.1,Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,0,2019-10-01 00:00:10 UTC,view,28719074,2053013565480109009,apparel.shoes.keds,baden,102.71,520571932,ac1cd4e5-a3ce-4224-a2d7-ff660a105880
1,1,2019-10-01 00:00:26 UTC,view,28719071,2053013565480109009,apparel.shoes.keds,baden,102.71,520571932,ac1cd4e5-a3ce-4224-a2d7-ff660a105880
2,2,2019-10-01 00:00:28 UTC,view,28714755,2053013565228450757,apparel.shoes,respect,51.22,555447570,99877fbe-d5a8-475e-a662-66bc9d29b6f8
3,3,2019-10-01 00:00:31 UTC,view,28718079,2053013565362668491,apparel.shoes.keds,respect,66.67,545323115,75fb5d0c-e907-4293-9c87-2419c2a7709d
4,4,2019-10-01 00:00:33 UTC,view,28717908,2053013565782098913,apparel.shoes,burgerschuhe,102.45,513798668,2034798f-43f2-8bcb-b169-c5f04a7a5a4f


Reading the November Apparel Data

In [3]:
novApparelData = pd.read_csv('data/2019-Nov-apparel.csv')
print(novApparelData.shape)
novApparelData.head()

(3011101, 10)


Unnamed: 0.1,Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,0,2019-11-01 00:00:17 UTC,view,43200121,2146660887346282824,apparel.tshirt,goodloot,8.73,566175330,680fb144-6940-4931-85e6-16dda8d4e2d5
1,1,2019-11-01 00:00:18 UTC,view,44300043,2100825583029060150,apparel.jeans,,40.16,545220871,f278cca0-e0f6-49a3-819a-d961998282d5
2,2,2019-11-01 00:00:35 UTC,view,44300009,2100825583029060150,apparel.jeans,,50.45,545220871,f278cca0-e0f6-49a3-819a-d961998282d5
3,3,2019-11-01 00:00:44 UTC,view,44300026,2100825583029060150,apparel.jeans,,46.08,545220871,f278cca0-e0f6-49a3-819a-d961998282d5
4,4,2019-11-01 00:01:12 UTC,view,45601414,2116907524572577889,apparel.shoes,pablosky,65.38,526996709,bd8f1103-1001-4b69-9106-da32f2c62653


Concatination of the two months into a single data frame,

In [4]:
fullApparelDF = pd.concat([octApparelData, novApparelData], ignore_index=True)

In [5]:
fullApparelDF.shape

(4554025, 10)

The size is far too large for personal computers to efficiently run so we will take a very small sample of the data to wrok with. In full implimentation this data set and analysis would be run on a server environment, likely using a cluster of computers.

In [6]:
sampleApparelDF = fullApparelDF.sample(frac=0.008, replace=True, random_state = 12)
sampleApparelDF.head()

Unnamed: 0.1,Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
3905179,2362255,2019-11-21 09:52:52 UTC,view,28718096,2053013565069067197,apparel.shoes.keds,respect,44.79,525647713,eace5cc4-364c-4042-88b8-f85bc5a39d86
1461501,1461501,2019-10-30 03:10:38 UTC,view,28720599,2053013565639492569,apparel.shoes,,86.23,518925748,54682b8d-06e9-4d78-b429-524511471552
2133634,590710,2019-11-08 12:37:48 UTC,view,54900006,2146660887203676486,apparel.costume,,51.48,512640780,540b6089-5c57-459e-a787-dfe4c43a923d
2303235,760311,2019-11-10 09:16:22 UTC,view,28719465,2053013565639492569,apparel.shoes,baden,52.0,515444497,fcdd2dfe-a781-40df-9b67-f2b5f49b42b5
4061123,2518199,2019-11-23 14:47:05 UTC,view,54900011,2146660887203676486,apparel.costume,,64.35,564303461,20581bd4-c331-46a8-b50b-d36b92d1b234


In [7]:
sampleApparelDF.shape

(36432, 10)

### User Session Analysis

The data needs to be reformatted to show all the items interacted with in each user session. This will group the commonly used items together. 

In [8]:
apparelBasket = sampleApparelDF.groupby('user_session')['product_id'].apply(list)

In [9]:
apparelBasket

user_session
0008e55a-2eee-4d83-a125-a95e75720a1a              [28717160]
00096c1b-6b6b-4822-b8a2-26571d64bd8c              [35200189]
000bedbb-ed86-4a87-8832-b9dc808bef6c              [28718936]
001184d3-70c3-46a4-94f5-f8af302220a7    [48200449, 48200446]
0013c638-0d24-49bd-9ea7-6cce6813ae14              [54900012]
                                                ...         
fff11c0a-5beb-457d-b638-849d18688bb0              [44300067]
fff17fc9-f54c-48f0-85e5-4b5e89329a17              [28721779]
fffbca59-8eaf-41cd-ac40-3b2ecb64ed90              [28719480]
fffd4d6a-4175-48f6-abde-298d0834ae39              [28718244]
fffd5189-bf07-4fe3-a6e3-f6596f9e2185              [28718601]
Name: product_id, Length: 34461, dtype: object

In [10]:
apparelBasket = apparelBasket.values
apparelBasket

array([list([28717160]), list([35200189]), list([28718936]), ...,
       list([28719480]), list([28718244]), list([28718601])], dtype=object)

Now that the data is in the correct orientation and in a format we can fit it to an encoder. This will transform the data to show each user session as a row and each product id as a column. The data is now binary showing a 0 if the product was interactied with in the session and a 1 if it was.

In [11]:
from mlxtend.preprocessing import TransactionEncoder

In [12]:
# Instantiate
te = TransactionEncoder()

# Fit & Transform - sparse
apparelMatrix = te.fit_transform(apparelBasket, sparse=True)

# Put in a dataframe
apparelDF = pd.DataFrame.sparse.from_spmatrix(apparelMatrix, columns=te.columns_)
print(apparelDF.shape)

(34461, 7572)


In [13]:
apparelDF.head()

Unnamed: 0,8400068,8400069,8400070,8400071,8400072,8400085,8400090,8400094,8400178,8400187,...,100025442,100025466,100025474,100025669,100026368,100026512,100027313,100027375,100027627,100028280
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Algorithm Application and Association Rules

To dial in our parameters for the algorithm we can decide on the minimum threshold by graphing out the supports. The supports show the percentage of sessions that that product id or combination of product ids occur in. This is a quick calculation for the visualization and we will apply the full algorithm afterwards.

In [14]:
apparelSupport = apparelDF.sum(axis=0)/apparelDF.shape[0]
apparelSupport

8400068      0.000058
8400069      0.000058
8400070      0.000116
8400071      0.000058
8400072      0.000174
               ...   
100026512    0.000029
100027313    0.000029
100027375    0.000029
100027627    0.000029
100028280    0.000029
Length: 7572, dtype: float64

In [16]:
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules

The FP-Growth algorithm is used to extract frequent item sets that can be applied to association rule learning. It formalizes the rough calculation we did earlier to find the support of each itemset.

In [17]:
apparelDF.columns = [str(i) for i in apparelDF.columns]

In [27]:
freqApparel = fpgrowth(apparelDF, min_support= 0.00001
                       , use_colnames=True)
freqApparel

Unnamed: 0,support,itemsets
0,0.002002,(28717160)
1,0.000174,(35200189)
2,0.000290,(28718936)
3,0.000783,(48200446)
4,0.000464,(48200449)
...,...,...
9665,0.000029,"(28718978, 28717896)"
9666,0.000029,"(28717813, 28719636)"
9667,0.000029,"(28720542, 44300028)"
9668,0.000029,"(28720778, 28711255)"


We can now apply association rules to calculate different metrics for evaluating the item sets. We are using lift as our main metric. 

Lift is used to measure how much more the antecedent and consequent occur together than we would expect if they were statistically independent. 

(http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)

In [28]:
apparelRules = association_rules(freqApparel, metric="lift", min_threshold=10)
apparelRules.sort_values(by="lift", ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2661,"(28722208, 28721230)",(28718690),2.9e-05,2.9e-05,2.9e-05,1.0,34461.0,2.9e-05,inf
1433,"(28719162, 28719168, 28720827)","(28719390, 28719094)",2.9e-05,2.9e-05,2.9e-05,1.0,34461.0,2.9e-05,inf
1438,"(28719390, 28719162, 28720827)","(28719094, 28719168)",2.9e-05,2.9e-05,2.9e-05,1.0,34461.0,2.9e-05,inf
1437,"(28719094, 28719390, 28719162)","(28719168, 28720827)",2.9e-05,2.9e-05,2.9e-05,1.0,34461.0,2.9e-05,inf
1436,"(28719094, 28719168, 28720827)","(28719390, 28719162)",2.9e-05,2.9e-05,2.9e-05,1.0,34461.0,2.9e-05,inf


In [20]:
apparelRules['antecedents'] = apparelRules['antecedents'].apply(lambda x:list(x)).copy()
apparelRules['consequents'] = apparelRules['consequents'].apply(lambda x:list(x)).copy()

In [21]:
import numpy as np

The below function will grab an item from the basket and then give suggestions on other items they might also like based on the itemset analysis. 

In [22]:
# Input basket
mybasket = ['28722071', '28716977', '28714273']

#metric
metric = 'lift'

#Complete this function
def product_recs(basket, apparelRules, metric):
    
    # Randomly select an item from the basket
    random_item = np.random.choice(basket, 1)[0]
    print(f"Based on: {random_item}")
    
    # Find rules where the item is in the antecedent
    rule_filter = apparelRules['antecedents'].apply(lambda x:x[0]) == random_item
    
    # Filter the dataframe using rule_filter and sort by the selected metric
    filtered_df = apparelRules[rule_filter].sort_values(by=metric)
    
    # Randomly return one of the top 20 items from the filtered dataframe
    reco1 = filtered_df.head(20).sample(1, replace = True)['consequents']
    reco2 = filtered_df.head(20).sample(1, replace = True)['consequents']
    
    return reco1, reco2

result = product_recs(mybasket, apparelRules, metric )
cleanResult = np.array(result)
print(f"You would also enjoy: {cleanResult[0][0]} And {cleanResult[1][0]}")

Based on: 28716977
You would also enjoy: ['28720953'] And ['28720550']
