# Simple conditional probability analysis on RetailRocket dataset

## Introduction
In this notebook, we'll be using simple one-item conditional probability analysis to get some insights from RetailRocket dataset

To start, we just import necessary libraries and read transactional data from the dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('./data/events.csv')
df = df.dropna(subset=['itemid', 'transactionid'])
df['transactionid'] = df['transactionid'].astype(int)

In [3]:
df.sample(10)

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
1177886,1441147730523,543365,transaction,100821,8202
1087745,1440710346003,837675,transaction,280187,961
2344898,1436896231627,1150086,transaction,247601,15713
2661384,1438037491887,215049,transaction,421548,11379
1045104,1440525397351,363577,transaction,413902,5548
2744471,1438375193522,1099749,transaction,194678,1089
2538671,1437590571004,475479,transaction,196398,13502
1947788,1432658454424,1162129,transaction,464073,4378
2202774,1436375319316,1097400,transaction,254324,9300
784869,1439247874028,429035,transaction,170262,14984


## Probabilities
Next, our idea will be to consider items that appear together in transactions and calculate mutual conditional probabilities for each pair of them: probability of item A given item B and vice versa.

In [4]:
transactionItems = df.groupby('transactionid')['itemid'].agg(list)

In [5]:
def cartesian_set_product(items):
    results = []
    for a in items:
        for b in items:
            if a != b:
                results.append((a, b))
    return results

In [6]:
pairs = df.groupby('transactionid')['itemid'].agg(cartesian_set_product).sum()

In [7]:
pairs = list(set(pairs))

In [8]:
def calculate_probability(item_pair):
    num_both = transactionItems.apply(lambda t: item_pair[0] in t and item_pair[1] in t).sum()
    num_given = transactionItems.apply(lambda t: item_pair[1] in t).sum()
    return num_both / num_given

In [9]:
probs = pd.Series(index=pd.MultiIndex.from_tuples(pairs, names=['itemid_of', 'itemid_given']))

In [10]:
for p in pairs:
    probs[p[0], p[1]] = calculate_probability(p)

In [11]:
probs.dropna(inplace=True)

In [12]:
probs.describe()

count    19376.000000
mean         0.658675
std          0.349326
min          0.007519
25%          0.333333
50%          0.600000
75%          1.000000
max          1.000000
dtype: float64

The distribution looks realistic. Now let's take a closer look at a random item. For example, with itemid 186702. 

In [21]:
probs.xs(186702, level='itemid_given').sort_values(ascending=False)

itemid_of
119736    0.333333
416992    0.166667
392655    0.166667
62275     0.166667
319723    0.083333
210137    0.083333
77733     0.083333
31768     0.083333
200537    0.083333
352091    0.083333
377603    0.083333
337361    0.083333
466109    0.083333
396853    0.083333
94437     0.083333
249804    0.083333
235559    0.083333
71754     0.083333
257245    0.083333
85334     0.083333
167126    0.083333
8259      0.083333
8523      0.083333
dtype: float64

As we can see here, if a customer buys an item with id 186702, he will likely buy also an item 119736 - with a chance of 1/3.

## Trigger items
Our next hypothesis will be that there are some trigger items, so that if a user buys such item, he will likely buy a lot of other things.
To ckech this idea, we'll look for items that frequently appear in transactions. We will not consider particular quantities of items but only facts of presence.
Also please note that items we find may be not only trigger items as we would like them to be but also simply popular items.

In [15]:
from collections import Counter

In [16]:
counter = Counter(df.groupby('transactionid')['itemid'].agg(list).sum())

In [17]:
counter.most_common(5)

[(461686, 133), (119736, 97), (213834, 92), (312728, 46), (7943, 46)]

Looks like item 461686 is very common across all transactions. If we convince a customer to buy it, it will likely make him to buy more other items thus increasing his bill.

## Recommending a trigger item
So we want to see a list of goods, given that recommending our trigger good will be the most efficient. 

In [19]:
probs.xs(461686, level='itemid_of').sort_values(ascending=False)

itemid_given
437607    1.000000
40630     1.000000
417673    1.000000
144706    1.000000
384170    1.000000
348272    1.000000
276864    1.000000
293733    1.000000
63406     1.000000
223027    1.000000
187697    1.000000
286718    1.000000
113712    1.000000
280845    1.000000
13188     1.000000
335088    1.000000
289006    1.000000
36972     1.000000
266965    1.000000
256952    1.000000
425159    1.000000
29114     1.000000
426588    1.000000
24150     1.000000
372728    1.000000
192043    1.000000
93092     1.000000
415928    1.000000
124081    1.000000
65215     1.000000
            ...   
132853    1.000000
459319    1.000000
447067    0.750000
108924    0.750000
75392     0.750000
218794    0.727273
171878    0.692308
125275    0.666667
442300    0.666667
35608     0.666667
312929    0.500000
17724     0.500000
258600    0.500000
213172    0.500000
431940    0.500000
215355    0.500000
431632    0.500000
67423     0.500000
270938    0.500000
10572     0.421053
357529    0.400000

The list above shows that, for example, if a customer buys 437607, we can recommed him our trigger good 461686, and he's guaranteed to buy it. Good result.

## Final remarks
All the conclusions made here are just preliminary. They demonstrate empirical patterns but do not proof causality.
Also due to the anonimity and size of available data, the patterns may be biased leading to incorrect conclusions. Additional research is required to validate it.