In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
%matplotlib inline

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
#load the dataset
retail_df = pd.read_excel("Online Retail.xlsx")
retail_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## Cleanup

In [3]:
#remove additional spaces
retail_df['Description'] = retail_df['Description'].str.strip()
#remove NA values
retail_df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
#remove cancelled orders
retail_df['InvoiceNo'] = retail_df['InvoiceNo'].astype('str')
retail_df = retail_df[~retail_df['InvoiceNo'].str.contains('C')]

In [12]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

def create_basket(country_filter):
    basket = (retail_df[retail_df['Country'] == country_filter]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
    return basket
    

In [15]:
country_filter = "France"
basket_french = create_basket("France")
basket_sets = basket_french.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

In [16]:
frequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True)

In [18]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules.head()

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.763158,7.478947
1,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.725,7.478947
2,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.815789,8.642959
3,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.837838,8.642959
4,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE RED),0.102041,0.725,7.681081


According to the results a customer is 7 times more likely to buy a pink alarm clock bakelike than an average customer (lift) if he/she buys ALARM CLOCK BAKELIKE GREEN. This rule is "true" in 77% of the cases (confidence). This can be used as insight to recommend **ALARM CLOCK BAKELIKE PINK** for tthose who bought the Green one.

In [19]:
basket_germany = create_basket("Germany")
basket2_sets = basket_germany.applymap(encode_units)
basket2_sets.drop('POSTAGE', inplace=True, axis=1)

In [21]:
frequent_itemsets_germany = apriori(basket2_sets, min_support=0.05, use_colnames=True)

In [22]:
rules = association_rules(frequent_itemsets_germany, metric="lift", min_threshold=1.2)
rules.head()

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(PLASTERS IN TIN WOODLAND ANIMALS),(PLASTERS IN TIN CIRCUS PARADE),0.137856,0.492063,4.242887
1,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN WOODLAND ANIMALS),0.115974,0.584906,4.242887
2,(ROUND SNACK BOXES SET OF 4 FRUITS),(PLASTERS IN TIN CIRCUS PARADE),0.157549,0.319444,2.754455
3,(PLASTERS IN TIN CIRCUS PARADE),(ROUND SNACK BOXES SET OF 4 FRUITS),0.115974,0.433962,2.754455
4,(ROUND SNACK BOXES SET OF4 WOODLAND),(PLASTERS IN TIN CIRCUS PARADE),0.245077,0.232143,2.001685


According to the results a customer is 4 times more likely to buy PLASTERS IN TIN WOODLAND ANIMALS than an average customer (lift) if he/she buys PLASTERS IN TIN CIRCUS PARADE. This rule is "true" in 42% of the cases (confidence). This can be used as insight to recommend PLASTERS IN TIN WOODLAND ANIMALS for tthose who bought the Green one.

At this point, you may want to look at how much opportunity there is to use the popularity of one product to drive sales of another.

In [23]:
basket_uk = create_basket("United Kingdom")
basket3_sets = basket_uk.applymap(encode_units)
basket3_sets.drop('POSTAGE', inplace=True, axis=1)

In [24]:
frequent_itemsets_uk = apriori(basket3_sets, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_itemsets_uk, metric="lift", min_threshold=0.8)
rules.head()

As we didn't find any rules for United Kigdom customers. We will try other filters.

In [29]:
retail_filtered = retail_df[retail_df['Country']=="United Kingdom"]

In [40]:
basket = (retail_filtered[retail_filtered['Quantity']<10]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [41]:
basket4_sets = basket.applymap(encode_units)
basket4_sets.drop('POSTAGE', inplace=True, axis=1)

In [44]:
frequent_itemsets_uk = apriori(basket3_sets, min_support=0.03, use_colnames=True)
rules = association_rules(frequent_itemsets_uk, metric="lift", min_threshold=0.5)
rules.head()

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.046928,0.642694,12.900183
1,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.049821,0.605376,12.900183
2,(GREEN REGENCY TEACUP AND SAUCER),(PINK REGENCY TEACUP AND SAUCER),0.050035,0.617773,16.403939
3,(PINK REGENCY TEACUP AND SAUCER),(GREEN REGENCY TEACUP AND SAUCER),0.03766,0.820768,16.403939
4,(GREEN REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER),0.050035,0.750535,14.639752


## Conclusion - Perspective

We arrived to distinguich some rules related to France and Germany Market which can be used to make recommendations for our customers, but to better understand their preferences. I would suggest if I have more time that I categorize the products by finding their root words, like for example for ALARM CLOCK BAKELIKE GREEN, the root word can be Alarm (more general category) and use these categories to cluster the product purchases. Extracting categories can be done by following [word2vec](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/) method. I worked on [similar example](https://github.com/SarahMestiri/MachineLearning/tree/master/Email-Spam-Filter) when doing spam filtering where I had to categorize the emails. 

Also, that will be interesting to apply Collaborative Filtering Recommendations based on preferred items and items bought together.