## Ассоциативные правила
### Apriori Algoritm

**Dataset**

Будем работать с датасетом с [соревнования Kaggle](https://www.kaggle.com/roshansharma/market-basket-optimization) по оптимизации продуктовой корзины.

Датасет содержит информацию о покупках в продуктовом магазине. Каждая строка соотвествует покупке. То есть датасет представляет собой разреженную матрицу, где в строках - набор items в каждой транзакции.


In [1]:
import pandas as pd
# загрузим данные
dataset = pd.read_csv('Market_Basket.csv', header = None)
# посомтрим на датасет
print('Transaction number: ',dataset.shape[0])
dataset.head()

Transaction number:  7501


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


Для простоты обработки заменим NaN на последнее значение внутри транзакции.

In [2]:
dataset.fillna(method = 'ffill',axis = 1, inplace = True)
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs,eggs
2,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney,chutney
3,turkey,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado,avocado
4,mineral water,milk,energy bar,whole wheat rice,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea,green tea


In [3]:
#создадим матрицу
transactions = []
for i in range(0, 7501):
    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])


### Apriori

Возспользуемся готовой имплементацией алгоритма apriori из библиотеки `efficient_apriori`.

In [4]:
!pip install efficient_apriori
!pip install dataclasses

Collecting efficient_apriori
  Downloading efficient_apriori-2.0.3-py3-none-any.whl (14 kB)
Installing collected packages: efficient_apriori
Successfully installed efficient_apriori-2.0.3
Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Installing collected packages: dataclasses
Successfully installed dataclasses-0.6


Обратите внимание, что пороговые значения мы вибираем сами в зависимости от того, насколько "сильные" правила мы хотим получить
* `min_support` -- минимальный support для правил `(dtype = float)`

* `min_confidence` -- минимальное значение confidence для правил `(dtype = float)`

* `max_length` -- максимальная длина itemset  `(dtype = integer)`

In [5]:
# загружаем apriori
from efficient_apriori import apriori

# вычисляем результат
itemsets, rules = list(apriori(transactions, min_support = 0.003, min_confidence = 0.2, max_length = 8))

функция возвращает набор itemset'ов и список правил.

### Посмотрим на правила:

In [6]:
rules[-10:]

[{milk, mineral water, spaghetti} -> {olive oil},
 {milk, mineral water, olive oil} -> {spaghetti},
 {mineral water, shrimp, spaghetti} -> {milk},
 {milk, shrimp, spaghetti} -> {mineral water},
 {milk, mineral water, shrimp} -> {spaghetti},
 {mineral water, spaghetti, tomatoes} -> {milk},
 {milk, spaghetti, tomatoes} -> {mineral water},
 {milk, mineral water, tomatoes} -> {spaghetti},
 {milk, mineral water, spaghetti} -> {tomatoes},
 {milk, tomatoes} -> {mineral water, spaghetti}]

In [7]:
print(type(rules[0]))
rules[0].lhs, rules[0].rhs

<class 'efficient_apriori.rules.Rule'>


(('almonds',), ('burgers',))

### Другая реализация apriori: mlxtend

**Недостатки предыдущей реализации:**

 * формат данных: необходимо подавать данные в виде списков (list) покупок, может быть вычислительно затратно;

 * неубодный формат выхода;

 * mlxtend имеет community support.



In [8]:
!pip install mlxtend



## Online Retail Dataset

Данные, которые мы используем для этого примера, поступают из репозитория UCI Machine Learning. Набор данных называется "Online Retail" и находится [здесь](http://archive.ics.uci.edu/ml/datasets/Online+Retail). Как видно из описания, в этом наборе данных содержатся все покупки, сделанные в компании, занимающейся розничной торговлей через Интернет, которая базируется в Великобритании в течение восьми месяцев.


In [9]:
'''
load apriori and association package from mlxtend.
Used different dataset because mlxtend need data in below format.

             itemname  apple banana grapes
transaction  1            0    1     1
             2            1    0     1
             3            1    0     0
             4            0    1     0

 we could have used above data as well but need to perform operation to bring in this format instead of that used seperate data only.
'''

from mlxtend.frequent_patterns import apriori as apriori_mlx
from mlxtend.frequent_patterns import association_rules
website_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
df1 = pd.read_excel(website_url)
#df1 = pd.read_excel('Online Retail.xlsx')
print(df1.shape)
df1.head()



(541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


У нас очень много транзакций. Для ускорения работы алгоритма будем выявлять ассоциативные правила для одной из стран.

In [10]:
df1.Country.value_counts().head(10)

United Kingdom    495478
Germany             9495
France              8557
EIRE                8196
Spain               2533
Netherlands         2371
Belgium             2069
Switzerland         2002
Portugal            1519
Australia           1259
Name: Country, dtype: int64

In [11]:
#let's use France data
df1 = df1[df1.Country == 'France']

df1.shape

(8557, 8)

**Preprocessing**

Уберем лишние пробелы и выкиним ошибочные транзакции с отрицательным количеством.

In [12]:
# remove extra spaces
df1['Description'] = df1['Description'].str.strip()


#some of transaction quantity is negative, which can not be possible
#let's remove these rows
df1 = df1[df1.Quantity >0]

In [13]:
df1.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
26,536370,22728,ALARM CLOCK BAKELIKE PINK,24,2010-12-01 08:45:00,3.75,12583.0,France
27,536370,22727,ALARM CLOCK BAKELIKE RED,24,2010-12-01 08:45:00,3.75,12583.0,France
28,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,2010-12-01 08:45:00,3.75,12583.0,France
29,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,2010-12-01 08:45:00,0.85,12583.0,France
30,536370,21883,STARS GIFT TAPE,24,2010-12-01 08:45:00,0.65,12583.0,France


С помощью функции `pivot` преобразуем данные в необходимый для алгоритма формат: таблица user-item, где на пересечении стоит число покупок.

In [14]:
basket = pd.pivot_table(data=df1,index='InvoiceNo',columns='Description',values='Quantity', \
                        aggfunc='sum',fill_value=0)
basket.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Так как в алгоритме Apriori, количество купленного товара нас не интересует, приведем данные к бинарному формату.

In [15]:
def convert_into_binary(x):
    if x > 0:
        return 1
    else:
        return 0

basket_sets = basket.applymap(convert_into_binary)


Выкинем технический товар POSTAGE (почтовый сбор) содержащийся во всех транзакциях.

In [16]:
basket_sets['POSTAGE'].head()

InvoiceNo
536370    1
536852    1
536974    1
537065    1
537463    1
Name: POSTAGE, dtype: int64

In [17]:
#remove postage item
basket_sets.drop(columns=['POSTAGE'],inplace=True)

In [18]:
from mlxtend.frequent_patterns import apriori as apriori_mlx
#call apriori function and pass minimum support
frequent_itemsets = apriori_mlx(basket_sets, min_support=0.07, use_colnames=True)

Посмотрим на получившиеся правила.

In [19]:
# we have association rules which need to put on frequent itemset.
# we set lift as a metric and set minimum lift = 1
rules_mlxtend = association_rules(frequent_itemsets, metric="lift", min_threshold=0.1)
rules_mlxtend.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.096939,0.07398,0.725,7.478947,0.064088,3.283859
1,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.102041,0.07398,0.763158,7.478947,0.064088,3.791383
2,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181
4,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE PINK),0.094388,0.102041,0.07398,0.783784,7.681081,0.064348,4.153061


In [20]:
rules_mlxtend[ (rules_mlxtend['lift'] >= 4) & (rules_mlxtend['confidence'] >= 0.8) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181
16,(SET/6 RED SPOTTY PAPER PLATES),(SET/20 RED RETROSPOT PAPER NAPKINS),0.127551,0.132653,0.102041,0.8,6.030769,0.085121,4.336735
18,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.137755,0.127551,0.122449,0.888889,6.968889,0.104878,7.852041
19,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.127551,0.137755,0.122449,0.96,6.968889,0.104878,21.556122
20,"(SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY...",(SET/20 RED RETROSPOT PAPER NAPKINS),0.122449,0.132653,0.09949,0.8125,6.125,0.083247,4.62585
21,"(SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO...",(SET/6 RED SPOTTY PAPER PLATES),0.102041,0.127551,0.09949,0.975,7.644,0.086474,34.897959
22,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS),0.102041,0.137755,0.09949,0.975,7.077778,0.085433,34.489796
