Мышковец С.А., v.1 10.01.2023

Решение задачи:

Построение ассоциативных правил для датасета :

1. Применение алгоритмов apriori и fpgrowth.
2. Применение коллаборативных фильтров.


Вывод:
    
1. Алгоритм apriori от mlxtend (Wall time: 8.82 ms) сработал быстрее, чем apriori от apyori (Wall time: 226 ms (скорее глюк: были случаи 13.6 ms) и алгоритм FP-Growth (Wall time: 17.7 ms ms), хотя документация утверждает обратное. Эти алгоритмы скорее помогают извлечь часто встречаемые позиции для дальнейшего исследования правил ассоциаций. чем предложить применимые на практике рекомендации. На мой взгляд, все они проигрывают в практичности и наглядности более сложными алгоритмам. 

                                                                                             
2. Алгоритм apriori от apyori сработал в 3 раза медленнее (1min 59s) apriori от mlxtend (37.3 ms)  на одинаковых параметрах. Вывод результата плохо читаем, надо дописывать код для извлечения результатов, пригодных для дальнейшего применения.

                                                                                             
Результаты коллаборативных фильтров:


| Method  based on | Recommendations
| :----------- | :----------- 
| На основе ключевых слов (cosine_similarity, new feature engineering)| sandwich loaves, paper towels, eggs, dinner rolls
| Для нового покупателя, основанные на предпочтения других покупателей (popularity)| poultry, bagels, lunch meat, ice cream, soda
| На схожести покупателей (cosine_similarity) | poultry, bagels, lunch meat, cereals, flour
| На схожести товаров (cosine_similarity)| bagels, poultry, flour, cereals, lunch meat
| Surprise.SVD | vegetables, lunch meat, waffles, soap, toilet paper
| LightFM | fruits, pasta, ice cream, paper towels, spaghetti sauce                                                                                              \
                                                                                             
3. Алгоритмы на основе коллаборативных фильтров выдают более информативные рекомендации.
            
                                                                                             
4. Явный недостаток модели Surprise.SVD - отсутсвие 0 предсказаний.
                                                                                             
                                                                                             
5. Модель LightFM явно выделяется своим результатом по сравнению с остальными моделями.                                                                                             
                                                                                                                                                        

---

# Алгоритм  apriori от mlxtend.frequent_patterns

Датасет представляет собой информацию о товарах, купленных в магазине. Строка - отдельный чек.

In [13]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import re
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
from mpl_toolkits.mplot3d import Axes3D
import networkx as nx
import collections
import warnings
warnings.simplefilter(action='ignore', category=Warning)

**Загружаем данные.**

In [2]:
basket = pd.read_csv("dataset.csv", skipinitialspace=True, header=None)
display(basket.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,pork,sandwich bags,lunch meat,all- purpose,flour,soda,butter,vegetables,beef,aluminum foil,all- purpose,dinner rolls,shampoo,all- purpose
1,shampoo,hand soap,waffles,vegetables,cheeses,mixes,milk,sandwich bags,laundry detergent,dishwashing liquid/detergent,waffles,individual meals,hand soap,vegetables
2,pork,soap,ice cream,toilet paper,dinner rolls,hand soap,spaghetti sauce,milk,ketchup,sandwich loaves,poultry,toilet paper,ice cream,ketchup
3,juice,lunch meat,soda,toilet paper,all- purpose,,,,,,,,,
4,pasta,tortillas,mixes,hand soap,toilet paper,vegetables,vegetables,paper towels,vegetables,flour,vegetables,pork,poultry,eggs


Удаляем пропуски.

In [3]:
basket = basket.fillna (' ')

Объединяем все покупки по чеку в одну ячейку.

In [4]:
basket.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype='int64')

In [5]:
basket_columns = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]

In [6]:
basket1 = basket[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]].T.agg(', '.join)

In [7]:
basket_new = basket1.to_frame()

In [8]:
basket_new = basket_new.rename(columns={0: 'itemDescription'})

In [9]:
basket_new.itemDescription

0       pork, sandwich bags, lunch meat, all- purpose,...
1       shampoo, hand soap, waffles, vegetables, chees...
2       pork, soap, ice cream, toilet paper, dinner ro...
3       juice, lunch meat, soda, toilet paper, all- pu...
4       pasta, tortillas, mixes, hand soap, toilet pap...
                              ...                        
1494    beef, sandwich bags, hand soap, paper towels, ...
1495    dinner rolls, lunch meat, spaghetti sauce, pas...
1496    lunch meat, eggs, poultry, vegetables, tortill...
1497    ketchup, milk, poultry, cheeses, soap, toilet ...
1498    laundry detergent, vegetables, shampoo, vegeta...
Name: itemDescription, Length: 1499, dtype: object

In [10]:
basket_new

Unnamed: 0,itemDescription
0,"pork, sandwich bags, lunch meat, all- purpose,..."
1,"shampoo, hand soap, waffles, vegetables, chees..."
2,"pork, soap, ice cream, toilet paper, dinner ro..."
3,"juice, lunch meat, soda, toilet paper, all- pu..."
4,"pasta, tortillas, mixes, hand soap, toilet pap..."
...,...
1494,"beef, sandwich bags, hand soap, paper towels, ..."
1495,"dinner rolls, lunch meat, spaghetti sauce, pas..."
1496,"lunch meat, eggs, poultry, vegetables, tortill..."
1497,"ketchup, milk, poultry, cheeses, soap, toilet ..."


С помощью TransactionEncoder закодируем транзакции в формат, подходящий для функции  apriori.

In [11]:
basket_new.itemDescription = basket_new.itemDescription.transform(lambda x: x.split(", "))

In [12]:
basket = basket_new.itemDescription

In [13]:
basket[0]

['pork',
 'sandwich bags',
 'lunch meat',
 'all- purpose',
 'flour',
 'soda',
 'butter',
 'vegetables',
 'beef',
 'aluminum foil',
 'all- purpose',
 'dinner rolls',
 'shampoo',
 'all- purpose']

In [14]:
encoder = TransactionEncoder()
transactions = pd.DataFrame(encoder.fit(basket).transform(basket), columns=encoder.columns_)
display(transactions.head())

Unnamed: 0,Unnamed: 1,all- purpose,aluminum foil,bagels,beef,butter,cereals,cheeses,coffee/tea,dinner rolls,...,shampoo,soap,soda,spaghetti sauce,sugar,toilet paper,tortillas,vegetables,waffles,yogurt
0,False,True,True,False,True,True,False,False,False,True,...,True,False,True,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False,True,False,False,...,True,False,False,False,False,False,False,True,True,False
2,False,False,False,False,False,False,False,False,False,True,...,False,True,False,True,False,True,False,False,False,False
3,True,True,False,False,False,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,False,False


In [15]:
transactions.columns

Index([' ', 'all- purpose', 'aluminum foil', 'bagels', 'beef', 'butter',
       'cereals', 'cheeses', 'coffee/tea', 'dinner rolls',
       'dishwashing liquid/detergent', 'eggs', 'flour', 'fruits', 'hand soap',
       'ice cream', 'individual meals', 'juice', 'ketchup',
       'laundry detergent', 'lunch meat', 'milk', 'mixes', 'paper towels',
       'pasta', 'pork', 'poultry', 'sandwich bags', 'sandwich loaves',
       'shampoo', 'soap', 'soda', 'spaghetti sauce', 'sugar', 'toilet paper',
       'tortillas', 'vegetables', 'waffles', 'yogurt'],
      dtype='object')

Примечание: датафрейм записывает каждую строку как транзакцию, а товары, которые были куплены в ходе транзакции, будут записаны как True.

Удаляем столбец с пустыми полями.

In [16]:
transactions = transactions.drop(' ' , axis=1)

Алгоритм Apriori будет использоваться для генерации частых наборов элементов. Зададаем минимальную поддержку в размере 4 из общего числа транзакций. Генерируем правила ассоциаций, и отфильтровываем значения Lift > 1,15.
Эти параметры подобраны опытным путем. Более высокие пороги не дают результатов. Лифт можно увеличить, увеличив максимальную длину, но практичность результатов сразу падает. Датасет не подходит для целей рекомендаций этим способом.

In [17]:
%%time

frequent_itemsets = apriori(transactions, min_support= 4/len(basket), use_colnames=True, max_len = 2)
rules = association_rules(frequent_itemsets, metric="lift",  min_threshold = 1.15)
display(rules.head(20))
print("Rules identified: ", len(rules))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(all- purpose),(fruits),0.263509,0.263509,0.08072,0.306329,1.1625,0.011283,1.06173
1,(fruits),(all- purpose),0.263509,0.263509,0.08072,0.306329,1.1625,0.011283,1.06173
2,(sandwich loaves),(butter),0.248833,0.261508,0.078719,0.316354,1.209731,0.013648,1.080226
3,(butter),(sandwich loaves),0.261508,0.248833,0.078719,0.30102,1.209731,0.013648,1.074663
4,(cheeses),(sandwich bags),0.260173,0.250167,0.075384,0.289744,1.158202,0.010297,1.055722
5,(sandwich bags),(cheeses),0.250167,0.260173,0.075384,0.301333,1.158202,0.010297,1.058912
6,(dishwashing liquid/detergent),(fruits),0.268179,0.263509,0.084056,0.313433,1.189458,0.013389,1.072715
7,(fruits),(dishwashing liquid/detergent),0.263509,0.268179,0.084056,0.318987,1.189458,0.013389,1.074607
8,(dishwashing liquid/detergent),(individual meals),0.268179,0.271514,0.084056,0.313433,1.154388,0.011242,1.061055
9,(individual meals),(dishwashing liquid/detergent),0.271514,0.268179,0.084056,0.309582,1.154388,0.011242,1.059969


Rules identified:  16
CPU times: user 31.4 ms, sys: 3.74 ms, total: 35.2 ms
Wall time: 33.6 ms


Выясним быстродействие на минимальной подержке 0.5. Ниже можно найти аналогичные данные в других реализациях алгоритма.

In [18]:
%%time

apriori(transactions, min_support=0.5, low_memory=True)

CPU times: user 7.59 ms, sys: 644 µs, total: 8.24 ms
Wall time: 7.87 ms


Unnamed: 0,support,itemsets
0,0.597065,(35)


Выясним быстродействие на минимальной подержке 0.27. 

In [19]:
%%time

apriori(transactions, min_support=0.27, low_memory=True)

CPU times: user 9.13 ms, sys: 3.31 ms, total: 12.4 ms
Wall time: 10.6 ms


Unnamed: 0,support,itemsets
0,0.278185,(2)
1,0.273516,(5)
2,0.27485,(14)
3,0.271514,(15)
4,0.276184,(19)
5,0.270847,(20)
6,0.273516,(21)
7,0.271514,(23)
8,0.287525,(25)
9,0.274183,(30)


---

# Алгоритм apriori от apyori

In [72]:
 from apyori import apriori

In [73]:
all_cheques = []
for i in basket_new['itemDescription_no_dupl']:
    i = [a for a in set(i)]
    all_cheques.append(i)

Сравним быстродействие на параметрах, использованных выше в алгорите aprioiri от apyori.

In [74]:
%%time

assosiation_rules = apriori(all_cheques, min_support=4/len(basket), min_lift=1.15, min_length=2)
assosiation_results = list(assosiation_rules)

CPU times: user 1min 57s, sys: 527 ms, total: 1min 57s
Wall time: 1min 57s


In [75]:
len(assosiation_results)

90792

In [76]:
assosiation_results[0]

RelationRecord(items=frozenset({'all- purpose', 'fruits'}), support=0.08072048032021348, ordered_statistics=[OrderedStatistic(items_base=frozenset({'all- purpose'}), items_add=frozenset({'fruits'}), confidence=0.30632911392405066, lift=1.1624995994231695), OrderedStatistic(items_base=frozenset({'fruits'}), items_add=frozenset({'all- purpose'}), confidence=0.30632911392405066, lift=1.1624995994231695)])

Выясним быстродействие при минимальной подержке 0.5.

In [77]:
%%time

assosiation_rules = apriori(all_cheques, min_support=0.5)
assosiation_results = list(assosiation_rules)

CPU times: user 219 ms, sys: 8.56 ms, total: 228 ms
Wall time: 226 ms


In [78]:
assosiation_results

[RelationRecord(items=frozenset({'vegetables'}), support=0.5970647098065377, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'vegetables'}), confidence=0.5970647098065377, lift=1.0)])]

---

# FP-Growth алгоритм

In [79]:
all_cheques = []
for i in basket_new['itemDescription_no_dupl']:
    i = [a for a in set(i)]
    all_cheques.append(i)

In [80]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(all_cheques).transform(all_cheques)
df = pd.DataFrame(te_ary, columns=te.columns_)

In [81]:
df.head()

Unnamed: 0,all- purpose,aluminum foil,bagels,beef,butter,cereals,cheeses,coffee/tea,dinner rolls,dishwashing liquid/detergent,...,shampoo,soap,soda,spaghetti sauce,sugar,toilet paper,tortillas,vegetables,waffles,yogurt
0,True,True,False,True,True,False,False,False,True,False,...,True,False,True,False,False,False,False,True,False,False
1,False,False,False,False,False,False,True,False,False,True,...,True,False,False,False,False,False,False,True,True,False
2,False,False,False,False,False,False,False,False,True,False,...,False,True,False,True,False,True,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,False,False


In [82]:
from mlxtend.frequent_patterns import fpgrowth

Вернем товары и наборы товаров. Выясним быстродействие при минимальной подержке 0.5.

In [83]:
%%time

fpgrowth(df, min_support=0.5)

CPU times: user 15.2 ms, sys: 2.18 ms, total: 17.3 ms
Wall time: 17.7 ms


Unnamed: 0,support,itemsets
0,0.597065,(35)


По умолчанию fpgrowth возвращает индексы колонок товаров, что может быть полезно для дальнейшего исследования правил ассоциаций. Для лучшей читабельности установим параметр use_colnames=True, чтобы получить соответствующие названия товаров.

In [84]:
%%time

fpgrowth(df, min_support=0.27, use_colnames=True)

CPU times: user 24.1 ms, sys: 978 µs, total: 25.1 ms
Wall time: 24.7 ms


Unnamed: 0,support,itemsets
0,0.597065,(vegetables)
1,0.276184,(lunch meat)
2,0.274183,(soda)
3,0.278853,(waffles)
4,0.273516,(mixes)
5,0.271514,(individual meals)
6,0.270847,(milk)
7,0.287525,(poultry)
8,0.27485,(ice cream)
9,0.27018,(toilet paper)


---

# **Рекомендации на основе ключевых слов.**

**Загружаем данные.**

In [20]:
basket = pd.read_csv("dataset.csv", skipinitialspace=True, header=None)
display(basket.head(5))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,pork,sandwich bags,lunch meat,all- purpose,flour,soda,butter,vegetables,beef,aluminum foil,all- purpose,dinner rolls,shampoo,all- purpose
1,shampoo,hand soap,waffles,vegetables,cheeses,mixes,milk,sandwich bags,laundry detergent,dishwashing liquid/detergent,waffles,individual meals,hand soap,vegetables
2,pork,soap,ice cream,toilet paper,dinner rolls,hand soap,spaghetti sauce,milk,ketchup,sandwich loaves,poultry,toilet paper,ice cream,ketchup
3,juice,lunch meat,soda,toilet paper,all- purpose,,,,,,,,,
4,pasta,tortillas,mixes,hand soap,toilet paper,vegetables,vegetables,paper towels,vegetables,flour,vegetables,pork,poultry,eggs


Удаляем пропуски.

In [21]:
basket = basket.fillna (' ')

Объединяем все продукты в 1 ячейку.

In [22]:
basket1 = basket[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]].T.agg(', '.join)

In [23]:
basket_new = basket1.to_frame()

In [24]:
basket_new = basket_new.rename(columns={0: 'itemDescription'})

In [25]:
basket_new

Unnamed: 0,itemDescription
0,"pork, sandwich bags, lunch meat, all- purpose,..."
1,"shampoo, hand soap, waffles, vegetables, chees..."
2,"pork, soap, ice cream, toilet paper, dinner ro..."
3,"juice, lunch meat, soda, toilet paper, all- pu..."
4,"pasta, tortillas, mixes, hand soap, toilet pap..."
...,...
1494,"beef, sandwich bags, hand soap, paper towels, ..."
1495,"dinner rolls, lunch meat, spaghetti sauce, pas..."
1496,"lunch meat, eggs, poultry, vegetables, tortill..."
1497,"ketchup, milk, poultry, cheeses, soap, toilet ..."


In [26]:
basket_new.itemDescription = basket_new.itemDescription.transform(lambda x: x.split(", "))

In [27]:
basket_new.itemDescription

0       [pork, sandwich bags, lunch meat, all- purpose...
1       [shampoo, hand soap, waffles, vegetables, chee...
2       [pork, soap, ice cream, toilet paper, dinner r...
3       [juice, lunch meat, soda, toilet paper, all- p...
4       [pasta, tortillas, mixes, hand soap, toilet pa...
                              ...                        
1494    [beef, sandwich bags, hand soap, paper towels,...
1495    [dinner rolls, lunch meat, spaghetti sauce, pa...
1496    [lunch meat, eggs, poultry, vegetables, tortil...
1497    [ketchup, milk, poultry, cheeses, soap, toilet...
1498    [laundry detergent, vegetables, shampoo, veget...
Name: itemDescription, Length: 1499, dtype: object

In [28]:
basket_new.iloc[0:1]

Unnamed: 0,itemDescription
0,"[pork, sandwich bags, lunch meat, all- purpose..."


**Генерируем новые признаки на основе имеющихся.**

Все продукты в выборке:

In [266]:
all_pr = ['all- purpose',
 'aluminum foil',
 'bagels',
 'beef',
 'butter',
 'cereals',
 'cheeses',
 'coffee/tea',
 'dinner rolls',
 'dishwashing liquid/detergent',
 'eggs',
 'flour',
 'fruits',
 'hand soap',
 'ice cream',
 'individual meals',
 'juice',
 'ketchup',
 'laundry detergent',
 'lunch meat',
 'milk',
 'mixes',
 'paper towels',
 'pasta',
 'pork',
 'poultry',
 'sandwich bags',
 'sandwich loaves',
 'shampoo',
 'soap',
 'soda',
 'spaghetti sauce',
 'sugar',
 'toilet paper',
 'tortillas',
 'vegetables',
 'waffles',
 'yogurt']

In [30]:
departments = {"fruits_vegetables":['fruits', 'vegetables'], 
"beverage" : 'juice', 
"dairy":['ice cream', 'butter', 'cheeses', 'milk', 'yogurt', 'eggs'], 
"dry_goods": ['pasta', 'coffee/tea', 'sugar', 'cereals', 'flour', 'soda', 'mixes'], 
"cleaning_products": ['dishwashing liquid/detergent', 'aluminum foil', 'laundry detergent', 'soap'], 
"paper_products": ['toilet paper', 'paper towels', 'sandwich bags'], 
"canned_goods": ['spaghetti sauce', 'ketchup'], 
"meat": ['pork', 'poultry', 'lunch meat', 'beef'], 
"health_beauty":['shampoo', 'hand soap'], 
"deli": ['dinner rolls', 'individual meals'], 
"bread": ['tortillas', 'sandwich loaves', 'bagels', 'waffles'], 
'all- purpose': 'all- purpose'}

In [31]:
department = []
for i in basket_new['itemDescription']:
    dep = []
    for a in i:
        for key, value in departments.items():
            if a in value:
                dep.append(key)
    dep = list(set(dep))
    department.append(dep)
basket_new['department'] = department

In [32]:
keywords = {
    "lunch":['sandwich loaves', 'bagels', 'fruits', 'juice', 'butter', 'cheeses', 'yogurt', 'paper towels', 'dinner rolls', 'individual meals', 'sandwich bags', 'lunch meat'], 
    "baking":['ice cream', 'all- purpose', 'fruits', 'butter', 'eggs', 'sugar', 'flour', 'soda', 'mixes', 'paper towels'],
    "cooking":['all- purpose', 'tortillas', 'pork', 'poultry', 'beef', 'vegetables', 'cheeses', 'butter', 'eggs', 'flour', 'paper towels', 'spaghetti sauce', 'ketchup'],
    "cleaning":['all- purpose', 'shampoo', 'hand soap', 'dishwashing liquid/detergent', 'aluminum foil', 'laundry detergent', 'soap', 'toilet paper', 'paper towels'],
    "breakfast":['milk', 'juice', 'sandwich loaves', 'waffles', 'fruits', 
                 'butter', 'cheeses', 'yogurt', 'eggs', 'coffee/tea', 'cereals']
}

In [33]:
key_words = []
for i in basket_new['itemDescription']:
    kw = []
    for a in i:
        for key, value in keywords.items():
            if a in value:
                kw.append(key)
    kw = list(set(kw))
    key_words.append(kw)
basket_new['key_words'] = key_words

Создаем колонку 'itemDescription_no_dupl' на основе 'itemDescription', исключая дубликаты продуктов в каждом чеке.

In [34]:
itemDescription_no_dupl = []
for i in basket_new['itemDescription']:
    i = set(i)
    id = [a for a in i if a!= ' ']
    itemDescription_no_dupl.append(id)
basket_new['itemDescription_no_dupl'] = itemDescription_no_dupl

Создаем 'soup'.

In [35]:
def create_soup(x):
    return ' '.join(x['department']) + ' ' + ' '.join(x['key_words'])
basket_new['soup'] = basket_new.apply(create_soup, axis=1)

In [36]:
basket_new['soup'][0]

'dry_goods health_beauty cleaning_products deli meat fruits_vegetables paper_products dairy all- purpose breakfast lunch cooking baking cleaning'

**Используем косинусное сходство для вычисления числовой величины, обозначающей сходство между двумя чеками.**

In [37]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(basket_new['soup'])

In [38]:
count_matrix.data

array([1, 1, 1, ..., 1, 1, 1])

In [39]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [40]:
# Reset index of our main DataFrame and construct reverse mapping as before
basket_new = basket_new.reset_index()
indices = pd.Series(basket_new.index, index=basket_new['itemDescription'])

In [41]:
cosine_sim

array([[1.        , 0.88949918, 0.81537425, ..., 0.88949918, 0.81537425,
        0.65465367],
       [0.88949918, 1.        , 0.84615385, ..., 0.92307692, 0.76923077,
        0.56613852],
       [0.81537425, 0.84615385, 1.        , ..., 0.84615385, 0.84615385,
        0.45291081],
       ...,
       [0.88949918, 0.92307692, 0.84615385, ..., 1.        , 0.76923077,
        0.56613852],
       [0.81537425, 0.76923077, 0.84615385, ..., 0.76923077, 1.        ,
        0.45291081],
       [0.65465367, 0.56613852, 0.45291081, ..., 0.56613852, 0.45291081,
        1.        ]])

**Определяем функцию, которая принимает индекс чека в качестве входных данных и выводит разницу продуктов в этом чеке и 10 наиболее часто покупаемых продуктов в 10 наиболее похожих чеках.**

In [42]:
# Function that takes in shopping list index as input and outputs most often bought products in similar shopping lists

def get_recommendations(index, cosine_sim=cosine_sim):
    # Get the index of the shopping list that matches the index column
    idx = indices[index]

    # Get the pairwsie similarity scores of all shopping lists with that shopping list
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the shopping lists based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar shopping lists
    sim_scores = sim_scores[1:11]

    # Get the shopping lists indices
    shopping_lists_indices = [i[0] for i in sim_scores]

    # Return the top 10 most often bought products in 10 most similar shopping lists
    result = basket_new['itemDescription_no_dupl'].iloc[shopping_lists_indices]
    result_list = []
    for i in result:
        result_list.extend(i)
    result_s = collections.Counter(result_list)
    res_most_common = [key for key, value in result_s.most_common(10)]
    res_most_common = set(res_most_common)
    print('your cheque: ', basket_new['itemDescription_no_dupl'][index])
    print('most commom cheques: ', res_most_common)
    
    # Return the difference between 10 most often bought products in similar shopping lists
    final_recommendation = res_most_common.difference(set(basket_new['itemDescription_no_dupl'][index]))
    
    return print('recommendation:', final_recommendation)
    

In [43]:
%%time

get_recommendations(1, cosine_sim)

your cheque:  ['dishwashing liquid/detergent', 'mixes', 'waffles', 'cheeses', 'individual meals', 'vegetables', 'shampoo', 'sandwich bags', 'hand soap', 'laundry detergent', 'milk']
most commom cheques:  {'dishwashing liquid/detergent', 'sandwich loaves', 'individual meals', 'cheeses', 'eggs', 'vegetables', 'dinner rolls', 'shampoo', 'paper towels', 'waffles'}
recommendation: {'sandwich loaves', 'paper towels', 'eggs', 'dinner rolls'}
CPU times: user 1.58 ms, sys: 22 µs, total: 1.6 ms
Wall time: 1.59 ms


---

#  Рекомендации для нового покупателя, основанные на предпочтения других покупателей

In [44]:
import math, random
import numpy as np
from collections import defaultdict, Counter
from numpy import dot

In [45]:
all_cheques = []
for i in basket_new['itemDescription_no_dupl']:
    i = [a for a in set(i)]
    all_cheques.append(i)

Простейший подход заключается в рекомендации на основе популярности продуктов:

In [46]:
popular_products = Counter(product
                            for all_cheques in all_cheques
                            for product in all_cheques).most_common()

После их вычисления может оказаться, что предложенные покупателю наиболее популярные продукты его уже не интересуют:

In [47]:
def most_popular_new_products(all_cheques, max_results=5):
    suggestions = [(product, frequency)
                   for product, frequency in popular_products
                   if product not in all_cheques]
    return suggestions[:max_results]

Так, для покупателя 1, который приобрел:

In [48]:
all_cheques[1]

['dishwashing liquid/detergent',
 'mixes',
 'cheeses',
 'individual meals',
 'vegetables',
 'shampoo',
 'sandwich bags',
 'waffles',
 'laundry detergent',
 'hand soap',
 'milk']

рекомендация пяти других товаров будет выглядеть следующим образом:

In [49]:
most_popular_new_products(all_cheques[1], max_results=5)

[('poultry', 431),
 ('bagels', 417),
 ('lunch meat', 414),
 ('ice cream', 412),
 ('soda', 411)]

# Коллаборативная фильтрация. Рекомендации, основанные на схожести покупателей.

In [50]:
# косинусный коэффициент подобия 

def cosine_similarity(v, w):
    return dot(v, w) / math.sqrt(dot(v, v) * dot(w, w))

In [51]:
# список интересующих товаров без повторов 
unique_products = sorted(list({ product
                                 for all_cheques in all_cheques
                                 for product in all_cheques }))

Далее для каждого покупателя следует создать вектор интересующих его тем из О и 1. Для этого надо лишь просмотреть список уникальных продуктов unique_products, назначая 1, если покупатель заинтересован в товаре, и 0 — если нет:

In [52]:
# создаем список интересующих покупателя товаров
def make_user_product_vector(all_cheques):
    """при заданном списке купленных покупателем товаров создать вектор,
    чей i-й элемент равен 1, если unique_products[i] есть в списке,
    и 0 в противном случае"""
    return [1 if product in all_cheques else 0
            for product in unique_products]

Затем можно создать матрицу товаров покупателей, отобразив эту функцию на каждый элемент списка списков товаров (т. е. применив ее к каждому элементу списка при помощи функции map:

In [53]:
# матрица товаров в формате (покупатель, продукты),
# где список товаров для каждого покупателя преобразован в 0 и 1 

In [54]:
user_product_matrix = list(map(make_user_product_vector, all_cheques))

In [55]:
np.array(make_user_product_vector )

array(<function make_user_product_vector at 0x7f8a201bb820>, dtype=object)

In [56]:
# матрица сходств между покупателями

In [57]:
user_similarities = [[cosine_similarity(product_vector_i, product_vector_j)
                      for product_vector_j in user_product_matrix]
                     for product_vector_i in user_product_matrix]

In [58]:
# пользователи наиболее похожие на пользователя user__id 

In [59]:
def most_similar_users_to(user_id):
    pairs = [(other_user_id, similarity)                      # find other
             for other_user_id, similarity in                 # users with
                enumerate(user_similarities[user_id])         # nonzero
             if user_id != other_user_id and similarity > 0]  # similarity

    return sorted(pairs,                                      # sort them
                  key=lambda pair: pair[1],                   # most similar
                  reverse=True)                               # first

In [60]:
# рекомендации на основе пользователя

In [61]:
def user_based_suggestions(user_id, include_current_products=False):
    # sum up the similarities
    suggestions = defaultdict(float)
    for other_user_id, similarity in most_similar_users_to(user_id):
        for product in all_cheques[other_user_id]:
            suggestions[product] += similarity

    # convert them to a sorted list
    suggestions = sorted(suggestions.items(),
                         key=lambda pair: pair[1],
                         reverse=True)

    # and (maybe) exclude already-interests
    if include_current_products:
        return suggestions
    else:
        return [(suggestion, weight)
                for suggestion, weight in suggestions
                if suggestion not in all_cheques[user_id]]

In [62]:
user_based_suggestions(1)[:5]

[('poultry', 125.19797960634497),
 ('bagels', 124.22008227556728),
 ('lunch meat', 121.84506224844571),
 ('cereals', 120.94284831075132),
 ('flour', 118.01524044581322)]

# Коллаборативная фильтрация. Рекомендации, основанные на схожести товаров.

Альтернативный подход заключается в вычислении сходства непосредственно между купленными товарами.

In [63]:
# коллаборационная фильтрация на основе товара

product_user_matrix = [[user_product_vector[j]
                         for user_product_vector in user_product_matrix]
                        for j, _ in enumerate(unique_products)]

In [64]:
# сходство товаров

In [65]:
product_similarities = [[cosine_similarity(user_vector_i, user_vector_j)
                          for user_vector_j in product_user_matrix]
                         for user_vector_i in product_user_matrix]

In [66]:
# товары, максимально похожие на интересующий товар

In [67]:
def most_similar_products_to(product_id):
    similarities = product_similarities[product_id]
    pairs = [(unique_products[other_products_id], similarity)
             for other_products_id, similarity in enumerate(similarities)
             if product_id != other_products_id and similarity > 0]
    return sorted(pairs,
                  key=lambda pair: pair[1],
                  reverse=True)

In [68]:
most_similar_products_to(0)[:5]

[('vegetables', 0.4019646258405418),
 ('fruits', 0.30632911392405066),
 ('waffles', 0.2903991351274259),
 ('laundry detergent', 0.28318606701032967),
 ('soap', 0.279951561959288)]

In [69]:
# рекомендации на основе  товаров

In [70]:
def product_based_suggestions(user_id, include_current_products=False):
    # список рекомендаций, где суммируются похожие товары
    suggestions = defaultdict(float)
    # вектор товаров, интересующих покупателя user_id 
    user_product_vector = user_product_matrix[user_id]
    for product_id, is_interested in enumerate(user_product_vector):
        if is_interested == 1:
            # товары, похожие на интересующий товар
            similar_products = most_similar_products_to(product_id)
            for product, similarity in similar_products:
                suggestions[product] += similarity
    # упорядочить по весу
    suggestions = sorted(suggestions.items(),
                         key=lambda pair: pair[1],
                         reverse=True)

    if include_current_products:
        return suggestions
    else:
        return [(suggestion, weight)
                for suggestion, weight in suggestions
                if suggestion not in all_cheques[user_id]]

In [71]:
product_based_suggestions(1)[:5]

[('bagels', 3.2005396457094326),
 ('poultry', 3.1553868130056553),
 ('flour', 3.14933674333341),
 ('cereals', 3.1255225872574686),
 ('lunch meat', 3.116662043304796)]

---

# Surprise

In [1]:
import pandas as pd
import numpy as np
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
from collections import Counter
reader = Reader()

Загрузим данные

In [2]:
basket = pd.read_csv("dataset.csv", skipinitialspace=True, header=None)
display(basket.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,pork,sandwich bags,lunch meat,all- purpose,flour,soda,butter,vegetables,beef,aluminum foil,all- purpose,dinner rolls,shampoo,all- purpose
1,shampoo,hand soap,waffles,vegetables,cheeses,mixes,milk,sandwich bags,laundry detergent,dishwashing liquid/detergent,waffles,individual meals,hand soap,vegetables
2,pork,soap,ice cream,toilet paper,dinner rolls,hand soap,spaghetti sauce,milk,ketchup,sandwich loaves,poultry,toilet paper,ice cream,ketchup
3,juice,lunch meat,soda,toilet paper,all- purpose,,,,,,,,,
4,pasta,tortillas,mixes,hand soap,toilet paper,vegetables,vegetables,paper towels,vegetables,flour,vegetables,pork,poultry,eggs


Удалим пропуски.

In [3]:
basket = basket.fillna (' ')

Объединим все покупки по чеку в 1 ячейку.

In [4]:
basket1 = basket[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]].T.agg(', '.join)

In [5]:
basket_new = basket1.to_frame()

In [6]:
basket_new = basket_new.rename(columns={0: 'itemDescription'})

In [7]:
basket_new.itemDescription = basket_new.itemDescription.transform(lambda x: x.split(", "))

In [8]:
basket_new.itemDescription

0       [pork, sandwich bags, lunch meat, all- purpose...
1       [shampoo, hand soap, waffles, vegetables, chee...
2       [pork, soap, ice cream, toilet paper, dinner r...
3       [juice, lunch meat, soda, toilet paper, all- p...
4       [pasta, tortillas, mixes, hand soap, toilet pa...
                              ...                        
1494    [beef, sandwich bags, hand soap, paper towels,...
1495    [dinner rolls, lunch meat, spaghetti sauce, pa...
1496    [lunch meat, eggs, poultry, vegetables, tortil...
1497    [ketchup, milk, poultry, cheeses, soap, toilet...
1498    [laundry detergent, vegetables, shampoo, veget...
Name: itemDescription, Length: 1499, dtype: object

In [9]:
basket_new['id'] = [i for i in range(0, len(basket_new))]

In [10]:
basket_new

Unnamed: 0,itemDescription,id
0,"[pork, sandwich bags, lunch meat, all- purpose...",0
1,"[shampoo, hand soap, waffles, vegetables, chee...",1
2,"[pork, soap, ice cream, toilet paper, dinner r...",2
3,"[juice, lunch meat, soda, toilet paper, all- p...",3
4,"[pasta, tortillas, mixes, hand soap, toilet pa...",4
...,...,...
1494,"[beef, sandwich bags, hand soap, paper towels,...",1494
1495,"[dinner rolls, lunch meat, spaghetti sauce, pa...",1495
1496,"[lunch meat, eggs, poultry, vegetables, tortil...",1496
1497,"[ketchup, milk, poultry, cheeses, soap, toilet...",1497


Сформируем список словарей для создания нового датафрейма, где кол-во купленного товара равно его рейтигу у покупателя.

In [84]:
lst = []
for ind, i in enumerate(basket_new['itemDescription']):
    i = Counter(i)
    for el in i.elements():
        if el != ' ':
            dct = {}
            dct['id'] = ind
            dct['item'] = el
            dct['rating'] = i[el]
            lst.append(dct)

In [43]:
basket_from_lst = pd.DataFrame(lst)

In [65]:
basket_from_lst = basket_from_lst.drop_duplicates()

In [68]:
basket_from_lst.head(12)

Unnamed: 0,id,item,rating
0,0,pork,1
1,0,sandwich bags,1
2,0,lunch meat,1
3,0,all- purpose,3
6,0,flour,1
7,0,soda,1
8,0,butter,1
9,0,vegetables,1
10,0,beef,1
11,0,aluminum foil,1


In [129]:
basket_from_lst['rating'].mean()

1.1769900096680632

In [69]:
data = Dataset.load_from_df(basket_from_lst[['id', 'item', 'rating']], reader)

In [70]:
svd = SVD()
result = cross_validate(svd, data, measures=['RMSE', 'MAE','FCP'],cv=5)

In [71]:
result

{'test_rmse': array([0.45658001, 0.44126099, 0.44953024, 0.44843765, 0.46309999]),
 'test_mae': array([0.29385654, 0.28785092, 0.29111226, 0.29191363, 0.29587958]),
 'test_fcp': array([0.51181812, 0.49858624, 0.51108557, 0.50223576, 0.50029944]),
 'fit_time': (0.07422995567321777,
  0.05759286880493164,
  0.057917118072509766,
  0.0574803352355957,
  0.05740094184875488),
 'test_time': (0.01223897933959961,
  0.011510133743286133,
  0.011508941650390625,
  0.011366844177246094,
  0.01137089729309082)}

RMSE достаточно высокий. Учитывая, что средний рейтинг около 1.

In [72]:
result['test_rmse'].mean()

0.4517817762752697

In [73]:
result['test_mae'].mean()

0.2921225873218526

In [74]:
result['test_fcp'].mean()

0.5048050250496938

Обучим модель на трейне.

In [75]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9c301d1d90>

Посмотрим реальные рейтинги покупателя 1.

In [76]:
basket_from_lst[basket_from_lst['id'] == 1]

Unnamed: 0,id,item,rating
14,1,shampoo,1
15,1,hand soap,2
17,1,waffles,2
19,1,vegetables,2
21,1,cheeses,1
22,1,mixes,1
23,1,milk,1
24,1,sandwich bags,1
25,1,laundry detergent,1
26,1,dishwashing liquid/detergent,1


In [123]:
# result = svd.predict(uid= 0, iid='vegetables') # uid =userId;  iid = item  # r_ui - истинный рейтинг

In [124]:
# result

Prediction(uid=0, iid='vegetables', r_ui=None, est=1.131529340274289, details={'was_impossible': False})

Сформируем рекомендации для покупателя 1.

In [127]:
fg = []
for i in basket_from_lst['item'].unique():
     result = svd.predict(1, i, None)
     fg.append([i,result.est])

In [128]:
res = []
for i in fg:
    res.append(i[1])
    res.sort(reverse=True)
    best_5 = res[:5]
recommendation = []
for i in fg:
    if i[1] in best_5:
        recommendation.append(i)
print(recommendation)

[['lunch meat', 1.3048651455326725], ['vegetables', 1.6520979284688864], ['waffles', 1.2896178205511286], ['soap', 1.2924538193829127], ['toilet paper', 1.2854850281232133]]


**Явный недостаток модели - отсутсвие 0 предсказаний.**

---

# lightfm

In [199]:
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=Warning)
from lightfm.datasets import fetch_movielens
from lightfm import LightFM
from recsys import *
from sklearn import preprocessing
from scipy.sparse import coo_matrix
from scipy import sparse

In [193]:
def create_interaction_matrix(df,user_col, item_col, rating_col, norm= False, threshold = None):
    '''
    Function to create an interaction matrix dataframe from transactional type interactions
    Required Input -
        - df = Pandas DataFrame containing user-item interactions
        - user_col = column name containing user's identifier
        - item_col = column name containing item's identifier
        - rating col = column name containing user feedback on interaction with a given item
        - norm (optional) = True if a normalization of ratings is needed
        - threshold (required if norm = True) = value above which the rating is favorable
    Expected output - 
        - Pandas dataframe with user-item interactions ready to be fed in a recommendation algorithm
    '''
    interactions = df.groupby([user_col, item_col])[rating_col] \
            .sum().unstack().reset_index(). \
            fillna(0).set_index(user_col)
    if norm:
        interactions = interactions.applymap(lambda x: 1 if x > threshold else 0)
    return interactions

In [264]:
basket_from_lst

Unnamed: 0,id,item,rating
0,0,pork,1
1,0,sandwich bags,1
2,0,lunch meat,1
3,0,all- purpose,3
6,0,flour,1
...,...,...,...
18255,1497,all- purpose,1
18256,1497,sandwich bags,1
18257,1498,laundry detergent,1
18258,1498,vegetables,2


Надо преобразовать список в NumPy array, чтобы позже использовать его в multi-dimensional NumPy indexing. 

In [292]:
all_pr = np.array(all_pr)

Закодируем название товаров.

In [293]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(all_pr)

items = [i for i in basket_from_lst.item]
items_le = le.transform(items)

In [294]:
basket_from_lst['items_le'] = items_le

In [295]:
basket_from_lst_le = basket_from_lst.drop('item', axis=1 )

In [296]:
basket_from_lst_le

Unnamed: 0,id,rating,items_le
0,0,1,24
1,0,1,26
2,0,1,19
3,0,3,0
6,0,1,11
...,...,...,...
18255,1497,1,0
18256,1497,1,26
18257,1498,1,18
18258,1498,2,35


Изменим датафейм для дальнейшего его преобразования в разряженную матрицу.

In [297]:
interactions = create_interaction_matrix(df = basket_from_lst_le,
                                         user_col = 'id',
                                         item_col = 'items_le',
                                         rating_col = 'rating')
interactions.head()

items_le,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.0,0.0,0.0


Преобразуем датафрейм в разряженную матрицу.

In [298]:
from scipy.sparse import csr_matrix

csr_matrix = csr_matrix(interactions.astype(pd.SparseDtype("float64",0)).sparse.to_coo())

Разделим на трейн и тест.

In [299]:
interactions_train = csr_matrix[:749]

In [300]:
interactions_train

<749x38 sparse matrix of type '<class 'numpy.float64'>'
	with 7783 stored elements in Compressed Sparse Row format>

In [301]:
interactions_test = csr_matrix[749:1498]

In [302]:
interactions_test

<749x38 sparse matrix of type '<class 'numpy.float64'>'
	with 7729 stored elements in Compressed Sparse Row format>

**Почему размеры train и test должны быть одинаковыми (из stackoverflow.com)?**

Often in ML, the rows are the samples and in this interpretation your single row-sample is a tuple or something similar (user, item, value). The number of rows in test and train surely has no limitation! But there is one in terms of dimensions. So only in terms of dimensions (which users and which items are observed) those datasets need a 1:1 correspondence. That's maybe a strong assumption, but you can imagine the problem predicting user 11, where there was no user 11 in training (there is no latent-vector which was build).

Обучим модель на трейне.

In [303]:
model = LightFM(loss='warp')
model.fit(interactions_train, epochs=30, num_threads=8)

<lightfm.lightfm.LightFM at 0x7f9c0871c220>

Посчитаем метрики precision_at_k и auc_score.

In [304]:
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score

In [305]:
print("Train precision: %.2f" % precision_at_k(model, interactions_train, k=5).mean())
print("Test precision: %.2f" % precision_at_k(model, interactions_test, k=5).mean())

Train precision: 0.79
Test precision: 0.30


In [306]:
print("Train precision: %.2f" % auc_score(model, interactions_train).mean())
print("Test precision: %.2f" % auc_score(model, interactions_test).mean())

Train precision: 0.86
Test precision: 0.52


Метрики на тесте довольно слабые.

In [319]:
def sample_recommendation(model, data, user_ids):
    

    n_users, n_items = interactions_train.shape

    for user_id in user_ids:
        known_positives = all_pr[interactions_train.tocsr()[user_id].indices]
        
        scores = model.predict(user_id, np.arange(n_items))
        top_items = all_pr[np.argsort(-scores)]
        
        top_items_to_reccomend = set(top_items).difference(set(known_positives))
        reccom = [i for i in top_items_to_reccomend]

        
        print("User %s" % user_id)
        print("     Known positives:")
        
        for x in known_positives[:3]:
            print("        %s" % x)

        print("     Recommended:")
        
        for x in reccom[:5]:
            print("        %s" % x)

Сформируем предсказания для покупателя 1.

In [320]:
sample_recommendation(model, interactions, [1]) 

User 1
     Known positives:
        cheeses
        dishwashing liquid/detergent
        hand soap
     Recommended:
        fruits
        pasta
        ice cream
        paper towels
        spaghetti sauce
