<a href="https://colab.research.google.com/github/Asterlok/cross_sales_analysis/blob/main/cross_sales_(1)_mlxtend.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
df = pd.read_excel('sum_for_male_40_less.xlsx')
df.dropna(axis=0, subset=['customer'], inplace=True)
df['customer'] = df['customer'].astype('str')

In [None]:
market_basket = df.groupby(
                ['customer', 'product'])['count']

I want to hot encode the data and get 1 transaction per row to prepare to run our mlxtend analysis.

In [None]:
market_basket = market_basket.sum().unstack().reset_index().fillna(0).set_index('customer')

In [None]:
market_basket

I’ve encoded data to show when a product is sold with another product. If there is a zero, that means those products haven’t sold together. Before continue, I want to convert all of numbers to either a `1` or a `0` (negative numbers are converted to zero, positive numbers are converted to 1). I can do this encoding step with the following function:

In [None]:
def encode_data(datapoint):
    if datapoint <= 0:
        return 0
    if datapoint >= 1:
        return 1

And now, I do final encoding step:

In [None]:
market_basket = market_basket.applymap(encode_data)

There one thing I need to think about first. the `apriori` function requires us to provide a minimum level of ‘support’. Support is defined as the percentage of time that an itemset appears in the dataset. If you set support = 50%, you’ll only get itemsets that appear 50% of the time. Setting the support level to high could lead to very few (or no) results and setting it too low could require an enormous amount of memory to process the data.
There one thing I need to think about first. the `apriori` function requires us to provide a minimum level of ‘support’. Support is defined as the percentage of time that an itemset appears in the dataset. If I set support = 50%, I’ll only get itemsets that appear 50% of the time. Setting the support level to high could lead to very few (or no) results and setting it too low could require an enormous amount of memory to process the data.

In [None]:
itemsets = apriori(market_basket, min_support=0.004, use_colnames=True)
itemsets

The final step is to build your association rules using the mxltend `association_rules` function. You can set the metric that you are most interested in (either `lift` or `confidence` and set the minimum threshold for the condfidence level (called `min_threshold`). The `min_threshold` can be thought of as the level of confidence percentage that you want to return. For example, if you set `min_threshold` to 1, you will only see rules with 100% confidence. 

In [None]:
rules = association_rules(itemsets, metric="lift", min_threshold=0.6)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Шампуни),(Бальзамы и кондиционеры),0.1476,0.01292,0.004792,0.032464,2.51261,0.002885,1.020199
1,(Бальзамы и кондиционеры),(Шампуни),0.01292,0.1476,0.004792,0.370861,2.51261,0.002885,1.354868
2,(Чистящие средства),(Жидкие средства для стирки),0.234876,0.026696,0.004278,0.018215,0.682301,-0.001992,0.991361
3,(Жидкие средства для стирки),(Чистящие средства),0.026696,0.234876,0.004278,0.160256,0.682301,-0.001992,0.91114
4,(Прокладки),(Прокладки ежедневные),0.155557,0.044152,0.004877,0.031353,0.710124,-0.001991,0.986787
5,(Прокладки ежедневные),(Прокладки),0.044152,0.155557,0.004877,0.110465,0.710124,-0.001991,0.949308


In [None]:
rules.to_excel("cross_sales_male_40Less.xlsx") 