<a href="https://colab.research.google.com/github/Lukas-Swc/machine-learning-bootcamp/blob/main/unsupervised/03_association_rules/02_apriori.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Załadownaie danych](#1)
3. [Przygotowanie danych](#2)
4. [Kodowanie transakcji](#3)
5. [Algorytm Apriori](#4)




### <a name='0'></a> Import bibliotek

In [1]:
import pandas as pd

pd.set_option('display.float_format', lambda x: f'{x:.2f}')

### <a name='1'></a> Załadownaie danych

In [2]:
!wget https://storage.googleapis.com/esmartdata-courses-files/ml-course/products.csv
!wget https://storage.googleapis.com/esmartdata-courses-files/ml-course/orders.csv

--2025-05-13 16:47:13--  https://storage.googleapis.com/esmartdata-courses-files/ml-course/products.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.180.207, 142.251.16.207, 142.251.167.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.180.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2166953 (2.1M) [application/octet-stream]
Saving to: ‘products.csv’


2025-05-13 16:47:15 (3.23 MB/s) - ‘products.csv’ saved [2166953/2166953]

--2025-05-13 16:47:15--  https://storage.googleapis.com/esmartdata-courses-files/ml-course/orders.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.180.207, 142.251.16.207, 142.251.167.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.180.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24680147 (24M) [application/octet-stream]
Saving to: ‘orders.csv’


2025-05-13 16:47:17 (19.6 MB/s) - ‘orders.csv’ saved

In [5]:
products = pd.read_csv('products.csv', usecols=['product_id', 'product_name'])
products.head()

Unnamed: 0,product_id,product_name
0,1,Chocolate Sandwich Cookies
1,2,All-Seasons Salt
2,3,Robust Golden Unsweetened Oolong Tea
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...
4,5,Green Chile Anytime Sauce


In [7]:
orders = pd.read_csv('orders.csv', usecols=['order_id', 'product_id'])
orders.head()

Unnamed: 0,order_id,product_id
0,1,49302
1,1,11109
2,1,10246
3,1,49683
4,1,43633


### <a name='2'></a> Przygotowanie danych

In [8]:
data = pd.merge(orders, products, how='inner', on='product_id', sort=True)
data = data.sort_values(by='order_id')
data.head()

Unnamed: 0,order_id,product_id,product_name
325660,1,13176,Bag of Organic Bananas
1382192,1,49683,Cucumber Kirby
1375139,1,49302,Bulgarian Yogurt
1302365,1,47209,Organic Hass Avocado
588447,1,22035,Organic Whole String Cheese


In [9]:
data.describe()

Unnamed: 0,order_id,product_id
count,1384617.0,1384617.0
mean,1706297.62,25556.24
std,989732.65,14121.27
min,1.0,1.0
25%,843370.0,13380.0
50%,1701880.0,25298.0
75%,2568023.0,37940.0
max,3421070.0,49688.0


In [10]:
data['product_name'].value_counts()

Unnamed: 0_level_0,count
product_name,Unnamed: 1_level_1
Banana,18726
Bag of Organic Bananas,15480
Organic Strawberries,10894
Organic Baby Spinach,9784
Large Lemon,8135
...,...
Radiant Infinity Overnight Light Clean Scent Pads With Wings,1
Classic Original Lip Balm SPF 12,1
Sweet & Thick Original BBQ Sauce,1
Birthday Candles Neon Crazy Curl,1


In [12]:
data['order_id'].nunique()

131209

In [14]:
transactions = data.groupby(by='order_id')['product_name'].apply(lambda name: ','.join(name))
transactions

Unnamed: 0_level_0,product_name
order_id,Unnamed: 1_level_1
1,"Bag of Organic Bananas,Cucumber Kirby,Bulgaria..."
36,"Grated Pecorino Romano Cheese,Organic Garnet S..."
38,"Green Peas,Bunched Cilantro,Flat Parsley, Bunc..."
96,"Organic Pomegranate Kernels,Organic Blueberrie..."
98,"Organic Ketchup,Queso Fresco,Aluminum Foil,Org..."
...,...
3421049,"Organic Baby Broccoli,Organic Whole Grain Whea..."
3421056,"Tartar Sauce,Homestyle Classics Meatloaf,Spark..."
3421058,"Club Soda Lower Sodium,Classic Britannia Crisp..."
3421063,"Natural Artesian Water,Twice Baked Potatoes,No..."


In [15]:
transactions = transactions.str.split(',')
transactions

Unnamed: 0_level_0,product_name
order_id,Unnamed: 1_level_1
1,"[Bag of Organic Bananas, Cucumber Kirby, Bulga..."
36,"[Grated Pecorino Romano Cheese, Organic Garnet..."
38,"[Green Peas, Bunched Cilantro, Flat Parsley, ..."
96,"[Organic Pomegranate Kernels, Organic Blueberr..."
98,"[Organic Ketchup, Queso Fresco, Aluminum Foil,..."
...,...
3421049,"[Organic Baby Broccoli, Organic Whole Grain Wh..."
3421056,"[Tartar Sauce, Homestyle Classics Meatloaf, Sp..."
3421058,"[Club Soda Lower Sodium, Classic Britannia Cri..."
3421063,"[Natural Artesian Water, Twice Baked Potatoes,..."


### <a name='3'></a> Kodowanie transakcji

In [16]:
from mlxtend.preprocessing import TransactionEncoder

encoder = TransactionEncoder()
encoder.fit(transactions)
transactions_encoded = encoder.fit_transform(transactions, sparse=True)
transactions_encoded

<Compressed Sparse Row sparse matrix of dtype 'bool'
	with 1442410 stored elements and shape (131209, 40434)>

In [17]:
transactions_encoded_df = pd.DataFrame(transactions_encoded.toarray(), columns=encoder.columns_)
transactions_encoded_df

Unnamed: 0,Unnamed: 1,Apricot & Banana Stage 2 Baby Food,Broad Spectrum SPF 30,Instant,Livermore Valley,Low Sodium Marinara,Premium,Vetiver scent,Whole,#2,...,with Xylitol Cinnamon 18 Sticks Sugar Free Gum,with Xylitol Island Berry Lime 18 Sticks Sugar Free Gum,with Xylitol Minty Sweet Twist 18 Sticks Sugar Free Gum,with Xylitol Original Flavor 18 Sticks Sugar Free Gum,with Xylitol Unwrapped Original Flavor 50 Sticks Sugar Free Gum,with Xylitol Unwrapped Spearmint 50 Sticks Sugar Free Gum,with Xylitol Watermelon Twist 18 Sticks Sugar Free Gum,with a Splash of Mango Coconut Water,with a Splash of Pineapple Coconut Water,Lightly Seasoned with Rosemary and Roasted Garlic Family Size Herb Chicken Tortellini
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131204,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
131205,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
131206,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
131207,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### <a name='4'></a> Algorytm Apriori

In [19]:
from mlxtend.frequent_patterns import apriori, association_rules

supports = apriori(transactions_encoded_df, min_support=0.01, use_colnames=True)
supports = supports.sort_values(by='support', ascending=False)
supports.head(10)

Unnamed: 0,support,itemsets
8,0.14,(Banana)
7,0.12,(Bag of Organic Bananas)
76,0.08,(Organic Strawberries)
41,0.07,(Organic Baby Spinach)
31,0.06,(Large Lemon)
37,0.06,(Organic Avocado)
61,0.06,(Organic Hass Avocado)
100,0.05,(Strawberries)
33,0.05,(Limes)
69,0.04,(Organic Raspberries)


In [20]:
rules = association_rules(supports, metric='confidence', min_threshold=0)
rules = rules.iloc[:, [0, 1, 4, 5, 6]]
rules = rules.sort_values(by='lift', ascending=False)
rules.head(15)

Unnamed: 0,antecedents,consequents,support,confidence,lift
27,(Clementines),( Bag),0.01,0.52,36.84
26,( Bag),(Clementines),0.01,0.79,36.84
22,(Limes),(Large Lemon),0.01,0.26,4.26
23,(Large Lemon),(Limes),0.01,0.2,4.26
19,(Organic Strawberries),(Organic Raspberries),0.01,0.15,3.63
18,(Organic Raspberries),(Organic Strawberries),0.01,0.3,3.63
31,(Large Lemon),(Organic Avocado),0.01,0.17,2.94
30,(Organic Avocado),(Large Lemon),0.01,0.18,2.94
3,(Organic Hass Avocado),(Bag of Organic Bananas),0.02,0.33,2.81
2,(Bag of Organic Bananas),(Organic Hass Avocado),0.02,0.16,2.81
