# Installing and importing the necessary libraries
We will use the apriori function from the apyori library in finding the association rules, as well as pandas and numpy for any data frame manipulation we need.

https://zaxrosenberg.com/unofficial-apyori-documentation/

In [None]:
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5955 sha256=ef7ff4a35c796142a4213188b04b8246a0026213d00c9ad4167e1987df50592d
  Stored in directory: /root/.cache/pip/wheels/c4/1a/79/20f55c470a50bb3702a8cb7c94d8ada15573538c7f4baebe2d
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [None]:
from apyori import apriori as apriori
import pandas as pd
import numpy as np


#1 Grocery Dataset


##Prepping the dataset
Because we are required to have 2 datasets (one grouped by member, another grouped by month) we will start first with grouping by month and then by member afterwards


### By Month


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/efvaldez1/data-repository/main/groceries.csv')
df.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


Start on preparing the data to group by month. We make a copy of the orignal dataset and then drop the Member_number column, as we are not interested in that for this grouping.

In [None]:
# Group By month
df1 = df.copy()
df1 = df.drop("Member_number", axis=1)

We drop the duplicates as this may affect the findings of the association rules

In [None]:
df1 = df1.drop_duplicates(keep='first')
df1

Unnamed: 0,Date,itemDescription
0,21-07-2015,tropical fruit
1,05-01-2015,whole milk
2,19-09-2015,pip fruit
3,12-12-2015,other vegetables
4,01-02-2015,whole milk
...,...,...
38760,08-10-2014,sliced cheese
38761,23-02-2014,candy
38762,16-04-2014,cake bar
38763,03-12-2014,fruit/vegetable juice


Convert the format of values in the date column to date time objects, and then extracting only the months of each day. This will make it so that we're able to group by month regardless of year.

In [None]:
#convert date column to datetime
df1['Date'] = pd.to_datetime(df1['Date'])

df1['Date'] = df1['Date'].dt.month

df1

  df1['Date'] = pd.to_datetime(df1['Date'])


Unnamed: 0,Date,itemDescription
0,7,tropical fruit
1,1,whole milk
2,9,pip fruit
3,12,other vegetables
4,2,whole milk
...,...,...
38760,10,sliced cheese
38761,2,candy
38762,4,cake bar
38763,12,fruit/vegetable juice


In [None]:
month_grouped = df1.groupby('Date')['itemDescription'].apply(list).reset_index()

In [None]:
month_grouped_list = month_grouped["itemDescription"].to_list()
#month_grouped_list[0]

### By Member


Grouping by member. We drop the date column as we are not interested in that, and afterwards drop the duplicate rows.

In [None]:
# Group By Member
df2 = df.drop("Date", axis=1)
df2 = df2.drop_duplicates(keep='first')
df2


Unnamed: 0,Member_number,itemDescription
0,1808,tropical fruit
1,2552,whole milk
2,2300,pip fruit
3,1187,other vegetables
4,3037,whole milk
...,...,...
38760,4471,sliced cheese
38761,2022,candy
38762,1097,cake bar
38763,1510,fruit/vegetable juice


In [None]:
member_grouped = df2.groupby('Member_number')['itemDescription'].apply(list).reset_index()

In [None]:
member_grouped_list = member_grouped["itemDescription"].to_list()
#member_grouped_list

## Creating the rules
Now that we have our dataframes month_grouped_list and member_grouped_list which are the dataset grouped by month and by member respectively, we can now find the rules for each grouping.

We start by finding the rules for the dataset grouped by month, month_grouped_list


In [None]:
rules_month= apriori(month_grouped_list, min_support=1, min_confidence = 1, min_lift = 1 , min_length = 2,  max_length = 2)
results_month = list(rules_month)
results_month


[RelationRecord(items=frozenset({'Instant food products'}), support=1.0, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'Instant food products'}), confidence=1.0, lift=1.0)]),
 RelationRecord(items=frozenset({'UHT-milk'}), support=1.0, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'UHT-milk'}), confidence=1.0, lift=1.0)]),
 RelationRecord(items=frozenset({'artif. sweetener'}), support=1.0, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'artif. sweetener'}), confidence=1.0, lift=1.0)]),
 RelationRecord(items=frozenset({'baking powder'}), support=1.0, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'baking powder'}), confidence=1.0, lift=1.0)]),
 RelationRecord(items=frozenset({'beef'}), support=1.0, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'beef'}), confidence=1.0, lift=1.0)]),
 RelationRecord(items=frozenset

In [None]:
#len(results_month)

7021

Function to make a dataframe to be able to easily see the rules and the repective confidence, support, and lift values

In [None]:
def create_dataframe(results):
    data = []

    for item in results:
        item_pair = item[0]
        items = [i for i in item_pair]

        if len(items) > 1:  # Check if there are two or more items
            data.append({
                'Item 1': items[0],
                'Item 2': items[1],
                'Support': item[1],
                'Confidence': str(item[2][0][2]),
                'Lift': str(item[2][0][3])
            })


    df = pd.DataFrame(data)
    return df



In [None]:
DF_monthresults = create_dataframe(results_month)

In [None]:
DF_monthresults

Unnamed: 0,Item 1,Item 2,Support,Confidence,Lift
0,Instant food products,UHT-milk,1.0,1.0,1.0
1,Instant food products,artif. sweetener,1.0,1.0,1.0
2,Instant food products,baking powder,1.0,1.0,1.0
3,Instant food products,beef,1.0,1.0,1.0
4,Instant food products,berries,1.0,1.0,1.0
...,...,...,...,...,...
6898,yogurt,white wine,1.0,1.0,1.0
6899,white wine,zwieback,1.0,1.0,1.0
6900,yogurt,whole milk,1.0,1.0,1.0
6901,zwieback,whole milk,1.0,1.0,1.0


In [None]:
DF_monthresults.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6903 entries, 0 to 6902
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Item 1      6903 non-null   object 
 1   Item 2      6903 non-null   object 
 2   Support     6903 non-null   float64
 3   Confidence  6903 non-null   object 
 4   Lift        6903 non-null   object 
dtypes: float64(1), object(4)
memory usage: 269.8+ KB


In [None]:
DF_monthresults[["Confidence","Lift"]] = DF_monthresults[["Confidence","Lift"]].astype(float)

In [None]:
DF_monthresults.describe()

Unnamed: 0,Support,Confidence,Lift
count,6903.0,6903.0,6903.0
mean,1.0,1.0,1.0
std,0.0,0.0,0.0
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,1.0,1.0,1.0
max,1.0,1.0,1.0


We now proceed to finding the rules for the grouped by member


In [None]:
rules_member= apriori(member_grouped_list, min_support=0.001, min_confidence = 0.6, min_lift = 0 , min_length = 2, max_length = 2)
results_member = list(rules_member)
results_member


[RelationRecord(items=frozenset({'artif. sweetener', 'whole milk'}), support=0.004617752693689072, ordered_statistics=[OrderedStatistic(items_base=frozenset({'artif. sweetener'}), items_add=frozenset({'whole milk'}), confidence=0.6206896551724138, lift=1.354674286596903)]),
 RelationRecord(items=frozenset({'bathroom cleaner', 'whole milk'}), support=0.0030785017957927143, ordered_statistics=[OrderedStatistic(items_base=frozenset({'bathroom cleaner'}), items_add=frozenset({'whole milk'}), confidence=0.7058823529411765, lift=1.5406099729925566)]),
 RelationRecord(items=frozenset({'bottled water', 'whisky'}), support=0.0012827090815802976, ordered_statistics=[OrderedStatistic(items_base=frozenset({'whisky'}), items_add=frozenset({'bottled water'}), confidence=0.625, lift=2.9246698679471788)]),
 RelationRecord(items=frozenset({'brandy', 'whole milk'}), support=0.0061570035915854285, ordered_statistics=[OrderedStatistic(items_base=frozenset({'brandy'}), items_add=frozenset({'whole milk'}), 

In [None]:
DF_memberresults = create_dataframe(results_member)

In [None]:
DF_memberresults.sort_values(by=['Support',"Confidence"], ascending=False)

Unnamed: 0,Item 1,Item 2,Support,Confidence,Lift
13,liquor,whole milk,0.016675,0.6310679611650485,1.3773252590265168
15,mustard,whole milk,0.01411,0.6043956043956044,1.3191120190000367
27,zwieback,whole milk,0.009236,0.6000000000000001,1.3095184770436732
8,whole milk,curd cheese,0.008722,0.7391304347826088,1.6131749354885827
11,house keeping products,whole milk,0.007696,0.6666666666666666,1.4550205300485255
14,whole milk,meat spreads,0.007183,0.8,1.7460246360582308
3,brandy,whole milk,0.006157,0.6315789473684211,1.378440502151235
9,dental care,whole milk,0.005131,0.606060606060606,1.3227459364077503
0,artif. sweetener,whole milk,0.004618,0.6206896551724138,1.354674286596903
25,whole milk,snack products,0.004361,0.6296296296296297,1.3741860561569408


#2 Own Dataset
We will use ecommerce data taken from https://www.kaggle.com/datasets/carrie1/ecommerce-data. I downloaded this data and uploaded it into my google drive.


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Data Cleaning


In [None]:
ecommerce_data = pd.read_csv('/content/gdrive/MyDrive/datasets/ecommerceData.csv',  encoding='unicode_escape')
ecommerce_data


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


Find out if there are null values using .info()


In [None]:
ecommerce_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [None]:
ecommerce_data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


Dropping these null values

In [None]:
ecommerce_data.dropna(inplace=True)

We only want to pay attention to the description and customer id, so we drop the rest of the columns

In [None]:
ecommerce_data_new = ecommerce_data[["Description","CustomerID"]]

In [None]:
df = ecommerce_data_new.groupby('CustomerID')['Description'].apply(list).reset_index()

Turning the item list of lists per customer to be able to be used as a parameter for the apriori function

In [None]:
ecommerce_item_list = df['Description'].tolist()

## Creating Rules

In [None]:
rules_ecommerce= apriori(ecommerce_item_list, min_support=0.05, min_confidence = 0.8, min_lift = 0 , min_length = 2, max_length = 2)
results_ecommerce = list(rules_ecommerce)
results_ecommerce


[RelationRecord(items=frozenset({'ALARM CLOCK BAKELIKE GREEN', 'ALARM CLOCK BAKELIKE RED '}), support=0.06061299176578225, ordered_statistics=[OrderedStatistic(items_base=frozenset({'ALARM CLOCK BAKELIKE GREEN'}), items_add=frozenset({'ALARM CLOCK BAKELIKE RED '}), confidence=0.8204334365325077, lift=9.173746763478576)]),
 RelationRecord(items=frozenset({'BAKING SET SPACEBOY DESIGN', 'BAKING SET 9 PIECE RETROSPOT '}), support=0.055809698078682524, ordered_statistics=[OrderedStatistic(items_base=frozenset({'BAKING SET SPACEBOY DESIGN'}), items_add=frozenset({'BAKING SET 9 PIECE RETROSPOT '}), confidence=0.8133333333333332, lift=6.078450142450142)]),
 RelationRecord(items=frozenset({'PINK REGENCY TEACUP AND SAUCER', 'GREEN REGENCY TEACUP AND SAUCER'}), support=0.06816102470265324, ordered_statistics=[OrderedStatistic(items_base=frozenset({'PINK REGENCY TEACUP AND SAUCER'}), items_add=frozenset({'GREEN REGENCY TEACUP AND SAUCER'}), confidence=0.9283489096573208, lift=10.407029315440528)])

In [None]:
DF_ecommerce = create_dataframe(results_ecommerce)

In [None]:
DF_ecommerce.sort_values(by=['Support',"Confidence"], ascending=False)

Unnamed: 0,Item 1,Item 2,Support,Confidence,Lift
3,ROSES REGENCY TEACUP AND SAUCER,GREEN REGENCY TEACUP AND SAUCER,0.074108,0.8307692307692307,8.506143037290578
6,WHITE HANGING HEART T-LIGHT HOLDER,RED HANGING HEART T-LIGHT HOLDER,0.071363,0.8103896103896104,4.129397874852421
2,PINK REGENCY TEACUP AND SAUCER,GREEN REGENCY TEACUP AND SAUCER,0.068161,0.9283489096573208,10.407029315440528
5,ROSES REGENCY TEACUP AND SAUCER,PINK REGENCY TEACUP AND SAUCER,0.062443,0.8504672897196262,8.70782901792554
0,ALARM CLOCK BAKELIKE GREEN,ALARM CLOCK BAKELIKE RED,0.060613,0.8204334365325077,9.173746763478576
4,PINK REGENCY TEACUP AND SAUCER,REGENCY CAKESTAND 3 TIER,0.059698,0.8130841121495327,4.007670505431518
1,BAKING SET SPACEBOY DESIGN,BAKING SET 9 PIECE RETROSPOT,0.05581,0.8133333333333332,6.078450142450142
7,SET OF TEA COFFEE SUGAR TINS PANTRY,SET OF 3 CAKE TINS PANTRY DESIGN,0.050778,0.8072727272727274,5.514681818181819
