<a href="https://colab.research.google.com/github/AnnaK8090/CIND-820_Big-Data-Analytics-Project/blob/main/CIND_820_Big_Data_Analytics_Project_3_association_rule.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Association Rule Mining(ARM) can be used to provide session-based recommendations and Apriori is one such widely accepted ARM algorithm.

**General Idea behind the Apriori Algorithm**
Let’s say User A bought Product1 in a single transaction and there’s User B who bought Product1 and Product2 in a single transaction. So, a rule will be generated to suggest Product2 to User A along with the Product1.

There are 3 main components of the Apriori algorithm —

**Support** — Probability of transactions containing both Product1 and Product2.

**Confidence**— Conditional Probability of transactions containing product Product2 given Product1. [P(ID2|ID1)]

**Lift**— Ratio of Confidence to Support. If the lift is < 1 then Product1 and Product2 are negatively correlated (doesn’t belong together in recommendations) else positively correlated.

In [2]:
# 1. Importing libraries:
import numpy as np 
import pandas as pd 
# installing the apyori package - machine learning model used in Association Rule to identify frequent itemsets from a dataset:
!pip install apyori
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
# 2. Loading csv file and saving into a dataframe:
masterDF = pd.read_csv("MasterDF.csv")

In [4]:
# 3. Since order_id might have more than 1 review (given on the same or different date), we will aggregate orders by order_id and choose max review score:
masterDF_grouped = masterDF.groupby(['order_id','customer_unique_id','product_id','product_category_name_english'])['review_score'].max()
masterDF_grouped = masterDF_grouped.reset_index()
masterDF_grouped.head()

Unnamed: 0,order_id,customer_unique_id,product_id,product_category_name_english,review_score
0,00010242fe8c5a6d1ba2dd792cb16214,871766c5855e863f6eccc05f988b23cb,4244733e06e7ecb4970a6e2683c13e61,cool_stuff,5
1,00018f77f2f0320c557190d7a144bdd3,eb28e67c4c0b83846050ddfb8a35d051,e5f2d52b802189ee658865ca93d83a8f,pet_shop,4
2,000229ec398224ef6ca0657da4fc703e,3818d81c6709e39d06b2738a8d3a2474,c777355d18b72b67abbeef9df44fd0fd,furniture_decor,5
3,00024acbcdf0a6daa1e931b038114c75,af861d436cfc08b2c2ddefd0ba074622,7634da152a4610f1595efa32f14722fc,perfumery,4
4,00042b26cf59d7ce69dfabb4e55b4fd9,64b576fb70d441e8f1b2d7d446e483c5,ac6c3623068f30de03045865e4e10089,garden_tools,5


In [5]:
# 4. Copying product_id column as ItemID_NEW_StringType to merge with product_category_name_english into ItemID_Category for algorythm results visual check.
# Also adding "Quantity" column:
masterDF["ItemID_NEW_StringType"] = masterDF["product_id"]
masterDF["ItemID_NEW_StringType"] = masterDF["ItemID_NEW_StringType"].astype(str)
masterDF['ItemID_Category'] = [''.join(i) for i in zip(masterDF['ItemID_NEW_StringType'], masterDF['product_category_name_english'])]
masterDF["Quantity"]=1

In [6]:
# 5. Stripping extra spaces in the description
masterDF['ItemID_Category'] = masterDF['ItemID_Category'].str.strip()
  
# Dropping the rows without any order_id:
masterDF.dropna(axis = 0, subset =['order_id'], inplace = True)
masterDF['order_id'] = masterDF['order_id'].astype('str')
  

In [7]:
# 6. Subsetting dataframe to certain product categories to check Apriori algoryths (otherwise session is crushed):
options = ["bed_bath_table","baby","furniture_decor","pet_shop","sports_leisure","auto","fashion_bags_accessories"]
masterDF = masterDF[masterDF['product_category_name_english'].isin(options)]

In [8]:
masterDF_subsetted = (masterDF
          .groupby(['order_id', 'ItemID_Category'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))


In [9]:
masterDF_subsetted.head()

ItemID_Category,00088930e925c41fd95ebfe695fd2655auto,0009406fd7479715e4bef61dd91f2462bed_bath_table,0011c512eb256aa0dbbb544d8dffcf6eauto,001b237c0e9bb435f2e54071129237e9bed_bath_table,001b72dfd63e9833e8c02742adf472e3furniture_decor,001c5d71ac6ad696d22315953758fa04bed_bath_table,002959d7a0b0990fe2d69988affcbc80furniture_decor,002d4ea7c04739c130bb74d7e7cd1694pet_shop,0030026a6ddb3b2d1d4bc225b4b4c4dasports_leisure,003128f981470c3e5a2e7445e4a771cdsports_leisure,...,ffbe169d395060d7fb975c990581a329furniture_decor,ffc9d90bae2127e6a6ce6d6654267ebdsports_leisure,ffccf0ce5eff1a158891296990107d08sports_leisure,ffd246249e3225c13f40b5b91dcaa65asports_leisure,ffd4bf4306745865e5692f69bd237893fashion_bags_accessories,ffd9ac56db9194a413298faaa03cd176pet_shop,ffe013e1b4603e3b0b02fbb159d5b400sports_leisure,ffe8083298f95571b4a66bfbc1c05524bed_bath_table,fff1059cd247279f3726b7696c66e44esports_leisure,fff9553ac224cec9d15d49f5a263411ffashion_bags_accessories
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00018f77f2f0320c557190d7a144bdd3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000229ec398224ef6ca0657da4fc703e,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00063b381e2406b52ad429470734ebd5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0006ec9db01a64e59a68b2c340bf65a7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0009792311464db532ff765bf7b182ae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# 7. Hot encoding to make the data suitable for Apriori library
def hot_encode(x):
    if(x<= 0):
        return 0
    if(x>= 1):
        return 1
  
# Encoding the datasets
masterDF_encoded = masterDF_subsetted.applymap(hot_encode)
masterDF_subsetted = masterDF_encoded

In [None]:
# 8. Building the model
frq_items = apriori(masterDF_subsetted, min_support = 0.001, use_colnames = True)
  
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head())

In [None]:
import os  

In [None]:
os.makedirs('folder/subfolder', exist_ok=True)  
rules.to_csv('folder/subfolder/out.csv')  