# ASSOCIATION ANALYSIS WITH APRIORI

**File:** Apriori.ipynb

**Course:** Data Science Foundations: Data Mining in Python

# INSTALL AND IMPORT LIBRARIES

The Python library `apyori` contains the implementation of the Apriori algorithm, which can be installed with Python's `pip` command. This command only needs to be done once per machine.

The standard, shorter approach may work:

In [1]:
# pip install apyori

If the above command didn't work, it may be necessary to be more explicit, in which case you could run the code below.

In [2]:
# import sys
# !{sys.executable} -m pip install apyori

Once `apyori` is installed, then load the libraries below.

In [3]:
import pandas as pd              # For dataframes
import matplotlib.pyplot as plt  # For plotting data
from apyori import apriori       # For Apriori algorithm
from mlxtend.frequent_patterns import apriori, association_rules

# LOAD AND PREPARE DATA

For this demonstration, we'll use the dataset `Groceries.csv`, which comes from the R package `arules` and is saved as a CSV file. The data is in transactional format (as opposed to tabular format), which means that each row is a list of items purchased together and that the items may be in different order. There are 32 columns in each row, each column either contains a purchased items or `NaN`.

The code below opens the dataset and converts to to list format, which is necessary for the `apriori()` function.

In [4]:
file_path = '/Users/sumedhajauhari/Desktop/My Study Material/Product_Details_apriori.csv'
df = pd.read_csv(file_path)
df_grouped = df.groupby('order_id')['Product_Name'].apply(list)
df_grouped = df_grouped[df_grouped.apply(len) > 1]
df_grouped = df_grouped.reset_index() #
df_grouped.columns = ['order_id', 'products']
df_grouped.head(10)

Unnamed: 0,order_id,products
0,2017-10-01 00:02:32-28705-123.2.191.66,"[The Perfect Tee, Hotter Than Ever Tee, All-Ov..."
1,2017-10-01 00:10:50-11645-49.191.2.51,"[The Diana One-Piece, Womens Havana Dress, Wom..."
2,2017-10-01 00:19:18-40598-110.150.216.111,"[Brynn Flats, Georgie Slides]"
3,2017-10-01 00:20:42-39355-60.242.62.12,"[Go Step Lite - Origin, Lorne Sandals]"
4,2017-10-01 00:21:47-43069-146.23.201.247,"[Everyday Train Graphic Tights, All Eyes On Me..."
5,2017-10-01 00:22:21-35489-49.199.120.131,"[Haven Sheer Shift Ruffle Dress, Nikita Shift ..."
6,2017-10-01 00:22:34-11273-122.58.117.133,[Inlay Case for iPhone 6/6S Plus - Beverly Hil...
7,2017-10-01 00:22:48-7208-175.33.160.56,"[Rally Roll Neck Knit, Vanessa Embroidered Kni..."
8,2017-10-01 00:25:10-26702-124.190.27.210,"[Dimension Frill Wrap Dress , Anahi Embroider..."
9,2017-10-01 00:26:05-29606-49.195.203.89,"[Satin Twill Midi Skirt, Desi Floral Full Skir..."


In [5]:
df_grouped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23597 entries, 0 to 23596
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   order_id  23597 non-null  object
 1   products  23597 non-null  object
dtypes: object(2)
memory usage: 368.8+ KB


# APPLY APRIORI

Call `apriori()`. As parameters `apriori()` can take the minimum support, minimum confidence, minimum lift and minimum items in a transaction. Only the pairs of items that satisfy these criteria would be returned.

In [None]:
# Step 1: Ensure product lists are unique
df_grouped['products'] = df_grouped['products'].apply(lambda x: list(set(x)))

# Step 2: Prepare the data for the Apriori algorithm
# One-hot encode the product lists
encoded_data = df_grouped['products'].apply(lambda x: pd.Series(1, index=x)).fillna(0)

# Ensure the columns are unique to avoid reindexing issues
encoded_data = encoded_data.loc[:, ~encoded_data.columns.duplicated()]

# Display the encoded data to ensure it's correct
#print("Encoded Data:")
#display(encoded_data)

# Step 3: Apply the Apriori algorithm to find frequent itemsets
# Further lower the min_support value to capture more itemsets
frequent_itemsets = apriori(encoded_data, min_support=0.0005, use_colnames=True)

# Display the frequent itemsets
print("Frequent Itemsets:")
display(frequent_itemsets)

# Check if frequent itemsets were found
if frequent_itemsets.empty:
    print("No frequent itemsets found.")
else:
    # Step 4: Derive association rules from the frequent itemsets
    rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
    
    # Display the initial set of rules
    print("Association Rules:")
    display(rules)
    
    # Filter rules by confidence and lift
    filtered_rules = rules[(rules['confidence'] >= 0.001) & (rules['lift'] >= 0.2)]
    
    # Display the filtered rules
    print("Filtered Association Rules:")
    display(filtered_rules)
    
    # Extract product affinities
    product_affinities = []
    for _, row in filtered_rules.iterrows():
        for product in row['antecedents']:
            for related_product in row['consequents']:
                product_affinities.append({
                    'product': product, 
                    'related_product': related_product, 
                    'support': row['support'], 
                    'confidence': row['confidence'], 
                    'lift': row['lift']
                })
    
    # Convert to DataFrame and display
    affinities_df = pd.DataFrame(product_affinities)
    print("Product Affinities:")
    display(affinities_df)

    """
    # Prepare the final output at order_id, product level
    # Create a DataFrame with order_id and the products involved in frequent itemsets
    order_product_affinity = []
    for index, row in df_grouped.iterrows():
        for product in row['products']:
            if any(set([product]) <= set(itemset) for itemset in frequent_itemsets['itemsets']):
                order_product_affinity.append({'order_id': row['order_id'], 'product': product})

    # Convert to DataFrame and set the column names
    final_df = pd.DataFrame(order_product_affinity, columns=['order_id', 'product'])

    # Display the final DataFrame
    print(final_df)
    """

