### Getting Started

Using company data, this demo will briefly cover some of the concepts and steps to performing pattern search/association rule analysis. Prior to running the notebook, please have the following packges installed:

- Pandas
- Numpy
- Matplotlib
- Mlxtend
- PyECLAT

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
# import the dataset

df = pd.read_csv("C:/")
df.head()

In [None]:
# get some general info on dataset
df.info()

### Exploratory Data Analysis

In [None]:
# Check the summary stats to better understand data
df.describe(include='all')

In [None]:
#Check for null/blank rows
df.isnull().values.sum()

In [None]:
# What product has most shipments
col_sum = df.sum()
highest_col_sum = col_sum.idxmax()
highest_sum=col_sum[highest_col_sum]

print("Most shipped product:", highest_col_sum)
print("Amount shipped:", highest_sum)

In [None]:
# Bar plot
plt.figure(figsize=(10,10))
plt.bar(col_sum.index,col_sum.values)
plt.xlabel('Products')
plt.ylabel('Count of obs')
plt.title('Total Count')

plt.show()

### Preprocessing/Formatting Dataset

Before doing any kind of metrics/evaluation, we need to format the dataset so that it can be used properly.
Completing the steps below.

In [None]:
df1 = df.copy() # make a copy of the dataset
df1.head()

In [None]:
# Create a new column for transaction ID to be at beginning of DF
df1 = df1.assign(Transaction+ID = range(1, len(df1) +1))
df1 = df1.set_index("Transaction_ID")

#Need the rows to show as True/False instead of 0 or 1 (needed for algorithms later)

df1 = df1.fillna(0).astype(bool)
df1.head()

### Evaluation Metrics & Calculations

The metrics used to analyze pattern search/association rules are listed below. Each metric measures the level of interestingness/importance of the rule in question. The below is some info about each metric:

 - Support: Measures the frequency of occurrence of a rule in the dataset
 - Confidence: Represents the conditional probability of the consequent (then) given the antecent (if), telling us how reliable the rule is
 - Lift: Measures the strength of association between the antecedent and consequent by comparing the observed support with the expected support if the items were independent
 - Leverage: Quantifies the difference between the observed support and expected support if the items were independent, indicating the co-occurrence of the antecedent and consequent
 - Conviction: Measures the degree of dependency between the antecedent and consequent, indicating how much the consequent relies on the antecedent for its occurence
 
Note, support and confidence are the most commonly used metrics for evaluating rules. Lift is a good metric to use to determine if a rule should be pruned (removed) or not.

For more info, you should check out the wiki page on association rule learning.

#### Support

In [None]:
# Calculating support for individual products
support = df1.mean().sort_values(ascending = False)
support.head(10) # give me the top 10 products with highest support

In [None]:
# Can also calculate support for bundles

df_bundles = df1.copy()
df_bundles['product1 and p2'] = np.logical_and(df_bundles['product1'],df_bundles['p2'])
df_bundles['product1 and p3'] = np.logical_and(df_bundles['product1'],df_bundles['p3'])
df_bundles['p2 and p4'] = np.logical_and(df_bundles['p2'],df_bundles['p4'])

# show the support for the new bundles
new_bundles = ['p1 and p2', 'p1 and p3', 'p2 and p4']
support = df_bundles.mean().sort_values(ascending = False) # adding bundles to main df

print(support[new_bundles].head())


#### Confidence

In [None]:
# confidence of p1 and p2
print(support['p1 and p2']/support['p1'])

# for p1 and p3
print(support['p1 and p3']/support['p1'])

# for p2 and p4
print(support['p2 and p4']/support['p2'])

#### Lift

In [None]:
# lift of p1 and p2
print(support['p1 and p2']/(support['p1']*support['p2']))

# for p1 and p3
print(support['p1 and p3']/(support['p1']*support['p3']))

# for p2 and p4
print(support['p2 and p4']/(support['p2']*support['p4']))

#### leverage

In [None]:
# leverage of p1 and p2
print(support['p1 and p2'] - (support['p1']*support['p2']))

# for p1 and p3
print(support['p1 and p3'] - (support['p1']*support['p3']))

# for p2 and p4
print(support['p2 and p4'] - (support['p2']*support['p4']))

#### Conviction

In [None]:
# Conviction of p1 and p2
print(support['p1']*(1-support['p2']) / (support['p1'] - support['p1 and p2']))

#for p1 and p3
print(support['p1']*(1-support['p3']) / (support['p1'] - support['p1 and p3']))

#for p2 and p4
print(support['p2']*(1-support['p4']) / (support['p2'] - support['p2 and p4']))

### Algorithms

the below section goes over some examples of the different algorithms used in pattern search. Each one approaches the dataset differently. A brief explanation of each one:

 - Apriori: looks at items that frequently appear together and generates rules based on how often they occur together, using support + confidence thresholds to do so.
 - ECLAT: Uses a depth-first search strategy to discover frequent itemsets and generate rules by exploiting the vertical data format.
 - F-P Growth: similar to Apriori, but looks at items individually, organizing them into a structure called F-P Tree.

#### Apriori

In [None]:
# import the packages needed

from mlxtend.frequent_patterns import apriori, association_rules

In [None]:
# Applying the apriori method
apriori = apriori(df_bundles, min_support = 0.01, use_colnames=True)

In [None]:
# Make rules
rules = association_rules(apriori, metric="confidence", min_threshold=0.5)
rules.sort_values(by = 'support', ascending = False)

# INSERT SOME INFO ON THE RESULTS ABOVE

#### ECLAT

In [None]:
# import needed packages
from pyECLAT import ECLAT

In [None]:
# To use this, must reformat the df so that it fits specific criteria
# you can get more info on this by creating a cell containing "help(ECLAT)" and running it

# will use our original df to start
df2 = df.copy()

product_names = list(df2.columns[1:]) #creating list of product names
new_df = pd.DataFrame(columns = product_names) #create new df using cols above

for column in df1.columns:
    transaction_data = [] #these are transation IDs/rows
    
    for index,value in df1[column].items():
        if value == 1:
            transaction_data.append(column)
        else:
            transaction_data.append('')
    
    new_df[column] = transaction_data
    
new_df.reset_index(drop=True, inplace=True) #reset index

new_df.columns = range(len(new_df.columns)) # change col names from product name to numbers instead

new_df = new_df.replace('', np.nan) #replace blanks with NaNs instead

#This is the new df
new_df


In [None]:
# Create the ECLAT algorithm
eclat = ECLAT(data = new_df, verbose = True)

In [None]:
# Fit the algorithm on our data
rule_indices, rule_supports = eclat.fit(min_support = 0.03, min_combination = 2, max_combination = 4)

#Limiting this to a max of 4x4 combinations and support must be >= 0.03, or this takes a very long time

In [None]:
# Get the rules

eclat_result = pd.DataFrame(rule_supports.items(), columns=['Item','Support'])
eclat_result.sort_values(by=['Support'], ascending=False)

#### F-P Growth

In [None]:
# Creating F-P growth method
from mlxtend.frequent_patterns import fpmax, fpgrowth, association_rules

fpgrowth = fpgrowth(df_bundles, min_support = 0.01, use_colnames = True)
#fpgrowth.sort_values(by = 'support', ascending=False).head(10)

In [None]:
# Make rules
rulesfp = association_rules(fpgrowth, metric="lift", min_threshold = 0.5)
rulesfp = rulesfp.sort_values(['confidence','lift'], ascending = [False, False])
rulesfp

# put info on interpreting the results