<a href="https://colab.research.google.com/github/OpenCV13/dm/blob/main/module_aa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module AA: Association Analysis

An example from this link.

https://www.kaggle.com/code/sangwookchn/association-rule-learning-with-scikit-learn/notebook

In [63]:
# from google.colab import drive
# drive.mount('/content/drive')

In [64]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [65]:
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None) #To make sure the first row is not thought of as the heading
print(dataset.shape)
dataset.head()

(7501, 20)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [66]:
#Transforming the list into a list of lists, so that each transaction can be indexed easier
transactions = []
for i in range(0, dataset.shape[0]):
    transactions.append([str(dataset.values[i, j]) for j in range(0, 20)])

print(transactions[0])

['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil']


In [67]:
!pip install apyori



In [68]:
from apyori import apriori
# Please download this as a custom package --> type "apyori"
# To load custom packages, do not refresh the page. Instead, click on the reset button on the Console.

rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)
# Support: number of transactions containing set of times / total number of transactions
# .      --> products that are bought at least 3 times a day --> 21 / 7501 = 0.0027
# Confidence: Should not be too high, as then this wil lead to obvious rules

#Try many combinations of values to experiment with the model.

#viewing the rules
results = list(rules)

In [69]:
#Transferring the list to a table

results = pd.DataFrame(results)
results
results.to_csv('results.csv')

"The first item in the list is a list itself containing three items. The first item of the list shows the grocery items in the rule.

For instance from the first item, we can see that light cream and chicken are commonly bought together. This makes sense since people who purchase light cream are careful about what they eat hence they are more likely to buy chicken i.e. white meat instead of red meat i.e. beef. Or this could mean that light cream is commonly used in recipes for chicken.

The support value for the first rule is 0.0045. This number is calculated by dividing the number of transactions containing light cream divided by total number of transactions. The confidence level for the rule is 0.2905 which shows that out of all the transactions that contain light cream, 29.05% of the transactions also contain chicken. Finally, the lift of 4.84 tells us that chicken is 4.84 times more likely to be bought by the customers who buy light cream compared to the default likelihood of the sale of chicken."

From https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

## In-Class: Association Analysis ##
Customer Computer Configuration

In [70]:
pc_purchase = pd.read_csv('PC-Purchase-Data.csv') #To make sure the first row is not thought of as the heading
print(pc_purchase.shape)
pc_purchase.head()

(67, 12)


Unnamed: 0,Intel Core i3,Intel Core i5,Intel Core i7,10 inch screen,12 inch screen,15 inch screen,2 GB,4 GB,8 GB,320 GB,500 GB,750 GB
0,0,1,0,0,1,0,0,1,0,0,1,0
1,0,1,0,0,0,1,0,0,1,0,0,1
2,0,1,0,0,1,0,0,1,0,1,0,0
3,1,0,0,0,1,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,1,0,0,1


The data represent the configurations for a small number of orders of laptops placed over the web. The main options from which customers can choose are the type of processors, screen size, memory, and hard drive. A '1' signifies that a customer selected a particular option. If the manufacturer can better understand what types of components are often ordered together, it can speed up final assembly by having partially completed laptops with the mYourorelar combinations of orderingnents configured before order. You task to is find the most popular configuraions.

Reflection:
Use support, confidence, and lift correctly to explain your findings.


In [71]:
columns = pc_purchase.columns

for column_label in columns:
    pc_purchase[column_label].replace(to_replace=1, value=column_label, inplace=True)
    pc_purchase[column_label].replace(to_replace=0, value=np.nan, inplace=True)


In [72]:
#Transforming the list into a list of lists, so that each transaction can be indexed easier
transactions_pc = []
for i in range(0, pc_purchase.shape[0]):
    transactions_pc.append([str(pc_purchase.values[i, j]) for j in range(0, pc_purchase.shape[1])])

print(transactions_pc[1])

['nan', 'Intel Core i5', 'nan', 'nan', 'nan', '15 inch screen', 'nan', 'nan', '8 GB', 'nan', 'nan', '750 GB']


In [73]:
rules_pc = apriori(transactions_pc, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)
results_pc = list(rules_pc)
results_pc_df = pd.DataFrame(results_pc)
# results_pc_df.to_csv('results_pc_df.csv')
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
    display(results_pc_df.head(10))
results_pc_df['ordered_statistics'][0]

Unnamed: 0,items,support,ordered_statistics
0,"(8 GB, 10 inch screen, 750 GB)",0.014925,"[((10 inch screen, 750 GB), (8 GB), 1.0, 5.153..."
1,"(Intel Core i3, 10 inch screen, 750 GB)",0.014925,"[((10 inch screen, 750 GB), (Intel Core i3), 1..."
2,"(8 GB, Intel Core i3, 10 inch screen)",0.014925,"[((8 GB, 10 inch screen), (Intel Core i3), 1.0..."
3,"(320 GB, Intel Core i3, 15 inch screen)",0.029851,"[((Intel Core i3, 15 inch screen), (320 GB), 1..."
4,"(8 GB, 750 GB, 15 inch screen)",0.074627,"[((750 GB), (8 GB, 15 inch screen), 0.29411764..."
5,"(Intel Core i7, 750 GB, 15 inch screen)",0.074627,"[((750 GB), (Intel Core i7, 15 inch screen), 0..."
6,"(Intel Core i7, 8 GB, 15 inch screen)",0.059701,"[((8 GB), (Intel Core i7, 15 inch screen), 0.3..."
7,"(320 GB, 4 GB, 2 GB )",0.014925,"[((4 GB, 2 GB ), (320 GB), 1.0, 3.526315789473..."
8,"(Intel Core i3, 4 GB, 2 GB )",0.014925,"[((4 GB, 2 GB ), (Intel Core i3), 1.0, 3.04545..."
9,"(Intel Core i7, 4 GB, 750 GB)",0.059701,"[((Intel Core i7), (4 GB, 750 GB), 0.333333333..."


[OrderedStatistic(items_base=frozenset({'10 inch screen', '750 GB'}), items_add=frozenset({'8 GB'}), confidence=1.0, lift=5.153846153846153),
 OrderedStatistic(items_base=frozenset({'8 GB', '10 inch screen'}), items_add=frozenset({'750 GB'}), confidence=1.0, lift=3.941176470588235)]

The top four rules all have confidences of 1.0, however their supports are at 0.014925, 0.014925, 0.014925 and 0.029851 respectively.  All four lift scores are above 3, so this indicates a strong correlation.  However the occurance of these four rules are relatively low compared to the rules with the highest support score.  

The itemset with the highest support is {'Intel Core i7', '750 GB', '15 inch screen'} and {'8 GB', '750 GB', '15 inch screen'} with a support of 0.074627.  This was calculated by taking the number of these itemsets divided by the total number of transactions.  The subsets of these two itemsets also had a support of 0.07462.  These itemsets had confidence scores of 0.8333 and 1.0 respectively.  This is good and indicates the rules are reliable.  Looking at the lift provides further insight into how interesting the rules are.  The lift of these rules are 3.2843 and 3.9411 respectively.  Since the lifts are greater than 1, this indicates positive association.

My ordered_statistics had multiple values for each row, I think it calculated the statistics after adding the items to the itemset.  Please let me know if this was a correct assumption.  