# Fall 2024 Data Science Track: Week 5 - Unsupervised Learning

## Packages, Packages, Packages!

Import *all* the things here! You need: `matplotlib`, `networkx`, `numpy`, and `pandas`―and also `ast.literal_eval` to correctly deserialize two columns in the `rules.tsv.xz` file.

If you got more stuff you want to use, add them here too. 🙂

In [325]:
# Import stuff.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import ast 


# Instacart Association Rules

## Introduction

With the packages out of the way, now you will be working with the Instacart association rules data set, mined from the [Instacart Market Basket Analysis data set](https://www.kaggle.com/c/instacart-market-basket-analysis/data) on Kaggle. [The script](https://github.com/LiKenun/shopping-assistant/blob/main/api/preprocess_instacart_market_basket_analysis_data.py) that does it and the instructions to run it can be found in my [Shopping Assistant Project](https://github.com/LiKenun/shopping-assistant) repository.

## Load the Data

This code has already been pre-written, simply because there are a few quirks which require converters to ensure the correct deserialization of some columns.

In [326]:
rules_data_path = 'rules.tsv'

df_rules = pd.read_csv(rules_data_path,
                       sep='\t',
                       quoting=3,
                       converters={
                           'consequent_item': ast.literal_eval,
                           'antecedent_items': ast.literal_eval
                       },
                       low_memory=True)

Note to self

ast. literal_eval() can only evaluate Python literals (such as strings, numbers, tuples, lists, dicts, booleans, and None)

But just *how* many rules were just loaded‽

In [327]:
# Show the list of column names and the number of rules.


print("The numnber of rules in the given df is: ",len(df_rules))

df_rules.head()

The numnber of rules in the given df is:  1048575


Unnamed: 0,consequent_item,transaction_count,item_set_count,antecedent_count,consequent_count,antecedent_items
0,Total 2% with Raspberry Pomegranate Lowfat Gre...,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2..."
1,Total 2% Lowfat Greek Strained Yogurt With Blu...,3346083,101,128,123,"[Fat Free Strawberry Yogurt, Total 0% Raspberr..."
2,Total 0% with Honey Nonfat Greek Strained Yogurt,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2..."
3,Total 0% Raspberry Yogurt,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2..."
4,Pineapple Yogurt 2%,3346083,101,128,123,"[Fat Free Strawberry Yogurt, Total 0% Raspberr..."


## Metrics

Compute the support, confidence, and lift of each rule.

* The rule’s *support* tells you how frequently the set of items appears in the dataset. It’s important to prune infrequent sets from further consideration.
    * The simple definition: $$P(A \cap B)$$
    * `= item_set_count / transaction_count`
* The rule’s *confidence* tells you how often a the rule is true. Divide the support for the set of items by the support for just the antecedents. Rules which are not true very often are also pruned.
    * The simple definition: $$\frac{P(A \cap B)}{P(A)}$$
    * `= item_set_count / transaction_count / (antecedent_count / transaction_count)`
    * `= item_set_count / antecedent_count`
* The rule’s *lift* tells you how much more likely the consequent is, given the antecedents, compared to its baseline probability. Divide the support for the set of items by both the support of the antecedents and consequent. Equivalently, divide the confidence by the support of the consequent.
    * The simple definition: $$\frac{P(A \cap B)}{P(A) \cdot P(B)}$$
    * `= item_set_count / transaction_count / (antecedent_count / transaction_count * (consequent_count / transaction_count))`
    * `= item_set_count / antecedent_count / (consequent_count / transaction_count)`
    * `= item_set_count * transaction_count / (antecedent_count * consequent_count)`

In [328]:
# Add new columns support, confidence, and lift to df_rules. And show the first 50 rules.

#df_rules = # Something goes here.

df_rules['support'] = df_rules['item_set_count'] / df_rules['transaction_count']
df_rules['confidence'] = df_rules['item_set_count'] / df_rules['transaction_count']
df_rules['lift'] = (df_rules['item_set_count'] * df_rules['transaction_count']) / ((df_rules['antecedent_count'] * df_rules['consequent_count']))
print ( len (df_rules))
df_rules.head(1000)

1048575


Unnamed: 0,consequent_item,transaction_count,item_set_count,antecedent_count,consequent_count,antecedent_items,support,confidence,lift
0,Total 2% with Raspberry Pomegranate Lowfat Gre...,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2...",0.00003,0.00003,21465.598514
1,Total 2% Lowfat Greek Strained Yogurt With Blu...,3346083,101,128,123,"[Fat Free Strawberry Yogurt, Total 0% Raspberr...",0.00003,0.00003,21465.598514
2,Total 0% with Honey Nonfat Greek Strained Yogurt,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2...",0.00003,0.00003,21465.598514
3,Total 0% Raspberry Yogurt,3346083,101,123,128,"[Fat Free Blueberry Yogurt, Pineapple Yogurt 2...",0.00003,0.00003,21465.598514
4,Pineapple Yogurt 2%,3346083,101,128,123,"[Fat Free Strawberry Yogurt, Total 0% Raspberr...",0.00003,0.00003,21465.598514
...,...,...,...,...,...,...,...,...,...
995,Total 2% with Strawberry Lowfat Greek Strained...,3346083,100,147,145,"[Blackberry Yogurt, Fat Free Strawberry Yogurt...",0.00003,0.00003,15698.254750
996,Total 2% Lowfat Greek Strained Yogurt with Peach,3346083,100,147,145,"[Blackberry Yogurt, Fat Free Strawberry Yogurt...",0.00003,0.00003,15698.254750
997,Total 2% Greek Strained Yogurt with Cherry 5.3 oz,3346083,100,145,147,"[Pineapple Yogurt 2%, Total 0% Raspberry Yogur...",0.00003,0.00003,15698.254750
998,Total 2% All Natural Greek Strained Yogurt wit...,3346083,100,145,147,"[Pineapple Yogurt 2%, Total 0% Raspberry Yogur...",0.00003,0.00003,15698.254750


The yogurts have got some insane lift (*over 9,000*). Why do you think that might be?

*(Write your answer here.)*

This could be the case, because people tend buy multiple flavors of yogurt. Thus, a given consequent (yogurt purchase) -> has a high probability of leading to an additional antecedent yogurt purchase. 

In [191]:
# Query the rule set if you have to to find out more.



## Network Visualization for Consequents with Single Antecedents

Let’s now visualize a small subset of 1,000,000+ rules. First, filter the rule set for the following to whittle it down to something more manageable:

1. The rule must have exactly `1` antecedent item. (There should be 38,684 such rules.)
2. The lift must be between `5` and `20`. (There should be 1,596 such rules, including the prior criterion.)
3. Either the antecedent or consequent of the rule must contain `'Hummus'`, but not both. (This should get you down to 26 rules.)
    * Convert the antecedents `list`-typed column to a `str`-typed column (`antecedent_item`) since there will only be a single antecedent in the subset.
    * Replace any item containing `'Hummus'` to just `'Hummus'`. This will make the visualization more readable later.

Hint: your code may run more efficiently if you re-order certain processing steps.

Assign the subset to `df_rules_subset`.

In [285]:
# Define df_rules_subset.

df_count = 0

list_to_filter = ['Hummus']

df_rules_subset = df_rules

df_rules_subset_condition_1 = df_rules_subset['antecedent_items'].str.len() == 1

df_rules_subset = df_rules_subset[df_rules_subset_condition_1]

df_rules_subset

print( 'The count of the following df is now',len (df_rules_subset))


df_rules_subset_condition_2 = df_rules_subset['lift'] > 5
df_rules_subset_condition_3 = df_rules_subset['lift'] < 20


df_rules_subset = df_rules_subset[df_rules_subset_condition_2 & df_rules_subset_condition_3]
print( 'The count of the following df is now',len (df_rules_subset))

df_rules_subset['antecedent_items_new'] = [','.join(map(str, l)) for l in df_rules_subset['antecedent_items']]


hummus_truth_value_ant = None
hummus_truth_value_cons = None
df_count = 0
df_rules_subset = df_rules_subset.reset_index(drop=True)



#iterates through a given df to reduce to list where consequent_item and antecedent_items_new meet
#the desired hummus conditions

for x in df_rules_subset['antecedent_items_new']:
        if "Hummus" in x:
            hummus_truth_value_ant = True
        else:
            hummus_truth_value_ant = False
        
        
        if "Hummus" in df_rules_subset['consequent_item'][df_count]:
            hummus_truth_value_cons = True
        else:
            hummus_truth_value_cons = False
            
            
            
        
        if (hummus_truth_value_ant == True) and (hummus_truth_value_cons == True):
            df_rules_subset = df_rules_subset.drop(index = df_count)
            df_count += 1
    
               
        elif (hummus_truth_value_ant == False) and (hummus_truth_value_cons == False):
            df_rules_subset = df_rules_subset.drop(index = df_count)
            df_count += 1
            
                
        else:
            df_count += 1
            print(df_count)

    

print( 'The count of the following df is now',len (df_rules_subset))


#removes the old antecedent_items column and replaces its values with non list values from antecedent_items_new
df_rules_subset =df_rules_subset.drop(axis='column', columns = 'antecedent_items')
df_rules_subset = df_rules_subset.rename(columns={'antecedent_items_new': 'antecedent_items'})
df_rules_subset = df_rules_subset.reset_index(drop=True)
df_count = 0




#iterates through a given df, to reduce values to the word Hummus, if found in a given
#consequent_item and antecedent_items_new value

print(df_rules_subset['antecedent_items'][df_count])

for x in df_rules_subset['antecedent_items']:
        if "Hummus" in x:
            df_rules_subset['antecedent_items'][df_count] = 'Hummus'

        
        if "Hummus" in df_rules_subset['consequent_item'][df_count]:
            df_rules_subset['consequent_item'][df_count] = 'Hummus'
     
     
        df_count += 1
        print(df_count)
        
df_rules_subset

The count of the following df is now 38684
The count of the following df is now 1596
92
288
647
671
701
796
809
905
911
978
1123
1193
1215
1260
1350
1356
1372
1394
1464
1482
1483
1492
1495
1501
1586
The count of the following df is now 25
Roasted Red Pepper Hummus With Chips
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,consequent_item,transaction_count,item_set_count,antecedent_count,consequent_count,support,confidence,lift,antecedent_items
0,Total 2% Lowfat Greek Strained Yogurt With Blu...,3346083,100,887,21405,3e-05,3e-05,17.623731,Hummus
1,Clementines,3346083,174,1334,32194,5.2e-05,5.2e-05,13.556738,Hummus
2,Hummus,3346083,345,1656,74172,0.000103,0.000103,9.398434,Organic White Pita Bread
3,Hummus,3346083,293,1434,74172,8.8e-05,8.8e-05,9.217543,Organic Whole Wheat Pita
4,Hummus,3346083,331,1655,74172,9.9e-05,9.9e-05,9.022496,Mini Whole Wheat Pita Bread
5,Hummus,3346083,3119,17333,74172,0.000932,0.000932,8.1178,Sea Salt Pita Chips
6,Organic Baby Carrots,3346083,258,1334,80493,7.7e-05,7.7e-05,8.039749,Hummus
7,Hummus,3346083,151,933,74172,4.5e-05,4.5e-05,7.301163,"Lentil Chips, Himalayan Pink Salt"
8,Hummus,3346083,311,1929,74172,9.3e-05,9.3e-05,7.273189,Garbanzo Beans No Salt Added
9,Hummus,3346083,512,3333,74172,0.000153,0.000153,6.92997,Organic Whole Peeled Baby Carrots


In [286]:

print ('number of rows is:',len(df_rules_subset))



number of rows is: 25


Build a network `graph_rules_subset` from the association rules subset.

In [338]:
# Define graph_rules_subset, add the graph’s edges, and plot it. You may need a large figure size, smaller node size, and smaller font size.

#was un-able to complete this one :(

What can you tell about people who buy hummus?

*People that buy Hummus also tend to buy vegitables/starchy based products to eat with it*

## Make a Prediction

Given that the basket of items contains the following items, use the full set of association rules to predict the next 20 most likely items (consequents) that the person will add to the basket in descending order of lift:

* `'Orange Bell Pepper'`
* `'Organic Red Bell Pepper'`

Hint: a single item in the basket may be a better predictor of some consequents than both items considered together. You must consider both or either, but not neither.

In [340]:
basket = {'Orange Bell Pepper', 'Organic Red Bell Pepper'}

df_rules_new = df_rules



df_count = 0
df_rules_new = df_rules_new.reset_index(drop=True)
df_rules_new['antecedent_items_new'] = [','.join(map(str, l)) for l in df_rules_new['antecedent_items']]
df_rules_new = df_rules_new[df_rules_new['antecedent_items_new'].isin(basket)]


df_rules_new =df_rules_new.drop(axis='column', columns = 'antecedent_items')
df_rules_new = df_rules_new.rename(columns={'antecedent_items_new': 'antecedent_items'})
df_rules_new = df_rules_new.reset_index(drop=True)
df_rules_new = df_rules_new.sort_values(by=['lift'], ascending=False)

df_rules_new.head(20)


Unnamed: 0,consequent_item,transaction_count,item_set_count,antecedent_count,consequent_count,support,confidence,lift,antecedent_items
0,Yellow Bell Pepper,3346083,7520,41052,26625,0.002247,0.002247,23.021341,Orange Bell Pepper
1,Organic Bell Pepper,3346083,6024,59878,24331,0.0018,0.0018,13.835486,Organic Red Bell Pepper
2,Red Peppers,3346083,5529,41052,58185,0.001652,0.001652,7.745295,Orange Bell Pepper
3,Green Bell Pepper,3346083,7086,59878,58005,0.002118,0.002118,6.826611,Organic Red Bell Pepper
4,Green Bell Pepper,3346083,4144,41052,58005,0.001238,0.001238,5.823133,Orange Bell Pepper
5,Organic Cucumber,3346083,6480,59878,85005,0.001937,0.001937,4.259905,Organic Red Bell Pepper
6,Organic Yellow Onion,3346083,7919,59878,117716,0.002367,0.002367,3.759277,Organic Red Bell Pepper
7,Cucumber Kirby,3346083,4374,41052,99728,0.001307,0.001307,3.574901,Orange Bell Pepper
8,Organic Zucchini,3346083,6727,59878,109412,0.00201,0.00201,3.435784,Organic Red Bell Pepper
9,Organic Garlic,3346083,6222,59878,113936,0.001859,0.001859,3.051676,Organic Red Bell Pepper


## Bonus: Other Interesting Findings

Find and share something else interesting about these association rules. It can be a graph, table, or some other format that illustrates your point.

*MY FINDINGS*

Seems that after the filtering, it could be safe to say, tgat when people pick up a red bell pepper or organic bell pepper, they have a high likelyhood of grabbing another pepper, or perhaps another vegatable/fruit. 