<a href="https://colab.research.google.com/github/FionaNalianya/Market-basket-analysis/blob/main/Market_Basket_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Notebook: Market Basket Analysis

## Pre-requisites

In [1]:
# Import the required libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## Examples

### Example 1

In [None]:
# Question
# ---
# Given the following one-hot encoded dataset from a Bakery,
# should the Bakery owner consider bundling toast and coffee together
# while selling them?
# ---
# Dataset URL (CSV) = https://bit.ly/3HByzqc
#

In [None]:
# Step 1: Loading and Performing Data processing
# ---
#
basket_df = pd.read_csv("https://bit.ly/3HByzqc")
basket_df.head()

Unnamed: 0,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,Bowl Nic Pitt,Bread,Bread Pudding,Brioche and salami,Brownie,Cake,Caramel bites,Cherry me Dried fruit,Chicken Stew,Chicken sand,Chimichurri Oil,Chocolates,Christmas common,Coffee,Coffee granules,Coke,Cookies,Crepes,Crisps,Drinking chocolate spoons,Duck egg,Dulce de Leche,Eggs,Ella's Kitchen Pouches,Empanadas,Extra Salami or Feta,Fairy Doors,Farm House,Focaccia,Frittata,...,Lemon and coconut,Medialuna,Mighty Protein,Mineral water,Mortimer,Muesli,Muffin,My-5 Fruit Shoot,Nomad bag,Olum & polenta,Panatone,Pastry,Pick and Mix Bowls,Pintxos,Polenta,Postcard,Raspberry shortbread sandwich,Raw bars,Salad,Sandwich,Scandinavian,Scone,Siblings,Smoothies,Soup,Spanish Brunch,Spread,Tacos/Fajita,Tartine,Tea,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**Observation**

We can see that the dataset is in the form of a **one-hot encoded** pandas DataFrame. We can use the apriori function to generate the frequent itemsets.

In [None]:
# Step 2: Generating frequent itemsets
# ---
# We'll generate the most frequent itemsets by using apriori function()
# pass the parameters:
# ---
# basket_df - our transactional dataset
# min_support = 0.01 - We set minimum-support threshold at 1%
# use_colnames = True to display the column names in our itemset columns.
# If you set use_colnames = False the itemsets will be shown in indices.
# ---
#
bs_frequent_itemsets = apriori(basket_df, min_support=0.01, use_colnames=True)
bs_frequent_itemsets.head()

In [None]:
# Step 3: Generating association rules
# ---
# The final step is to generate the rules with their
# corresponding support, confidence and lift using the
# association_rules() function.
# We will set the minimum threshold for lift at 1
# and then sort the result by descending confidence value.
# Don't worry about the leverage and conviction metrics.
# You can consider them for your further reading
# ---
#
rules = association_rules(bs_frequent_itemsets, metric="lift", min_threshold=1)

# Sorting
rules.sort_values("confidence", ascending = False, inplace = True)

# Previewing the association rules
rules.head()

**Observation**

* The output above shows the Top 10 itemsets sorted by confidence value and all itemsets have support value over 1% and lift value over 1.

* The first itemset shows the association rule "if Toast then Coffee" with support value at 0.023666 means nearly 2.4% of all transactions have this combination of Toast and Coffee bought together.

* We also have 70% confidence that Coffee sales happen whenever a Toast is purchased.

* The lift value of 1.47 (greater than 1) shows that the purchase of Coffee is indeed influenced by the purchase of Toast rather than Coffee's purchase being independent of Toast.

* The lift value of 1.47 means that Toast's purchase lifts the Coffee's purchase by 1.47 times.

* Therefore, we can conclude that there is indeed evidence to suggest that the purchase of Toast leads to the purchase of Coffee. The owner of the bakery should consider bundling Toast and Cofee together as a Breakfast Set or Lunch Set, the staff in the store should also be trained to cross-sell Coffee to customers who purchase Toast, knowing that they are more likely to purchase them together, thereby increasing the store's revenue.


### Example 2

In [None]:
# Question: Using the following shop dataset determine whether there
# is a strong association between {chips, milk} and {juice}.
# ---
# Dataset URL (CSV) = https://bit.ly/3kJbqs8
# ---
#
shop_df = pd.read_csv('https://bit.ly/3kJbqs8')
shop_df.head()

In [None]:
# Step 1: Data processing
# ---
# We group the bread dataframe by Transaction
# and Item and display the count of items
# ---
shop_df2 = shop_df.groupby(["TID","item"]).size().reset_index(name="Count")
shop_df2.head()

In [None]:
# Step 1: Data processing
# ---
# Then we consolidate the items into one transaction per row
# with each item one-hot encoded.
# ---
#
shop_df3 = (shop_df2.groupby(['TID', 'item'])['Count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('TID'))

shop_df3.head()

In [None]:
# Step 1: Data processing
# ---
# We then use our custom encoding function to convert
# all the values to 0 or 1.
# The Apriori algorithm will only take 0's or 1's.
# ---
#
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

shop_df4 = shop_df3.applymap(encode_units)

shop_df4.head()

In [None]:
# Step 2: We generate the frequent itemsets
shop_frequent_itemsets = apriori(shop_df4, min_support=0.2, use_colnames=True)
shop_frequent_itemsets.head()

In [None]:
# Step 3: Finding the association rules
shop_rules = association_rules(shop_frequent_itemsets, metric="lift", min_threshold=1)

# Sorting
shop_rules.sort_values("confidence", ascending = False, inplace = True)

# Previewing the associative rules
shop_rules.head()

**Observation**
Indeed we can colude that there is a strong association between {chips, milk} and {cookies,juice} as we can see their lift value = 2.500000.


### Example 3

In [None]:
# Example
# ---
#
# Given the following dataset, determine which products should be sold together.
# ---
# Dataset URL (CSV) = https://bit.ly/3HyoTwU
# ---
# NB:
# 1. Each row corresponds to a transaction and each column corresponds
#    to an item purchased in that specific transaction.
# 2. The NaN tells us that the item represented by the column was not
#    purchased in that specific transaction.
# ---
#

# Step 1: Loading and Data Processing
# ---
#
store_df = pd.read_csv("https://bit.ly/3HyoTwU")
store_df.head()

In [None]:
# Step 1: Data processing
# ----
# Our data processing techniques here will be abit
# different from the previous example.
#

# ---
# We will convert the pandas dataframe into a list of lists
#
records = []
for i in range(1, store_df.shape[0]):
    records.append([str(store_df.values[i, j]) for j in range(0,  store_df.shape[1])])

# Then later transform the list of lists into a one-hot encoded
# pandas DataFrame via TransactionEncoder().
# The resulting dataframe will be used for the generation of
# frequent itemsets using the apriori() function.
# ---
#
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(records).transform(records)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

In [None]:
# Step 1: We drop the "nan" column because we won't need it
# ---
#
df.drop('nan', inplace=True, axis=1)

In [None]:
# Step 2: We generate the frequent itemsets with a min_support of 0.2
# ---
#
df_frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)

# Step 3: The find the association rules
df_rules = association_rules(df_frequent_itemsets, metric="lift", min_threshold=1)

# We sort them
df_rules.sort_values("confidence", ascending = False, inplace = True)

# And preview them
df_rules.head()

**Observation**

Undershirts have a strong association with Shorts i.e. lift = 3.166667 and should be sold together.

## <font color="green">Challenges</font>


### Challenge 1

In [None]:
# Question: Determine which items in the given groceries store dataset
# have a strong association relationship.
# ---
# Dataset URL (CSV) = https://bit.ly/3oEvE7t
#

In [None]:
# Step 1: Reading the dataset
groceries_df = pd.read_csv("https://bit.ly/3oEvE7t")
groceries_df.head()

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,berries,beverages,bottled beer,bottled water,brandy,brown bread,butter,butter milk,cake bar,candles,candy,canned beer,canned fish,canned fruit,canned vegetables,cat food,cereals,chewing gum,chicken,chocolate,chocolate marshmallow,citrus fruit,cleaner,cling film/bags,cocoa drinks,coffee,condensed milk,cooking chocolate,cookware,cream,...,salty snack,sauces,sausage,seasonal products,semi-finished bread,shopping bags,skin care,sliced cheese,snack products,soap,soda,soft cheese,softener,sound storage medium,soups,sparkling wine,specialty bar,specialty cheese,specialty chocolate,specialty fat,specialty vegetables,spices,spread cheese,sugar,sweet spreads,syrup,tea,tidbits,toilet cleaner,tropical fruit,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [None]:
# Step 2: Generating frequent itemsets
groceries_df_frequent_itemsets = apriori(groceries_df, min_support=0.01, use_colnames=True)
groceries_df_frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.033452,(UHT-milk)
1,0.017692,(baking powder)
2,0.052466,(beef)
3,0.033249,(berries)
4,0.026029,(beverages)


In [None]:
# Step 3: Generating association rules
# ---
rules = association_rules(groceries_df_frequent_itemsets, metric="lift", min_threshold=1)

# Sorting
rules.sort_values("confidence", ascending = False, inplace = True)

# Previewing the association rules
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
419,"(citrus fruit, root vegetables)",(other vegetables),0.017692,0.193493,0.010371,0.586207,3.029608,0.006948,1.949059
491,"(tropical fruit, root vegetables)",(other vegetables),0.021047,0.193493,0.012303,0.584541,3.020999,0.008231,1.941244
438,"(curd, yogurt)",(whole milk),0.017285,0.255516,0.010066,0.582353,2.279125,0.005649,1.782567
414,"(butter, other vegetables)",(whole milk),0.020031,0.255516,0.01149,0.573604,2.244885,0.006371,1.745992
570,"(tropical fruit, root vegetables)",(whole milk),0.021047,0.255516,0.011998,0.570048,2.230969,0.00662,1.731553


### Challenge 2

In [None]:
# Question: Determine which movies should be promoted together
# given the following dataset.
# ---
# Dataset URL = https://bit.ly/3Hxz2ts
# ---
#

In [None]:
# Step 1: Reading the dataset
# ---
#
movies_df = pd.read_csv("https://bit.ly/3Hxz2ts")
movies_df.head()

Unnamed: 0,TID,item
0,145755,The Fault in Our Stars
1,145755,Boyhood
2,145755,Big Hero 6
3,145755,The Imitation Game
4,145755,Inside Out


In [None]:
# Step 1: Data processing
# ---
# We group the bread dataframe by Transaction
# and Item and display the count of items
# ---
movies_df2 = movies_df.groupby(["TID","item"]).size().reset_index(name="Count")
movies_df2.head()

Unnamed: 0,TID,item,Count
0,22,Inside Out,1
1,46,The Imitation Game,1
2,123,Big Hero 6,1
3,123,The Imitation Game,1
4,128,Big Hero 6,1


In [None]:
# Step 1: Data processing
# ---
# Then we consolidate the items into one transaction per row
# with each item one-hot encoded.
# ---
#
movies_df3 = (movies_df2.groupby(['TID', 'item'])['Count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('TID'))

movies_df3.head()

item,400 Days,A Walk in the Woods,About Alex,Action Jackson,American Ultra,Annie,Anti-Social,Appropriate Behaviour,Ascension,Aziz Ansari: Live at Madison Square Garden,Bad Asses on the Bayou,Bad Country,Batkid Begins,Bears,Beautiful Girl,Before I Disappear,Best of Enemies,Big Eyes,Big Hero 6,Blind,Boyhood,Break Point,Breathe,Butterfly,Calvary,Camp Belvidere,Carmina and Amen (Carmina y am̩n),Catch Hell,Contracted: Phase II,Copenhagen,Court,Crimes Against Humanity,Cruel & Unusual,Daawat-e-Ishq,Dark Summer,Deadly Virtues: Love.Honour.Obey.,Dear White People,Demonic,Dum Laga Ke Haisha,Eat,...,Sinister 2,Sisters,Skin Trade,Stage Fright,Steve Jobs: The Man in the Machine,Strangerland,Tanu Weds Manu Returns,Teenage Mutant Ninja Turtles,The Best of Me,The Big Short,The Christmas Secret,The Crow's Nest,The Divine Move,The Fault in Our Stars,The Green Prince,The Humbling,The Hunger Games: Mockingjay - Part 2,The Imitation Game,The Little Death,The Little Rascals Save the Day,The Mend,The Price of Gold,The Treasure,The Unexpected Love,The Vatican Tapes,The Voices,The Wonders,Time Out of Mind,Transformers: Age of Extinction,Turks & Caicos,Unexpected,V/H/S: Viral,Victor Frankenstein,Viy,Wanderers,Web Junkie,Welcome to Leith,Whitey: United States of America v. James J. Bulger,Wild Card,Wild Tales
TID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
46,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
123,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
176,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Step 1: Data processing
# ---
# We then use our custom encoding function to convert
# all the values to 0 or 1.
# The Apriori algorithm will only take 0's or 1's.
# ---
#
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

movies_df4 = movies_df3.applymap(encode_units)

movies_df4.head()

item,400 Days,A Walk in the Woods,About Alex,Action Jackson,American Ultra,Annie,Anti-Social,Appropriate Behaviour,Ascension,Aziz Ansari: Live at Madison Square Garden,Bad Asses on the Bayou,Bad Country,Batkid Begins,Bears,Beautiful Girl,Before I Disappear,Best of Enemies,Big Eyes,Big Hero 6,Blind,Boyhood,Break Point,Breathe,Butterfly,Calvary,Camp Belvidere,Carmina and Amen (Carmina y am̩n),Catch Hell,Contracted: Phase II,Copenhagen,Court,Crimes Against Humanity,Cruel & Unusual,Daawat-e-Ishq,Dark Summer,Deadly Virtues: Love.Honour.Obey.,Dear White People,Demonic,Dum Laga Ke Haisha,Eat,...,Sinister 2,Sisters,Skin Trade,Stage Fright,Steve Jobs: The Man in the Machine,Strangerland,Tanu Weds Manu Returns,Teenage Mutant Ninja Turtles,The Best of Me,The Big Short,The Christmas Secret,The Crow's Nest,The Divine Move,The Fault in Our Stars,The Green Prince,The Humbling,The Hunger Games: Mockingjay - Part 2,The Imitation Game,The Little Death,The Little Rascals Save the Day,The Mend,The Price of Gold,The Treasure,The Unexpected Love,The Vatican Tapes,The Voices,The Wonders,Time Out of Mind,Transformers: Age of Extinction,Turks & Caicos,Unexpected,V/H/S: Viral,Victor Frankenstein,Viy,Wanderers,Web Junkie,Welcome to Leith,Whitey: United States of America v. James J. Bulger,Wild Card,Wild Tales
TID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
123,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
176,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# Step 2: We generate the frequent itemsets
shop_frequent_itemsets = apriori(movies_df4, min_support=0.2, use_colnames=True)
shop_frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.280029,(Big Hero 6)
1,0.421852,(Gone Girl)
2,0.325615,(Inside Out)
3,0.48987,(The Imitation Game)
4,0.230825,"(The Imitation Game, Gone Girl)"


In [None]:
# Step 3: Finding the association rules
shop_rules = association_rules(shop_frequent_itemsets, metric="lift", min_threshold=1)

# Sorting
shop_rules.sort_values("confidence", ascending = False, inplace = True)

# Previewing the associative rules
shop_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,(Gone Girl),(The Imitation Game),0.421852,0.48987,0.230825,0.54717,1.11697,0.024172,1.126538
0,(The Imitation Game),(Gone Girl),0.48987,0.421852,0.230825,0.471196,1.11697,0.024172,1.093313


### Challenge 3

In [None]:
# Question
# ---
# Given the following dataset, determine which items should be promoted together.
# ---
# Dataset URL (CSV) = https://bit.ly/323fA7E
#
#

In [None]:
# Loading and Data Processing
# ---
#
df = pd.read_csv("https://bit.ly/323fA7E")
df.head()

Unnamed: 0,Tissue,Towels,Plates,Cutlery,Mop,Broom,Kleenex,Bag,Detergent,Foil,Scrubber,Oxiclean,Handwash,Lotion,Shampoo,Conditioner,Soap,Disinfecting wipes,Diapers,Fabric softener
0,Tissue,Towels,Plates,Cutlery,Mop,Broom,Kleenex,,,,,,,,,,,,,
1,Shampoo,Conditioner,Soap,Disinfecting wipes,Diapers,Fabric softener,,,,,,,,,,,,,,
2,Kleenex,Bag,Detergent,Foil,Scrubber,Disinfecting wipes,Diapers,Fabric softener,,,,,,,,,,,,
3,Tissue,Towels,Cutlery,Broom,Kleenex,Bag,Detergent,Foil,Oxiclean,Handwash,Lotion,Shampoo,Conditioner,Soap,Disinfecting wipes,Diapers,Fabric softener,,,
4,Tissue,Towels,Plates,Cutlery,Kleenex,Bag,Detergent,Foil,Scrubber,Oxiclean,Handwash,Lotion,Shampoo,Conditioner,Soap,Disinfecting wipes,Diapers,,,


In [None]:
# ---
# We will convert the pandas dataframe into a list of lists
#
records = []
for i in range(1, df.shape[0]):
    records.append([str(df.values[i, j]) for j in range(0,  df.shape[1])])

# Then later transform the list of lists into a one-hot encoded
# pandas DataFrame via TransactionEncoder().
# The resulting dataframe will be used for the generation of
# frequent itemsets using the apriori() function.
# ---
#
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(records).transform(records)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

Unnamed: 0,Bag,Broom,Conditioner,Cutlery,Detergent,Diapers,Disinfecting wipes,Fabric softener,Foil,Handwash,Kleenex,Lotion,Mop,Oxiclean,Plates,Scrubber,Shampoo,Soap,Tissue,Towels,nan
0,False,False,True,False,False,True,True,True,False,False,False,False,False,False,False,False,True,True,False,False,True
1,True,False,False,False,True,True,True,True,True,False,True,False,False,False,False,True,False,False,False,False,True
2,True,True,True,True,True,True,True,True,True,True,True,True,False,True,False,False,True,True,True,True,True
3,True,False,True,True,True,True,True,False,True,True,True,True,False,True,True,True,True,True,True,True,True
4,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True


In [None]:
# Step 1: We drop the "nan" column because we won't need it
# ---
#
df.drop('nan', inplace=True, axis=1)

In [None]:
# Step 2: We generate the frequent itemsets with a min_support of 0.2
# ---
#
df_frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)

# Step 3: The find the association rules
df_rules = association_rules(df_frequent_itemsets, metric="lift", min_threshold=1)

# We sort them
df_rules.sort_values("confidence", ascending = False, inplace = True)

# And preview them
df_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
196172,"(Detergent, Conditioner, Disinfecting wipes, L...","(Handwash, Bag, Soap)",0.210526,0.210526,0.210526,1.0,4.75,0.166205,inf
197286,"(Shampoo, Bag, Disinfecting wipes, Oxiclean, F...","(Handwash, Detergent, Soap, Conditioner)",0.210526,0.210526,0.210526,1.0,4.75,0.166205,inf
81601,"(Diapers, Shampoo, Bag)","(Handwash, Lotion, Detergent, Conditioner)",0.210526,0.210526,0.210526,1.0,4.75,0.166205,inf
81600,"(Handwash, Shampoo, Bag)","(Diapers, Lotion, Detergent, Conditioner)",0.210526,0.210526,0.210526,1.0,4.75,0.166205,inf
81599,"(Lotion, Shampoo, Bag)","(Handwash, Diapers, Detergent, Conditioner)",0.210526,0.210526,0.210526,1.0,4.75,0.166205,inf
