# Part III: Machine learning

**Apriori Algorithm**:
- Machine Learning algorithm which is used to gain insight into the structured relationships between different items involved
- An algorithm for frequent item set mining and association rule learning over relational databases. 

**Project**: understand which items are frequently bought together
- Practical application: customers who purchased item A are recommended item B

**Metrics**

**1. Suport**: percentage of orders that contain the item set.
- Example: item set {apple, egg} » there are 5 orders in total and {apple, egg} occurs in 3 of them, so: support{apple,egg} = 3/5 or 60%

**2. Confidence**: Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that item A was purchased.
- Example: confidence{A->B} = support{A,B} / support{A} 


**3. Lift**: Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items are occurring together in the same orders simply by chance
- Example: lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})
- lift = 1 implies no relationship between A and B. (ie: A and B occur together only by chance)
- lift > 1 implies that there is a positive relationship between A and B. (ie: A and B occur together more often than random)
- lift < 1 implies that there is a negative relationship between A and B. (ie: A and B occur together less often than random) 

# Loading and exploring the data

In [1]:
#importing the necessary libraries
import pandas as pd 
import numpy as np 
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
# Importing the datasets

productDf  = pd.read_csv('Data/products.csv')
orderDf  = pd.read_csv('Data/orders.csv')
trainDf  = pd.read_csv('Data/order_products__train.csv')

In [3]:
# For counting each product, I'm going to assign reordered column as 1

trainDf['reordered'] = 1

# Counting the frequency of each product
productCountDf = trainDf.groupby("product_id",as_index = False)["order_id"].count()

In [5]:
# creating a new dataframe with the unique product_id and product_name

newproductCountDf=productCountDf.merge(productDf, left_on='product_id', right_on='product_id', how='inner')
newDf = newproductCountDf[['product_id','product_name']]
newDf

Unnamed: 0,product_id,product_name
0,1,Chocolate Sandwich Cookies
1,2,All-Seasons Salt
2,3,Robust Golden Unsweetened Oolong Tea
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...
4,5,Green Chile Anytime Sauce
...,...,...
39118,49682,California Limeade
39119,49683,Cucumber Kirby
39120,49686,Artisan Baguette
39121,49687,Smartblend Healthy Metabolism Dry Cat Food


In [6]:
# creating a dataframe with the Top 100 most frequently purchased products

topLev = 100
productCountDf = productCountDf.sort_values("order_id",ascending = False)
topProdFrame = productCountDf.iloc[0:topLev,:]
productId= topProdFrame.loc[:,["product_id"]]

In [7]:
# I'm going to filter the orders containting the the most frequently purchased products

df = trainDf[0:0]
for i in range(0,99):
    pId = productId.iloc[i]['product_id'] 
    stDf = trainDf[trainDf.product_id == pId ]
    df = df.append(stDf,ignore_index = False)

In [8]:
df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
115,226,24852,2,1
156,473,24852,2,1
196,878,24852,2,1
272,1042,24852,1,1
297,1139,24852,1,1


# Hot encoding the Data

Consolidating the items into 1 transaction per row with each product 1 hot encoded. Each row will represent an order and each column will represent product_id. If the cell value is '1' say (i,j) then ith order contains jth product

In [9]:
df=df.merge(newDf, left_on='product_id', right_on='product_id', how='inner')
df=df[['order_id','add_to_cart_order','reordered','product_name']]
df

Unnamed: 0,order_id,add_to_cart_order,reordered,product_name
0,226,2,1,Banana
1,473,2,1,Banana
2,878,2,1,Banana
3,1042,1,1,Banana
4,1139,1,1,Banana
...,...,...,...,...
312871,3405263,7,1,Organic Broccoli Florets
312872,3410603,1,1,Organic Broccoli Florets
312873,3411504,4,1,Organic Broccoli Florets
312874,3412303,1,1,Organic Broccoli Florets


In [10]:
basket = df.groupby(['order_id', 'product_name'])['reordered'].sum().unstack().reset_index().fillna(0).set_index('order_id')

In [11]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

In [12]:
basket_sets = basket.applymap(encode_units)

In [13]:
basket_sets.head()

product_name,100% Whole Wheat Bread,2% Reduced Fat Milk,Apple Honeycrisp Organic,Asparagus,Bag of Organic Bananas,Banana,Blueberries,Boneless Skinless Chicken Breasts,Broccoli Crown,Bunched Cilantro,...,Sparkling Lemon Water,Sparkling Natural Mineral Water,Sparkling Water Grapefruit,Spring Water,Strawberries,Uncured Genoa Salami,Unsalted Butter,Unsweetened Almondmilk,Unsweetened Original Almond Breeze Almond Milk,Yellow Onions
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
38,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [14]:
basket_sets.size

9283032

In [15]:
basket_sets.shape

(93768, 99)

# Buliding the models and analyzing the results

In [16]:
# Build up the frequent items
frequent_itemsets = apriori(basket_sets, min_support=0.01, use_colnames=True)

In [17]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.024507,(100% Whole Wheat Bread)
1,0.016424,(2% Reduced Fat Milk)
2,0.024017,(Apple Honeycrisp Organic)
3,0.041251,(Asparagus)
4,0.165088,(Bag of Organic Bananas)
...,...,...
138,0.011006,"(Organic Strawberries, Organic Cucumber)"
139,0.010867,"(Organic Hass Avocado, Organic Raspberries)"
140,0.016413,"(Organic Strawberries, Organic Hass Avocado)"
141,0.017810,"(Organic Strawberries, Organic Raspberries)"


In [19]:
# Create the rules
rules = association_rules(frequent_itemsets, metric ="lift", min_threshold = 1) 
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False]) 

In [20]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
32,(Organic Fuji Apple),(Banana),0.034735,0.199706,0.012904,0.371508,1.860275,0.005967,1.273355
22,(Honeycrisp Apple),(Banana),0.037870,0.199706,0.013128,0.346663,1.735869,0.005565,1.224933
9,(Organic Large Extra Fancy Fuji Apple),(Bag of Organic Bananas),0.030831,0.165088,0.010377,0.336562,2.038677,0.005287,1.258462
6,(Organic Hass Avocado),(Bag of Organic Bananas),0.077777,0.165088,0.025808,0.331825,2.009985,0.012968,1.249541
13,(Organic Raspberries),(Bag of Organic Bananas),0.059146,0.165088,0.018983,0.320952,1.944123,0.009219,1.229533
...,...,...,...,...,...,...,...,...,...
8,(Bag of Organic Bananas),(Organic Large Extra Fancy Fuji Apple),0.165088,0.030831,0.010377,0.062855,2.038677,0.005287,1.034172
38,(Banana),(Seedless Red Grapes),0.199706,0.043288,0.012392,0.062053,1.433497,0.003747,1.020007
43,(Banana),(Yellow Onions),0.199706,0.040120,0.011422,0.057193,1.425543,0.003410,1.018109
34,(Banana),(Organic Whole Milk),0.199706,0.052342,0.011134,0.055751,1.065137,0.000681,1.003611


In [21]:
# filtering the data

rules=rules[ (rules['lift'] >= 2) &
       (rules['confidence'] >= 0.1) ]
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
9,(Organic Large Extra Fancy Fuji Apple),(Bag of Organic Bananas),0.030831,0.165088,0.010377,0.336562,2.038677,0.005287,1.258462
6,(Organic Hass Avocado),(Bag of Organic Bananas),0.077777,0.165088,0.025808,0.331825,2.009985,0.012968,1.249541
77,(Organic Raspberries),(Organic Strawberries),0.059146,0.11618,0.01781,0.301118,2.591814,0.010938,1.264619
57,(Organic Cilantro),(Limes),0.037603,0.06434,0.010739,0.285593,4.43883,0.00832,1.309702
45,(Limes),(Large Lemon),0.06434,0.086757,0.01701,0.264379,3.047365,0.011428,1.241459
69,(Organic Blueberries),(Organic Strawberries),0.05296,0.11618,0.013533,0.255538,2.199491,0.00738,1.187192
44,(Large Lemon),(Limes),0.086757,0.06434,0.01701,0.196066,3.047365,0.011428,1.163853
73,(Organic Raspberries),(Organic Hass Avocado),0.059146,0.077777,0.010867,0.183736,2.362342,0.006267,1.12981
47,(Organic Avocado),(Large Lemon),0.079014,0.086757,0.014387,0.182076,2.098696,0.007532,1.116538
52,(Limes),(Organic Avocado),0.06434,0.079014,0.011059,0.171888,2.175407,0.005975,1.112151
