### Dependencies Installation
Before we get started, let's make sure we have all dependencies installed.

In [1]:
%%capture
! pip3 install pymongo dateparser sklearn pandas numpy pprint scipy matplotlib seaborn mlxtend
%matplotlib inline


# Association Rules


## Importing Necessary Dependencies

In [2]:
# dependencies
import dateparser
import pymongo
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import preprocessing
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import one_hot
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", palette="muted")

### The Initial Setup

We'll create a dataframe with some made up transactions to illustrate the apriori algorithm and association rules. The dictionary key will represent the product bought, and the number will represent the quantity bought.

In [3]:
transactions = [
    {
        "beer": 1,
        "chips": 2,
        "salsa": 1,
    },
    {
        "chips": 1,
        "salsa": 1,
        "chocolate": 3
    },
    {
        "chocolate": 2,
        "diapers": 1,
        "beer": 2
    },
    {
        "chips": 2,
        "salsa": 1,
        "chocolate": 2
    },
    {
        "diapers": 3,
        "chips": 1,
        "salsa": 2,
        "beer": 2
    },
    {
        "diapers": 2,
        "chips": 1,
        "salsa": 1,
        "chocolate": 4,
        "beer": 3
    }
]

In [4]:
transactions = pd.DataFrame.from_dict(transactions)
transactions

Unnamed: 0,beer,chips,salsa,chocolate,diapers
0,1.0,2.0,1.0,,
1,,1.0,1.0,3.0,
2,2.0,,,2.0,1.0
3,,2.0,1.0,2.0,
4,2.0,1.0,2.0,,3.0
5,3.0,1.0,1.0,4.0,2.0


### Getting rid of NaN Values

We need to get rid of NaN values, so we'll use a utility method from Pandas to replace them with 0.

In [5]:
transactions.fillna(0, inplace=True)
transactions

Unnamed: 0,beer,chips,salsa,chocolate,diapers
0,1.0,2.0,1.0,0.0,0.0
1,0.0,1.0,1.0,3.0,0.0
2,2.0,0.0,0.0,2.0,1.0
3,0.0,2.0,1.0,2.0,0.0
4,2.0,1.0,2.0,0.0,3.0
5,3.0,1.0,1.0,4.0,2.0


### One-hot Encoding

We need to one hot encode the data, so that 1 means they bought the item and 0 means they didn't. We'll quickly search the dataframe and replace values greater than 1 to 1.

In [6]:
oh = transactions
for column in oh.columns:
    oh.loc[oh[column] > 0, column] = 1
oh

Unnamed: 0,beer,chips,salsa,chocolate,diapers
0,1.0,1.0,1.0,0.0,0.0
1,0.0,1.0,1.0,1.0,0.0
2,1.0,0.0,0.0,1.0,1.0
3,0.0,1.0,1.0,1.0,0.0
4,1.0,1.0,1.0,0.0,1.0
5,1.0,1.0,1.0,1.0,1.0


### Apriori

The first step is to use the apriori algorithm. This will give us our frequent itemsets and their support.

The support of an itemset is the proportion of transaction in the collection in which the itemset appears. It signifies the popularity of an itemset.

Given the above information, we have 6 transactions. Of those, beer appears in 4 of them. So, we'd expect the itemset `[beer]` to have a support value of `4/6` or `.666666667`.

Going through all of them, we can build itemsets that are just one item and calculate their support.

Now that we have our 1 item itemsets, let's build up our 2 item itemsets. So, if an itemset is [a, b] where a is chips and b is salse, the support is the ratio of the apperance of itemset `[a, b]` in all transactions. We would do this until we have exhausted all possible itemsets.

Also of key importance is being able to define some minimum threshold for which we do not care about that itemset.

For this, we'll use the `apriori` algorithm from `mlxtend`.

In [7]:
assocs = apriori(oh, min_support=0.5, use_colnames=True)

assocs =assocs.sort_values(by='support', ascending=False)
assocs

Unnamed: 0,support,itemsets
1,0.833333,(chips)
2,0.833333,(salsa)
8,0.833333,"(chips, salsa)"
0,0.666667,(beer)
3,0.666667,(chocolate)
4,0.5,(diapers)
5,0.5,"(beer, chips)"
6,0.5,"(beer, salsa)"
7,0.5,"(beer, diapers)"
9,0.5,"(chocolate, chips)"


In [8]:
rules = association_rules(assocs, min_threshold=0.5)
with pd.option_context('display.max_rows', None, 'display.max_columns', 5):
    display(rules.sort_values(by='lift', ascending=False))

Unnamed: 0,antecedents,consequents,...,leverage,conviction
6,(beer),(diapers),...,0.166667,2.0
7,(diapers),(beer),...,0.166667,inf
0,(chips),(salsa),...,0.138889,inf
1,(salsa),(chips),...,0.138889,inf
22,(chips),"(chocolate, salsa)",...,0.083333,1.25
19,"(chocolate, salsa)",(chips),...,0.083333,inf
18,"(chocolate, chips)",(salsa),...,0.083333,inf
17,(salsa),"(beer, chips)",...,0.083333,1.25
16,(chips),"(beer, salsa)",...,0.083333,1.25
13,"(beer, salsa)",(chips),...,0.083333,inf


## Pymongo Setup

In [9]:
# pymongo driver configuration
course_cluster_uri = "mongodb://agg-student:agg-password@cluster0-shard-00-00-jxeqq.mongodb.net:27017,cluster0-shard-00-01-jxeqq.mongodb.net:27017,cluster0-shard-00-02-jxeqq.mongodb.net:27017/test?ssl=true&replicaSet=Cluster0-shard-0&authSource=admin"
course_client = pymongo.MongoClient(course_cluster_uri)
orders = course_client['coursera-agg']['orders']

# Getting our data from MongoDB

We'll need to construct a one-hot encoded dataframe. This means that for every document, convert the information into the purchases array into something like:

```
{
    ...,
    "purchases": [
        {
          "description": "WHITE WIRE EGG HOLDER",
          "quantity": 36,
          "stock_code": "84880",
          "unit_price": 4.95
        },
        {
          "description": "JUMBO  BAG BAROQUE BLACK WHITE",
          "quantity": 100,
          "stock_code": "85099C",
          "unit_price": 1.65
        },
        {
          "description": "JUMBO BAG RED RETROSPOT",
          "quantity": 100,
          "stock_code": "85099B",
          "unit_price": 1.65
        }
      ],
  }
  ```
  into
  ```
{
    "84880": 1,
    "85099C": 1,
    "85099B": 1,
}
```

## The Pipeline

In [10]:
order_projection = {
    "$replaceRoot": {
            "newRoot":  {
                "$arrayToObject": {
                    "$map": {
                        "input": "$purchases",
                        "in": {
                            "k": "$$this.stock_code",
                            "v": 1
                        }
                    }
                }
            }
    }
            
}

print(json.dumps(order_projection, indent=2))

{
  "$replaceRoot": {
    "newRoot": {
      "$arrayToObject": {
        "$map": {
          "input": "$purchases",
          "in": {
            "k": "$$this.stock_code",
            "v": 1
          }
        }
      }
    }
  }
}


# Constructing the Pipeline

That's it! We will use our one stage.

In [11]:
pipeline = [
    order_projection
]

# Constructing the pandas Dataframe from MongoDB

Here you will need to construct the DataFrame. Assign it to the variabled `df` below.

In [12]:
df = pd.DataFrame.from_dict(list(orders.aggregate(pipeline)))
df.head(n=10)

Unnamed: 0,21756,84879,22745,22749,22748,84969,22623,22622,21755,21754,...,23561,90214F,90214O,90214U,90214T,90214W,90214Z,90089,72783,23843
0,1.0,,,,,,,,,,...,,,,,,,,,,
1,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


## Fixing the NaN values

We will use the Pandas DataFrame [fillna](http://github.com/pandas-dev/pandas/blob/v0.21.0/pandas/core/frame.py#L3029-L3035) method to fill in NaN values for us with 0.

In [13]:
df.fillna(0, inplace=True)
df.head(10)

Unnamed: 0,21756,84879,22745,22749,22748,84969,22623,22622,21755,21754,...,23561,90214F,90214O,90214U,90214T,90214W,90214Z,90089,72783,23843
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Association

### Apriori
First, we'll use the `apriori` algorithm from `mlxtend` to extract frequent itemsets. 

In [14]:
assocs = apriori(df, min_support=0.02, use_colnames=True)

In [15]:
with pd.option_context('display.max_rows', None, 'display.max_columns', 5):
    assocs =assocs.sort_values(by='support', ascending=False)
    display(assocs)

Unnamed: 0,support,itemsets
9,0.11358,(85123A)
37,0.086912,(85099B)
94,0.08469,(22423)
0,0.078083,(84879)
165,0.077542,(47566)
18,0.067271,(20725)
171,0.060484,(22720)
74,0.059823,(20727)
185,0.058983,(23203)
72,0.057601,(22383)


## Association Rules

Now we form the association rules. Try adjusting the `min_threshold` along with the `metric` to find interesting associations. For example, which class appears to be highly associated with `parents_children`? Go back and add a one-hot encoding function for `parents_children` and see if the results are more clear.

In [16]:
rules = association_rules(assocs, metric="lift", min_threshold=3)

In [17]:
with pd.option_context('display.max_rows', None, 'display.max_columns', 5):
    display(rules.sort_values(by='lift', ascending=False))

Unnamed: 0,antecedents,consequents,...,leverage,conviction
77,"(22699, 22698)",(22697),...,0.019636,8.783841
80,(22697),"(22699, 22698)",...,0.019636,2.206352
76,"(22699, 22697)",(22698),...,0.019635,3.421518
81,(22698),"(22699, 22697)",...,0.019635,3.150691
36,(22697),(22698),...,0.023177,2.855182
37,(22698),(22697),...,0.023177,5.335706
79,(22699),"(22697, 22698)",...,0.019494,1.96305
78,"(22697, 22698)",(22699),...,0.019494,6.151553
4,(22699),(22697),...,0.027093,3.233057
5,(22697),(22699),...,0.027093,4.316746


In [18]:
query = {
    "$match": {
        "_id.stock_code": { "$in": ["22697", "22698", "22699"]}
    }
}

project = {
    "$project": { "_id": 0, "purchases.stock_code": 1, "purchases.description": 1}
}

pipeline = [
    {
        "$unwind": "$purchases"
    },
    {
        "$group": {
            "_id": {
                "stock_code": "$purchases.stock_code",
                "description": "$purchases.description"
            }
            
        }
    },
    query
]
display(list(orders.aggregate(pipeline)))

[{'_id': {'stock_code': '22699',
   'description': 'ROSES REGENCY TEACUP AND SAUCER'}},
 {'_id': {'stock_code': '22697',
   'description': 'GREEN REGENCY TEACUP AND SAUCER'}},
 {'_id': {'stock_code': '22698',
   'description': 'PINK REGENCY TEACUP AND SAUCER'}}]