# Association Rules: Market Basket Analysis

## Import Libraries

In [100]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Load Dataset

Dataset obtained from [here](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II).

<font color="red">Load in your data. Make sure to also set `encoding="ISO-8859-1"` as one of the `read_csv` parameters.</font>

In [101]:
# TODO write code here

In [None]:
df.head()

In [103]:
df.shape

(541910, 8)

We have 8 features and 541,910 rows.  
Looking at the source, this data comes from a UK-based online retail store selling gift-ware.  
  
Feature information:
- InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.
- UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).
- CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal. The name of the country where a customer resides.

## Pre-processing

### Drop nulls

<font color="red">Drop nulls</font>

In [104]:
# TODO write code here

In [None]:
df.shape

We have 406,840 rows left after dropping any row with null values.

### Filter to only include positive quantities

In [None]:
df['Quantity'].unique()

It is possible for a customer to cancel their order, resulting in a negative quantity. We only want to look at items the customers have actually purchased, so we will remove any negatives (or cancelled orders). 

In [107]:
df = df[df['Quantity']>1] # this allows us to filter only rows where quantity > 1

In [108]:
df.shape

(324610, 8)

After filtering, we have a remaining 397,925 rows.

### Filter to only UK transcations

In [None]:
df['Country'].value_counts()

<font color="red">Filter the dataframe to include only transactions from the United Kingdom</font>

In [110]:
# TODO write code here

Now we should have 282,944 rows left.

But, each row is an individual item in a given transaction. To work with Apriori algorithm, we want each row to be a transaction along with all the purchased items.

### One Hot Encoding of our items

In [None]:
df.head()

We just want a table with each row being an individual invoice, and each column being the various items (0 if not purchased, 1 if purchased). The exact quantity does not matter since we have already filtered out those more than 0 (meaning it was purchased).

In [113]:
df_enc = df[['Invoice','Description']].pivot_table(index=['Invoice'], columns=['Description'], aggfunc=[len], fill_value=0)

There are some instances of 2's and 3's, etc... instead of just 1's. So we will clean that as well.

In [114]:
def encode(x):
    if x >= 1:
        return 1
    # else
    return 0

In [None]:
df_enc = df_enc.applymap(encode)

In [None]:
df_enc

Let's briefly check this with our first invoice: 536365  
The below shows the sum of different items purchased for each invoice.

In [None]:
df_enc[df_enc == 1].sum(axis=1)

When checking below, we see that invoice 536365 had 7 different items.

In [None]:
df.head(10)

### Filter out transactions with only one purchase

Before we run Apriori algorithm, we only want to perform this on transactions that had more than one purchase. When looking for assocation rules, we are looking for what purchased items are related to each other. If a transaction only has one purchase, this doesn't help us much. *So we will simply remove those transactions.* :)

In [None]:
df_enc = df_enc[(df_enc[df_enc == 1]).sum(axis=1) > 1]
df_enc

Originally, we had 16,649 rows of transactions. After filtering, we now have 15,371 rows.

## Apriori Algorithm

We can use the `mlxtend` library to apply the Apriori Algorithm. We will first install the package and then load it in.  
  
If you are interested in seeing how this works, you can also view their Github [here](https://github.com/rasbt/mlxtend). The more specific area on Apriori is [here](https://github.com/rasbt/mlxtend/tree/master/mlxtend/frequent_patterns).

### Load our library

In [None]:
!pip install mlxtend

In [121]:
from mlxtend.frequent_patterns import apriori, association_rules

### Find frequent itemsets

<font color="red">Before continuing, take a look at the documentation for our `apriori` method.</font>

In [None]:
apriori?

<font color="red">Create a variable called `freq_itemsets` to save our results from the `apriori` method call. In this call, set `df_enc` as our DataFrame, `min_support` to be `0.03`, and `use_colnames` to `True`.</font>

In [123]:
# TODO write code here

In [None]:
freq_itemsets.sort_values('support', ascending=False)

### Find the association rules

<font color="red">Again, take a look at the documentation for our `association_rules` method before continuing.</font>

In [None]:
association_rules?

In [None]:
association_rules(freq_itemsets, metric="confidence", min_threshold=0.60, num_itemsets = len(df_enc)).sort_values("confidence", ascending=False)

With a minimum confidence threshold of 60%, this is what we have (above).  
If someone buys a *green regency teacup and saucer* they are likely to also buy a *roses regency teacup and saucer*.

<font color="red">What if we look at a confidence threshold of 20%? What are some other rules?</font>

In [127]:
# TODO write code here

## Conclusions

<font color="red">Discuss, based on the association rules found above, what are some strategies this online store could employ to increase sales?</font>