# Example Code

Make your way through this notebook and make sure you understand what the code is doing. You can use this understanding to complete your assignment.

In [None]:
# import the packages we'll use in this notebook
import pandas as pd
import collections
from ast import literal_eval
from itertools import permutations
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

### Read in the example data

Because our data is stored in csv, pandas will read our data values as strings.
Instead, we can specify we want our Colours column to be evaluated literally since we already know it is a list.

In [None]:
# read data into a DataFrame
df = pd.read_csv('data/example_data.csv', converters={'Colours': literal_eval}) 
df.head(5)

In [None]:
print(df.shape) # dataframe shape 
print(df.shape[0]) # number of rows 
print(df.shape[1]) # number of columns

Drop unwanted columns

In [None]:
# view column names
df.columns

In [None]:
# drop unwanted column
df.drop('Price', axis=1, inplace=True)

In [None]:
# view column names
df.columns

In [None]:
# change pandas default index to the transaction ID
df.set_index('Transaction_ID', inplace=True)
df.head(2)

#### Explore the data

In [None]:
df.describe()

### Format Data

We need to re-format the data so we can (later) create our binary indicator matrix (one-hot encoding, see more: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/)

First, let's split out that column of lists and put each item in its own column.

In [None]:
# get the length of the longest list so we know how many columns to create
max_len = df.Colours.str.len().max()
print(max_len)

# generate column names
cols = [i for i in range(0,4)]
print(cols)

# create a new dataframe with split out columns
df_split = pd.DataFrame(df["Colours"].tolist(), columns=[0,1,2,3])#.fillna(value=np.nan)
df_split.head()

Now that we have our data in a structure we're happy with, let's do some exploring with our association rules in mind. 

Let's create a list of lists, where each row in the df_split is a list in our list, i.e. our list will contain all transactions, each of which is also a list.

In [None]:
# instantiate an empty list
transactions = []
# iterate through each row and column to create the list
for i in range(0, len(df_split)): 
    transactions.append([str(df_split.values[i,j]) for j in range(0, len(df_split.columns))])

# look at first two lists
transactions[:2]

Now let's create one master list by flattening our list of lists.

In [None]:
# flatten the list of lists to get a list containint all items in the dataset
flattened = [item for transaction in transactions for item in transaction]
print(len(flattened))

Let's view only the unique items in the list.

In [None]:
# create a list of unique items from the flattened list
# use the set() method because a set by definition contains only unique items
items = list(set(flattened))

# print the count of unique items which is the length of the list
print('# of items:',len(items))

# sort items alphabeitcally
items.sort() 
print(items)

Let's drop the None value from the list

In [None]:
# note: you can replace any string here, e.g. "nan" or punctuation
if 'None' in flattened: flattened.remove('None')
len(flattened)

Let's count how many rules we could generate for this dataset if we looked at all combinations of 3-itemsets

In [None]:
# we'll call these combinations rules and use the permutations method we imported 
# we pass the items list to the permutations method 
# and set the itemset size limit to 3
rules = list(permutations(items, 3))
print('# of rules:',len(rules))
print(rules[:5])

We can look at the frequency of each item to see how popular it is in the dataset.

In [None]:
# let's use our list that contains all items and use the Counter method
item_freq = collections.Counter(flattened)
item_freq.most_common()

### One-hot encoding

we'll use TransactionEncoder to convert data to one hot encoding (our binary incident matrix)

In [None]:
# create an encoder object that is fit to our list of lists we created earlier
encoder = TransactionEncoder().fit(transactions)

In [None]:
# create encoded array
onehot = encoder.transform(transactions)

In [None]:
# create a dataframe with the ecoded values and the item names as columns
# We need to drop the "None" value or it will become its own columng
df_onehot = pd.DataFrame(onehot, columns=encoder.columns_).drop('None', axis=1)
df_onehot.head()

Now we have our data encoded and we're ready to apply the apriori algorithm. Note how the values in our df are TRUE and FALSE, where previously we've seen 0s and 1s. Conceptually, they are the same thing.

The mlxtend apriori methd accepts TRUE/FALSE or 1/0.

In [None]:
# Generate frequent itemsets with a minimum support of 20%
df_itemsets = apriori(df_onehot, min_support=0.2, use_colnames=True)

# itemsets_df is a DataFrame, let's see how many itemsets it contains
df_itemsets.shape

In [None]:
df_itemsets.sort_values(by=['support'], ascending=False)

Now we can use the association_rules() method to generate a dataframe with out rules and metrics.

In [None]:
# minimum threshold of 50%
rules_df = association_rules(df_itemsets, metric='confidence', min_threshold=0.5)
rules_df.shape

In [None]:
rules_df

Now let's have a look at the 5 strongest rules, sorted by lift, and ignoring some of the fields we're not interested in.

In [None]:
# lets just have a look at our strongest rules
print(rules_df.sort_values(by=['lift'], ascending=False)
      .drop(columns=['antecedent support', 'consequent support', 'conviction'])
      .head(8))

#### Rule {Green}⇒{White, Red}

If green is purchased, then white and red will also be purchased with confidence of 95%. This rule has a lift ratio of 2.27. 

This rule has support of 0.21 which means 21% of transactions are impacted. The lift ratio indicates this not occuring by chance and the confidence is high, thus it seems like this rule could be useful for a marketing campaign or for cross-selling on the website.

{Orange}⇒{White} may also be promising with almost as much support (20%), the same high confidence (95%), and a lift of 1.32 that suggests the relationship is not by chance. However,beyond that rule, it doesn't seem like there are any other rules that would be meaningful enough to warrant spending money on our marketing campaign.