In [46]:
#pip install setuptools==58.2.0
#pip install scikit-surprise==1.1.3
#pip install mlxtend


In [47]:
%matplotlib inline

from pathlib import Path

import heapq
from collections import defaultdict

import pandas as pd
import matplotlib.pylab as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

from surprise import dataset, Reader
from surprise.prediction_algorithms import KNNBasic
from surprise.model_selection import train_test_split


In [48]:
DATA = Path('dmba')

### 2. Identifying Course Combinations. The Institute for Statistics Education at Statistics.com offers online courses in statistics and analytics, and is seeking information that will help in packaging and sequencing courses. Consider the data in the file CourseTopics.csv, the first few rows of which are shown in Table 14.14. These data are for purchases of online statistics courses at Statistics.com. Each row represents the courses attended by a single customer. The firm wishes to assess alternative sequencings and bundling of courses. Use association rules to analyze these data, and interpret several of the resulting rules.

In [49]:
ct_df = pd.read_csv(DATA / 'CourseTopics.csv')
print(ct_df.head(5))

   Intro  DataMining  Survey  Cat Data  Regression  Forecast  DOE  SW
0      1           1       0         0           0         0    0   0
1      0           0       1         0           0         0    0   0
2      0           1       0         1           1         0    0   1
3      1           0       0         0           0         0    0   0
4      1           1       0         0           0         0    0   0


In [50]:
ct_bool = pd.get_dummies(ct_df, prefix_sep='_', drop_first=True)

In [51]:
for i in ct_df.columns:
    ct_df[i] = ct_df[i].astype('bool')

ct_df.dtypes

Intro         bool
DataMining    bool
Survey        bool
Cat Data      bool
Regression    bool
Forecast      bool
DOE           bool
SW            bool
dtype: object

In [52]:
# create frequent itemsets
itemsets = apriori(ct_df, min_support=0.05, use_colnames=True)
itemsets


Unnamed: 0,support,itemsets
0,0.394521,(Intro)
1,0.178082,(DataMining)
2,0.186301,(Survey)
3,0.208219,(Cat Data)
4,0.208219,(Regression)
5,0.139726,(Forecast)
6,0.172603,(DOE)
7,0.221918,(SW)
8,0.054795,"(Intro, DataMining)"
9,0.060274,"(Survey, Intro)"


In [53]:
# and convert into rules
rules = association_rules(itemsets, metric='confidence', min_threshold=0.3)
rules.sort_values(by=['lift'], ascending=False).head(6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
7,(Survey),(Cat Data),0.186301,0.208219,0.063014,0.338235,1.62442,0.024222,1.196469,0.472405
6,(Cat Data),(Survey),0.208219,0.186301,0.063014,0.302632,1.62442,0.024222,1.166813,0.485482
9,(DOE),(SW),0.172603,0.221918,0.057534,0.333333,1.502058,0.019231,1.167123,0.403974
8,(Cat Data),(SW),0.208219,0.221918,0.063014,0.302632,1.36371,0.016806,1.115741,0.336844
5,(SW),(Intro),0.221918,0.394521,0.09589,0.432099,1.09525,0.008339,1.06617,0.111771
4,(Forecast),(Intro),0.139726,0.394521,0.052055,0.372549,0.944308,-0.00307,0.964983,-0.064157


If a customer buys Survey they are 1.6 times as likley to buy Cat Data and vice versa. If a customer buys DOE they are 1.5 times as likley to buy SW. A customer buying CAT data is 1.1 times as likely to buy SW, and a customer buying SW is 1.1 times as likely to buy intro. 

### 4. Cosmetics Purchases. The data shown in Table 14.15 and the output in Table 14.16 are based on a subset of a dataset on cosmetic purchases (Cosmetics.csv) at a large chain drugstore. The store wants to analyze associations among purchases of these items for purposes of point-of-sale display, guidance to sales personnel in promoting cross-sales, and guidance for piloting an eventual time-of- purchase electronic recommender system to boost cross-sales. Consider first only the data shown in Table 14.15, given in binary matrix form.

#### a. Select several values in the matrix and explain their meaning.

A 1 in the table means they have bought the item, a zero means they did not. Then we look at the column names for the item type that has either been bought nor not. 

#### b. Consider the results of the association rules analysis shown in Table 14.16.

##### i. For the first row, explain the “confidence” output and how it is calculated.

the confidence out put is the ratio of number of transactions that include all antecedent and consequent itemsets (namely, the support) to the number of transactions that include all the antecedent itemsets. 

Vaguely calculated as follows :


confidence = (number of transactions that include all antecedent and consequent itemsets) / (number of transactions that include all the antecedent itemsets. )

#####  ii. For the first row, explain the “support” output and how it is calculated.

support is simply the number of transactions that include both the antecedent and consequent itemsets. 

Calculated as follows:

support = (number of transactions that include both the antecedent and consequent itemsets)/ (the total number of records in the database)

#####  iii. For the first row, explain the “lift” and how it is calculated.

It is a ratio that is the confidence of the rule divided by the benchmark confidence

calcualted as follows: 

Lift = (Confidence)/(benchmark confidenece)

with

benchmark confidenece = (number of transactions with consequent itemsets)/ (the total number of transactions in the database)

#####  iv. For the first row, explain the rule that is represented there in words.

If the customer buys (blush, concealer, mascara, eye shadow, and lipstick), then we can say that with a 30% level of confidence that they are 7.19823 times as likley to buy Eyebrow pencils. 

#### c. Now, use the complete dataset on the cosmetics purchases (in the file Cosmetics.csv). Using Python, apply association rules to these data (for apriori use min_support=0.1 and use_colnames=True, for association_rules use default parameters).

In [54]:
cm_df = pd.read_csv(DATA / 'Cosmetics.csv')
print(cm_df.head(5))

   Trans.   Bag  Blush  Nail Polish  Brushes  Concealer  Eyebrow Pencils  \
0        1    0      1            1        1          1                0   
1        2    0      0            1        0          1                0   
2        3    0      1            0        0          1                1   
3        4    0      0            1        1          1                0   
4        5    0      1            0        0          1                0   

   Bronzer  Lip liner  Mascara  Eye shadow  Foundation  Lip Gloss  Lipstick  \
0        1          1        1           0           0          0         0   
1        1          1        0           0           1          1         0   
2        1          1        1           1           1          1         1   
3        1          0        0           0           1          0         0   
4        1          1        1           1           0          1         1   

   Eyeliner  
0         1  
1         0  
2         0  
3         1 

In [55]:
cm_df.set_index('Trans. ', inplace=True)
print(cm_df.head(3))

         Bag  Blush  Nail Polish  Brushes  Concealer  Eyebrow Pencils  \
Trans.                                                                  
1          0      1            1        1          1                0   
2          0      0            1        0          1                0   
3          0      1            0        0          1                1   

         Bronzer  Lip liner  Mascara  Eye shadow  Foundation  Lip Gloss  \
Trans.                                                                    
1              1          1        1           0           0          0   
2              1          1        0           0           1          1   
3              1          1        1           1           1          1   

         Lipstick  Eyeliner  
Trans.                       
1               0         1  
2               0         0  
3               1         0  


In [56]:
# create frequent itemsets
itemsets = apriori(cm_df, min_support=0.1, use_colnames=True)



In [57]:
# and convert into rules
rules = association_rules(itemsets)

#### i. Interpret the first three rules in the output in words.

In [58]:
rules.sort_values(by=['lift'], ascending=False).head(3)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Brushes),(Nail Polish),0.149,0.28,0.149,1.0,3.571429,0.10728,inf,0.846063
22,"(Blush, Concealer, Eye shadow)",(Mascara),0.124,0.357,0.119,0.959677,2.688172,0.074732,15.9464,0.716895
5,"(Blush, Eye shadow)",(Mascara),0.182,0.357,0.169,0.928571,2.60104,0.104026,9.002,0.752492


If someone buys Brushes, then they are 3.57 times as likely to buy nail polish.
If someone buys Blush, concealer, and eye shadow, we can say with 95% confidence that they are 2.69 times as likley to buy mascara; but if they dont buy concealer, then we can say with 93% confidence that they are only 2.6 times as likley to buy mascara.  

#### ii. Reviewing the first couple of dozen rules, comment on their redundancy and how you would assess their utility

In [59]:
rules.sort_values(by=['lift'], ascending=False).head(12)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Brushes),(Nail Polish),0.149,0.28,0.149,1.0,3.571429,0.10728,inf,0.846063
22,"(Blush, Concealer, Eye shadow)",(Mascara),0.124,0.357,0.119,0.959677,2.688172,0.074732,15.9464,0.716895
5,"(Blush, Eye shadow)",(Mascara),0.182,0.357,0.169,0.928571,2.60104,0.104026,9.002,0.752492
7,"(Nail Polish, Eye shadow)",(Mascara),0.131,0.357,0.119,0.908397,2.544529,0.072233,7.019417,0.698504
12,"(Concealer, Eye shadow)",(Mascara),0.201,0.357,0.179,0.890547,2.49453,0.107243,5.874682,0.749841
14,"(Bronzer, Eye shadow)",(Mascara),0.141,0.357,0.124,0.879433,2.463397,0.073663,5.333118,0.691567
24,"(Concealer, Eye shadow, Eyeliner)",(Mascara),0.13,0.357,0.114,0.876923,2.456367,0.06759,5.224375,0.681488
4,"(Blush, Mascara)",(Eye shadow),0.184,0.381,0.169,0.918478,2.410704,0.098896,7.593067,0.717137
18,"(Lipstick, Eye shadow)",(Mascara),0.129,0.357,0.11,0.852713,2.388552,0.063947,4.365632,0.667436
17,"(Lipstick, Mascara)",(Eye shadow),0.121,0.381,0.11,0.909091,2.386065,0.063899,6.809,0.660865


Eye shadow and mascara lead to each other alot. If we could get how well one leads to the other then we could try to use that as a more universal prior probability and clean up this ruleset to observe the relationships of the other products more. 