In [None]:
'''
Association Rules and Colaborative Filtering are popular methods in marketing for cross-selling 
products associated with an item that a consumer is cosidering. 

Association rule discovery in marketing is termed “market basket analysis” and is aimed at discovering 
which groups of products tend to be purchased together.

In collaborative filtering, the goal is to provide personalized recommendations that leverage 
user-level information. User-based collaborative filtering starts with a user, then finds users 
who have purchased a similar set of items or ranked items in a similar fashion, and makes a 
recommendation to the initial user based on what the similar users purchased or liked. 

In [1]:
import heapq
from collections import defaultdict
import pandas as pd
import matplotlib.pylab as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from surprise import Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split

In [None]:
# %pip install surprise

In [None]:
'''
Put simply, association rules, or affinity analysis, constitute a study of “what goes with what.”
This method is also called market basket analysis because it originated with the study of customer
transactions databases to determine dependencies between purchases of different items.

For example, a medical researcher might want to learn what symptoms appear together. In law, word 
combinations that appear too often might indicate plagiarism.




In [None]:
'''
The Apriori Algorithm


he key idea of the algorithm is to begin by generating frequent itemsets with just one item 
(one-itemsets) and to recursively generate frequent itemsets with two items, then with three 
items, and so on, until we have generated frequent itemsets of all sizes.

It is easy to generate frequent one-itemsets. All we need to do is to count, for each item, 
ow many transactions in the database include the item. These transaction counts are the supports 
for the one-itemsets. We drop one-itemsets that have support below the desired minimum support to
create a list of the frequent one-itemsets.

To generate frequent two-itemsets, we use the frequent one-itemsets. The reasoning is that if a 
certain one-itemset did not exceed the minimum support, any larger size itemset that includes 
it will not exceed the minimum support. In general, generating k-itemsets uses the frequent 
(k − 1)-itemsets that were generated in the preceding step. Each step requires a single run 
through the database, and therefore the Apriori algorithm is very fast even for a large number 
of unique items in a database.

Confiedence and Support
In addition to support, which we described earlier, there is another measure that expresses the 
degree of uncertainty about the if–then rule. This is known as the confidence2 of the rule. This 
measure compares the co-occurrence of the antecedent and consequent itemsets in the database to
the occurrence of the antecedent itemsets. Confidence is defined as the ratio of the number of 
transactions that include all antecedent and consequent itemsets (namely, the support) to the 
number of transactions that include all the antecedent itemsets

Lift Ratio
A better way to judge the strength of an association rule is to compare the confidence of the 
ule with a benchmark value, where we assume that the occurrence of the consequent itemset in a 
transaction is independent of the occurrence of the antecedent for each rule. 

Leverage 
Leverage measures the deviation from independence. It ranges from − 1 to 1 and is 0 if the antecedent 
and consequent are independent. In a sales setting, leverage tells us how much more frequently the 
items are bought together compared to their independent sales. 



'''

In [2]:
#Apriori Algorithm

# Load and preprocess data set 
fp_df = pd.read_csv('DataMining/Faceplate.csv')
fp_df.set_index('Transaction', inplace=True)
print(fp_df)
# create frequent itemsets
itemsets = apriori(fp_df, min_support=0.2, use_colnames=True)
# convert into rules
rules = association_rules(itemsets, metric='confidence', min_threshold=0.5)
rules.sort_values(by=['lift'], ascending=False).head(6)
print(rules.sort_values(by=['lift'], ascending=False)
      .drop(columns=['antecedent support', 'consequent support', 'conviction'])
      .head(6))

             Red  White  Blue  Orange  Green  Yellow
Transaction                                         
1              1      1     0       0      1       0
2              0      1     0       1      0       0
3              0      1     1       0      0       0
4              1      1     0       1      0       0
5              1      0     1       0      0       0
6              0      1     1       0      0       0
7              1      0     1       0      0       0
8              1      1     1       0      1       0
9              1      1     1       0      0       0
10             0      0     0       0      0       1
       antecedents   consequents  support  confidence      lift  leverage
14    (Red, White)       (Green)      0.2         0.5  2.500000      0.12
15         (Green)  (Red, White)      0.2         1.0  2.500000      0.12
4          (Green)         (Red)      0.2         1.0  1.666667      0.08
13  (White, Green)         (Red)      0.2         1.0  1.666667     

In [5]:
#AAssociation Rules Output for Random Data

# create frequent itemsets
itemsets = apriori(fp_df, min_support=2/len(fp_df), use_colnames=True)
# and convert into rules
rules = association_rules(itemsets, metric='confidence', min_threshold=0.7)
print(rules.sort_values(by=['lift'], ascending=False)
      .drop(columns=['antecedent support', 'consequent support', 'conviction'])
      .head(6))

      antecedents   consequents  support  confidence      lift  leverage
5         (Green)  (Red, White)      0.2         1.0  2.500000      0.12
0         (Green)         (Red)      0.2         1.0  1.666667      0.08
4  (White, Green)         (Red)      0.2         1.0  1.666667      0.08
1        (Orange)       (White)      0.2         1.0  1.428571      0.06
2         (Green)       (White)      0.2         1.0  1.428571      0.06
3    (Red, Green)       (White)      0.2         1.0  1.428571      0.06


In [6]:
all_books_df = pd.read_csv('DataMining/CharlesBookClub.csv')
# create the binary incidence matrix
ignore = ['Seq#', 'ID#', 'Gender', 'M', 'R', 'F', 'FirstPurch', 'Related Purchase',
          'Mcode', 'Rcode', 'Fcode', 'Yes_Florence', 'No_Florence']
count_books = all_books_df.drop(columns=ignore)
count_books[count_books > 0] = 1
# create frequent itemsets and rules
itemsets = apriori(count_books, min_support=200/4000, use_colnames=True)
rules = association_rules(itemsets, metric='confidence', min_threshold=0.5)
# Display 25 rules with highest lift
rules.sort_values(by=['lift'], ascending=False).head(25)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
64,"(RefBks, YouthBks)","(ChildBks, CookBks)",0.08125,0.242,0.05525,0.68,2.809917,0.035588,2.36875
73,"(DoItYBks, RefBks)","(ChildBks, CookBks)",0.0925,0.242,0.06125,0.662162,2.736207,0.038865,2.24368
60,"(DoItYBks, YouthBks)","(ChildBks, CookBks)",0.10325,0.242,0.067,0.64891,2.681448,0.042014,2.158993
80,"(RefBks, GeogBks)","(ChildBks, CookBks)",0.08175,0.242,0.05025,0.614679,2.539995,0.030467,1.96719
69,"(GeogBks, YouthBks)","(ChildBks, CookBks)",0.1045,0.242,0.06325,0.605263,2.501087,0.037961,1.920267
77,"(DoItYBks, GeogBks)","(ChildBks, CookBks)",0.101,0.242,0.0605,0.59901,2.475248,0.036058,1.890321
66,"(ChildBks, GeogBks, CookBks)",(YouthBks),0.1095,0.23825,0.06325,0.577626,2.424452,0.037162,1.803495
71,"(ChildBks, RefBks, CookBks)",(DoItYBks),0.1035,0.25475,0.06125,0.591787,2.323013,0.034883,1.825642
47,"(DoItYBks, GeogBks)",(YouthBks),0.101,0.23825,0.0545,0.539604,2.264864,0.030437,1.654554
62,"(ChildBks, RefBks, CookBks)",(YouthBks),0.1035,0.23825,0.05525,0.533816,2.240573,0.030591,1.634013


In [None]:
'''
Collaborative filtering is a popular technique used by such recommendation systems. The term 
collaborative filtering is based on the notions of identifying relevant items for a specific user
from the very large set of items (“filtering”) by considering preferences of many users 
(“collaboration”).






'''

In [7]:
#Colaborative Filtering

import random
random.seed(0)
nratings = 5000
randomData = pd.DataFrame({
    'itemID': [random.randint(0,99) for _ in range(nratings)],
    'userID': [random.randint(0,999) for _ in range(nratings)],
    'rating': [random.randint(1,5) for _ in range(nratings)],
})
def get_top_n(predictions, n=10):
    # First map the predictions to each user.
    byUser = defaultdict(list)
    for p in predictions:
        byUser[p.uid].append(p)
    
    # For each user, reduce predictions to top-n
    for uid, userPredictions in byUser.items():
        byUser[uid] = heapq.nlargest(n, userPredictions, key=lambda p: p.est)
    return byUser

In [8]:
# Convert the data set into the format required by the surprise package
# The columns must correspond to user id, item id, and ratings (in that order)
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(randomData[['userID', 'itemID', 'rating']], reader)
# Split into training and test set
trainset, testset = train_test_split(data, test_size=.25, random_state=1)
## User-based filtering
# compute cosine similarity between users 
sim_options = {'name': 'cosine', 'user_based': True}
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)
# predict ratings for all pairs (u, i) that are NOT in the training set.
predictions = algo.test(testset) 
# Print the recommended items for each user
top_n = get_top_n(predictions, n=4)
print('Top-3 recommended items for each user')
for uid, user_ratings in list(top_n.items())[:5]:
    print('User {}'.format(uid))
    for prediction in user_ratings:
        print('  Item {0.iid} ({0.est:.2f})'.format(prediction), end='')
    print()

Computing the cosine similarity matrix...
Done computing similarity matrix.
Top-3 recommended items for each user
User 6
  Item 6 (5.00)  Item 77 (2.50)  Item 60 (1.00)
User 222
  Item 77 (3.50)  Item 75 (2.78)
User 424
  Item 14 (3.50)  Item 45 (3.10)  Item 54 (2.34)
User 87
  Item 27 (3.00)  Item 54 (3.00)  Item 82 (3.00)  Item 32 (1.00)
User 121
  Item 98 (3.48)  Item 32 (2.83)


In [9]:
trainset = data.build_full_trainset()
sim_options = {'name': 'cosine', 'user_based': False}
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)
# Predict rating for user 383 and item 7
algo.predict(383, 7)

Computing the cosine similarity matrix...
Done computing similarity matrix.


Prediction(uid=383, iid=7, r_ui=None, est=2.3661840936304324, details={'actual_k': 4, 'was_impossible': False})

In [None]:
'''
Associatuin Rules vs Colaborative Filtering 

1. Frequent itemsets vs. personalized recommendations: Association rules look for frequent item 
combinations and will provide recommendations only for those items. In contrast, collaborative 
filtering provides personalized recommendations for every item, thereby catering to users with
unusual taste. In this sense, collaborative filtering is useful for capturing the “long tail” of
user preferences, while association rules look for the “head.” 

2. Transactional data vs. user data: Association rules provide recommendations of items based on
their co-purchase with other items in many transactions/baskets.
 In contrast, collaborative filtering provides recommendations of items based on their 
 co-purchase or co-rating by even a small number of other users.
 
3. Binary data and ratings data: Association rules treat items as binary data (1 = purchase, 
0 = nonpurchase), whereas collaborative filtering can operate on either binary data or on 
numerical ratings.

4. Two or more items: In association rules, the antecedent and consequent can each include one or
more items (e.g., IF milk THEN cookies and cornflakes). Hence, a recommendation might be a bundle
of the item of interest with multiple items (“buy milk, cookies, and cornflakes and receive 10% 
discount’’). In contrast, in collaborative filtering, similarity is measured between pairs of items
or pairs of users. A recommendation will therefore be either for a single item (the most popular 
item purchased by people like you, which you haven’t purchased), or for multiple single items 
which do not necessarily relate to each other (the top two most popular items purchased by people
like you, which you haven’t purchased).



'''