In this notebook, we will get the solutions for both A-priori algorithm and collaborative filtering

In [None]:
import pandas as pd
import numpy as np

# A-priori algorithm

The theoretical description can be found in Lecture note slide 26. Here is a brief introduction.
The rationale is:
- Candidate generation, find support of all itemsets of size X (start with X=1)
- Retain all that meet minimum support level (minSup)
- Repeat for size X+1 until no more itemsets meet crieria or X= |l|

In [None]:
from itertools import combinations,chain

## Write a function to find all combination of itemsets of size level-1 to generate new level-size itemsets

PLEASE REVIEW our lecture note, and write the code below!

In [None]:
def mingle(items, level):
    # mingle that creates all the possible sub-sets from a set (also given the depth, which is most likely 2)
    # Note: to store sets in sets, use a frozenset to add to a set.
    
    outcome = set()
    """
    Write your code here
    """        
    return outcome

In [None]:
# test the written function
assert  mingle(["a","b","c"], 2) == {frozenset({'a', 'c'}), 
                                     frozenset({'b', 'c'}), 
                                     frozenset({'a', 'b'})}

assert mingle([["a","b"],["a","c"],["a","d"]], 3) == {frozenset({'a', 'c', 'd'}), 
                                               frozenset({'a', 'b', 'd'}),
                                               frozenset({'a', 'b', 'c'})}
print("tests passed")

## Write a function that calculates the support of an itemset in a transactions database.

PLEASE REVIEW our lecture note, and write the code below!

In [None]:
def support(itemset,transactions,level):
    
    count = 0
    """
    Write your code here
    """     
    return count/len(transactions)

In [None]:
# test the written function
assert support("a", [["a","b","c"], ["a","b","d"], ["b","c"], ["a","c"]], 1) == 0.75
assert support("d", [["a","b","c"], ["a","b","d"], ["b","c"], ["a","c"]], 1) == 0.25
assert support(["a","b"], [["a","b","c"], ["a","b","d"], ["b","c"], ["a","c"]], 2) == 0.5
print("tests passed")

## Write the Apriori function

PLEASE REVIEW our lecture note, and write the code below!

In [None]:
# for now this function will just print some results for us to observe, 
# rather than return them in a data structure

def apriori(level,transactions,items,minsup):
    
    print("\nLevel: "+str(level))
    
    # set for items with support value that is high enough
    retain = set()
    
    # find items with support value that is high enough
    """
    Write your code here
    """     
    print("Retain: "+str(retain))
    
    level = level+1
        
    # generate new candidates
    newsets=mingle(retain,level)
    print("New itemsets: "+str(newsets))    
    
    # stop if no candidates are left or you will put all items in one set
    if len(newsets)!=0 and level<len(items)+1:
        apriori(level,transactions,newsets,minsup)

In [None]:
apriori(1, [["a","b","c"], ["a","b","d"], ["b","c"]], {"a","b","c", "d"}, 0.6)

## Use this to run the complete algorithm.

In [None]:
# open the data
file = open('data/baskets.csv','r')

transactions = []
items = set()

# save all transactions and items
for line in file:
    line = line.replace('\n','')
    litems = line.split(',')
    transactions.append(litems)
    for item in litems:
        items.add(item)

# apply Apriori algorithm
apriori(1,transactions,items,0.6)   

# Collaborative filtering

Here we use the cosine similarity in order to find similar users that we can recommend products.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine

## Load data

In [None]:
# load data
ratings = pd.read_csv('data/ratings.csv')

# sample dataset
# be careful, large dataset!
ratings = ratings[:10000]

print(ratings.head())

# print some information
noMovies = len(ratings['movieId'].unique())
noUsers = len(ratings['userId'].unique())
print(str(noMovies)+" from "+str(noUsers)+' users')

## Create an empty perons_ratings matrix

In [None]:
perons_ratings = np.zeros(shape=(noUsers,noMovies))
perons_ratings

Store movieIds as indices to use in perons_ratings matrix as the current indices don't match the sequential indices that a matrix uses.

In [None]:
movieIds = {}
midi = 0
for value in ratings['movieId'].unique():
    movieIds[value]=midi
    midi = midi + 1

Populate the perons_ratings matrix by looping all the rows in the ratings dataframe

In [None]:
for index, line in ratings.iterrows():
    uid = int(line['userId'])-1
    mid = movieIds[line['movieId']]
    rating = line['rating']
    # store the rating in the perons_ratings matrix at row user id - uid and column movie - mid
    perons_ratings[uid,mid]=rating
    
perons_ratings

Then we need to write two functions, one is to `find similar user` and the other one is to `find new product`!


**Note: the best solution is to write a class and you can put all global variables inside the function. However, here, to simplify the code for understanding, we write two functions and use global variables (it is not a good habit).** 

Of course, if you have interests and you're familiar with Python class, I believe you can do better by writing these functions in a class.

## Write a function to find similar users

In [None]:
def findSimilarUsers(person_number, minCos):
    # list for similar users
    similar_users = []
    
    # for all other users
    """
    Write your code here, we can consider perons_ratings and minCos as global variables
    """   
    print("#similar users: "+str(len(similar_users)))
    return similar_users

## Write a function to find new product

In [None]:
def findNewProducts(similar_users,person_number, minScore):
    if len(similar_users)>0:
        # celli stands for the column number of the perons_ratings matrix, i.e., a movie
        for movie_number in range(len(perons_ratings[person_number])-1):
            # if there is no rating for our current user, calculate new score
                """
                Write your code here, we can consider perons_ratings as global variable
                """ 

## Complete the experiment

In [None]:
# minimum cosine similarity
minCos = 0.8

for row_number in range(0,len(perons_ratings)-1):
    print("\nFinding recommendations for user "+str(row_number))
    simmilarUsers = findSimilarUsers(row_number)
    findNewProducts(simmilarUsers,row_number)