## Homework 2: Discovery of Frequent Itemsets and Association Rules

### Description
You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

The implementation can be done using any big data processing framework, such as Apache Spark, Apache Flink, or no framework, e.g., in Java, Python, etc.  

The optional task for an extra bonus
Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered using the A-Priori algorithm in a dataset of sales transactions. The rules must have the support of at least s and confidence of at least c, where s and c are given as input parameters.

### Program Introduction
* Create function "apriori" to create frequent set.
* Create function to get associate member by using frequent set

### Libararies

Here are some libraries that I used in the program.
* Using "numpy" as our data format. 
* Using "collections" to computes and merges the same elements in one list in the function "LSH".

In [27]:
import numpy as np
from collections import Counter
import pandas as pd

### Data Sets:

I used data sets which is given by Assignment2. I transform the data sets into list for further use.

In [28]:
# Import sales data
dataPath = "T10I4D100K.dat"
baskets = []

f = open(dataPath, 'r')

for doc in f:
    docStr = doc.rstrip()
    item = list(docStr.split(' '))
    for i in range(len(item)):
        item[i] = int(item[i])
    #print(item)
    baskets.append(item)

print(baskets[0:2])

[[25, 52, 164, 240, 274, 328, 368, 448, 538, 561, 630, 687, 730, 775, 825, 834], [39, 120, 124, 205, 401, 581, 704, 814, 825, 834]]


### Function for Apriori Augorithm

* Function "getAllItems": create a list with all characters in baskets.
* Function "getFirstStage": create a list that only contains frequent character.
* Function "getRelevanItems": create a list that only contains frequent character sets in each iteration.
* Function "createItems": create a list that contains next stage items by using relevant items that created by function "getRelevantItems".
* Function "getItemsCount": create a list that save the computation result of the frequency of items for next stage.
* Function "apriori": use these functions above to get the final results


In [29]:
def getAllItems(baskets):
    docAllItems = list()
    for basket in baskets:
        docAllItems.extend(basket)
    return Counter(docAllItems)

def getFirstStage(preDictionary, t=12):
    perStageSets = list()
    frequencyMap = list()
    for key in preDictionary.keys():
        if int(preDictionary[key]) >= t :
            if type(key) == int:
                perStageSets.append([key])
                frequencyMap.append([[key], preDictionary[key]])

            else:
                perStageSets.append(key.sort())
    return perStageSets, frequencyMap

def getRelevantItems(itemsCount, t=12):
    perStageSets = list()
    frequencyMap = list()
    for item in itemsCount:
        if item[-1] >= t :
            perStageSets.append(item[0])
            frequencyMap.append(item)
    return perStageSets, frequencyMap

def createItems(preStageSets, docAllItems):
    nextStageLists = list()
    for preStageSet in preStageSets:
        for key in docAllItems:
            if preStageSet[-1] < key[0]:
                nextStageLists.append(preStageSet + key)
    return nextStageLists

def getItemsCount(baskets, stageLists):
    allStageItem = list()
    relevantItemList = list()
    for backet in baskets:
        for index, item in enumerate(stageLists):
            if list(set(backet) & set(item)) == item:
                allStageItem.append(index)
    firstCounter = Counter(allStageItem)
    for index, item in enumerate(stageLists):
        for key in firstCounter.keys():
            if index == key :
                relevantItemList.append([item, firstCounter[key]])

    return relevantItemList

def apriori(baskets, t):
    frequencyMap = list()
    aprioriResult = list()
    preDic = getAllItems(baskets)
    
    DicItems, frequencyMaptmp = getFirstStage(preDic, t)
    stageItems, frequencyMaptmp = getFirstStage(preDic, t)
    while stageItems != []:
        aprioriResult.append(stageItems)
        frequencyMap.append(frequencyMaptmp)

        tmp = createItems(stageItems, DicItems)
        tmp = getItemsCount(baskets, tmp)
        stageItems,frequencyMaptmp = getRelevantItems(tmp, t)

    return aprioriResult, frequencyMap

    

basketstmp = baskets[0:1000]
aprioriResult, frequencyMap = apriori(basketstmp, 10)
print(frequencyMap)



[[[[25], 13], [[52], 19], [[164], 11], [[240], 15], [[274], 33], [[368], 80], [[448], 13], [[538], 42], [[561], 33], [[687], 11], [[775], 33], [[825], 25], [[834], 10], [[39], 48], [[120], 56], [[205], 28], [[401], 31], [[581], 28], [[704], 18], [[814], 19], [[35], 19], [[674], 24], [[733], 16], [[854], 36], [[950], 19], [[422], 11], [[449], 20], [[857], 10], [[895], 46], [[937], 45], [[964], 19], [[229], 20], [[283], 57], [[294], 12], [[352], 10], [[381], 31], [[708], 10], [[738], 26], [[766], 54], [[853], 19], [[883], 45], [[966], 36], [[978], 12], [[104], 10], [[143], 14], [[569], 38], [[620], 17], [[798], 34], [[214], 14], [[350], 32], [[529], 56], [[658], 22], [[682], 35], [[782], 37], [[809], 20], [[947], 28], [[970], 26], [[227], 16], [[390], 27], [[71], 49], [[192], 21], [[272], 11], [[279], 32], [[280], 21], [[496], 15], [[530], 12], [[597], 30], [[618], 10], [[675], 32], [[720], 40], [[914], 36], [[932], 19], [[183], 38], [[193], 12], [[217], 67], [[256], 13], [[276], 27], [[

### Get Association results
Develop and implement an algorithm for generating association rules between frequent itemsets discovered using the A-Priori algorithm in a dataset of sales transactions.

In [30]:
def getAssociationRules(basketstmp, confidence=0.7, support=10):
    aprioriResult, frequencyMap = apriori(basketstmp, support)
    assuciationResult = list()
    for itemList in aprioriResult:
        if len(itemList[0]) == 1: continue
        for item in itemList:
            for num in item:
                itemtmp = list(item)
                itemtmp.remove(num)
                for i in frequencyMap[0]:
                    if set(set(i[0]) & set([num])) == set([num]):
                        numSupport = i[-1]
                        #print(numSupport)
                for j in frequencyMap[len(item)-1]:
                    if set(set(j[0]) & set(item)) == set(item):
                        itemSupport = j[-1]
                        #print(itemSupport)
                perConfidence = itemSupport/numSupport
                #print(perConfidence)
                if (perConfidence >= confidence):
                    assuciationResult.append([itemtmp,num,perConfidence])
                    print(itemtmp, '------->', num, ': ', perConfidence)
    return assuciationResult

associationResult = getAssociationRules(aprioriResult, frequencyMap)

    

[283] -------> 515 :  0.7647058823529411
[569] -------> 801 :  0.8333333333333334
[217] -------> 515 :  0.8235294117647058
[217] -------> 394 :  0.7692307692307693
[392] -------> 801 :  0.9166666666666666
[862] -------> 801 :  1.0
[346] -------> 515 :  0.7647058823529411
[354] -------> 58 :  0.7368421052631579
[33] -------> 515 :  0.7058823529411765
[583] -------> 158 :  0.7333333333333333
[217, 346] -------> 515 :  0.7647058823529411
[392, 862] -------> 801 :  0.9166666666666666
[33, 346] -------> 515 :  0.7058823529411765
