# Problem Statement: 
In this assignment you are to implement the Apriori algorithm for identifying the frequent item sets using the MapReduce framework. A sample data set from kaggle giving the transactions in a super market is provided to test your implementation. Identify top 5 association rules that you find interesting with explanation.

### *Approach Overview*

**Step1**: Calculate the support of each individual items  
**Step2**: Calculate the confidence of every pair of items  
**Step3**: Calculate the lift of every pair of items  
**Step4**: Eliminate the datapoints where support is less than a threshold value  
**Step5**: Eliminate the datapoints where pair_confidence is less than a threshold value  
**Step6**: Sort the rules by lift 

### Step1: Getting Support of individual items

In [5]:
%%file map1.py
import sys

header = sys.stdin.readline()


for line in sys.stdin:
    data = line.strip().split(",")
    counter = int(data[0])
    
    for i in range(1, counter+1):
        print("%s\t%s" % (data[i], 1))

Overwriting map1.py


In [6]:
!type groceries.csv | python map1.py | sort > A2mapoutput1.txt

In [7]:
%%file reduce1.py
import sys

FREQUENCY = 9835                                     #Basis our data


initial_data = sys.stdin.readline().strip().split("\t")
curr_name = initial_data[0]
curr_count = 1

for line in sys.stdin:
    data = line.strip().split("\t")
    name = data[0]
    
    if curr_name == name:
        curr_count += 1
        
    else:
        print("%s\t%s" % (round(curr_count/FREQUENCY,5) ,curr_name))
        curr_name = name
        curr_count = 1
        
print("%s\t%s" % (round(curr_count/FREQUENCY,5) ,curr_name))

Overwriting reduce1.py


In [8]:
!type A2mapoutput1.txt | python reduce1.py | sort /r > A2reduceoutput1.txt

###### Sample output taken from A2reduceoutput1.txt

**Format: [support , item]**

0.25552	whole milk  
0.19349	other vegetables  
0.18393	rolls/buns  
0.17438	soda  
0.1395	yogurt  
0.11052	bottled water  
0.109	root vegetables  
0.10493	tropical fruit  
0.09853	shopping bags  
0.09395	sausage  
0.08897	pastry  
0.08277	citrus fruit  
0.08053	bottled beer  
0.07982	newspapers  
0.07768	canned beer 

### Step 2: Getting Support of pair of items

In [9]:
%%file map2.py
import sys

header = sys.stdin.readline()
for line in sys.stdin:
    data = line.strip().split(",")
    counter = int(data[0])
    item_data = data[1:]
    
    for i in range(counter):
        for j in range(i+1, counter):
            print("%s\t%s\t%s" % (item_data[i], item_data[j],1))

Overwriting map2.py


In [10]:
!type groceries.csv | python map2.py | sort  > A2mapoutput2.txt

In [13]:
%%file reduce2.py
import sys

FREQUENCY = 9835                                     #Basis our data

initial_data = sys.stdin.readline().strip().split("\t")
curr_name1 = initial_data[0]
curr_name2 = initial_data[1]
curr_count = 1

for line in sys.stdin:
    data = line.strip().split("\t")
    name1 = data[0]
    name2 = data[1]
    
    if curr_name1 == name1 and curr_name2 == name2:
        curr_count +=1
    
    else:
        print("%s\t%s\t%s" % (round(curr_count/FREQUENCY,5) ,curr_name1, curr_name2))
        curr_name1 = name1
        curr_name2 = name2
        curr_count = 1

print("%s\t%s\t%s" % (round(curr_count/FREQUENCY,5) ,curr_name1, curr_name2))

Overwriting reduce2.py


In [14]:
!type A2mapoutput2.txt | python reduce2.py | sort /r  > A2reduceoutput2.txt

##### Sample output taken from A2reduceoutput2.txt

**Format: [ support of the pair, item1, item 2]**

0.07483	other vegetables	whole milk  
0.05663	whole milk	        rolls/buns  
0.05602	whole milk	        yogurt  
0.04891	root vegetables	    whole milk  
0.04738	root vegetables	    other vegetables  
0.04342	other vegetables	yogurt  
0.0426	other vegetables	rolls/buns  
0.0423	tropical fruit	    whole milk  
0.04006	whole milk	        soda  
0.03833	rolls/buns	        soda  
0.03589	tropical fruit	    other vegetables  
0.03437	yogurt	            rolls/buns  
0.03437	whole milk	        bottled water   

### Step 3: Calculating the confidence for our dataset

In [15]:
!type A2reduceoutput1.txt A2reduceoutput2.txt > A2jointoutput.txt


A2reduceoutput1.txt



A2reduceoutput2.txt




In [16]:
%%file reduce3.py
import sys

my_dict1 = {}
my_dict2 = {}

for i in range(169):
    data =  sys.stdin.readline().strip().split("\t")
    support_1 = float(data[0])
    key_val = data[1]
    my_dict1[key_val] = support_1                    #Adding the individual key-value pair to my dictionary
    
    
for j in range(9636):
    data2 = sys.stdin.readline().strip().split("\t")
    support_2 = float(data2[0])
    key_val1 = data2[1]
    key_val2 = data2[2]
    
    my_dict2[(key_val1, key_val2)] = support_2       #Adding both the pair of item set i.e. (A,B) and (B,A) to my dictionary
    my_dict2[(key_val2, key_val1)] = support_2
    
    
for key_pair in my_dict2.keys():
    key1,key2 = key_pair                                                         #tuple unpacking
    
    conf_score_12 = round(my_dict2[(key1,key2)]/my_dict1[key1],5)                #calculating confidence of the pair  
    lift = round(my_dict2[(key1,key2)]/(my_dict1[key1]*my_dict1[key2]),5)        #calculating LIFT of the pair
    print("%s\t%s\t%s\t%s\t%s\t%s" % (lift,conf_score_12,my_dict1[key1],my_dict1[key2],key1, key2))  #print out the lift, confidence, support_key1,support_key2, key_1, key_2 

Overwriting reduce3.py


In [17]:
!type A2jointoutput.txt | python reduce3.py | sort /r > A2finaloutput.txt

##### Sample output taken from A2finaloutput.txt

**Format: [lift, confidence, support_key1,support_key2, key_1, key_2 ] for key_1 => key_2**  


96.33911	0.5	0.0002	0.00519	preservation products	spices  
96.33911	0.01927	0.00519	0.0002	spices	preservation products  
9.99295	0.10363	0.00193	0.01037	flower soil/fertilizer	flower (seeds)  
9.99295	0.01929	0.01037	0.00193	flower (seeds)	flower soil/fertilizer  
9.97606	0.08929	0.00224	0.00895	organic sausage	candles  
9.97606	0.02235	0.00895	0.00224	candles	organic sausage  
9.96612	0.0578	0.00173	0.0058	specialty vegetables	dental care  
9.96612	0.01724	0.0058	0.00173	dental care	specialty vegetables  
9.92891	0.07576	0.00132	0.00763	cream	rice  
9.92891	0.01311	0.00763	0.00132	rice	cream  
9.89751	0.10363	0.00193	0.01047	flower soil/fertilizer	dish cleaner  
9.89751	0.0191	0.01047	0.00193	dish cleaner	flower soil/fertilizer  

### Step 4: Filtering the results

###### 1. We will now take A3finaloutput.txt as input in format [lift, confidence, support_key1, key_1, key_2 ] and start filtering the datapoints
###### 2. First we will eliminate the datapoints where support is < threshold_value
###### 3. Second, we will eliminate the datapoints where the confidence of the pair is < threshold_value
###### 4. Sort is decreasing order of LIFT to get the most frequent dataset  

In [18]:
%%file filter.py
import sys

MIN_CONFIDENCE = 0.2                              #chosen basis our observation
MIN_SUPPORT = 0.03                                #slected basis the onservation approx. equals to 300 counts
MIN_LIFT = 3

for line in sys.stdin:
    my_data = line.strip().split("\t")
    lift_value = float(my_data[0])
    conf_value = float(my_data[1])
    supp_value_key1 = float(my_data[2])
    supp_value_key2 = float(my_data[3])
    key_1 = my_data[4]
    key_2 = my_data[5]
    
    if supp_value_key1 <= MIN_SUPPORT or supp_value_key2 <= MIN_SUPPORT :
        continue
        
    if conf_value <= MIN_CONFIDENCE:
        continue
        
    print("%s\t%s\t%s\t%s\t%s\t%s" % (lift_value,conf_value,supp_value_key1,supp_value_key2,key_1, key_2)) 

Overwriting filter.py


In [20]:
!type A2finaloutput.txt | python filter.py | sort /r

3.79716	0.27218	0.03325	0.07168	berries	whipped/sour cream
3.04062	0.33143	0.05247	0.109	beef	root vegetables
2.79874	0.30506	0.03101	0.109	onions	root vegetables
2.57497	0.27019	0.07565	0.10493	pip fruit	tropical fruit
2.37162	0.45888	0.03101	0.19349	onions	other vegetables
2.32625	0.32451	0.05328	0.1395	curd	yogurt
2.32618	0.25355	0.04291	0.109	chicken	root vegetables
2.29475	0.24079	0.08277	0.10493	citrus fruit	tropical fruit
2.27882	0.31789	0.03325	0.1395	berries	yogurt
2.24652	0.43468	0.109	0.19349	root vegetables	other vegetables
2.24652	0.24487	0.19349	0.109	other vegetables	root vegetables
2.24184	0.31274	0.03965	0.1395	cream cheese	yogurt
2.21107	0.24101	0.04809	0.109	frozen vegetables	root vegetables
2.18607	0.23828	0.07168	0.109	whipped/sour cream	root vegetables
2.16746	0.23625	0.05765	0.109	pork	root vegetables
2.15594	0.41715	0.04291	0.19349	chicken	other vegetables
2.14967	0.41594	0.03325	0.19349	hamburger meat	other vegetables
2.13753	0.23299	0.05541	0.109	butter	root v

In [22]:
!type A2finaloutput.txt | python filter.py | sort /r > A2finalfilteredoutout.txt