# Exercise 3 (MapReduce in Practice)   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     [4 points]
---

For this exercise, you are tasked with writing your own Hadoop MapReduce program in Python and to
run it on the cluster on the provided datasets.   
You may look at the exercise sheet for all the information on the datasets and this task.


In [1]:
# Saving variables to access the file locations
articles='/home/adbs23/adbs23_shared/hm/articles.csv'

customers='/home/adbs23/adbs23_shared/hm/customers.csv'

transactions='/home/adbs23/adbs23_shared/hm/transactions_small.csv'

- ### **a) Write a MapReduce job with “articles.csv” as input and following output:**  

For each garment group, show the most frequent product, the second most frequent section and the most frequent department it appears inside the article.csv file; make sure output has the following schema:

            garment_group_name, prod_name, section_name,  department_name

The product names are stored in "prod_name", the deparment name in "department_name", the garment group in "garment_group_name" and the section in "section_name". In case that there are multiple departments, garment groups or sections with the same number of occurences, you may resolve these conflicts randomly, i.e. pick one of them arbitrarily. In case there is only one section, or all sections appear with the same frequency, just pick the most frequent one, and resolve conflicts randomly. 

Make sure that your program correctly deals with the header, and possible sparse values.

In [2]:
%%file mymrjob1.py

# This will create a local file to run your MapReduce program  

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.util import log_to_stream, log_to_null
from mr3px.csvprotocol import CsvProtocol
import csv 
import operator
import logging


log = logging.getLogger(__name__)
# 
#  Below is the skeleton for a MapReduce program in mrjob.
#  Write your own solution here. Be sure that it actually runs successfully.

class MyMRJob1(MRJob):
    
    
    OUTPUT_PROTOCOL = CsvProtocol  # write output as CSV
    
    def set_up_logging(cls, quiet=False, verbose=False, stream=None):  
        log_to_stream(name='mrjob', debug=verbose, stream=stream)
        log_to_stream(name='__main__', debug=verbose, stream=stream)

    def mapper_prodcount(self, _, line):
        result = next(csv.reader([line],quotechar=None)) # extract columns from line

        garment_group_name = result[23]
        prod_name = result[2]
        # skip sparce entries for other columns 
        if prod_name == "prod_name" or prod_name == "" or garment_group_name == "": #skip sparse entries and header
            return
        yield (garment_group_name,prod_name), 1
       


     # The reducer now creates a dict for all department_names, product_names and sections
     # and in the end returns the most or second most frequent values based on its contents
    def reducer_prodcount(self,garmetProd,valuelist):
        garmet, prod = garmetProd
        output = sum(valuelist)
        yield None,(garmet,prod,output)


    def steps(self):
        return [
            MRStep(mapper=self.mapper_prodcount,
                   reducer=self.reducer_prodcount)            
        ]

if __name__ == '__main__':
    MyMRJob1.run()


Writing mymrjob1.py


Running a local MRjob 

In [3]:
!python3 mymrjob1.py $articles

  File "c:\Users\Lenovo\OneDrive\Documents\ADMS_navya\notebooks\mymrjob1.py", line 33
    def reducer_prodcount(self,key,pairs):
    ^
IndentationError: expected an indented block after function definition on line 27


Running a Hadoop job

---

- ### **b) Write a MapReduce job with all three datasets as input and following output:**  
for all customers with 'fashion\_news\_frequency' = 'Regularly', show the number of transactions they appear in where the article  has a 'graphical\_appearance\_name' equal to 'Solid' and a 'colour\_group\_name' equal to 'Light Beige'


Make sure to have the following format in your final output:

            customer_id,count_transactions


In [None]:
%%file mymrjob2.py
# This will create a local file to run your MapReduce program  

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.util import log_to_stream, log_to_null
from mr3px.csvprotocol import CsvProtocol
import csv 
import logging

    
 log = logging.getLogger(__name__)
# 
#  Below is the skeleton for a MapReduce program in mrjob.
#  Write your own solution here. Be sure that it actually runs successfully.
class MyMRJob2(MRJob):
    
    
    OUTPUT_PROTOCOL = CsvProtocol  # write output as CSV
    
    def set_up_logging(cls, quiet=False, verbose=False, stream=None):  
        log_to_stream(name='mrjob', debug=verbose, stream=stream)
        log_to_stream(name='__main__', debug=verbose, stream=stream)

#   Feel free to rename the functions
    def mapper_mrjob2(self, _, line):
        #TODO
        
# use of a combiner is optional. It may speed up your job. Be sure that using the combiner preserves the correctness. 
#     def combiner_mrjob2(self,key,valuelist):
        #TODO
        
    def reducer_mrjob2(self,key,valuelist):
         #TODO

    def steps(self):
        first_step = MRStep(
            mapper=self.mapper_mrjob2, 
#             combiner=self.combiner_mrjob2, 
            reducer=self.reducer_mrjob2
        )
        # just generate more steps to run a multi-step MR job
        
        return [ first_step ]

if __name__ == '__main__':
    MyMRJob2.run()


In [None]:
! python3 mymrjob2.py  $articles $transactions $customers

---
## **Your solution for Exercise 3 will consist of:**  
*  This notebook, filled with your solution.
