This hands-on activity will focus on using Python to implement MapReduce jobs. You will use two frameworks for this task: Hadoop Streaming or mrjob. You will analyze a simple dataset, similar to previous activities, but implement the Mapper and Reducer in Python.
By completing this hands-on activity, students will:
- Gain Experience with Python for MapReduce: Learn how to implement mappers and reducers in Python using Hadoop Streaming or the
mrjob
framework. - Understand Basic Python MapReduce Concepts: Develop an understanding of how to write Python code for MapReduce and run it on a Hadoop cluster.
- Deploy and Run Python MapReduce on Hadoop: Learn how to deploy and execute MapReduce jobs written in Python using Hadoop Streaming or
mrjob
. - Submit and Manage Code Using GitHub: Develop skills in managing code and submitting assignments via GitHub.
- First, accept the GitHub Classroom invitation and fork the assignment repository to your own GitHub account.
- Once you’ve forked the repo, open the repository in GitHub Codespaces to begin working on the assignment.
- Download virtual box from this link
- Download the iso file for Ubuntu from this link
- After installing and setting up the ubuntu VM follow the instruction that are given in the lecture slides to setup hadoop on a VM
- For mac user installing hadoop doesn't require any additional VM on top of Macos
- You can follow the instruction given in this article to install hadoop natively on mac book
- Click here to install hadoop on macos
- You can follow the instructions in this Gist to install hadoop on the codespaces instance.
-
Task 1: Total Sales per Product Category
- Implement the Python Mapper and Reducer to calculate the total quantity sold and total revenue for each product category.
- Refer to the provided
mapper_task1.py
andreducer_task1.py
files in the repository.
-
Task 2: Average Revenue per Product Category
- Implement the Python Mapper and Reducer to calculate the average revenue per product for each category.
- Refer to the provided
mapper_task2.py
andreducer_task2.py
files in the repository.
- After accepting the assignment in the github classroom you will be given a repository into you github profile
- Clone the repository into your appropriate VM
- Finish the code for both tasks
To run the MapReduce job, the input data file needs to be stored in Hadoop’s distributed file system (HDFS).
Create a directory in HDFS for the input file:
hadoop fs -mkdir -p /input/sales_data
Upload the sales dataset (product_sales.csv
) to HDFS:
hadoop fs -put product_sales.csv /input/sales_data/
Now, you are ready to run the MapReduce job using Python and Hadoop Streaming. Below are the commands for each task.
Run the job using Hadoop Streaming:
mapred streaming
-files mapper_task1.py,reducer_task1.py
-mapper mapper_task1.py
-reducer reducer_task1.py
-input /input/sales_data/product_sales.csv
-output /output/task1_total_sales
Or you can use this method to run the mapreduce streaming job
hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar -D stream.mapper=./mappers/mapper_task1.py -D stream.reducer=./reducers/reducer_task1.py -input /input/sales_data/product_sales.csv -output /output/task1_total_sales
Run the job using Hadoop Streaming:
mapred streaming
-files mapper_task1.py,reducer_task1.py
-mapper mapper_task1.py
-reducer reducer_task1.py
-input /input/sales_data/product_sales.csv
-output /output/task1_total_sales
Or you can use this method to run the mapreduce streaming job
hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar -D stream.mapper=./mappers/mapper_task2.py -D stream.reducer=./reducers/reducer_task2.py -input /input/sales_data/product_sales.csv -output /output/task2
After running the jobs, you can view the output stored in HDFS.
hadoop fs -cat /output/task1_total_sales/part-00000
hadoop fs -cat /output/task2_avg_revenue/part-00000
If you prefer to use the mrjob
framework, follow the steps below.
pip install mrjob
You can test the mrjob MapReduce job locally by running:
python3 task1_mrjob.py /path/to/product_sales.csv
To run the mrjob MapReduce jobs on a Hadoop cluster, use the following command:
python3 task1_mrjob.py -r hadoop hdfs:///input/sales_data/product_sales.csv -o hdfs:///output/task1_total_sales
Similarly, for Task 2:
python3 task2_mrjob.py -r hadoop hdfs:///input/sales_data/product_sales.csv -o hdfs:///output/task2_avg_revenue
Once you have verified the results, copy the output from HDFS to your local file system.
Use the following command to copy the output from HDFS to the Hadoop directory:
hadoop fs -get /output /opt/hadoop-3.2.1/share/hadoop/mapreduce/
Commit your changes, including the output from the MapReduce job, and push them to your GitHub repository:
git add .
git commit -m "Completed Python MapReduce with Hadoop Streaming"
git push origin main
Once you've pushed your code, go to GitHub Classroom and ensure your repository is submitted for the assignment. Make sure that the following are included:
- Python scripts for Mapper and Reducer.
- The input file (
product_sales.csv
). - The output files from your MapReduce job.
- A one-page report documenting the steps you followed, any challenges faced, and your observations.
- Correct Implementation of Python MapReduce Jobs: The MapReduce jobs must correctly process the dataset and produce the expected results.
- Proper Use of Hadoop Streaming or mrjob: Your solution must show correct implementation using Hadoop Streaming or mrjob.
- Submission of Code and Output: The correct output should be produced and submitted via GitHub, along with the Python code and a brief report.
- Report: A clear and concise one-page report summarizing your setup, challenges, and observations.
This concludes the instructions for the hands-on activity. If you encounter any issues, feel free to reach out during office hours or post your queries in the course discussion forum.
Good luck, and happy coding!