<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Introduction to </b> <span style="font-weight:bold; color:green">MapReduce</span></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru, papulin_hse@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Content</span>
    <ol>
        <li><a href="#1">Word Count</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#1a">Java</a></li>
                <li><a href="#1b">Python</a></li>
                <li><a href="#1c">Scala</a></li>
            </ol>
        </li>
        <li><a href="#2">Average Rating Calculation</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#2a">Average ratings for each product</a></li>
                <li><a href="#2b">Average rating of all products</a></li>
                <li><a href="#2c">Filter items by their ratings</a></li>
                <li><a href="#2d">Average rating of product</a></li>
            </ol>
        </li>
        <li><a href="#3">MapReduce on AWS EMR</a></li>
        <li><a href="#4">References</a></li>
    </ol>
</div>

<p>Launch the cell below to apply a jupyter notebook style</p>

In [1]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. MapReduce Word Count</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<a name="1a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Java
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#1b">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Run and debug a MapReduce code in IntelliJ IDE</b></p>

<p>MapReduce with Java</p>

<div class="msg-block msg-info">
      <div class="msg-text-info">
          <p>MapReduce Tutorial with the word count example is <a href="https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">here</a></p>
     </div>
</div>

In [None]:
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

<p><span class="cmd-no-code"></span>Run the code</p>

<p><b>Run on the Local Cloudera VM</b></p>

<p><span class="cmd-no-code"></span>Create a jar file</p>

<p>Copy the jar file</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/WordCount.jar cloudera@127.0.0.1:/home/cloudera/classes/mapreduce

<p>Copy the <span class="code-font">"/data/samples.json"</span> file to your Local Cloudera VM</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/samples.json cloudera@127.0.0.1:/home/cloudera/classes/mapreduce

<p>Use port forwarding from your local host to the local VM to access a HDFS dashboard</p>

In [None]:
sudo ssh -N -f -L 9962:quickstart.cloudera:8088 cloudera@127.0.0.1 -p 2222

<p>Open a web browser to see a Hadoop dashboard</p>

<div class="code-block code-font"><a href="http://localhost:9962">http://localhost:9962</a></div>

<p>Connect to the VM via SSH</p>

In [None]:
sudo ssh -p 2222 cloudera@127.0.0.1

<p>Create a HDFS directory for the extracted data</p>

In [None]:
hdfs dfs -mkdir -p /mapreduce_data/input

<p>Move the data to the HDFS directory</p>

In [None]:
hdfs dfs -moveFromLocal \
            /home/cloudera/classes/mapreduce/samples.json \
            hdfs:///mapreduce_data/input

<p>Run the jar file</p>

In [None]:
hadoop jar /home/cloudera/classes/mapreduce/WordCount.jar \
            hdfs:///mapreduce_data/input \
            hdfs:///mapreduce_data/output

<p>Display content of output files</p>

In [None]:
hdfs dfs -cat /mapreduce_data/output/part-r-00000

<p>Remove the output directory if needed</p>

In [None]:
hdfs dfs -rm -r /mapreduce_data/output

<a name="1b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Python
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#1c">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Run and debug a MapReduce code</b></p>

<p>Assign paths to python files with map and reduce functions, and data source</p>

In [None]:
map_python_file = "/YOUR_PATH/code_py/wordcount_mapper.py"
reduce_python_file = "/YOUR_PATH/code_py/wordcount_reduce.py"

data = "/YOUR_PATH/data/samples.json"

<p>Map function</p>

In [None]:
%%writefile $map_python_file
import sys

for line in sys.stdin:
    line = line.split()
    for key in line:
        value = 1
        print('%s\t%i' % (key, value))

<p>Reduce function</p>

In [None]:
%%writefile $reduce_python_file
import sys

last_key = None
running_total = 0

for input_line in sys.stdin:
    input_line = input_line.strip()
    this_key, value = input_line.split("\t", 1)
    value = int(value)
    
    if last_key == this_key:
        running_total += value
    else:
        if last_key:
            print("%s\t%i" % (last_key, running_total))
    
        running_total = value
        last_key = this_key

<p>Load a python code of a map function to the notebook</p>

In [None]:
%load $map_python_file

<p>Load a python code of a reduce function to the notebook</p>

In [None]:
%load $reduce_python_file

<p><b>Examining results of map and reduce functions without Hadoop</b></p>

<p>Test a map function</p>

In [None]:
!cat $data | python $map_python_file

<p>Test map and reduce functions together</p>

In [None]:
!cat $data | python $map_python_file | sort | python $reduce_python_file

<p><b>Examining results of map and reduce functions on the Local Cloudera VM</b></p>

In [None]:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.reduces=2 \
    -mapper "python $map_python_file" \
    -reducer "python $reduce_python_file" \
    -input "/mapreduce_data/input" \
    -output "/mapreduce_data/output2"

<a name="1c"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            c. Scala
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1b">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2">Next</a>
            </div>
        </div>
    </div>
</div>

In [None]:
# TODO

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Average Rating Calculation</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<a name="2a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Average ratings for each product
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2b">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Run and debug a source code</b></p>

<p><span class="code-font">ProdAvgDriver.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class ProdAvgDriver extends Configured implements Tool {

    public int run(String[] args) throws Exception {

        Job job = Job.getInstance(getConf(), "ProdAverageRating");
        job.setJarByClass(ProdAvgDriver.class);
        job.setMapperClass(ProdAvgMapper.class);
        job.setCombinerClass(ProdAvgCombiner.class);
        job.setReducerClass(ProdAvgReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(SumCountWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        System.exit(ToolRunner.run(conf, new ProdAvgDriver(), args));
    }
}

<p><span class="code-font">SumCountWritable.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class SumCountWritable implements Writable {

    SumCountWritable() {
        this.sum = 0d;
        this.count = 0;
    }

    SumCountWritable(double sum, int count){
        this.sum = sum;
        this.count = count;
    }

    private double sum;
    private int count;

    public void set(double sum, int count) {
        this.sum = sum;
        this.count = count;
    }

    public double getSum() {
        return this.sum;
    }

    public int getCount() {
        return this.count;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeDouble(this.sum);
        out.writeInt(this.count);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.sum = in.readDouble();
        this.count = in.readInt();
    }

    @Override
    public String toString() {
        return this.sum + ":" + this.count;
    }
}

<p><span class="code-font">ProdAvgMapper.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.json.JSONObject;

import java.io.IOException;

public class ProdAvgMapper extends Mapper<Object, Text, Text, SumCountWritable> {

    private Text word = new Text();

    public void map(Object key, Text value, Context context
    ) throws IOException, InterruptedException {

        JSONObject json = new JSONObject(value.toString());

        String prod = json.getString("asin");
        double rating = json.getDouble("overall");

        //System.out.println(rating);

        context.write(new Text(prod), new SumCountWritable(rating, 1));
    }
}

<p><span class="code-font">ProdAvgCombiner.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class ProdAvgCombiner extends Reducer<Text,SumCountWritable,Text,SumCountWritable> {

    private SumCountWritable result = new SumCountWritable();

    public void reduce(Text key, Iterable<SumCountWritable> values, Context context)
            throws IOException, InterruptedException {

        double sum = 0.0;
        int count = 0;

        for (SumCountWritable val : values) {
            sum += val.getSum();
            count += val.getCount();
        }

        result.set(sum, count);

        //System.out.println("Combiner");
        //System.out.println(result.toString());

        context.write(key, result);
    }
}

<p><span class="code-font">ProdAvgReducer.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class ProdAvgReducer extends Reducer<Text,SumCountWritable,Text,DoubleWritable> {

    private SumCountWritable result = new SumCountWritable();

    public void reduce(Text key, Iterable<SumCountWritable> values, Context context)
            throws IOException, InterruptedException {

        double sum = 0.0;
        int count = 0;

        for (SumCountWritable val : values) {
            sum += val.getSum();
            count += val.getCount();
        }

        result.set(sum, count);

        //System.out.println("Reducer");
        //System.out.println(result.toString());

        double average = sum / count;

        System.out.println(key + " : " + average);

        context.write(key, new DoubleWritable(average));
    }
}

<p><span class="cmd-no-code"></span>Run the code. Use the <span class="code-font">"samples_100.json"</span> as the input</p>

<p><span class="cmd-no-code"></span>Create a jar file</p>

<p><b>Run on the Local Cloudera VM</b></p>

<p><span class="cmd-no-code"></span>Download an archive of data using <span class="code-font">wget</span> to your Local Cloudera VM, extract them, and upload to the HDFS.<br>
Link to dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
</p>

<div class="msg-block msg-info">
  <div class="msg-text-info"><p>For more information about this dataset click on the link below<br><a href="http://jmcauley.ucsd.edu/data/amazon/">Amazon product data</a></p></div>
</div>

<p>Create a HDFS directory for the extracted data</p>

In [None]:
hdfs dfs -mkdir -p /mapreduce_data/input_ratings

<p>Copy the jar file (if needed)</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/ProdAvgRatingApp.jar cloudera@127.0.0.1:/home/cloudera/classes/mapreduce

<p>Copy an archive with data to the VM (or use gwet for download the file from the Internet)</p>

In [None]:
sudo scp -P 2222 /YOUR_PATH/reviews_Electronics_5.json.gz cloudera@127.0.0.1:/home/cloudera/classes/mapreduce

<p>Use port forwarding from your local host to the local VM to access a HDFS dashboard</p>

In [None]:
sudo ssh -N -f -L 9962:quickstart.cloudera:8088 cloudera@127.0.0.1 -p 2222

<p>Open a web browser to see a Hadoop dashboard</p>

<div class="code-block code-font"><a href="http://localhost:9962">http://localhost:9962</a></div>

<p>Connect to the VM via SSH</p>

In [None]:
sudo ssh -p 2222 cloudera@127.0.0.1

<p>Extract data</p>

In [None]:
gzip -d /YOUR_PATH/reviews_Electronics_5.json.gz

<p>Create a HDFS directory for the extracted data</p>

In [None]:
hdfs dfs -mkdir -p /mapreduce_data/input_ratings

<p>Move the data to the HDFS directory</p>

In [None]:
hdfs dfs -moveFromLocal \
            /home/cloudera/classes/mapreduce/reviews_Electronics_5.json \
            hdfs:///mapreduce_data/input_ratings

<p>Run the jar file</p>

In [None]:
hadoop jar /home/cloudera/classes/mapreduce/ProdAvgRatingApp.jar \
            hdfs:///mapreduce_data/input_ratings \
            hdfs:///mapreduce_data/output_ratings

In [None]:
hadoop jar /home/cloudera/classes/mapreduce/ProdAvgRatingApp.jar hdfs:///mapreduce_data/input_ratings hdfs:///mapreduce_data/output_ratings

<p>Display content of output files</p>  

In [None]:
hdfs dfs -cat /mapreduce_data/output_ratings/part-r-00000

<p>Remove the output directory (if needed)</p>

In [None]:
hdfs dfs -rm -r /mapreduce_data/output_ratings

<a name="2b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Average rating of all products
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2a">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2c">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Run and debug a source code</b></p>

<p><span class="code-font">AverageDriver.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class AverageDriver extends Configured implements Tool {

    public int run(String[] args) throws Exception {

        Job job = Job.getInstance(getConf(), "AverageRating");
        job.setJarByClass(AverageDriver.class);
        job.setMapperClass(AverageMapper.class);
        job.setCombinerClass(AverageCombiner.class);
        job.setReducerClass(AverageReducer.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(SumCountWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        System.exit(ToolRunner.run(conf, new AverageDriver(), args));
    }
}


<p><span class="code-font">AverageMapper.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.json.JSONObject;

import java.io.IOException;

public class AverageMapper extends Mapper<Object, Text, IntWritable, SumCountWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
    ) throws IOException, InterruptedException {

        JSONObject json = new JSONObject(value.toString());

        double rating = json.getDouble("overall");

        //System.out.println(rating);

        context.write(one, new SumCountWritable(rating, 1));
    }
}

<p><span class="code-font">AverageCombiner.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class AverageCombiner extends Reducer<IntWritable,SumCountWritable,IntWritable,SumCountWritable> {

    private SumCountWritable result = new SumCountWritable();

    public void reduce(IntWritable key, Iterable<SumCountWritable> values,Context context)
            throws IOException, InterruptedException {

        double sum = 0.0;
        int count = 0;
        for (SumCountWritable val : values) {
            sum += val.getSum();
            count += val.getCount();
        }

        result.set(sum, count);

        //System.out.println("Combiner");
        //System.out.println(result.toString());

        context.write(key, result);
    }
}

<p><span class="code-font">AverageReducer.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class AverageReducer extends Reducer<IntWritable,SumCountWritable,Text,DoubleWritable> {

    private SumCountWritable result = new SumCountWritable();

    public void reduce(IntWritable key, Iterable<SumCountWritable> values, Context context)
            throws IOException, InterruptedException {

        double sum = 0.0;
        int count = 0;

        for (SumCountWritable val : values) {
            sum += val.getSum();
            count += val.getCount();
        }

        result.set(sum, count);

        //System.out.println("Reducer");
        //System.out.println(result.toString());

        double average = sum / count;

        System.out.println(average);

        context.write(new Text("Average"), new DoubleWritable(average));
    }
}

<div class="msg-block msg-task">
  <div class="msg-text-task"><p>Run on the Local Cloudera VM</p></div>
</div>

In [None]:
# YOUR INSTRUCTIONS

<a name="2c"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            c. Filter items by their ratings
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2b">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2d">Next</a>
            </div>
        </div>
    </div>
</div>

<p><b>Run and debug a source code</b></p>

<p><span class="code-font">FilterDriver.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class FilterDriver extends Configured implements Tool {

    public int run(String[] args) throws Exception {

        Job job = Job.getInstance(getConf(), "FIlterByRating");

        job.setJarByClass(FilterDriver.class);
        job.setMapperClass(FilterMapper.class);

        job.setNumReduceTasks(0);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        conf.setInt("threshold", Integer.valueOf(args[2]));

        System.exit(ToolRunner.run(conf, new RatingDriver(), args));
    }
}

<p><span class="code-font">FilterMapper.java</span></p>

In [None]:
package edu.classes.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.json.JSONObject;

import java.io.IOException;

public class FilterMapper extends Mapper<Object, Text, IntWritable, Text> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    private int thld = 0;

    protected void setup(Context context) throws IOException, InterruptedException {

        Configuration conf = context.getConfiguration();

        thld = conf.getInt("threshold", 0);

    }

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

        JSONObject json = new JSONObject(value.toString());

        double rating = json.getDouble("overall");

        if (rating >= thld) {

            //System.out.println(rating);
            context.write(one, new Text(value.toString()));
        }
    }
}

<div class="msg-block msg-task">
  <div class="msg-text-task"><p>Run on the Local Cloudera VM</p></div>
</div>

In [None]:
# YOUR INSTRUCTIONS

<a name="2d"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            d. Average rating of product
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2c">Back</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#3">Next</a>
            </div>
        </div>
    </div>
</div>

<div class="msg-block msg-task">
  <div class="msg-text-task"><p>Write a code for computing average ratings per years for a given product id</p></div>
</div>

In [None]:
# YOUR CODE AND INSTRUCTIONS

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. MapReduce on AWS EMR
</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<div class="msg-block msg-task">
  <div class="msg-text-task"><p>Deploy an EMR cluser with 3 instances and run the apps from the previous section. Use <span class="code-font">reviews_Electronics_5.json</span> as input after uploading it to the HDFS</p></div>
</div>

<div class="msg-block msg-info">
  <div class="msg-text-info"><p>By default for this configuration a replication factor for HDFS will be set to 1 by EMR. You can you the <span class="code-font">--configurations</span> option for the <span class="code-font">create-cluster</span> command to assign your configuration. To replace a replication factor with 3, specify a configuration from the <span class="code-font">hdfs_config.json</span> file that is in the <span class="code-font">config</span> directory</p></div>
</div>

In [None]:
# YOUR INSTRUCTIONS

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>