### MapReducer in DataFlow

✓ Open the SSH terminal and connect to the training VM <br>
Compute Engine > VM instances > training-vm > Connect <br>
✓ In the training-vm SSH terminal (Clone the training github repository) <br>
git clone https://github.com/GoogleCloudPlatform/training-data-analyst <br>
✓ Identify Map and Reduce operations <br>
In training-vm SSH terminal and navigate to the directory /training-data-analyst/courses/data_analysis/lab2/python than is_popular.py with Nano than Ctrl+X <br>
Can you answer these questions about the file is_popular.py? <br>
•What custom arguments are defined? <br>
•What is the default output prefix? <br>
•How is the variable output_prefix in main() set? <br>
•How are the pipeline arguments such as --runner set? <br>
•What are the key steps in the pipeline? <br>
•Which of these steps happen in parallel? <br>
•Which of these steps are aggregations? <br>


In [None]:
import apache_beam as beam
import argparse

def startsWith(line, term):
   if line.startswith(term):
      yield line

def splitPackageName(packageName):
   """e.g. given com.example.appname.library.widgetname
           returns com
	           com.example
                   com.example.appname
      etc.
   """
   result = []
   end = packageName.find('.')
   while end > 0:
      result.append(packageName[0:end])
      end = packageName.find('.', end+1)
   result.append(packageName)
   return result

def getPackages(line, keyword):
   start = line.find(keyword) + len(keyword)
   end = line.find(';', start)
   if start < end:
      packageName = line[start:end].strip()
      return splitPackageName(packageName)
   return []

def packageUse(line, keyword):
   packages = getPackages(line, keyword)
   for p in packages:
      yield (p, 1)

if __name__ == '__main__':
   parser = argparse.ArgumentParser(description='Find the most used Java packages')
   parser.add_argument('--output_prefix', default='/tmp/output', help='Output prefix')
   parser.add_argument('--input', default='../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/', help='Input directory')

   options, pipeline_args = parser.parse_known_args()
   p = beam.Pipeline(argv=pipeline_args)

   input = '{0}*.java'.format(options.input)
   output_prefix = options.output_prefix
   keyword = 'import'

   # find most used packages
   (p
      | 'GetJava' >> beam.io.ReadFromText(input)
      | 'GetImports' >> beam.FlatMap(lambda line: startsWith(line, keyword))
      | 'PackageUse' >> beam.FlatMap(lambda line: packageUse(line, keyword))
      | 'TotalUse' >> beam.CombinePerKey(sum)
      | 'Top_5' >> beam.transforms.combiners.Top.Of(5, key=lambda kv: kv[1])
      | 'write' >> beam.io.WriteToText(output_prefix)
   )

   p.run().wait_until_finish()

1.In the training-vm SSH terminal, run the pipeline locally: <br>
python3 ./is_popular.py <br>
2.Identify the output file. It should be output<suffix> and could be a sharded file. <br>
ls -al /tmp <br>
3. Examine the output file, replacing '-*' with the appropriate suffix. <br>
cat /tmp/output-* <br>
Use command line parameters <br>
1.In the training-vm SSH terminal, change the output prefix from the 
default value: <br>
python3 ./is_popular.py --output_prefix=/tmp/myoutput <br>
2.What will be the name of the new file that is written out? <br>
3.Note that we now have a new file in the /tmp directory: <br>
ls -lrt /tmp/myoutput* <br>