###  Simple Dataflow Pipeline

✓ Check project permissions, Enable Dataflow API (see appendix) <br>
✓ Open the SSH terminal and connect to the training VM <br>
Compute Engine > VM instances > training-vm > Connect <br>
✓ In training-vm SSH terminal Download Code Repository <br>
git clone https://github.com/GoogleCloudPlatform/training-data-analyst <br>
✓ Create a Cloud Storage bucket <br>
Cloud Storage > Browser > Create Bucket  <br>
Name :<your unique bucket name (Project ID)> <br>
Location type : Multi-Region <br>
Location : <Your location> <br>
✓ In training-vm SSH terminal init bucket variable  <br>
BUCKET="<your unique bucket name (Project ID)>" <br>
echo $BUCKET <br>


✓ In training-vm SSH terminal change directory and show code source and than Press Ctrl+X to exit Nano. <br>
cd ~/training-data-analyst/courses/data_analysis/lab2/python  <br>
nano grep.py <br>
✓ Can you answer these questions about the file grep.py? <br>
•What files are being read? <br>
•What is the search term? <br>
•Where does the output go? <br>
There are three transforms in the pipeline: <br>
•What does the transform do? <br>
•What does the second transform do? <br>
•Where does its input come from? <br>
•What does it do with this input? <br>
•What does it write to its output? <br>
•Where does the output go to? <br>
•What does the third transform do? <br>

In [None]:
import apache_beam as beam
import sys

def my_grep(line, term):
   if line.startswith(term):
      yield line

if __name__ == '__main__':
   p = beam.Pipeline(argv=sys.argv)
   input = '../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java'
   output_prefix = '/tmp/output'
   searchTerm = 'import'

   # find all lines that contain the searchTerm
   (p
      | 'GetJava' >> beam.io.ReadFromText(input)
      | 'Grep' >> beam.FlatMap(lambda line: my_grep(line, searchTerm) )
      | 'write' >> beam.io.WriteToText(output_prefix)
   )

   p.run().wait_until_finish()

1.In the training-vm SSH terminal, locally execute grep.py. <br>
python3 grep.py <br>
The output file will be output.txt. If the output is large enough, it will be sharded into separate 
parts with names like: output-00000-of-00001. <br>
2.Locate the correct file by examining the file's time. <br>
ls -al /tmp <br>
3.Examine the output file(s). <br>
4.You can replace "-*" below with the appropriate suffix. <br>
cat /tmp/output-* <br>
Does the output seem logical? <br>

### Execute the pipeline on the cloud


1.Copy some Java files to the cloud. In the training-vm SSH terminal, enter the following commmand: <br>
```shell
gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java 
gs://$BUCKET/javahelp
```
2.Using Nano, edit the Dataflow pipeline in grepc.py. <br>
```shell
nano grepc.py
```
3.Replace PROJECT and BUCKET with your Project ID and Bucket name. <br>
```shell
PROJECT='qwiklabs-gcp-your-value’ BUCKET='qwiklabs-gcp-your-value'
qwiklabs-gcp-04-4491a9a7c668
```
Save the file and close Nano by pressing the CTRL+X key, then press Y, and Enter. <br>
4.Submit the Dataflow job to the cloud: <br>
```shell
python3 grepc.py
```
Note: You may ignore the message: WARNING:root:Make sure that locally built Python SDK docker image has 
Python 3.7 interpreter. Your Dataflow job will start successfully.Because this is such a small job, running on the 
cloud will take significantly longer than running it locally (on the order of 7-10 minutes).
5. Monitor job <br>
6. Cloud Storage > Browser > javahelp folder > output.txt <br>

In [None]:
import apache_beam as beam

def my_grep(line, term):
   if line.startswith(term):
      yield line

PROJECT='cloud-training-demos'
BUCKET='cloud-training-demos'

def run():
   argv = [
      '--project={0}'.format(PROJECT),
      '--job_name=examplejob2',
      '--save_main_session',
      '--staging_location=gs://{0}/staging/'.format(BUCKET),
      '--temp_location=gs://{0}/staging/'.format(BUCKET),
      '--region=us-central1',
      '--runner=DataflowRunner'
   ]

   p = beam.Pipeline(argv=argv)
   input = 'gs://{0}/javahelp/*.java'.format(BUCKET)
   output_prefix = 'gs://{0}/javahelp/output'.format(BUCKET)
   searchTerm = 'import'

   # find all lines that contain the searchTerm
   (p
      | 'GetJava' >> beam.io.ReadFromText(input)
      | 'Grep' >> beam.FlatMap(lambda line: my_grep(line, searchTerm) )
      | 'write' >> beam.io.WriteToText(output_prefix)
   )

   p.run()

if __name__ == '__main__':
   run()

### Monitor job in dataflow
![monitor](Media/monitor.png)

## check permissions (IAM)
![checkPermissions](./Media/checkPermissions.png)

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.
•	In the Google Cloud console, on the Navigation menu, click Home.
•	Copy the project number (e.g. 729328892908).
•	On the Navigation menu, click IAM & Admin > IAM.
•	At the top of the IAM page, click Add.
•	For New principals, type:
{project-number}-compute@developer.gserviceaccount.com 
Replace {project-number} with your project number.
•	For Role, select Project (or Basic) > Editor. Click Save.
