### Running Apache Spark jobs on Dataproc

You are migrating an existing Spark workload to Cloud Dataproc 
and then progressively modifying the Spark code to make use of 
GCP native features and services. <br>
Activate Google Cloud Shell <br>
gcloud auth list <br>
gcloud config list project <br>
Check project permissions
Configure and start a Cloud Dataproc
cluster <br>
1.In the GCP Console, on the Navigation menu, in the Big 
Data section, click Dataproc. <br>
2.Click Create Cluster. <br>
3.Enter sparktodp for Cluster Name. <br>
4.In the Versioning section, click Change and select 2.0 (Debian 10, 
Hadoop 3.2, Spark 3.1). <br>
This version includes Python3 which is required for the sample 
code used in this lab.<br>

![imageVersion_dataproc](Media/imageVersion_dataproc.png)

5.Click Select. <br>
6.In the Components > Component gateway section, 
select Enable component gateway. <br>
7.Under Optional components, Select Jupyter
Notebook. <br>
8.Click Create <br>

![createCluster_dataproc](Media/createCluster_dataproc.png)

**The cluster should start in a couple of 
minutes. You can proceed to the next 
step without waiting for the Cloud 
Dataproc Cluster to fully deploy**

### Setting(get repository, specify dataproc storage ,copy notebook to jupter working folder)
Clone the source repository for the lab
In the Cloud Shell you clone the Git repository for the lab and copy the required notebook files to the Cloud 
Storage bucket used by Cloud Dataproc as the home directory for Jupyter notebooks.
1.To clone the Git repository for the lab enter the following command in Cloud Shell:
```shell
git -C ~ clone https://github.com/GoogleCloudPlatform/training-data-analyst 
```
2. To locate the default Cloud Storage bucket used by Cloud Dataproc enter the following command in Cloud 
Shell: <br>
```shell
export DP_STORAGE="gs://$(gcloud dataproc clusters describe sparktodp --region=us-central1 --format=json | 
jq -r '.config.configBucket’)” 
```
3. To copy the sample notebooks into the Jupyter working folder enter the following command in Cloud Shell: <br>
```shell
gsutil -m cp ~/training-data-analyst/quests/sparktobq/*.ipynb $DP_STORAGE/notebooks/jupyter <br>
```

### Log in to the Jupyter Notebook
As soon as the cluster has fully started up you can connect to the Web interfaces. 
Click the refresh button to check as it may be deployed fully by the time you reach 
this stage. <br>
1.On the Dataproc Clusters page wait for the cluster to finish starting and then click 
the name of your cluster to open the Cluster details page. <br>
2.Click Web Interfaces.<br>
3.Click the Jupyter link to open a new Jupyter tab in your browser. <br>
This opens the Jupyter home page. Here you can see the contents of 
the /notebooks/jupyter directory in Cloud Storage that now includes the sample 
Jupyter notebooks used in this lab. <br>
4.Under the Files tab, click the GCS folder and then click 01_spark.ipynb notebook to 
open it. <br>
5.Click Cell and then Run All to run all of the cells in the notebook. <br>
6.Page back up to the top of the notebook and follow as the notebook completes 
runs each cell and outputs the results below them. <br>

The first code cell fetches the source data file, which is an extract from the KDD Cup competition from the 
Knowledge, Discovery, and Data (KDD) conference in 1999. The data relates to computer intrusion 
detection events.
```shell
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-
mld/kddcup.data_10_percent.gz
```
In the second code cell, the source data is copied to the default (local) Hadoop file system.
```shell
!hadoop fs -put kddcup* /
```
In the third code cell, the command lists contents of the default directory in the cluster's HDFS file system.
```shell
!hadoop fs -ls /
```


### Jupyter Notebook(Reading in data)
The data are gzipped CSV files. In Spark, these can be read directly using the textFile method and 
then parsed by splitting each row on commas.
The Python Spark code starts in cell In[4]. In this cell Spark SQL is initialized and Spark is used to 
read in the source data as text and then returns the first 5 rows.

In [None]:
from pyspark.sql import SparkSession, SQLContext, Row
spark = SparkSession.builder.appName("kdd").getOrCreate()
sc = spark.sparkContext
data_file = "hdfs:///kddcup.data_10_percent.gz"
raw_rdd = sc.textFile(data_file).cache()
raw_rdd.take(5)


In [None]:
## clean data
csv_rdd = raw_rdd.map(lambda row: row.split(","))
parsed_rdd = csv_rdd.map(lambda r: Row(
duration=int(r[0]),
protocol_type=r[1],
service=r[2],
flag=r[3],
src_bytes=int(r[4]),
dst_bytes=int(r[5]),
wrong_fragment=int(r[7]),
urgent=int(r[8]),
hot=int(r[9]),
num_failed_logins=int(r[10]),
num_compromised=int(r[12]),
su_attempted=r[14],
num_root=int(r[15]),
num_file_creations=int(r[16]),
label=r[-1]
)
)
parsed_rdd.take(5)

In [None]:
## query data
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(parsed_rdd)
connections_by_protocol = 
df.groupBy('protocol_type').count().orderBy('count', 
ascending=False)
connections_by_protocol.show()


In [None]:
df.registerTempTable("connections")
attack_stats = sqlContext.sql("""
SELECT
protocol_type,
CASE label
WHEN 'normal.' THEN 'no attack'
ELSE 'attack'
END AS state,
COUNT(*) as total_freq,
ROUND(AVG(src_bytes), 2) as mean_src_bytes,
ROUND(AVG(dst_bytes), 2) as mean_dst_bytes,
df.registerTempTable("connections")
attack_stats = sqlContext.sql("""
SELECT
protocol_type,
CASE label
WHEN 'normal.' THEN 'no attack'
ELSE 'attack'
END AS state,
COUNT(*) as total_freq,
ROUND(AVG(src_bytes), 2) as mean_src_bytes,
ROUND(AVG(dst_bytes), 2) as mean_dst_bytes,
ROUND(AVG(duration), 2) as mean_duration,
SUM(num_failed_logins) as total_failed_logins,
SUM(num_compromised) as total_compromised,
SUM(num_file_creations) as total_file_creations,
SUM(su_attempted) as total_root_attempts,
SUM(num_root) as total_root_acceses
FROM connections
GROUP BY protocol_type, state
ORDER BY 3 DESC
""")
attack_stats.show() ROUND(AVG(duration), 2) as 
mean_duration,
SUM(num_failed_logins) as total_failed_logins,
SUM(num_compromised) as total_compromised,
SUM(num_file_creations) as total_file_creations,
SUM(su_attempted) as total_root_attempts,
SUM(num_root) as total_root_acceses
FROM connections
GROUP BY protocol_type, state
ORDER BY 3 DESC
""")
attack_stats.show()

### Jupyter Notebook(SparkSQL to Dataframe, viz)

In [None]:
%matplotlib inline
ax = attack_stats.toPandas().plot.bar(x='protocol_type', 
subplots=True, figsize=(10,25))


![viz](Media/viz.png)

### Cloud storage instead of HDFS(copy data to new storage bucket)

Modify Spark jobs to use Cloud Storage instead of HDFS
Taking this original 'Lift & Shift' sample notebook you will now create a copy that decouples the storage requirements 
for the job from the compute requirements. In this case, all you have to do is replace the Hadoop file system calls with 
Cloud Storage calls by replacing hdfs:// storage references with gs:// references in the code and adjusting folder names 
as necessary.
You start by using the cloud shell to place a copy of the source data in a new Cloud Storage bucket.
1.In the Cloud Shell create a new storage bucket for your source data.
```shell
export PROJECT_ID=$(gcloud info --format='value(config.project)')
gsutil mb gs://$PROJECT_ID
```
2. In the Cloud Shell copy the source data into the bucket.
```shell
wget https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.data_10_percent.gz
gsutil cp kddcup.data_10_percent.gz gs://$PROJECT_ID/
```
3.Switch back to the 01_spark Jupyter Notebook tab and Make a Copy with name De-couple-storage 
and close first one ,delete first three cells because we work with google cloud storage .
4 .replace code in cell 4 with the following code



In [None]:
from pyspark.sql import SparkSession, SQLContext, Row
gcs_bucket='[Your-Bucket-Name]'
spark = SparkSession.builder.appName("kdd").getOrCreate()
sc = spark.sparkContext
data_file = "gs://"+gcs_bucket+"//kddcup.data_10_percent.gz"
raw_rdd = sc.textFile(data_file).cache()
raw_rdd.take(5)

**Deploy Spark Jobs(Optimize Spark jobs to run on Job specific clusters)**

3.Switch back to the 01_spark Jupyter Notebook tab and Make a Copy with name PySpark-analysis-file 
and close first one ,.
4 .insert cell above with following code




In [None]:
%%writefile spark_analysis.py
import matplotlib
matplotlib.use('agg')
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--bucket", help="bucket for input and output")
args = parser.parse_args()
BUCKET = args.bucket

The **%%writefile **spark_analysis.py Jupyter magic command creates a new output file to contain your standalone python 
script. You will add a variation of this to the remaining cells to append the contents of each cell to the standalone script 
file.

This code also imports the matplotlib module and explicitly sets the default plotting backend via matplotlib.use('agg') so that the 
plotting code runs outside of a Jupyter notebook. <br>
10.For the remaining cells insert %%writefile -a spark_analysis.py at the start of each Python code cell. These are the five cells 
labelled In [x]. <br>
%%writefile -a spark_analysis.py
For example the next cell should now look as follows.
```Python
%%writefile -a spark_analysis.py
from pyspark.sql import SparkSession, SQLContext, Row
spark = SparkSession.builder.appName("kdd").getOrCreate()
sc = spark.sparkContext
data_file = "gs://{}/kddcup.data_10_percent.gz".format(BUCKET)
raw_rdd = sc.textFile(data_file).cache()
#raw_rdd.take(5)
```

11.Repeat this step, inserting %%writefile -a spark_analysis.py at the start of each code cell until you reach the end. <br>
12.In the last cell, where the Pandas bar chart is plotted remove the %matplotlib inline magic command. <br>
Note: You must remove this inline matplotlib Jupyter magic directive or your script will fail when you run it.
13.Make sure you have selected the last code cell in the notebook then, in the menu bar, click Insert and 
select Insert Cell Below.
14.Paste the following code into the new cell.
```Python
%%writefile -a spark_analysis.py
ax[0].get_figure().savefig('report.png’);
```
15.Add another new cell at the end of the notebook and paste in the following:
```Python
%%writefile -a spark_analysis.py
import google.cloud.storage as gcs
bucket = gcs.Client().get_bucket(BUCKET)
for blob in bucket.list_blobs(prefix='sparktodp/'):
blob.delete()
bucket.blob('sparktodp/report.png').upload_from_filename('report.png’)
```

### Test Automation

You now test that the PySpark code runs successfully as a file by calling the local copy from inside the notebook, 
passing in a parameter to identify the storage bucket you created earlier that stores the input data for this job. The 
same bucket will be used to store the report data files produced by the script. <br>
1.In the PySpark-analysis-file notebook add a new cell at the end of the notebook and paste in the following:<br>
```shell
BUCKET_list = !gcloud info --format='value(config.project)'
BUCKET=BUCKET_list[0]
print('Writing to {}'.format(BUCKET))
!/opt/conda/miniconda3/bin/python spark_analysis.py --bucket=$BUCKET
```
This code assumes that you have followed the earlier instructions and created a Cloud Storage Bucket 
using your lab Project ID as the Storage Bucket name. If you used a different name modify this code to 
set the BUCKET variable to the name you used.
2. Add a new cell at the end of the notebook and paste in the following:

```shell
!gsutil ls gs://$BUCKET/sparktodp/**
```
This lists the script output files that have been saved to your Cloud Storage bucket.

3.To save a copy of the Python file to persistent storage, add a new cell and paste in the following:
```shell
!gsutil cp spark_analysis.py gs://$BUCKET/sparktodp/spark_analysis.py
```
4.Run All
![test_automation](Media/test_automation.png)

### Test Automation(Run the Analysis Job from Cloud Shell)
1.Switch back to your Cloud Shell and copy the Python script from Cloud Storage so you can run it as a Cloud 
Dataproc Job.
```shell
gsutil cp gs://$PROJECT_ID/sparktodp/spark_analysis.py spark_analysis.py
```
2. Create a launch script.

```shell
nano submit_onejob.sh
```
1. Paste the following into the script:

```shell
#!/bin/bash
gcloud dataproc jobs submit pyspark \
--cluster sparktodp \
--region us-central1 \
spark_analysis.py \
-- --bucket=$1
```
4.Press CTRL+X then Y and Enter key to exit and save.
5.Make the script executable:
6.Launch the PySpark Analysis job:
```powershell
./submit_onejob.sh $PROJECT_ID
```
7.In the Cloud Console tab navigate to 
the Dataproc > Clusters page if it is not already open. <br>
8.Click Jobs , wait until job finished. <br>
10.Navigate to your storage bucket and note that the 
output report, /sparktodp/report.png has an updated timestamp indicating that the stand-alone job has completed 
successfully. <br>
The storage bucket used by this Job for input and output 
data storage is the bucket that is used just the Project ID 
as the name. <br>
11.Navigate back to the Dataproc > Clusters page. <br>
12.Select the sparktodp cluster and click Delete. You don't 
need it any more. <br>
13.Click CONFIRM. <br>
14.Close the Jupyter tabs in your browser. <br>


### appendix
#### Check project permissions
Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions 
within Identity and Access Management (IAM).
1.In the Google Cloud console, on the Navigation menu ( ), click IAM & Admin > IAM.
2.Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is 
present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation 
menu > Home
![permissions](Media/permissions.png)