#Part 2

Partially based on Google's provided tutorial: https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-tutorial#python (see for a additional links and documentation for gcloud command line parameters and usage)

##Loading data

In [1]:
import urllib.request
url = 'https://drive.google.com/uc?export=download&confirm=t&id=1Ijyh14a0Lh9sjwQUR6PE1TB2phjAZP4P'
filename = "browsing.txt"
urllib.request.urlretrieve(url, filename)

('browsing.txt', <http.client.HTTPMessage at 0x7f41153143a0>)

##Step 3.1:
Copy your working a_priori function code into the cell below.

In [2]:
%%writefile pyspark_apriori.py
import pyspark, time
import sys
from operator import add
from itertools import combinations

if len(sys.argv) < 2:
  raise Exception("Input URI required")

def a_priori_step1(text_file_rdd):
  item_lines = text_file_rdd.map(str.split)
  # A-Priori step 1: Who shows up at least `support` times
  item_counts = item_lines.flatMap(lambda line: ((item, 1) for item in line)).reduceByKey(add)
  return item_counts

def a_priori(text_file_rdd, support=100):
  frequent_items = (a_priori_step1(text_file_rdd)
          .filter(lambda kv: kv[1] >= support)     # Filter out uncommon items
          .map(lambda kv: kv[0])                  # Strip out the count
  )
  frequent_items = set(frequent_items.collect())  # Turn into local variable
  frequent_items = sc.broadcast(frequent_items)   # Broadcast to all nodes
  # Do the naive algorithm, but filter out uncommon items first.
  item_lines = text_file_rdd.map(lambda line: [item for item in line.split() if item in frequent_items.value])
  all_combinations = item_lines.flatMap(lambda items: combinations(items, r=2)).map(lambda pair: tuple(sorted(pair)))
  pair_counts = all_combinations.map(lambda pair: (pair, 1)).reduceByKey(add)
  filtered_counts = pair_counts.filter(lambda kv: kv[1] >= support)
  return filtered_counts
  
support_threshold = 1000
if len(sys.argv) == 3:
  support_threshold = int(sys.argv[2])

sc = pyspark.SparkContext()

time_start = time.time()
rdd = sc.textFile(sys.argv[1])

pair_counts = a_priori(rdd, support=support_threshold)
print(pair_counts.takeOrdered(5, key=lambda kv: -kv[1]))

time_end = time.time()
print(f"elapsed time is {time_end-time_start}")

Writing pyspark_apriori.py


##Step 3.2:
Edit the cell to add your USERNAME

In [3]:
USERNAME="amh284"
%env REGION=australia-southeast1
%env ZONE=australia-southeast1-a
%env PROJECT=data301-2023-$USERNAME
%env CLUSTER=data301-2023-$USERNAME-lab3-cluster
%env BUCKET=data301-2023-$USERNAME-lab3-bucket


env: REGION=australia-southeast1
env: ZONE=australia-southeast1-a
env: PROJECT=data301-2023-amh284
env: CLUSTER=data301-2023-amh284-lab3-cluster
env: BUCKET=data301-2023-amh284-lab3-bucket


##Step 3.3: 
Run code to setup google cloud project and storage bucket.

In [4]:
!python3 -m pip install google-cloud-dataproc[libcst]


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting google-cloud-dataproc[libcst]
  Downloading google_cloud_dataproc-5.4.0-py2.py3-none-any.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.3/307.3 KB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Collecting grpc-google-iam-v1<1.0.0dev,>=0.12.4
  Downloading grpc_google_iam_v1-0.12.6-py2.py3-none-any.whl (26 kB)
Installing collected packages: grpc-google-iam-v1, google-cloud-dataproc
Successfully installed google-cloud-dataproc-5.4.0 grpc-google-iam-v1-0.12.6


In [7]:
!gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=wUgeFY89KhLBI6UCn4OxTCIhmpQ1i6&prompt=consent&access_type=offline&code_challenge=dMOPL6GmJPnfe4RZYoh-dSv3Iye6zEpxtiFWFDn7SkM&code_challenge_method=S256

Enter authorization code: 4/0AWtgzh4pqs-SLrtpJ1JDciytNO1EA8jMxgTj_Sr4NIIuUu9fc8-Ilk51yzVM0GzAVUhPtg

You are now logged in as [64andyuni@gmail.com].
Your current project is [data301-2023-amh284].  You can change this setting by running:
  $ gcloud config set project P

In [8]:
!gcloud config set project $PROJECT


Updated property [core/project].


In [9]:
!gcloud services enable dataproc.googleapis.com cloudresourcemanager.googleapis.com


Operation "operations/acat.p2-782458903224-e9755497-bc29-457f-88d4-62ea1717cc32" finished successfully.


In [10]:
!gsutil mb -c regional -l $REGION -p $PROJECT gs://$BUCKET

Creating gs://data301-2023-amh284-lab3-bucket/...
ServiceException: 409 A Cloud Storage bucket named 'data301-2023-amh284-lab3-bucket' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


In [11]:
!gcloud storage cp ./browsing.txt gs://$BUCKET

Copying file://./browsing.txt to gs://data301-2023-amh284-lab3-bucket/browsing.txt


##Step 3.4
Run the cluster create/execute/delete code.

**NOTE**: it may take 5-10 minutes

In [12]:
!gcloud dataproc clusters create $CLUSTER --region=$REGION --bucket=$BUCKET --zone=$ZONE \
--master-machine-type=n1-standard-2 --worker-machine-type=n1-standard-2 \
--image-version=1.5 --max-age=30m --num-masters=1 --num-workers=2

[1;31mERROR:[0m (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Multiple validation errors:
 - Insufficient 'CPUS' quota. Requested 34.0, available 24.0.
 - Insufficient 'CPUS_ALL_REGIONS' quota. Requested 34.0, available 32.0.


In [13]:
!gcloud dataproc jobs submit pyspark --cluster=$CLUSTER --region=$REGION pyspark_apriori.py -- gs://$BUCKET/browsing.txt

[1;31mERROR:[0m (gcloud.dataproc.jobs.submit.pyspark) NOT_FOUND: Not found: Cluster projects/data301-2023-amh284/regions/australia-southeast1/clusters/data301-2023-amh284-lab3-cluster


In [14]:
!gcloud dataproc clusters delete $CLUSTER --region=$REGION --quiet

[1;31mERROR:[0m (gcloud.dataproc.clusters.delete) NOT_FOUND: Not found: Cluster projects/data301-2023-amh284/regions/australia-southeast1/clusters/data301-2023-amh284-lab3-cluster


## Steps 3.5, 3.6, and 4.1, 4.2
Refer to the lab document