## Academic exercise for study

### Install pyspark and customize Colab configuration

In [1]:
# Python interface to Spark
!pip install pyspark --quiet
# Installation and update of the PyDrive library, for interacting with Google Drive using Python.
!pip install -U -q PyDrive --quiet
# Install OpenJDK 8
!apt install openjdk-8-jdk-headless &> /dev/null
# Download the ngrok zip file to access the local server over the internet
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip &> /dev/null
# Unzip the ngrok zip file
!unzip ngrok-stable-linux-amd64.zip &> /dev/null
# Starts ngrok, allowing HTTP traffic on port 4050
get_ipython().system_raw('./ngrok http 4050 &')
# Import the Python os module
import os
# Sets the JAVA_HOME environment variable to the location of Java
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


### Initialize Spark

In [2]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from random import random
conf = SparkConf().set('spark.ui.port', '4050').setAppName("pi").setMaster("local[2]")
sc = SparkSession.builder.config(conf=conf).getOrCreate()

**The code demonstrates how to use PySpark to perform distributed computations and estimate the value of pi using the Monte Carlo method**

In [3]:
"""
def sample(p):
This is a function definition for the sample function.
The function takes a single parameter p.
Inside the function, it generates random values for x and y.
It returns 1 if the point (x, y) falls within the unit circle (x^2 + y^2 < 1), otherwise it returns 0.
"""
def sample(p):
    x,y = random(), random()
    return 1 if x*x + y*y < 1 else 0
# This line defines the number of samples (points) to be generated for the Monte Carlo estimation of pi.
NUM_SAMPLES = 1000*1000*100
"""
This line parallelizes a range of numbers from 0 to NUM_SAMPLES using Spark's parallelize method.
It distributes the numbers across two partitions for parallel processing.
The resulting RDD (Resilient Distributed Dataset) is assigned to the variable items.
"""
items = sc.sparkContext.parallelize(range(0, NUM_SAMPLES), 50)
"""
This line applies the sample function to each item in the items RDD using the map transformation.
The resulting RDD contains 1s and 0s indicating whether each point falls within the unit circle.
The reduce action is then used to sum up all the 1s and 0s to get the total count.
The final count is assigned to the variable count.
"""
count = items.map(sample).reduce(lambda a, b: a + b)
# Print the result
# This line prints the estimated value of pi by dividing the count of points inside the unit circle by the total number of points and multiplying by 4.
print("Pi is roughly %.9f" % (4.0 * count / NUM_SAMPLES))
# This line prints the default parallelism level, which indicates the number of partitions used by default when parallelizing data.
print("Default parallelism: {}".format(sc.sparkContext.defaultParallelism))
# This line prints the number of partitions in the items RDD.
print("Number of partitions: {}".format(items.getNumPartitions()))
# This line prints the partitioner used by the items RDD, which determines how the data is distributed across partitions.
print("Partitioner: {}".format(items.partitioner))

Pi is roughly 3.139906000
Default parallelism: 2
Number of partitions: 50
Partitioner: None
