<img src="http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png" align=left>
<img src="http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png" align=left>

# **Spark Tutorial: Learning Apache Spark**

*Spark, like other big data tools, is powerful, capable, and well-suited to tackling a range of data challenges. Spark, like other big data technologies, is not necessarily the best choice for every data processing task.*

## 0. Overview
 
**0.1** Apache Spark is a fast and general engine for large-scale data processing:

* Supports cyclic data flow and in-memory computing.
* Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
* Offers over 80 high-level operators that make it easy to build parallel apps.

![spark](http://spark.apache.org/images/logistic-regression.png)

**0.2** Spark Libraries includes:

* [SQL and DataFrames](http://spark.apache.org/sql/) 
* [ML and MLlib](http://spark.apache.org/mllib/) (machine learning)
* [GraphX](http://spark.apache.org/graphx/) (graph) 
* [Spark Streaming](http://spark.apache.org/streaming/)
* [SparkR (R on Spark)](http://spark.apache.org/docs/latest/sparkr.html)
* and many third party libraries

<img src="https://www.mapr.com/ebooks/spark/images/spark-stack-diagram.png" width="60%">

You can combine these libraries seamlessly in the same application.

**0.3** Spark Runs everywhere:

* Standalone cluster mode
* AWS EC2
* Hadoop YARN
* Apache Mesos
* diverse data sources including HDFS, Cassandra, HBase, and S3.

**0.4** This lecture will provide you a brief introduction on how to use Spark (with jupyter notebook). During this lecture we will cover:

1. Initializing Spark
2. RDDs, Transformations and Actions
3. Working with Key-Value Pairs
4. Performance & Optimization

In [1]:
# Run this cell to setup data path
import os

datapath = os.getcwd()
if datapath.find('databricks') != -1:
    ACCESS_KEY = "AKIAI2P5MSEO2JYXJVQQ"
    SECRET_KEY = "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"
    AWS_BUCKET_NAME = "nycdsabootcamp"
    datapath = "s3a://%s:%s@%s/" %(ACCESS_KEY, SECRET_KEY, AWS_BUCKET_NAME)

## 1. Initializing Spark

Every Spark application consists of a **driver program** that launches various parallel operations on executor Java Virtual Machines (JVMs) running either in a cluster or locally on the same machine. When running locally, `PySparkShell` is the driver program. 

Driver programs access Spark through a `SparkContext` object, which represents a connection to a computing cluster. A `SparkContext` object is the main entry point for Spark functionality.


<img src="https://www.mapr.com/ebooks/spark/images/streaming-driver.png" width="60%">


In the PySpark shell, a special `SparkContext` is already created for you, in the variable called `sc`.

Now run the following cell to make sure that Spark runs correctly with your notebook:

In [2]:
# Run the following command and check the output
print "sc type: ", type(sc)
print "Driver Program name: ", sc.appName
print "Spark version: ", sc.version

sc type:  <class 'pyspark.context.SparkContext'>
Driver Program name:  PySparkShell
Spark version:  2.0.0


## 2. RDDs, Transformations and Actions

###  2.1 Resilient Distributed Datasets (RDDs)

The most fundamental Spark data structure is called `resilient distributed dataset` (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

**Creating an RDD**

Two basic ways to create RDDs: 

* `sc.parallelize(c) ` - parallelizing an existing collection in your driver program.
* `sc.textFile(path)` - referencing a dataset in a storage source supported by Hadoop, including your local file system, HDFS, HBase, Amazon S3, etc.

**count()** and **take(n)**

Once an RDD has been created, we can use `RDD.count()` to check the number of elements and use `RDD.take(n)` to return the first n elements as a regular python list object.

In [3]:
# xrange(100) will create a python iterator which generates numbers from 0 to 99

numRDD = sc.parallelize(xrange(100))
print "The length of numRDD is: ", numRDD.count()
print "The first 5 elements are: ", numRDD.take(5)

The length of numRDD is:  100
The first 5 elements are:  [0, 1, 2, 3, 4]


In [4]:
# The README.md file contains the summary of spark

filepath = os.path.join(datapath, "pyspark_1/README.md")

textRDD = sc.textFile(filepath)
print "The length of textRDD is: ", textRDD.count()
print "The first 5 elements are: ", textRDD.take(5)

The length of textRDD is:  99
The first 5 elements are:  [u'# Apache Spark', u'', u'Spark is a fast and general cluster computing system for Big Data. It provides', u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', u'supports general computation graphs for data analysis. It also supports a']


**collect()**

To fetch the entire RDD to the driver node as a python list, we can use `RDD.collect()`. 

*Note*: this can cause the driver to run out of memory if the RDD is too big to fit into one single node.

In [5]:
numList = numRDD.collect()
print numList

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


If the data source is a folder containing many small files, and you want to keep the files being loaded separately then we can use another method called `wholeTextFiles`.

`RDD.wholeTextFiles(path)` reads a directory of text files. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

In [6]:
# `sent_items_Lay_k/` is a directory that contains emails sent by 
# Kenneth Lay, the CEO and chairman of Enron Corporation and one of the key 
# person in Enron scandal during his time in Enron Corporation.

dirpath = os.path.join(datapath, "./pyspark_1/sent_item_lay_k/")

fileRDD = sc.wholeTextFiles(dirpath)
print "Total number of files:\t", fileRDD.count()

# The first element in fileRDD contains one entile file 
fileName, content = fileRDD.take(1)[0]
print "File path:\n\t", fileName
print "Content:\n", content

Total number of files:	13
File path:
	file:/Users/shuyan/Workspace/Supstat_NYC_office/bootcamp_slides/pyspark/pyspark_1/sent_items_lay_k/1.
Content:
Message-ID: <30661811.1075845189364.JavaMail.evans@thyme>
Date: Wed, 30 May 2001 13:00:47 -0700 (PDT)
From: kenneth.lay@enron.com
To: tom.acton@enron.com, janie.aguayo@enron.com, amelia.alland@enron.com, 
	lauri.allen@enron.com, aric.archie@enron.com, karl.atkins@enron.com, 
	a..austin@enron.com, henry.batiste@enron.com, 
	rusty.belflower@enron.com, daxa.bhavsar@enron.com, 
	michael.bilberry@enron.com, brad.blevins@enron.com, 
	debbie.boudar@enron.com, greg.brazaitis@enron.com, 
	willie.brooks@enron.com, rosa.brown@enron.com, 
	jerry.bubert@enron.com, esther.buckley@enron.com, 
	candy.bywaters@enron.com, bob.camp@enron.com, howard.camp@enron.com, 
	molly.carriere@enron.com, clem.cernosek@enron.com, 
	nick.cocavessis@enron.com, jane.coleman@enron.com, 
	mary.comello@enron.com, robert.cook@enron.com, 
	william.cosby@enron.com, paul.couvillon

### Exercise 1

*1.1* Create an RDD of 1000 (or more) random numbers that satisfy uniform distribution on the interval [0, 1), call it `unifRDD`. Check your RDD by applying `.take(5)` to it. You may want to use python `random` module to create random numbers.

*1.2* (Optional) Create a histogram with any plot functions you like to confirm that the numbers in `unifRDD` follows uniform distribution. *Note*: pyspark doesn't support plotting (yet) and you'll need to `collect()` the data to a python list.

In [7]:
# Your code goes here
from random import random
import matplotlib.pyplot as plt

# 1.1
unifRDD = sc.parallelize([random() for i in xrange(1000)])
print unifRDD.take(5)

# 1.2
fig = plt.figure(figsize=(8,4))
plt.hist(unifRDD.collect(), 20, alpha = .4)
plt.show()
# uncomment the following line if you're using databricks
# display(fig) 

[0.1420716224881987, 0.9912027201017164, 0.018663027347752825, 0.19744660021743, 0.7923456800763063]


### 2.2 RDD Operations

RDDs support two types of operations:

* **transformation** - create a new dataset from an existing one. For example:
    * `RDD.map(f)` - Return a new RDD by applying *f* to each element of this RDD.
    * `RDD.filter(f)` - Return a new RDD containing only the elements that satisfy a predicate.
* **action** - return a value to the driver program after running a computation on the dataset. For example:
    * `RDD.take(n)` - Take the first n elements of the RDD.
    * `RDD.count()` - Return the number of elements in this RDD.
    * `RDD.collect()` - Return a list that contains all of the elements in this RDD. 
    * `RDD.reduce(f)` - Reduces the elements of this RDD using the specified commutative and associative binary operator.

**.map(f)** 

`RDD.map(f)` returns a new RDD by applying a function to each element of this RDD.

For example we can apply `math.sqrt()` function to each number of numRDD using `RDD.map()` transformation:

In [8]:
# Create a new RDD with the square root of each element in numRDD

import math 

numRDDSqrt = numRDD.map(math.sqrt)
print numRDDSqrt.take(10)

[0.0, 1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.23606797749979, 2.449489742783178, 2.6457513110645907, 2.8284271247461903, 3.0]


In [9]:
# Create a new RDD that contains the string length of each element in textRDD

textRDDlen = textRDD.map(len)
print textRDDlen.collect()

[14, 0, 78, 75, 73, 74, 56, 42, 0, 26, 0, 0, 23, 0, 68, 76, 70, 56, 0, 17, 0, 62, 45, 0, 39, 0, 67, 0, 195, 66, 76, 151, 119, 0, 26, 0, 64, 0, 21, 0, 52, 0, 44, 0, 27, 0, 66, 0, 17, 0, 61, 0, 43, 0, 19, 0, 74, 74, 0, 29, 0, 32, 0, 75, 62, 41, 73, 72, 22, 0, 54, 0, 69, 0, 16, 0, 84, 17, 0, 19, 0, 33, 120, 0, 31, 0, 77, 76, 77, 0, 42, 120, 84, 65, 0, 16, 0, 97, 70]


**.reduce(f)**

`RDD.reduce(f)` reduces the elements of this RDD using the specified *commutative and associative binary operator*.

In [10]:
# Find the sum of the numbers in numRDDSqrt

from operator import add

sqrtSum = numRDDSqrt.reduce(add)
print sqrtSum

661.462947103


In [11]:
# Find the largest number in textRDDlen

maxLine = textRDDlen.reduce(max)
print maxLine

195


### 2.3 Passing Functions to Spark

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. We can define functions using python basic syntax:

* Lambda expressions, for simple functions that can be written as an expression.
* Functions defined using `def` for complex functions.

To find the number of words in each line 



In [12]:
def numOfWords(line):
    words = line.split()
    return len(words)

textRDDwordlen = textRDD.map(numOfWords)
print textRDDwordlen.collect()

[3, 0, 14, 13, 11, 12, 8, 6, 0, 1, 0, 0, 3, 0, 10, 6, 3, 8, 0, 3, 0, 6, 8, 0, 4, 0, 13, 0, 22, 10, 2, 8, 2, 0, 4, 0, 12, 0, 1, 0, 8, 0, 4, 0, 4, 0, 11, 0, 1, 0, 10, 0, 2, 0, 3, 0, 11, 11, 0, 2, 0, 6, 0, 12, 12, 9, 13, 14, 3, 0, 3, 0, 13, 0, 3, 0, 10, 4, 0, 1, 0, 7, 8, 0, 6, 0, 13, 11, 13, 0, 7, 4, 12, 8, 0, 2, 0, 6, 12]


**.filter(f)**

`RDD.filter(f)` return a new RDD containing only the elements that satisfy a condition.

In [13]:
# Jeff Skilling is the former CEO of the Enron Corporation, another key person in Enron scandal
# Find all the emails that mentioned Jeff Skilling, 
# i.e., elememts in fileRDD that contain string "Jeff Skilling"

fileRDDjs = fileRDD.filter(lambda kv: kv[1].find("Jeff Skilling") > -1)

print "Total number of files: ", fileRDDjs.count()

# The first element in fileRDD contains one entile file 
for sentfile in fileRDDjs.collect():
    print sentfile[0]

Total number of files:  2
file:/Users/shuyan/Workspace/Supstat_NYC_office/bootcamp_slides/pyspark/pyspark_1/sent_items_lay_k/10.
file:/Users/shuyan/Workspace/Supstat_NYC_office/bootcamp_slides/pyspark/pyspark_1/sent_items_lay_k/11.


### 2.4 Chaining Together Transformations and Actions

Here're some key points in Spark that you need to remember:

* An RDD is immutable, so once it is created, it cannot be changed. 
* Regular RDDs don't support random access (some third-party packages allow users to create indexedRDDs which support random access)
* Each transformation creates a new RDD. 
* Spark uses lazy evaluation, so transformations are not actually executed until an action occurs.


To perform multiple transformations we can chain them together using dot notation.

Now let's calculate the sum of all the multiples of 3 or 5 below 1000 by chaining an RDD with a `filter` and a `reduce`.

In [14]:
sc.parallelize(xrange(1000)).filter(lambda x: x % 3 == 0 or x % 5 == 0).reduce(add)

233168

### Exercise 2

*2.1* Calculate the sample variance of `unifRDD` uning proper transformations and actions. Compare your result with the built in function `RDD.variance()`. The formula is:

$s^2 = \frac{1}{N}\sum\limits_{i=1}^N(x_i - \bar{x})^2$

*2.2* Use Monte Carlo method to estimate Pi. Here are the suggested steps:

* Create an RDD with each element a pair of uniformly distributed random numbers within [0, 1), call it `pointRDD`.
* Create another RDD by filtering on `pointRDD` so that only those points that fall in the circle of radius 1 left, call it `pointInCircle`.
* $\pi$ can be estimated by `4 * pointInCircle.count() / pointRDD.count()`. Don't for get to convert `int` to `float`.

You can chain multiple transformations and actions together if you feel comfortable.

In [15]:
# Your code goes here

from operator import add
from random import random

# 1.1
m = sc.broadcast(unifRDD.mean())
print "Variance (UDF)", unifRDD.map(lambda x: (x - m.value)**2).reduce(add) / (unifRDD.count())
print "Variance (built-in function)", unifRDD.variance()

# 1.2
N = 10000000

def inCircle(point):
    x, y = point
    return 1 if x ** 2 + y ** 2 < 1 else 0

pi = 4.0 * sc.parallelize([(random(), random()) for x in range(N)]).map(inCircle).reduce(add) / N

print "Pi is approximately: ", pi

Variance (UDF) 0.0855947561078
Variance (built-in function) 0.0855947561078
Pi is approximately:  3.14193


### 2.5 More Operations 

Let's investigate some common operations:
* Actions:
    * `takeOrdered(num, key=None)` - get the N elements from a RDD ordered in ascending order or as specified by the optional key function.
    * `top(n, key=None)` - get the top N elements from a RDD.
    * `takeSample(withReplacement, num, seed=None)` - return a fixed-size sampled subset of this RDD.
    * `countByValue()` - return the count of each unique value in this RDD as a dictionary of (value, count) pairs.
* Transformations:
    * `flatMap(f)` - urn a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
    * `sample(withReplacement, fraction, seed=None)` - return a sampled subset of this RDD.

In [16]:
from random import randint

rintRDD = sc.parallelize([randint(0, 10) for x in range(10)])

print "takeOrdered - ", rintRDD.takeOrdered(5)
print "top - ", rintRDD.top(5)
print "takeSample - ", rintRDD.takeSample(True, 5, 1)
print "countByValue - ", rintRDD.countByValue()

takeOrdered -  [0, 0, 2, 4, 5]
top -  [10, 8, 6, 5, 5]
takeSample -  [5, 4, 8, 0, 5]
countByValue -  defaultdict(<type 'int'>, {0: 2, 2: 1, 4: 1, 5: 3, 6: 1, 8: 1, 10: 1})


In [17]:
wordRDD = textRDD.flatMap(lambda line: line.split())
print "flatMap - ", wordRDD.take(10)
print "sample - ", wordRDD.sample(False, 0.01, 1).collect()

flatMap -  [u'#', u'Apache', u'Spark', u'Spark', u'is', u'a', u'fast', u'and', u'general', u'cluster']
sample -  [u'Spark', u'guide,', u'using', u'of', u'Hadoop-supported', u'on']


### Exercise 3

A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 × 99.

Find the largest palindrome made from the product of two 3-digit numbers.

In [18]:
# Your code goes here

(sc.parallelize(xrange(100, 1000))
 .flatMap(lambda x: [x * y for y in range(x, 1000)])
 .filter(lambda x: str(x) == str(x)[::-1])
 .reduce(max))

906609

## 3. Working with Key-Value Pairs

### 3.1 pair RDDs

pair RDDs are a special type of RDDs in that each element consists of a Key/value pair.

pair RDDs are commonly used to perform aggregations, and often we will do some initial ETL (extract, transform, and load) to get our data into a key/value format. 

To create a pair RDD, we can either apply `parallelize` on a list of tuples or use `map` to generate key-value pairs. 

In [19]:
# parallelize on a list of tuples
pairRDD1 = sc.parallelize([('a', 1), ('a', 2), ('b', 1)])
print pairRDD1.collect()

# generate key-value pairs using map
pairRDD2 = numRDD.map(lambda x: (x % 2, x))
print pairRDD2.take(5)

[('a', 1), ('a', 2), ('b', 1)]
[(0, 0), (1, 1), (0, 2), (1, 3), (0, 4)]


### 3.2 Operations on pair RDDs

Besides common operations, pair also RDDs expose many new operations. Two commonly used transformations on pair RDD are: `.groupByKey()` and `.reduceByKey()`.

* `.groupByKey()` transformation gathers together pairs that have the same key and applies a function to two associated values at a time. 
* `.reduceByKey()` operates by applying the function first within each partition on a per-key basis and then across the partitions. 

While both the `.groupByKey()` and `.reduceByKey()` transformations can often be used to solve the same problem and will produce the same answer, the `.reduceByKey()` transformation works much better for large distributed datasets.

As an example, let's see how can we perform sum by key using the two transformations

In [20]:
# mapValues is used to convert iterable object to list for printing
print pairRDD2.groupByKey().mapValues(lambda x: list(x)).collect()
 
# mapValues sum values of the same list
print "\ngroupByKey - ", pairRDD2.groupByKey().map(lambda (k, v): (k, sum(v))).collect()
# pass add function into reduceByKey to perform summation
print "reduceByKey -", pairRDD2.reduceByKey(add).collect()

[(0, [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]), (1, [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99])]

groupByKey -  [(0, 2450), (1, 2500)]
reduceByKey - [(0, 2450), (1, 2500)]


### Exercise 4

*3.1* Create a pair RDD using `unifRDD` by mapping the numbers into 5 evenly spaced intervals.

*3.2* Calculate the number of items of the same key. You may find the action [`countByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.countByKey) be useful.

*3.3* Calculate the mean and sd of the elements that are in the same intervals.

In [21]:
# Your code goes here

unifRDD = sc.parallelize([random() for i in xrange(1000)])

# 3.1
def getInteval(num):
    return (int(num / .2), num)

unifPairRDD = unifRDD.map(getInteval)

# 3.2
print unifPairRDD.countByKey().items()

# 3.3
def stat(v):
    n = len(v)
    mean = sum(v)/n
    sd = (sum([(x - mean)**2 for x in v])/n)**.5
    return (mean, sd)

print unifPairRDD.groupByKey().map(lambda (k, v): (k, stat(v))).collect()

[(0, 205), (1, 212), (2, 183), (3, 208), (4, 192)]
[(0, (0.10524500888264254, 0.05537537601805199)), (1, (0.3001178866330926, 0.057997358554036385)), (2, (0.5017401346072036, 0.05663129890821499)), (3, (0.7065108759867413, 0.0591974422875698)), (4, (0.904742886595292, 0.058376883471220664))]


## 4. Performance & Optimization

### 4.1 RDD Persistence

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm. You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it.

Please read [here](http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence) for more details about persistence and storage levels.

###  4.2 Partitions

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. 

When creating an RDD using `parallelize`, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize.

The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value.

Here are some actions that can give partition information in RDD:
* `RDD.getNumPartitions()` - returns the number of partitions in RDD
* `RDD.glom()` - transforms an RDD to a new RDD with elements within each partition coalesced into a list.

In [22]:
print "numRDD num of partitions: ", numRDD.getNumPartitions()
print "textRDD num of partitions: ", textRDD.getNumPartitions() 

numRDD num of partitions:  8
textRDD num of partitions:  2


### 4.3 Shared Variables 

**Broadcast Variables**

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Broadcast variables have to be able to fit in memory on one machine. That means that they definitely should NOT be anything super large, like a large table or massive vector. Secondly, broadcast variables are immutable, meaning that they cannot be changed later on. This may seem inconvenient but it truly suits their use case. 

The following code shows how to use a broadcast variable to send values to an RDD.

In [23]:
broadcastVar = sc.broadcast([1, 2, 3])
bcRDD = sc.parallelize(range(5)).map(lambda x: [x * b for b in broadcastVar.value])
bcRDD.collect()

[[0, 0, 0], [1, 2, 3], [2, 4, 6], [3, 6, 9], [4, 8, 12]]

**Accumulators**

Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. To use accumulators, we need to use `.foreach(f)`, which applies a function to all elements of this RDD.

The following example shows how to calculate $\pi$ using an accumulator.

In [24]:
from random import random

N = 1000000
accum = sc.accumulator(0)

def inCircle(point):
    x, y = point
    return 1 if x ** 2 + y ** 2 < 1 else 0

sc.parallelize([(random(), random()) for x in range(N)]).foreach(lambda x: accum.add(inCircle(x)))
print "Pi is approximately: ", accum.value * 4.0 / N

Pi is approximately:  3.140732
