# Spark Lab Assignment

## Instructions 
The purpose of this assignment is to develop some basic Spark development skills. The assignment is composed by two mandatory tasks and one challenge. Only PhD students are required to go through the challenge part. However, completing the challenges will improve the final grade for the Master students too. 

**Warning:** all of the tasks **must** be solved using the Spark RDD API, in order to distribute the computations in the Spark workers.

## Task 1: DNA G+C precentage
DNA is a molecule that carries genetic information used in the growth, development, functioning and reproduction of all known living organisms. The DNA information is coded in a language of 4 bases: cytosine (C), guanine (G), adenine (A), thymine (T). The percentage of G+C bases in a DNA sequence has important biological mening (wikipedia: https://goo.gl/kCLvDp), hence it is important to be able to compute it for long DNA sequences.

### Task
Given an input DNA sequence, represented as a text file: `data/dna.txt`, compute the percentage of `g` + `c` occurrences into it. An example follows:

**Input file:**
```
atcg
ccgg
ttat
```
**result:** 
$$\frac{C_{count} + G_{count}}{C_{count} + G_{count} + A_{count} + T_{count}} = \frac{6}{12} = 0.5$$

**Tip 1:** when you load an input file as an RDD, each line will be loaded into distinct string RDD record. In Scala you can count the occurrences of a certain character in a string as it follows:

In [3]:
"atttccgg".count(c => c == 'g')

2

**Question 1:** Is the previous operation parallel, or is computed locally? Why?

*Answer goes here*

**Tip 2:** sums form different RDD records can be aggregated using the RDD *reduce* method. An example follows:

In [4]:
val sumsRDD = sc.parallelize(Array(3,5,2))
sumsRDD.reduce(_+_)

10

### Your solution

In [None]:
// Implementation goes here

**Question 2:** What is an RDD in Spark?

*Answer goes here*

**Question 3:** Is *reduce* a trasformation or an action? What are RDD transformations and RDD actions? How do they differ from each other?

*Answer goes here*

**Question 4:** What are some of the advanteges of Spark over Hadoop (and MapReduce)?

*Answer goes here*

## Task 2: Monte Carlo integration
Large dataset analysis is the main use case of Spark. However, Spark can be used to perform compute intensive tasks as well. Numerical integration is a good example problem that falls in this group of use cases. 

<img src="https://pymc-devs.github.io/pymc/_images/reject.png" width="550"/>

The **Monte Carlo integration** method, is a way to get an approximation of the definite integral of a function $f(x)$, over an interval $[A,B]$. Given a value $Max_{f(x)},$ which $f(x)$ never exceeds, we first randomly draw $N$ uniformly distributed points $(x_1,y_2) … (x_N,y_N)$ s.t. $x_1 … x_N \in [A,B], y_1 … y_N \in [0,Max_{f(x)}]$. Then, assuming that $f(x)$ is positive over $[A,B]$, the fraction of points that fell under $f(x)$ will be roughly equal to the area under the curve, divided by the total area of the rectangle in which we randomly drew points. Hence, the definite integral of $f(x)$ over $[A,B]$ is roughly equal to: 

$$(B-A) Max_{f(x)}\frac{n_{P}}{tot_{P}},$$
where $n_{P}$ is the number of points that fell under $f(x)$, and $tot_{P}$ is the total number of randomly drawn points.

### Task 
Write a program in Spark to approximate the definite integral of $f(x) = (1 + sin(x))$ / $cos(x)$ over $[0,1].$ Such function is positive and it is lower than $4$ over $[0,1]$. 

![integral](http://www5a.wolframalpha.com/Calculate/MSP/MSP25901g5d7g5f348eic6600001f2091710666623b?MSPStoreType=image/gif&s=32&w=200.&h=135.&cdf=Coordinates&cdf=Tooltips)

For the purpose of this assignment drawing 1000 points is good enough.

**Question** What does the Spark's parallelize function do? What is it good for?

*Answer goes here*

### Your solution

In [None]:
// Implementation goes here