# Apache Spark development basics

## Word Count
The word count program is considered to be the "hello world" program in Big Data analytics. In the word count program, given a text file, we want to count how many times every single word occurs. An example follows.

**Input file:**
```
foo bar bar 
baz bar
foo baz bar
```
**Result:** 
```
(foo,2)
(bar,4)
(baz,2)
```

**Task 1:** given an input file `data/la_divina_commedia.txt`, count how many times each single word occurs into it.

In [3]:
{
sc.textFile("data/la_divina_commedia.txt")
  .flatMap(_.split(" "))
  .map((_,1))
  .reduceByKey(_+_)
  .saveAsTextFile("data/commedia_counts.txt")
}  

## Montecarlo $\pi$ estimation
Large dataset analysis is the main use case of Spark. However, Spark can be used to perform compute intensive tasks as well. Montecarlo $\pi$ estimation is a good example problem.

### Montecarlo method
![circle](https://learntofish.files.wordpress.com/2010/10/circle_dots.png?w=300)

**Idea:** the ratio $\frac{A_{circle}}{A_{square}}$ is roughly equal to the faction of *darts* that fall in the circle.

#### Algorithm
1. Throw $N$ uniformly distributed darts in the square
2. Count how many darts fall in the circle
3. Pi is roughly $4\frac{count}{N}$ 

\begin{equation*}
\frac{count}{N} \simeq \frac{A_{circle}}{A_{square}} = \frac{\pi r^2}{(2r)^2} = \frac{\pi r^2}{4^2} = \frac{\pi}{4}
\end{equation*}


In [11]:
{
val n = 10000000
val count = sc.parallelize(1 to n)
  .map { _ =>
      val x = math.random
      val y = math.random
      if(x*x + y*y < 1) 1 else 0
  }.reduce(_+_)
val pi = 4.0 * count / n
println(pi)
}

                                                                                3.141272


## K-Nearest Neighbour Classifier

When Spark was first implemented, the motivation for having a new framework was the lack of dataset caching in MapReduce (and Hadoop). This is penalizing for applications that need to access a hot dataset iteratively. Building a K-Nearest Neighbour (KNN) classifier, is a nice example that falls in this range of problems.