# RDD creation

#### [Introduction to Spark with Python]

In this notebook we will introduce two different ways of getting data into the basic Spark data structure, the **Resilient Distributed Dataset** or **RDD**. An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

#### References

The KDD Cup 1999 competition dataset is described in detail [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99).

## Getting the data files  

The reference book for these and other Spark related topics is *Learning Spark* by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.  

More notebooks are: https://github.com/jadianes/spark-py-notebooks)

## Creating a RDD from a file  
#### KDDCUP99
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ``bad'' connections, called intrusions or attacks, and ``good'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

 This is the original task decription given to competition participants (http://kdd.ics.uci.edu/databases/kddcup99/task.html)

In [None]:
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

In this notebook we will use the reduced dataset (10 percent) provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a *Gzip* file that we will download locally.  

Now we have our data file loaded into the `raw_data` RDD.

Without getting into Spark *transformations* and *actions*, the most basic thing we can do to check that we got our RDD contents right is to `count()` the number of lines loaded from the file into the RDD.  

In [None]:
raw_data.count()

We can also check the first few entries in our data.  

In [None]:
raw_data.take(5)

## Asisgment  1: 

Count the number the number of 'normal.'  interacions we have in our dataset, using filter() transformation and count actions(). 

Remeber that a function is evaluated on every element in the original RDD. The new resulting RDD will contain just those elements that make the function return `True`.

In [None]:
# Replace <FILL IN> with the proper code
raw_data.first()
goodLines = raw_data.filter(<FILL IN>)
<FILL IN>




### Map()

By using the `map` transformation in Spark, we can apply a function to every element in our RDD. Python's lambdas are specially expressive for this particular. In this case we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows.

In [None]:
from time import time
csv_data = raw_data.map(lambda x: x.split(","))
t0 = time()
head_rows = csv_data.take(5)
tt = time() - t0
print ("Parse completed", tt) 
print(head_rows[0])

## Creating and RDD using `parallelize`

Another way of creating an RDD is to parallelize an already existing list.  

In [None]:
a = range(100)

data = sc.parallelize(a)

As we did before, we can `count()` the number of elements in the RDD.

In [None]:
data.count()

As before, we can access the first few elements on our RDD.  

In [None]:
data.take(5)