# Programming with RDDs

Spark automatically distributes the data in the RDD across a cluster and performs operations on them.

## RDD Basics

### Creating a RDD from Files


#### Creating RDD from Textfiles (i.e. CSV files)
We can create a RDD from a text file where each new line is a new record. We can then transform this RDD into the dataset that we want with RDD operations.

In [5]:
lines = sc.textFile("file:/usr/local/spark/1.4.0/README.md")
lines.take(5)

[u'# Apache Spark',
 u'',
 u'Spark is a fast and general cluster computing system for Big Data. It provides',
 u'high-level APIs in Scala, Java, and Python, and an optimized engine that',
 u'supports general computation graphs for data analysis. It also supports a']

#### Creating RDD from JSON file

There is a special kind of RDD called a DataFrame (much like the dataframes in R or Pandas in Python). You can read json files into this kind of RDD.

More on DataFrames and Spark-SQL later.

```Python
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.json("jsonFilePath")
```

#### Creating RDD from Parquet file

The Parquet file format is a columnar storage developed specifically for the Hadoop ecosystem and is the default file format for DataFrames in Spark.

```Python
SQLContext.read.parquet("parquetFilePath")
```

#### Generic RDD (DataFrame) Creation

DataFrames can be loaded and saved into a variety of formats. The default of which is [Parquet file format](https://parquet.apache.org/).

```Python
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load("filePathToParquetFile")
```

`sqlContext.read.load` can read files from json, parquet, or jdbc:

```Python
df = sqlContext.read.load("filePathToJsonFile", format="json")
```

```Python
df = sqlContext.read.load("filePathToParquetFile", format="parquet")
```

```Python
df = sqlContext.read.load("filePathToParquetFile", format="jdbc")
```

Dataframes can also be saved by specifying the format, with the default as `parquet`.


```Python
df.write.save("FileName.parquet")
```

```Python
df.write.save("FileName.parquet", format="parquet")
```

## The Two Types of RDD Operations: Transformations and Actions

There are two types of RDD operations: transformations and actions.

__Transformations__ construct a new RDD from the previous one. (mapping, filtering, etc.)

__Actions__ Compute a result based on an RDD and either return it to the driver program or save it to a file.

Spark performs these operations in a lazy fashion. Spark constructs a Directed Acyclic Graph (DAG) based on the transformations given and only groups them into tasks when an action is called.

In [None]:
lines.