# RDDs

Resilient Distributed Datasets. An RDD is an immutable partitioned collection of records that can be worked in parallel. Now, remember that with a DataFrame, each record is a structured row containing fields with a known schema. In the case of RDD, the records are just Java, Scala, or Python objects. And so you have complete control over them. Although this has several advantages, there are a couple of challenges. Spark does not understand the inner workings of your records as it does with your DataFrames. This means that the optimizations you would have automatically got with DataFrames, you will need to manually recreate. The RDD APIs are available in Python as well as Scala and Java. You can get good performace with running RDDs with Scala and Java. However, running Python RDDs, is like running Python user-defined functions row by row. So we need to serialize the data to the Python process, work on it in Python and then serialize it back to the Java Virtual Machine. For this reason, it's recommended to stick with the the high level APIs in Spark and only use RDDs when absolutely necessary.

## Import

In [None]:
import os

import findspark
import pyspark
from pyspark.sql import SparkSession

findspark.init()

## SparkSession and SparkContext

In [None]:
spark = SparkSession.builder.getOrCreate()

sc = spark.sparkContext

## RDDs setup

In [None]:
data_path = "file:///" + os.getcwd() + "/data"

file_path = data_path + "/police-stations.csv"
ps_rdd = sc.textFile(file_path)
ps_rdd.first()

In [None]:
ps_header = ps_rdd.first()

In [None]:
ps_rest = ps_rdd.filter(lambda line: line != ps_header)
ps_rest.first()

**How many police stations are there?**

In [None]:
ps_rest.map(lambda line: line.split(",")).count()

**Display the District ID, District name, Address and Zip for the police station with District ID 7**

In [None]:
(
    ps_rest.filter(lambda line: line.split(",")[0] == "7")
    .map(
        lambda line: (
            line.split(",")[0],
            line.split(",")[1],
            line.split(",")[2],
            line.split(",")[5],
        )
    )
    .collect()
)

**Police stations 10 and 11 are geographically close to each other. District ID, District name, address and zip code**

In [None]:
(
    ps_rest.filter(lambda line: line.split(",")[0] in ["10", "11"])
    .map(lambda line: (line.split(",")[1], line.split(",")[2], line.split(",")[5]))
    .collect()
)