# Introduction to PySpark

## Import Module

In [1]:
import pyspark

## Spark Context

In **PySpark**, ```SparkContext``` serves as the entry point to Spark functionality. It represents the connection to a Spark cluster
and is used to coordinate and manage the execution of Spark applications. It also handles the communication with the cluster manager to allocate resources and schedule tasks. 

In this case, we will connect to a local cluster with by creating as many possible worker threads on logical cores (CPU) to run the Spark the job in parallel with the ```local[*]``` syntax. 

In [2]:
sc = pyspark.SparkContext('local[*]')

25/02/13 11:52:47 WARN Utils: Your hostname, Cesars-MBP.local resolves to a loopback address: 127.0.0.1; using 192.168.7.230 instead (on interface en0)
25/02/13 11:52:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/02/13 11:52:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Resilient Distributed Dataset (RDD)

In **PySpark**, RDD is a fundamental data structure and abstraction that represents an immutable, distributed collection of data elements.

RDDs provide a powerful abstraction for distributed data processing. They handle the complexities of distributing data, fault tolerance, and parallel execution, allowing developers to focus on the logic of their data processing tasks. While newer abstractions like DataFrames and Datasets are built on top of RDDs and offer more structured data handling and optimizations, understanding RDDs is still crucial for a deeper understanding of Spark's internals and for certain advanced use cases.  They are the foundation upon which Spark's more modern data processing capabilities are built.

Let's break down what each part of that name means and why it's important:

- **Resilient**: RDDs are fault-tolerant.  If a node in your cluster fails, the RDD can be reconstructed from other nodes.  Spark achieves this by tracking the lineage of each RDD, which is essentially a graph of the transformations applied to create it.  This lineage allows Spark to recompute lost partitions of the RDD without having to reload the entire dataset.

- **Distributed**: RDDs are partitioned and distributed across multiple nodes in a cluster. This parallelization is key to Spark's performance, as it allows computations to be performed on different parts of the data simultaneously.  The data is split into chunks (partitions), and each partition can be processed on a different machine.

- **Dataset**: RDDs represent a collection of data. This data can come from various sources, such as files (text files, CSV, Parquet, etc.), databases, or even other RDDs.  The data within an RDD can be of any data type (e.g., integers, strings, Python objects).

Key Characteristics and Concepts related to RDDs:

- **Immutability**: Once an RDD is created, it cannot be changed.  Transformations on an RDD create a new RDD. This immutability simplifies debugging and makes it easier to reason about the code.

- **Lazy Evaluation**: Computations on RDDs are not performed immediately. Instead, Spark builds a plan of operations (the DAG - Directed Acyclic Graph) and executes it only when an action is triggered (e.g., collect, count, save). This lazy evaluation allows Spark to optimize the execution plan and avoid unnecessary computations.

- **Transformations**: Operations that create new RDDs from existing ones are called transformations. Examples include ```map```, ```filter```, ```reduce```, ```groupBy```, ```join```, etc. These are lazy operations.

- **Actions**: Operations that trigger the execution of the RDD computations and return a result to the driver program or write data to an external system are called actions. Examples include ```collect```, ```count```, ```first```, ```take```, ```reduce```, ```save```, etc. These are the operations that actually produce a result.

- **Partitioning**: RDDs are divided into partitions, which are the basic units of parallelism. The number of partitions can be configured and significantly impacts performance.  Good partitioning ensures balanced workload distribution across the cluster.

## Exercise

- Create a list of numbers
- Convert the list into an RDD and divide into 2 partitions
- Apply a filter to retain *only* the odd numbers from the RDD
- Retrieve the first 5 odd numbers from the filtered RDD

In [3]:
big_list = range(1000)

In [4]:
rdd = sc.parallelize(big_list, 2)

In [5]:
odds = rdd.filter(lambda x: x % 2 != 0)

In [6]:
odds.take(5)

                                                                                

[1, 3, 5, 7, 9]