# PySpark Lab Introduction

Welcome to our PySpark Lab! In this session, we'll be diving into the world of big data processing with Apache Spark and its Python API, PySpark.

## What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general computation graphs for data analysis. Spark is designed for both batch and real-time data processing, making it a versatile tool for a wide range of data processing tasks.

## What is PySpark?

PySpark is the Python API for Spark, allowing data scientists and analysts familiar with Python to leverage Spark's powerful data processing capabilities. PySpark provides a way to scale up your data processing tasks from a single machine to a cluster, enabling analysis of large datasets that wouldn't fit into the memory of a single machine.

## Using Google Colab for PySpark

For this lab, we'll be using Google Colab, a cloud-based Python notebook environment that provides free access to computing resources, including CPUs and GPUs. Google Colab allows us to run PySpark without any setup on our local machines, making it an excellent platform for learning and experimentation.

### Setting Up PySpark in Google Colab

To get started with PySpark in Google Colab, we'll first need to install PySpark. Don't worry, we'll guide you through this process in the lab. Here's a sneak peek of the commands you'll run:

```python
!pip install pyspark
!pip install findspark
```

## Running PySpark Code

Once PySpark is installed, you can start a Spark session and begin processing data.

### Setting Up Spark on Your Own Machine

While Google Colab provides a convenient platform for working with PySpark, you might want to set up Spark on your own machine for more control over your development environment or for projects that require specific configurations. Installing and configuring Spark on your own machine can be a valuable learning experience and can give you more flexibility for developing large-scale data processing applications.

In [6]:
!pip install pyspark
!pip install findspark



# Spark Basics
### Creating RDDs
First import pyspark then create a SparkContext

In [7]:
import findspark
findspark.init()

In [10]:
import pyspark

sc = pyspark.SparkContext.getOrCreate()

In [11]:
data = [num for num in range(1,10)]
print(data)

[1, 2, 3, 4, 5, 6, 7, 8, 9]


In [12]:
myRDD = sc.parallelize(data)

In [13]:
print(myRDD.collect())

[1, 2, 3, 4, 5, 6, 7, 8, 9]


In [14]:
print(myRDD.count())

9


#### Example 2 creating RDDs from key value pairs (tuples)

In [15]:
kv = [('a',7), ('a', 2), ('b', 2), ('b',4), ('c',1), ('c',2), ('c',3), ('c',4)]
print(kv)

[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]


In [16]:
rdd2 = sc.parallelize(kv)
print(rdd2.collect())

[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]


In [17]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y)
print(rdd3.collect())

[('b', 6), ('c', 10), ('a', 9)]


In [18]:
rdd4 = rdd2.groupByKey()
print(rdd4.collect())

[('b', <pyspark.resultiterable.ResultIterable object at 0x7ca81e28e440>), ('c', <pyspark.resultiterable.ResultIterable object at 0x7ca81e28e290>), ('a', <pyspark.resultiterable.ResultIterable object at 0x7ca81e28ee60>)]


In [19]:
rdd4.map(lambda x: (x[0], list(x[1]))).collect()

[('b', [2, 4]), ('c', [1, 2, 3, 4]), ('a', [7, 2])]