# PySpark exercise 1

Author: ISTD, SUTD

Title: Lab 12 Spark Part 2

Date: March 5, 2025




## Installing PySpark in Google Colab

To install PySpark in Google Collab, execute the below cell. This will download Spark and install all necessary libraries for this lab.

In [None]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Check this site for the latest download link https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"


import findspark
findspark.init()
findspark.find()

import pyspark

from pyspark.sql import DataFrame, SparkSession
from typing import List
import pyspark.sql.types as T
import pyspark.sql.functions as F




# Exercise 1

In this exercise we are tasked to perform some data transformation using PySpark.

For parts marked with **[CODE CHANGE REQUIRED]** you need to modify or complete the code before execution.
For parts without **[CODE CHANGE REQUIRED]** , you can just run the given code.

## Task:

Change the below input into the specified output.

### Input



The input data are stored in a text file `https://raw.githubusercontent.com/sutd50043/cohortclass/main/cc12/data/ex1/input.txt` which is a list of 2D coordinates in the following format.

```text
<label> 0:<x-value> 1:<y-value>
...
<label> 0:<x-value> 1:<y-value>
```

### Output

The expected output of the transformation are two seperate TSV outputs, `ones` and `zeros`. Both are in the following format

```text
<x-value> <y-value> ...
<x-value> <y-value>
```


For example, given the input file as the following

```text
1 0:102 1:230
0 0:123 1:56
0 0:22  1:2
1 0:74 1:102
```
The output files in `ones` would be

```text
102 230
74 102
```

The output files in `zeros` would be

```text
123 56
22 2
```
where the space in between the numbers are tabs.

Run the following bash cell to upload the input data to HDFS

In [None]:
!wget https://raw.githubusercontent.com/sutd50043/cohortclass/main/cc12/data/ex1/input.txt
!mkdir input
!mkdir output
!mv input.txt ./input/

Complete the following PySpark script so that it performs the above-mentioned transformation.

**[CODE CHANGE REQUIRED]**

```text
Let r be an RDD, r.map(f) applies f to all elements in r and return a new RDD.
r.filter(p) fitlers out elements e from r that satisfying p(e).
r.saveAsTextFile("hdfs://...") will save an RDD into hdfs.
```

In [None]:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("Transformation notebook").getOrCreate()
sc = sparkSession.sparkContext


def join(tokenized):
    x = (tokenized[1].split(":"))[1]
    y = (tokenized[2].split(":"))[1]
    return "\t".join([x,y])

sc.appName = "ETL (Transform) Example"

input_file_name = 'input/input.txt'
input = sc.textFile(input_file_name)

tokenizeds = input.map(lambda line : line.split(" "))
tokenizeds.persist()

ones = tokenizeds.filter(lambda tokenized : tokenized[0] == "1").map(join)

output_folder_ones = "./output/ones"
ones.saveAsTextFile(output_folder_ones)


zeros = None
# TODO: fix me

output_folder_zeros = "./output/zeros"
zeros.saveAsTextFile(output_folder_zeros)

sc.stop()

### Sample answer





```python
zeros = tokenizeds.filter(lambda tokenized : tokenized[0] == "0").map(join)
```


## Test case

Run the below bash cell to check the results

It should be something like the following

```text
20/11/12 18:51:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
124	253
145	255
5	63
1	168
121	254
166	222
178	255
7	176
68	45
...
```

and

```text
124	253
145	255
5	63
1	168
121	254
166	222
178	255
7	176
68	45
64	191
...
```


In [None]:
!head ./output/ones/*

In [None]:
!head ./output/ones/*