# Practice 1 : Processing a Collection of Objects

So far, we have seen how to process and analyze a dataset composed of text files with Spark. We will now see through a the next exercises how to use it to process an arbitrary type of file dataset, in this case images.

We will work on the [Oxford Flowers](http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html) dataset.

In [None]:
! pip install pillow

Collecting pillow


In [1]:
import os, glob

In [2]:
PATH = '/project/datasets/flowers/jpg/'
image_list = glob.glob(PATH+'*.jpg')
image_list[:5]

['/project/datasets/flowers/jpg/image_0064.jpg',
 '/project/datasets/flowers/jpg/image_0716.jpg',
 '/project/datasets/flowers/jpg/image_1172.jpg',
 '/project/datasets/flowers/jpg/image_0914.jpg',
 '/project/datasets/flowers/jpg/image_0963.jpg']

The Python library [Pillow](https://python-pillow.github.io/) can help us read images. For example:

In [3]:
from PIL import Image

ModuleNotFoundError: No module named 'PIL'

In [None]:
flower_image = Image.open(image_list[0])
flower_image.filename

To reduce the dimensions of an image, we can use the method `resize` which returns a new `PIL.image` object.

In [None]:
image_resized = flower_image.resize((flower_image.width//10, flower_image.height//10))
image_resized

This image can then be saved on disk using the method `save()`.

## Instructions

### 0. Import the necessary module to process data with Spark and create a Spark context if required

### 1. Create a RDD from the list of filenames `image_list`

### 2. Create a second RDD that contains `PIL.Image`

### 3. Keep only the landscape images (`width > height`)

### 4. Count the number of landscape images

### 5. Reduce the dimensions of landscape images by a factor 5

### 6. Verify the transformation by retrieving the first element

### 7. Save the new images as files

**Watch out**
* How to tell PIL where to save the files?
* How can we specify the name and the paths of our new files?
* Could a dataset of key-value pairs be useful?
* Is this an action or a transformation?

### 8. End the application

## Recap

In this notebook, we put in practice and learned about the following parts of 
**[Python Spark API](http://spark.apache.org/docs/latest/api/python/)**:
1. Import Spark Python module: 
**[`import pyspark`](http://spark.apache.org/docs/latest/api/python/pyspark.html)**
2. Create a SparkContext:
**[`pyspark.SparkContext()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext)**
2. Create a RDD from a list of objects:
**[`SparkContext.parallelize(list)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.textFile)**
3. Count the number of elements in a RDD: 
**[`Rdd.count()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.count)**
4. Retrieve the first element of a RDD: 
**[`RDD.first()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.first)**
5. Apply a transformation on each element of a RDD:
**[`RDD.map(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map)**
5. Filter a RDD:
**[`RDD.filter(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.filter)**
7. Apply a function to all elements of a RDD: 
**[`RDD.foreach()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.foreach)**
10. End the SparkContext:
**[`SparkContext.stop()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.stop)**