# Project 8: Deploy a model with a big data architecture in AWS

*Pierre-Eloi Ragetly*

This notebook has been realised to perform a dimension reduction on an image dataset with Pyspark

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Load-images" data-toc-modified-id="Load-images-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load images</a></span><ul class="toc-item"><li><span><a href="#Get-all-subfolders" data-toc-modified-id="Get-all-subfolders-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Get all subfolders</a></span></li><li><span><a href="#Create-a-DataFrame-with-all-images" data-toc-modified-id="Create-a-DataFrame-with-all-images-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Create a DataFrame with all images</a></span></li></ul></li><li><span><a href="#Create-a-Category-column" data-toc-modified-id="Create-a-Category-column-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create a Category column</a></span></li><li><span><a href="#Feature-extraction" data-toc-modified-id="Feature-extraction-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Feature extraction</a></span><ul class="toc-item"><li><span><a href="#Prepare-model" data-toc-modified-id="Prepare-model-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Prepare model</a></span></li><li><span><a href="#Prepare-data" data-toc-modified-id="Prepare-data-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Prepare data</a></span></li><li><span><a href="#Define-featurization-in-a-Pandas-UDF" data-toc-modified-id="Define-featurization-in-a-Pandas-UDF-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Define featurization in a Pandas UDF</a></span></li><li><span><a href="#Apply-featurization-to-the-DataFrame-of-images" data-toc-modified-id="Apply-featurization-to-the-DataFrame-of-images-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Apply featurization to the DataFrame of images</a></span></li></ul></li><li><span><a href="#Save-results" data-toc-modified-id="Save-results-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Save results</a></span></li></ul></div>

## Setup

First, let's import a common modules.

In [1]:
# Standard libraries
import os
import io
import time
from functools import reduce

# Import numpy and pandas for data manipulation
import numpy as np
import pandas as pd

# image preprocessing
from PIL import Image
from tensorflow.keras.preprocessing.image import img_to_array

# Import deep learning models with tensorflow
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input

# Import pyspark library
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, pandas_udf
from pyspark.sql.types import ArrayType, StringType, FloatType

2023-01-06 08:51:07.346007: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Let's create a spark session. 

In [2]:
sc = SparkContext()
sc.setLogLevel("Warn")
spark = SparkSession.builder.appName('ImageReduction').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/06 08:51:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load images

### Get all subfolders

In [3]:
root_path = 'dataset/sample_dataset'
subfolders = [d.path for d in os.scandir(root_path) if d.is_dir()]
subfolders

['dataset/sample_dataset/Kiwi',
 'dataset/sample_dataset/Apple_Golden_3',
 'dataset/sample_dataset/Cherry_1',
 'dataset/sample_dataset/Eggplant',
 'dataset/sample_dataset/Banana']

In [4]:
# change all spaces (' ') by '_' in directory names
# to avoid loading issues with pyspark
for d in subfolders:
    os.rename(d, d.replace(' ', '_'))

### Create a DataFrame with all images

Since Spark 2.4, reading image in compressed formats (jpg, png, etc...) is possible with `spark.read.format('image').load('path')`.  
The image is read with the ImageIO *Java Library*, and has a special DataFrame schema. The schema contains a StructType Column "Image" with all information about reading data.

However, data manipulation is much easier by using **binaryFile** format. Instead of creating a unique column *image* including six subcolums,it creates four columns that contain the raw content and metadata of the file:
- path: `StringType` *image file path* 
- modificationTime: `TimestampType` *last modification time of the image*
- lenth: `IntegerType` *bytes number of the image*
- content: `BinaryType` *image bytes in OpenCV-compatible order (BGR)*

The latter will be chosen to avoid multi-index.

In [5]:
def load_img(dir_path):
    """
    Load all .jpg images saved in a directory to a binary Spark DataFrame.
    """
    images = spark.read.format("binaryFile") \
        .option("pathGlobFilter", "*.jpg") \
        .option("recursiveFileLookup", "true") \
        .load(dir_path)
    return images

In [6]:
# Load images
dataframes =[load_img(dir_path) for dir_path in subfolders]

# merge DataFrames
image_df = reduce(lambda first, second: first.union(second), dataframes)

In [7]:
image_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)



## Create a Category column

In [8]:
regex = r'(.*)/(.*[a-zA-Z])(.*)/'
df = image_df.withColumn('category', regexp_extract('path', regex, 2))
df.show(10)

+--------------------+-------------------+------+--------------------+--------+
|                path|   modificationTime|length|             content|category|
+--------------------+-------------------+------+--------------------+--------+
|file:/Users/pierr...|2021-09-12 19:26:18|  5380|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:18|  5378|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:18|  5344|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:18|  5336|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:18|  5330|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:18|  5328|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:18|  5327|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:18|  5326|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:18|  5312|[FF D8 FF E0 00 1...|    Kiwi|
|file:/Users/pierr...|2021-09-12 19:26:1

## Feature extraction

### Prepare model

The **InceptionV3** model will be used to extract features from images.

The model weights will be saved in **Broadcast variables** to reduce communication costs. 
Broadcast variables are read-only shared variables, they are cached and available on all nodes in a cluster to be accessed or used by the tasks. Instead sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms.

In [9]:
model = InceptionV3(include_top=False)
bc_model_weights = sc.broadcast(model.get_weights())

def model_fn():
    """
    Returns an InceptionV3 model with top layer removed
    and broadcasted pretrained weights.
    """
    model = InceptionV3(weights=None, include_top=False)
    model.set_weights(bc_model_weights.value)
    return model

2023-01-06 08:51:22.951717: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Prepare data

In [10]:
def preprocess_img(content):
    """
    Preprocesses raw image bytes for prediction.
    """
    img = Image.open(io.BytesIO(content)).resize([224, 224])
    arr = img_to_array(img)
    return preprocess_input(arr)

### Define featurization in a Pandas UDF

In [11]:
@pandas_udf('array<float>')
def featurize_udf(s: pd.Series) -> pd.Series:
    """
    Featurize a pd.Series of raw images using the inceptionV3 model.
    For some layers, output features will be multi-dimensional tensors.
    Feature tensors are flattened to vectors for easier storage in Spark DataFrames.
    -----------    
    Return: a pd.Series of image features
    """
    model = model_fn()
    arr = np.stack(s.map(preprocess_img))
    preds = model.predict(arr)
    output = [p.flatten() for p in preds]
    return pd.Series(output)

As explained in the [spark documentation](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html), it is preferred to specify type hints for the pandas UDF instead of specifying pandas UDF type via *functionType* which will be deprecated in the future releases.

### Apply featurization to the DataFrame of images

Pandas UDFs on large records (e.g., very large images) can run into Out Of Memory (OOM) errors. It can be avoided by reducing the Arrow batch size via `maxRecordsPerBatch`.

In [12]:
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1024")

In [13]:
features_df = df.withColumn('features', featurize_udf('content')) \
                .select('path', 'category', 'content', 'features')

features_df.show(5)

2023-01-06 08:51:27.652185: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-06 08:51:31.730731: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


+--------------------+--------+--------------------+--------------------+
|                path|category|             content|            features|
+--------------------+--------+--------------------+--------------------+
|file:/Users/pierr...|    Kiwi|[FF D8 FF E0 00 1...|[0.0, 0.0, 0.0, 0...|
|file:/Users/pierr...|    Kiwi|[FF D8 FF E0 00 1...|[0.0, 0.01969701,...|
|file:/Users/pierr...|    Kiwi|[FF D8 FF E0 00 1...|[0.0, 0.0, 0.0, 0...|
|file:/Users/pierr...|    Kiwi|[FF D8 FF E0 00 1...|[0.0, 0.0, 0.0, 0...|
|file:/Users/pierr...|    Kiwi|[FF D8 FF E0 00 1...|[0.0, 0.0, 0.0, 0...|
+--------------------+--------+--------------------+--------------------+
only showing top 5 rows



                                                                                

## Save results

Results will be saved using the parquet format for performance purpose.

In [14]:
features_df.write.mode("overwrite").parquet("dataset/prep_data.parquet")

2023-01-06 08:51:37.574433: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-06 08:51:37.587907: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-06 08:51:37.616056: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the ap

                                                                                