# Project 8: Deploy a model with a big data architecture in AWS

*Pierre-Eloi Ragetly*

This notebook has been realised to perform a dimension reduction on an image dataset with Pyspark

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Load-images" data-toc-modified-id="Load-images-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load images</a></span><ul class="toc-item"><li><span><a href="#Get-all-subfolders" data-toc-modified-id="Get-all-subfolders-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Get all subfolders</a></span></li><li><span><a href="#Create-a-DataFrame-with-all-images" data-toc-modified-id="Create-a-DataFrame-with-all-images-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Create a DataFrame with all images</a></span></li></ul></li><li><span><a href="#Create-a-Category-column" data-toc-modified-id="Create-a-Category-column-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create a Category column</a></span></li><li><span><a href="#Feature-extraction" data-toc-modified-id="Feature-extraction-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Feature extraction</a></span><ul class="toc-item"><li><span><a href="#Prepare-model" data-toc-modified-id="Prepare-model-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Prepare model</a></span></li><li><span><a href="#Prepare-data" data-toc-modified-id="Prepare-data-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Prepare data</a></span></li><li><span><a href="#Define-featurization-in-a-Pandas-UDF" data-toc-modified-id="Define-featurization-in-a-Pandas-UDF-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Define featurization in a Pandas UDF</a></span></li><li><span><a href="#Apply-featurization-to-the-DataFrame-of-images" data-toc-modified-id="Apply-featurization-to-the-DataFrame-of-images-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Apply featurization to the DataFrame of images</a></span></li></ul></li><li><span><a href="#Save-results" data-toc-modified-id="Save-results-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Save results</a></span></li></ul></div>

## Setup

First, let's import a common modules.

In [39]:
# Standard libraries
import os
import io
import time
from functools import reduce

# Import numpy and pandas for data manipulation
import numpy as np
import pandas as pd

# image preprocessing
from PIL import Image
from tensorflow.keras.preprocessing.image import img_to_array

# Import deep learning models with tensorflow
#from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input

# Import pyspark library
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, pandas_udf, PandasUDFType
from pyspark.sql.types import ArrayType, StringType, FloatType

Let's create a spark session. 

In [40]:
sc = SparkContext.getOrCreate()
sc.setLogLevel("Warn")
spark = SparkSession.builder.appName('ImageReduction').getOrCreate()

## Load images

### Get all subfolders

In [41]:
root_path = 'dataset/sample_dataset'
subfolders = [d.path for d in os.scandir(root_path) if d.is_dir()]
subfolders

['dataset/sample_dataset/Apple_Golden_3', 'dataset/sample_dataset/Cherry_1']

In [42]:
# change all ' ' by '_' in directory names
# to avoid loading issues with pyspark
for d in subfolders:
    os.rename(d, d.replace(' ', '_'))

### Create a DataFrame with all images

**TO BE RE-WRITTEN**


Since Spark 2.4, reading image in compressed formats (jpg, png, etc...) is possible with `spark.read.format('image').load('path')`.

The image is read with the ImageIO *Java Library*, and has a special DataFrame schema. The schema contains a StructType Column "Image" with all information about reading data.
- origin: `StringType` *image file path* 
- height: `IntegerType` *image height*
- width: `IntegerType` *image width*
- nChannels: `IntegerType` *number of image channels*
- mode: `IntegerType` *OpenCV-compatible type* 
- data: `BinaryType` *Image bytes in OpenCV-compatible order (BGR)*

In [43]:
def load_img(dir_path):
    """
    Load all .jpg images saved in a directory to a binary Spark DataFrame.
    """
    images = spark.read.format("binaryFile") \
        .option("pathGlobFilter", "*.jpg") \
        .option("recursiveFileLookup", "true") \
        .load(dir_path)
    return images

In [44]:
# Load images
dataframes =[load_img(dir_path) for dir_path in subfolders]

# merge DataFrames
image_df = reduce(lambda first, second: first.union(second), dataframes)

In [45]:
image_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)



In [5]:
# Load images
dataframes =[spark.read.format('image').load(p) for p in subfolders]

# merge DataFrames
image_df = reduce(lambda first, second: first.union(second), dataframes)

In [6]:
image_df.printSchema()

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)



## Create a Category column

In [46]:
regex = r'(.*)/(.*[a-zA-Z])(.*)/'
df = image_df.withColumn('category', regexp_extract('path', regex, 2))
df.show(10)

+--------------------+-------------------+------+--------------------+------------+
|                path|   modificationTime|length|             content|    category|
+--------------------+-------------------+------+--------------------+------------+
|file:/Users/pierr...|2021-09-12 19:25:46|  4869|[FF D8 FF E0 00 1...|Apple_Golden|
|file:/Users/pierr...|2021-09-12 19:25:46|  4865|[FF D8 FF E0 00 1...|Apple_Golden|
|file:/Users/pierr...|2021-09-12 19:25:46|  4857|[FF D8 FF E0 00 1...|Apple_Golden|
|file:/Users/pierr...|2021-09-12 19:25:46|  4847|[FF D8 FF E0 00 1...|Apple_Golden|
|file:/Users/pierr...|2021-09-12 19:25:46|  4847|[FF D8 FF E0 00 1...|Apple_Golden|
|file:/Users/pierr...|2021-09-12 19:25:46|  4842|[FF D8 FF E0 00 1...|Apple_Golden|
|file:/Users/pierr...|2021-09-12 19:25:46|  4834|[FF D8 FF E0 00 1...|Apple_Golden|
|file:/Users/pierr...|2021-09-12 19:25:46|  4824|[FF D8 FF E0 00 1...|Apple_Golden|
|file:/Users/pierr...|2021-09-12 19:25:46|  4820|[FF D8 FF E0 00 1...|Apple_

## Feature extraction

### Prepare model

We will use the **InceptionV3** model to extract features from images.

In [9]:
from sparkdl import DeepImageFeaturizer

# model: InceptionV3
# extracting feature from images
featurizer = DeepImageFeaturizer(inputCol="image",
                                 outputCol="features",
                                 modelName="InceptionV3")

ModuleNotFoundError: No module named 'sparkdl'

We will use **Broadcast variables** to reduce communication costs.  
Broadcast variables are read-only shared variables, they are cached and available on all nodes in a cluster to be accessed or used by the tasks. Instead sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms.

In [49]:
model = InceptionV3(include_top=False)
#model = ResNet50(include_top=False)
bc_model_weights = sc.broadcast(model.get_weights())

def model_fn():
    """
    Returns a ResNet50 model with top layer removed
    and broadcasted pretrained weights.
    """
    model = InceptionV3(weights=None, include_top=False)
    model.set_weights(bc_model_weights.value)
    return model

### Prepare data

In [50]:
def preprocess_img(content):
    """
    Preprocesses raw image bytes for prediction.
    """
    img = Image.open(io.BytesIO(content)).resize([224, 224])
    arr = img_to_array(img)
    return preprocess_input(arr)

### Define featurization in a Pandas UDF

In [51]:
def featurize_series(model, content_series):
    """
    Featurize a pd.Series of raw images using the input model.
    For some layers, output features will be multi-dimensional tensors.
    Feature tensors are flattened to vectors for easier storage in Spark DataFrames.
    -----------    
    Return: a pd.Series of image features
    """
    arr = np.stack(content_series.map(preprocess_img))
    preds = model.predict(arr)
    output = [p.flatten() for p in preds]
    return pd.Series(output)

In [52]:
@pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
    """
    This method is a Scalar Iterator pandas UDF wrapping our featurization function.
    The decorator specifies that this returns
    a Spark DataFrame column of type ArrayType(FloatType).
    With Sclaar Iterator pandas UDFS,
    the model can be loaded once and re-used after for multiple data batches.
    This amortizes the overhead of loading big models.
    -----------
    Parameters:
    content_series_iter: This argument is an iterator over batches of data,
                         where each batch is a pandas Series of image data.
    -----------
    Return: a Spark DataFrame column of type ArrayType(FloatType)
    """
    model = model_fn()
    for content_series in content_series_iter:
        yield featurize_series(model, content_series)

### Apply featurization to the DataFrame of images

Pandas UDFs on large records (e.g., very large images) can run into Out Of Memory (OOM) errors. It can be avoided by reducing the Arrow batch size via `maxRecordsPerBatch`.

In [53]:
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1024")

In [54]:
features_df = df.withColumn('features', featurize_udf('content')) \
                .select('path', 'category', 'content', 'features')

features_df.show(5)

2022-11-18 15:54:06.665196: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-18 15:54:09.647279: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
                                                                                

+--------------------+------------+--------------------+--------------------+
|                path|    category|             content|            features|
+--------------------+------------+--------------------+--------------------+
|file:/Users/pierr...|Apple_Golden|[FF D8 FF E0 00 1...|[0.0, 0.29480368,...|
|file:/Users/pierr...|Apple_Golden|[FF D8 FF E0 00 1...|[0.0, 0.4715738, ...|
|file:/Users/pierr...|Apple_Golden|[FF D8 FF E0 00 1...|[0.0, 0.30836552,...|
|file:/Users/pierr...|Apple_Golden|[FF D8 FF E0 00 1...|[0.0, 0.1550447, ...|
|file:/Users/pierr...|Apple_Golden|[FF D8 FF E0 00 1...|[0.0, 0.28061968,...|
+--------------------+------------+--------------------+--------------------+
only showing top 5 rows



## Save results

Results will be saved using the parquet format for performance purpose.

In [None]:
features_df.write.mode("overwrite").parquet("path_s3_buquet")