# Distributed image inference workflow

The following workflow describes how to do model inference with Tensorflow/Keras in Analytics cluster for deep learning image-related applications.

#### Step 0: Configure the necessary settings

* Configure Tensorflow: In case Tensorflow tasks to take all available resources (see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Configure_your_Tensorflow_script) 

```python
import tensorflow as tf
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpu_devices[0], True)
tf.config.threading.set_intra_op_parallelism_threads(4) #or lower values
tf.config.threading.set_inter_op_parallelism_threads(4) #or lower values
```

* Configure a custom PySpark SparkSession
    1. Ship the local conda stack/environment to Hadoop workers: The idea is to create a compressed env file then pass to workers via Spark's args. Note that a ***non-ROCm tensorflow*** need to be installed in the conda environment that ship to Hadoop workers, since we will use CPU on Hadoop workers to run the inference jobs.
    
    `os.environ['PYSPARK_SUBMIT_ARGS'] = '--archives tf-env-2.4.zip#venv pyspark-shell'`
    
    2. Configure Apache Arrow to decrease the batch size of the Arrorw reader to avoid OOM errors on smaller instance types:

    `.config('spark.sql.execution.arrow.maxRecordsPerBatch', 1024) `
    
    3. Depending on the data type, import external packages:
        * TFRecords: use linkdin's [spark-tfrecord](https://github.com/linkedin/spark-tfrecord) 
        
        `.config('spark.jars.packages', 'com.linkedin.sparktfrecord:spark-tfrecord_2.11:0.2.4')`
        * Avro: use [spark-avro](https://spark.apache.org/docs/2.4.0/sql-data-sources-avro.html)
        
        `.config('spark.jars.packages', 'org.apache.spark:spark-avro_2.11:2.4.4')`
    
    More detail on how to launch a Custom PySpark SparkSession, please refer to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter/Tips

#### Step 1: Prepare trained model

The idea is to broadcast the weights of the model from the driver, load the model graph and get the weights from the broadcasted variables in a pandas UDF.

##### Example 1: Load a saved trained model:

```python
from tensorflow import keras
from tensorflow.keras.models import model_from_json

model = keras.models.load_model('/path/to/model/')
model_json = model.to_json()
bc_model_weights = sc.broadcast(model.get_weights())
```

* Be sure to install numpy<1.20, since `load_model` might have trouble to work with numpy==1.20 (see https://stackoverflow.com/questions/58479556)

##### Example 2: Load ResNet50:
```python
from tensorflow.keras.applications.resnet50 import ResNet50

model = ResNet50()
bc_model_weights = sc.broadcast(model.get_weights())
```

##### Example 3: Load ResNet50 excludes the top layer for feature extraction:

```python
model = ResNet50(weights='imagenet', pooling='max', include_top=False)
model_json = model.to_json()
bc_model_weights = sc.broadcast(model.get_weights())
```

#### Step 2: Prepare data for inference

If your image data is on local/stat machine, we recommend first to save images into TFRecord.

* An example script to save image data to TFRecord file:

In [None]:
import pathlib

data_dir = pathlib.Path('/home/aikochou/VisualGap/models/quality_model/Pixels/')
quality_img = list(data_dir.glob('Quality/Quality/*.jpg'))[:50]
random_img = list(data_dir.glob('Random/Random/*.jpg'))[:50]

def _bytes_feature(value):
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def image_example(image_file_name, image_bytes, label):
    feature = {
      'image_file_name': _bytes_feature(image_file_name),
      'image_bytes': _bytes_feature(image_bytes),
      'label': _int64_feature(label),
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))

file_name = 'output.tfrecords'
with tf.io.TFRecordWriter(file_name) as writer:
    for img, label in [(img, 1) for img in quality_img]+[(img, 0) for img in random_img]:
        try:
            image = Image.open(img).convert('RGB')
            img_byte_arr = BytesIO()
            image.save(img_byte_arr, format='JPEG')
            tf_example = image_example(str(img).split('/')[-1].encode('utf-8'),img_byte_arr.getvalue(),label)
            writer.write(tf_example.SerializeToString())
        except:
            continue

* Move the TFRecord file to HDFS:

    `hadoop fs -moveFromLocal <file_name> <hdfs_dir>`

#### Step 3: Load the data into Spark DataFrames

##### Example 1: Load TFRecords into Spark DataFrames
    
   When loading TFRecords using linkdin's spark-tfrecord, you need to specify the schema explicitly (see https://github.com/tensorflow/ecosystem/issues/123)

```python
df = (spark.read.schema('image_file_name string, image_bytes binary, label int').format("tfrecord")
        .option('recordType', 'Example').load('/path/to/file/in/hdfs'))
```

##### Example 2: Load Avro into Spark DataFrames

```python
pixels = spark.read.format('avro').load('/path/to/file/in/hdfs')
pixels.printSchema()
```

#### Step 4: Run model inference via pandas UDF and save prediction results

##### Preprocess input data
* Convert the bytes to numpy.array

```python
from io import BytesIO
image_size = 180

@F.pandas_udf(ArrayType(FloatType()))
def process_image(image_bytes):
    ret = []
    for image in image_bytes:
        im = Image.open(BytesIO(image))
        im = im.resize([image_size,image_size])
        image_data = [float(i) for i in np.asarray(im, dtype='float32').flatten()]
        ret.append(image_data)
    return pd.Series(ret)
```

* Define the function to parse the input data.

```python
def parse_image(image_data):
    image = tf.image.convert_image_dtype(image_data, dtype=tf.float32) * (2. / 255) - 1 # normalization
    image = tf.reshape(image,[IMAGE_SIZE,IMAGE_SIZE,3])
    return image
```

##### Define the function for model inference

To load data in batches, using the `tf.data` API is recommended which support prefetching and multi-threaded loading to hide IO bound latency.

```python
@F.pandas_udf(ArrayType(FloatType()))
def predict_batch_udf(image_batch):
    batch_size = 64
    model = model_from_json(model_json) # load the model graph 
    model.set_weights(bc_model_weights.value) # set the weights from the broadcasted variables
    images = np.vstack(image_batch)
    dataset = tf.data.Dataset.from_tensor_slices(images)
    dataset = dataset.map(parse_image, num_parallel_calls=8).prefetch(5000).batch(batch_size)
    preds = model.predict(dataset)
    return pd.Series(list(preds))
```

For a complete pretrained model from keras.application, load the model graph directly in the pandas UDF:
```python
model = ResNet50(weights=None)
```

For a part of pretrained model or a model you trained, load the model graph from json:
```python
model = model_from_json(model_json)
```

##### Run model prediction and save prediction results back to Hive

```python
(df
 .withColumn('image_arr', process_image(F.col('image_bytes')))
 .withColumn('prediction', predict_batch_udf(F.col('image_arr')))
 .write
 .mode('overwrite')
 .option('path', '/path/to/save/table')
 .saveAsTable('aikochou.testTable')
)
```

If you want to save to parquet only:

```python
(df
 .withColumn('image_arr', process_image(F.col('image_bytes')))
 .withColumn('prediction', predict_batch_udf(F.col('image_arr')))
 .write
 .mode('overwrite')
 .parquet('/path/to/output/file')
)
```

#### Step 5: Make the result Hive table public

To make the table public, you’ll have to change permission to the files in HDFS, and make them read available to the analytics group `analytics-privatedata-users`

* To change group ownership to and HDFS directory you have to use this command:

    `hadoop fs -chown :analytics-privatedata-users -R /path/to/your/table`
    
* Be sure to make the data group readable with:

    `hadoop fs -chmod g+r /path/to/your/table`
    
* Also make your user directory readable and executable:

    `hadoop fs -chmod o=rx /user/aikochou`