# **Pneumonia identification from X-Ray images**
---
**Authors: Sandra Alonso Paz and Lobna Ramadane Morchadi**
 
Pneumonia is an inflammation of the lung tissue caused by an infection that can lead to serious health problems and even death [1]. Although bacteria are the most common causes of pneumonia, it can also be caused by viruses, fungi, and other agents.
 
Focussing on the principal symptoms of this disease, they can vary among children, adults and older people. However, the most common symptoms include  shaking chills, fever, chest pain, cough, night sweats, nausea, vomiting, muscle aches, rapid breathing and heartbeat, shortness of breath, confusion, and weight loss.
 
Pneumonia treatment generally involves determining the need for: hospitalization, antibiotics, supportive care, and follow-up care. Although most adults do not need to be hospitalized for pneumonia, it is mandatory to follow a home-care treatment which involves:  drinking fluids, monitoring body temperature, allowing the cough reflex to clear the lung (no cough suppression), pain relief (if needed), finishing the entire course of antibiotics (if applicable), and not smoking. In other cases, if hospitalization is needed, the standard treatment is intravenous antibiotics.
 
Luckily as earlier is diagnosed this disease, as faster a treatment will be chosen and consequently, less aggressive would be the pneumonia. Principal diagnosis procedures are taking into account patient history, performing physical examination and laboratory tests, and imaging. In this last one is where our work fits.
 
Nowadays, one of the applications of ML models are clinical-decision support algorithms for medical imaging diagnosis. Even though it is highly challenging to obtain algorithms with an important reliability and interpretability, the aim of this project is to build a diagnostic tool based in deep-learning using pySpark and AnaliticsZoo.
 
We will apply this approach to a dataset containing chest X-ray images for the diagnosis of pediatric pneumonia. This tool may aid in the diagnosis facilitating early treatment, resulting in improved clinical outcomes and help for clinical interpretability and support to the human expertise.
 
As mentioned, a key element for diagnosis is radiographic data: chest X-rays. These are routinely obtained and can help to differentiate between viral and bacterial pneumonia. But, a fast radiologic interpretation is not always possible, especially in low-resource settings. For that purpose, **we have developed an effective supervised machine learning framework to classify pediatric chest X-rays to detect pneumonia and facilitate rapid diagnosis and treatment.**
 
In this sense, we have followed in some way steps proposed in Analytics Zoo documentation which can be consulted in [2]
 
 



# Step 0: Prepare Environment

Requirements for the execution:

* Install jdk8
* Install latest pre-release version of Analytics Zoo with RayOnSpark

In [1]:
# Install jdk8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
import os
# Set environment variable JAVA_HOME.
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version

update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)


In [2]:
# Install latest pre-release version of Analytics Zoo with RayOnSpark
# Installing Analytics Zoo from pip will automatically install pyspark, bigdl, and their dependencies.
!pip install --pre --upgrade analytics-zoo[ray]

Collecting analytics-zoo[ray]
  Downloading analytics_zoo-0.12.0b2022022201-py2.py3-none-manylinux1_x86_64.whl (194.7 MB)
[K     |████████████████████████████████| 194.7 MB 65 kB/s 
[?25hCollecting bigdl==0.13.1.dev0
  Downloading BigDL-0.13.1.dev0-py2.py3-none-manylinux1_x86_64.whl (114.0 MB)
[K     |████████████████████████████████| 114.0 MB 29 kB/s 
Collecting pyspark==2.4.6
  Downloading pyspark-2.4.6.tar.gz (218.4 MB)
[K     |████████████████████████████████| 218.4 MB 55 kB/s 
[?25hCollecting conda-pack==0.3.1
  Downloading conda_pack-0.3.1-py2.py3-none-any.whl (27 kB)
Collecting aioredis==1.1.0
  Downloading aioredis-1.1.0-py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 3.6 MB/s 
[?25hCollecting aiohttp==3.7.0
  Downloading aiohttp-3.7.0-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 38.8 MB/s 
[?25hCollecting setproctitle
  Downloading setproctitle-1.2.2-cp37-cp37m-manylinux1_x86_64.whl (36 kB)
Coll

# Step 1: Init Orca Context
 
Most Artificial Intelligence projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The Orca library seamlessly scales out your single node TensorFlow or PyTorch notebook across large clusters (so as to process distributed Big Data).
 
Ray is an open source distributed framework for emerging AI applications. It allows users to directly run Ray programs on existing Big Data clusters, and directly write Ray code inline with their Spark code (so as to process the in-memory Spark RDDs or DataFrames).
 
## First, we should import necessary libraries and modules
 
 



In [3]:
# Orca libraries:
from zoo.orca import init_orca_context, stop_orca_context
from zoo.orca import OrcaContext

# Model creation libraries:
import tensorflow as tf
from tensorflow import keras as K
from tensorflow.keras.optimizers import Adam

# Image pre-processing libraries:
import os
from PIL import Image
import torchvision
import torchvision.transforms as T

# Dataset creation libraries:
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing import image_dataset_from_directory

# Result evaluation libraries:
import matplotlib.pyplot as plt

## Then, let's initialize Orca context

In [4]:
# It is recommended to set it to True when running Analytics Zoo in Jupyter notebook 
OrcaContext.log_output = True # (this will display terminal's stdout and stderr in the Jupyter notebook).

init_orca_context(cluster_mode="local", cores=6)

Initializing orca context
Current pyspark location is : /usr/local/lib/python3.7/dist-packages/pyspark/__init__.py
Start to getOrCreate SparkContext
pyspark_submit_args is:  --driver-class-path /usr/local/lib/python3.7/dist-packages/zoo/share/lib/analytics-zoo-bigdl_0.13.1-SNAPSHOT-spark_2.4.6-0.12.0-SNAPSHOT-jar-with-dependencies.jar:/usr/local/lib/python3.7/dist-packages/bigdl/share/lib/bigdl-0.13.1-SNAPSHOT-jar-with-dependencies.jar pyspark-shell 
2022-02-22 17:10:24 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


cls.getname: com.intel.analytics.bigdl.python.api.Sample
BigDLBasePickler registering: bigdl.util.common  Sample
cls.getname: com.intel.analytics.bigdl.python.api.EvaluatedResult
BigDLBasePickler registering: bigdl.util.common  EvaluatedResult
cls.getname: com.intel.analytics.bigdl.python.api.JTensor
BigDLBasePickler registering: bigdl.util.common  JTensor
cls.getname: com.intel.analytics.bigdl.python.api.JActivity
BigDLBasePickler registering: bigdl.util.common  JActivity
Successfully got a SparkContext



User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=0
   KMP_SETTINGS=1
   OMP_NUM_THREADS=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=128
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_HAND_THREAD=false
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=3
   KMP_GTID_MODE=3
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_MWAIT_HINTS=0
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_

## Now it is time to perform data-parallel processing in Orca.

In [5]:
# This supports standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, Pillow, etc.
spark = OrcaContext.get_spark_session()

# Step 2: Define the Model


 
After consulting several resources we came up with the conclusion that Transfer learning was the best choice for solving the proposed problem.
 
## 2.1. What is Transfer Learning?
**Transfer learning** is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.
It is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems. [3]
 
## 2.2. What is ResNet?
 
ResNet stands for Residual Network. It is an innovative neural network that was first introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in their 2015 computer vision research paper titled ‘Deep Residual Learning for Image Recognition’ [4]. 
 
## 2.3. What is ResNet-50?
 
ResNet has many variants that run on the same concept but have different numbers of layers. Resnet50 is used to denote the variant that can work with 50 neural network layers. [5]
 
## 2.4. Deep Neural Networks in Computer Vision
 
When working with deep convolutional neural networks to solve a problem related to computer vision, machine learning experts engage in stacking more layers. These additional layers help solve complex problems more efficiently as the different layers could be trained for varying tasks to get highly accurate results.
 
While the number of stacked layers can enrich the features of the model, a deeper network can show the issue of degradation. In other words, **as the number of layers of the neural network increases, the accuracy levels may get saturated and slowly degrade after a point**. As a result, the performance of the model deteriorates both on the training and testing data.
 
This degradation is not a result of overfitting. Instead, it may result from the initialization of the network, optimization function, or, more importantly, the problem of vanishing or exploding gradients.
 
**ResNet was created with the aim of tackling this exact problem**. Deep residual nets make use of residual blocks to improve the accuracy of the models. The concept of “skip connections,” which lies at the core of the residual blocks, is the strength of this type of neural network.
 
## 2.5. Resnet50 with Keras
 
Keras is a deep learning API that is popular due to the simplicity of building models using it. Keras comes with several pre-trained models, including Resnet50, that anyone can use for their experiments.
 
Therefore, building a residual network in Keras for computer vision tasks like image classification is relatively simple.
 



In [6]:
# References used while creating the model:
  # Create a model based on Zoo Keras: 
  #        https://github.com/intel-analytics/analytics-zoo/blob/master/docs/docs/ProgrammingGuide/workingwithimages.md

  # Net type and parameters:
  #       https://keras.io/api/applications/resnet/#resnet50-function
  #       https://www.delltechnologies.com/asset/en-us/solutions/industry-solutions/industry-market/h17686_hornet_wp.pdf

  # Transfer learning:
  #        https://medium.com/@kenneth.ca95/a-guide-to-transfer-learning-with-keras-using-resnet50-a81a4a28084b
  #        https://chroniclesofai.com/transfer-learning-with-keras-resnet-50/

def model_creator(config):
  # Pre-trained model
   res_model = tf.keras.applications.ResNet50(include_top=False, weights="imagenet", input_tensor=None,
                                               input_shape=(224, 224, 3), pooling=None)
   # Create and model based on pretrained model
   model = K.models.Sequential()
   model.add(res_model)
   # Flatten reshapes the tensor to have a shape that is equal to the number of elements contained in the tensor (making a 1d-array of elements)
   model.add(K.layers.Flatten())
   # Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation has been setted at sigmoid
   model.add(K.layers.Dense(1, activation='sigmoid'))
   # Compile the model means configure the model. For this task we have selected categorical_crossentropy as loss and the adam optimizer (see below).
   # As we want to study our result we will need to obtain the accuracy, precission and recall which are the most important parameter in this terms
   model.compile( loss='binary_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(learning_rate = 0.0001),
                  metrics=['Accuracy', 
                  keras.metrics.AUC(),
                  tf.keras.metrics.Precision(), 
                  tf.keras.metrics.Recall(),
                  tf.keras.metrics.TruePositives(),
                  tf.keras.metrics.TrueNegatives(),
                  tf.keras.metrics.FalsePositives(),
                  tf.keras.metrics.FalseNegatives()])
   # Model has been created
   return model

## What is the Adam optimization algorithm?
Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.

When introducing the algorithm, the authors list the attractive benefits of using Adam on non-convex optimization problems, as follows:

* Straightforward to implement.
* Computationally efficient.
* Little memory requirements.
* Invariant to diagonal rescale of the gradients.
* Well suited for problems that are large in terms of data and/or parameters.

In this sense we made use of Adam optimzer instead of RMSprop proposed on the followed guide [6]

# Step 3: Define Train Dataset



Our dataset is made up of three image folders (train, test and validation). Each one has two folders inside which contain classified 5232 images of 1349 "NORMAL" X-RAY images and "PNEUMONIA" X-RAY images.
 
 
This dataset has been upload to our personal Drive for facilitating data handling accounts but can also be downloaded in [Kaggle](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia). 

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
# In order to develop a user-friendly code, we have save the different paths we will use 
# during the devlopment process.
train_images= '/content/drive/MyDrive/UPM/chest_xray/train'
val_images = '/content/drive/MyDrive/UPM/chest_xray/val/'
test_images = '/content/drive/MyDrive/UPM/chest_xray/test/'

## 3.1. Data visualization

### 3.1.1 Train data

In [9]:
normal_df_train = spark.read.format("image").load(train_images+'/NORMAL')
pneumonia_df_train = spark.read.format("image").load(train_images+'/PNEUMONIA')

In [10]:
print("Schema of normal dataframe of training set")
normal_df_train.printSchema()
print("Image labeled sample ")
normal_df_train.show(5)
print("Number of images in the normal training dataframe: ")
normal_df_train.select("image").count()

Schema of normal dataframe of training set
root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)

Image labeled sample 




+--------------------+
|               image|
+--------------------+
|[file:///content/...|
|[file:///content/...|
|[file:///content/...|
|[file:///content/...|
|[file:///content/...|
+--------------------+
only showing top 5 rows

Number of images in the normal training dataframe: 


1341

In [11]:
print("Schema of pneumonia dataframe of training set")
pneumonia_df_train.printSchema()
print("Image labeled sample ")
pneumonia_df_train.show(5)
print("Number of images in the pneumonia training dataframe: ")
pneumonia_df_train.select("image").count()

Schema of pneumonia dataframe of training set
root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)

Image labeled sample 




+--------------------+
|               image|
+--------------------+
|[file:///content/...|
|[file:///content/...|
|[file:///content/...|
|[file:///content/...|
|[file:///content/...|
+--------------------+
only showing top 5 rows

Number of images in the pneumonia training dataframe: 




3875

In [12]:
train_df = normal_df_train.unionAll(pneumonia_df_train)

# Count all the images storaged
print('Number of training images: ', train_df.count())
print('Number of training NORMAL images: ', normal_df_train.count(), " or" ,(normal_df_train.count()/ train_df.count()), "%")
print('Number of training PNEUMONIA images: ', pneumonia_df_train.count(), " or", (pneumonia_df_train.count()/train_df.count()), "%")



Number of training images:  5216
Number of training NORMAL images:  1341  or 0.2570935582822086 %
Number of training PNEUMONIA images:  3875  or 0.7429064417177914 %


In [13]:
# First it is mandatory study image format
train_df.select("image.origin", "image.width", "image.height").show(truncate=False)
# As width and height are not standardized we will start pre-processing

[Stage 20:>                                                         (0 + 1) / 1]

+----------------------------------------------------------------------------------------+-----+------+
|origin                                                                                  |width|height|
+----------------------------------------------------------------------------------------+-----+------+
|file:///content/drive/MyDrive/UPM/chest_xray/train/NORMAL/NORMAL2-IM-1257-0001.jpeg     |2916 |2583  |
|file:///content/drive/MyDrive/UPM/chest_xray/train/NORMAL/NORMAL2-IM-1018-0001.jpeg     |2694 |2625  |
|file:///content/drive/MyDrive/UPM/chest_xray/train/NORMAL/NORMAL2-IM-1050-0001.jpeg     |2564 |2519  |
|file:///content/drive/MyDrive/UPM/chest_xray/train/NORMAL/NORMAL2-IM-0602-0001.jpeg     |2619 |2628  |
|file:///content/drive/MyDrive/UPM/chest_xray/train/NORMAL/NORMAL2-IM-1302-0001.jpeg     |2721 |2438  |
|file:///content/drive/MyDrive/UPM/chest_xray/train/NORMAL/NORMAL2-IM-0774-0001.jpeg     |2510 |2543  |
|file:///content/drive/MyDrive/UPM/chest_xray/train/NORMAL/NORMA

                                                                                

### 3.1.2. Test data

In [14]:
# For future steps we will store the rest of the data given
normal_df_test = spark.read.format("image").load(test_images+'/NORMAL')
pneumonia_df_test = spark.read.format("image").load(test_images+'/PNEUMONIA')
test_df = normal_df_test.unionAll(pneumonia_df_test)

# Count all the images storaged
print('Number of testing images: ', test_df.count())
print('Number of testing NORMAL images: ', normal_df_test.count(), " or" ,(normal_df_test.count()/ test_df.count()), "%")
print('Number of testing PNEUMONIA images: ', pneumonia_df_test.count(), " or", (pneumonia_df_test.count()/test_df.count()), "%")

Number of testing images:  624
Number of testing NORMAL images:  234  or 0.375 %
Number of testing PNEUMONIA images:  390  or 0.625 %


### 3.1.3. Validation data

In [15]:
normal_df_val = spark.read.format("image").load(val_images+'/NORMAL')
pneumonia_df_val = spark.read.format("image").load(val_images+'/PNEUMONIA')
val_df = normal_df_val.unionAll(pneumonia_df_val)

# Count all the images storaged
print('Number of validation images: ', val_df.count())
print('Number of validation NORMAL images: ', normal_df_val.count(), " or" ,(normal_df_val.count()/ val_df.count()), "%")
print('Number of validation PNEUMONIA images: ', pneumonia_df_val.count(), " or", (pneumonia_df_val.count()/val_df.count()), "%")

Number of validation images:  16
Number of validation NORMAL images:  8  or 0.5 %
Number of validation PNEUMONIA images:  8  or 0.5 %


## 3.2. Preprocessing

Image pre-processing are steps which involve formatting images before they are used by the model training. This usually includes: resizing, orienting and color corrections.
 
As we are using fully connected layers in convolutional neural networks, these require that all images are the same sized arrays. Most architectures require a squared image entry, but in very few situations we have images with that format. In this sense we have used resize() function for resizing all images in 1024 x 1024. Moreover we have used CenterCrop() function for focussing on the center of the chest x ray image where usually the anomalies we want to find are placed.
 
On the other hand, color changes are a type of image transformation that is generally applied to all the dataset (train and test). Being more timely performance when images are grayscaled. Color images are stored as red, green and blue values, while grayscale images are only stored as a range of white to black. This means only need to work with one matrix per image, and not three as in color images occurs [7]. In this sense, we tried to use a greyscale. However ResNet50 only accepts 3 input channels so finally, we used rgb scale.
 
Finally, we have included the function ColorJitter which adjust brightness and contrast.
 
 



In [16]:
# Firstly we have defined a function (which will be called later) to include all the preprocessing function we want to apply.
# In our case:
# 1. CenterCrop() to focuss on the center of the image
# 2. RandomHorizontalFlip() to randomly flip the image (0.5 = 50%)
# 3. ColorJitter() which manages the brightness and contrast of the image
def define_transformer(sample):
  transformer = torchvision.transforms.Compose([T.CenterCrop(224), T.RandomHorizontalFlip(p=0.5), T.ColorJitter(brightness=0.5, contrast=0.1,hue=0)])
  return transformer(sample)

In [17]:
# Here we resize y call define_transformer function for each all the paths and its images

def processing (path):
  path_types = ['/NORMAL/', '/PNEUMONIA/']
  for t in path_types:
    for img in os.listdir(path+t): # For all the images contained in the folders
        image = Image.open(path+t+img) # We use PIL functions to manage the images
        new_image = image.resize((1024, 1024)) # Resize to a proper and common size
        define_transformer(new_image).save(path+'/processed/'+t+img) # Save the result in a new folder for using them in training and testing

In [18]:
# Perform the pre-processing process for all the paths
paths = [train_images, val_images, test_images]
for p in paths:
    processing(p)

In [19]:
# New paths for processed images
train_images= '/content/drive/MyDrive/UPM/chest_xray/train/processed'
val_images = '/content/drive/MyDrive/UPM/chest_xray/val/processed'
test_images = '/content/drive/MyDrive/UPM/chest_xray/test/processed'

Here are some examples of the pre-processing process:

In [20]:
print('Image NORMAL2-IM-1308-0001.jpeg before pre-processing')
image = Image.open('/content/drive/MyDrive/UPM/chest_xray/train/NORMAL/NORMAL2-IM-1308-0001.jpeg')
image.show()

Image NORMAL2-IM-1308-0001.jpeg before pre-processing


In [21]:
print('Image NORMAL2-IM-1308-0001.jpeg before post-processing')
image = Image.open(train_images+'/NORMAL/NORMAL2-IM-1308-0001.jpeg')
image.show()

Image NORMAL2-IM-1308-0001.jpeg before post-processing


## 3.3. Create Dataset for future training and testing

In order to use given images in our model, it was mandatory to create Datasets of each processed image folder. For this task we have made use of image_dataset_from_directory() function which generates a tf.data.Dataset from image files in a directory. In addition, we could include some characteristics of the images such as their class, size and label.

In [22]:
# Function extracted from 
#       https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html and
#       https://keras.io/api/preprocessing/image/def train_data_creator(config, batch_size):

#-------------------------------------- TRAIN DATASET ------------------------------------------------------------------------------------------
    
def train_data_creator(config, batch_size):

    train_ds = image_dataset_from_directory(train_images, 
                                            labels= 'inferred',                     # labels are generated from the directory structure
                                            label_mode='binary',                    #  means that the labels can be only 2
                                            # !! important !! in other versions you could chose float32  as label mode. However is not usable any more
                                            # But it is mandatory to convert later this type
                                            class_names = ['NORMAL', 'PNEUMONIA'],
                                            image_size=(224, 224),                 # Shape has been defined before as 224x224 after cropping images so we set this param to not resize again
                                            batch_size = 1,                        # Size of the batches of data.
                                            shuffle=False)
                                            

    # Duplication of the train dataset
    train_ds = train_ds.repeat(2)

    # map function is frequently used for applying a characteristic or method to set of data. In our case we need to perform three:
    # 1. Perform the preproces proper to Resnet50 
    train_ds = train_ds.map(lambda x, y: (K.applications.resnet50.preprocess_input(x), y))
    # 2. Finally, it is mandatory to cast (change type) to float32 (as explained above (label_mode = binary)) in order to avoid converse problems
    train_ds = train_ds.map(lambda x, y: (tf.cast(x, dtype = tf.float32), y))
    return train_ds

#-------------------------------------- VALIDATION DATASET ---------------------------------------------------------------------------------------

def val_data_creator(config, batch_size):
    val_ds = image_dataset_from_directory(val_images, 
                                          labels = 'inferred',
                                          label_mode='binary',
                                          class_names = ['NORMAL', 'PNEUMONIA'],
                                          image_size=(224, 224), 
                                          batch_size = 1,
                                          shuffle=False)                                        
                                            
    val_ds = val_ds.repeat(2)
    val_ds = val_ds.map(lambda x, y: (K.applications.resnet50.preprocess_input(x), y))
    val_ds = val_ds.map(lambda x, y: (tf.cast(x, dtype = tf.float32), y))
    return val_ds

#-------------------------------------- TEST DATASET ---------------------------------------------------------------------------------------------


def test_data_creator(config, batch_size):
    test_ds = image_dataset_from_directory(test_images, 
                                            label_mode='binary',
                                            class_names = ['NORMAL', 'PNEUMONIA'],
                                            image_size=(224, 224), 
                                            batch_size = batch_size,
                                            shuffle=True)                                          
                                           
   
    test_ds = test_ds.repeat(2)
    test_ds = test_ds.map(lambda x, y: (K.applications.resnet50.preprocess_input(x), y))
    test_ds = test_ds.map(lambda x, y: (tf.cast(x, dtype = tf.float32), y))
    return test_ds



# Step 4: Fit with Orca Estimator

## 4.1. First we created an Estimator

In [23]:
from zoo.orca.learn.tf2.estimator import Estimator
from tensorflow import keras
import keras

est = Estimator.from_keras(model_creator = model_creator)

2022-02-22 18:09:55,753	INFO services.py:1174 -- View the Ray dashboard at [1m[32mhttp://172.28.0.2:8265[39m[22m


{'node_ip_address': '172.28.0.2', 'raylet_ip_address': '172.28.0.2', 'redis_address': '172.28.0.2:6379', 'object_store_address': '/tmp/ray/session_2022-02-22_18-09-54_643664_72/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-02-22_18-09-54_643664_72/sockets/raylet', 'webui_url': '172.28.0.2:8265', 'session_dir': '/tmp/ray/session_2022-02-22_18-09-54_643664_72', 'metrics_export_port': 63615, 'node_id': 'c64c913983015a6d4d789faf23666445c6de02099e2e5d9220cabb99'}


[2m[36m(pid=5233)[0m Instructions for updating:
[2m[36m(pid=5233)[0m use distribute.MultiWorkerMirroredStrategy instead
[2m[36m(pid=5233)[0m 2022-02-22 18:10:08.865923: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


[2m[36m(pid=5233)[0m Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
[2m[36m(pid=5233)[0m    16384/94765736 [..............................] - ETA: 0s
 4628480/94765736 [>.............................] - ETA: 0s
10338304/94765736 [==>...........................] - ETA: 0s
15687680/94765736 [===>..........................] - ETA: 0s
20971520/94765736 [=====>........................] - ETA: 0s


## 4.2. Then, we evaluate the model using the Estimator

In [24]:
batch_size = 320
stats = est.fit(train_data_creator,
                epochs=1,
                batch_size=batch_size,
                validation_data= val_data_creator)

[2m[36m(pid=5233)[0m Found 5216 files belonging to 2 classes.


[2m[36m(pid=5233)[0m Cause: could not parse the source code of <function train_data_creator.<locals>.<lambda> at 0x7f7205bd3ef0>: no matching AST found among candidates:
[2m[36m(pid=5233)[0m 
[2m[36m(pid=5233)[0m Cause: could not parse the source code of <function train_data_creator.<locals>.<lambda> at 0x7f7205bd3dd0>: no matching AST found among candidates:
[2m[36m(pid=5233)[0m 
[2m[36m(pid=5233)[0m Cause: could not parse the source code of <function val_data_creator.<locals>.<lambda> at 0x7f7203b7ff80>: no matching AST found among candidates:
[2m[36m(pid=5233)[0m 
[2m[36m(pid=5233)[0m Cause: could not parse the source code of <function val_data_creator.<locals>.<lambda> at 0x7f7203b7ad40>: no matching AST found among candidates:
[2m[36m(pid=5233)[0m 


[2m[36m(pid=5233)[0m Found 16 files belonging to 2 classes.


[2m[36m(pid=5233)[0m 2022-02-22 18:10:15.563777: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
[2m[36m(pid=5233)[0m op: "TensorSliceDataset"
[2m[36m(pid=5233)[0m input: "Placeholder/_0"
[2m[36m(pid=5233)[0m attr {
[2m[36m(pid=5233)[0m   key: "Toutput_types"
[2m[36m(pid=5233)[0m   value {
[2m[36m(pid=5233)[0m     list {
[2m[36m(pid=5233)[0m       type: DT_STRING
[2m[36m(pid=5233)[0m     }
[2m[36m(pid=5233)[0m   }
[2m[36m(pid=5233)[0m }
[2m[36m(pid=5233)[0m attr {
[2m[36m(pid=5233)[0m   key: "_cardinality"
[2m[36m(pid=5233)[0m   value {
[2m[36m(pid=5233)[0m     i: 5216
[2m[36m(pid=5233)[0m   }
[2m[36m(pid=5233)[0m }
[2m[36m(pid=5233)[0m attr {
[2m[36m(pid=5233)[0m   key: "is_files"
[2m[36m(pid=5233)[0m   value {
[2m[36

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m


[2m[36m(pid=5233)[0m 2022-02-22 21:30:33.306642: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
[2m[36m(pid=5233)[0m op: "TensorSliceDataset"
[2m[36m(pid=5233)[0m input: "Placeholder/_0"
[2m[36m(pid=5233)[0m attr {
[2m[36m(pid=5233)[0m   key: "Toutput_types"
[2m[36m(pid=5233)[0m   value {
[2m[36m(pid=5233)[0m     list {
[2m[36m(pid=5233)[0m       type: DT_STRING
[2m[36m(pid=5233)[0m     }
[2m[36m(pid=5233)[0m   }
[2m[36m(pid=5233)[0m }
[2m[36m(pid=5233)[0m attr {
[2m[36m(pid=5233)[0m   key: "_cardinality"
[2m[36m(pid=5233)[0m   value {
[2m[36m(pid=5233)[0m     i: 16
[2m[36m(pid=5233)[0m   }
[2m[36m(pid=5233)[0m }
[2m[36m(pid=5233)[0m attr {
[2m[36m(pid=5233)[0m   key: "is_files"
[2m[36m(pid=5233)[0m   value {
[2m[36m(



# Step 5: Evaluate and save the model

In [25]:
stats = est.evaluate(test_data_creator, num_steps=40)
print(stats)


[2m[36m(pid=5233)[0m Found 624 files belonging to 2 classes.


[2m[36m(pid=5233)[0m Cause: could not parse the source code of <function test_data_creator.<locals>.<lambda> at 0x7f71f6f0add0>: no matching AST found among candidates:
[2m[36m(pid=5233)[0m 
[2m[36m(pid=5233)[0m Cause: could not parse the source code of <function test_data_creator.<locals>.<lambda> at 0x7f71f6f0acb0>: no matching AST found among candidates:
[2m[36m(pid=5233)[0m 
[2m[36m(pid=5233)[0m 2022-02-22 21:30:47.843452: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
[2m[36m(pid=5233)[0m op: "TensorSliceDataset"
[2m[36m(pid=5233)[0m input: "Placeholder/_0"
[2m[36m(pid=5233)[0m attr {
[2m[36m(pid=5233)[0m   key: "Toutput_types"
[2m[36m(pid=5233)[0m   value {
[2m[36m(pid=5233)[0m     list {
[2m[36m(pid=5233)[0m       type: DT_STRING
[2m[

 1/40 [..............................] - ETA: 5:31 - loss: 79.1359 - Accuracy: 0.6250 - auc: 0.5979 - precision: 0.6250 - recall: 1.0000 - true_positives: 20.0000 - true_negatives: 0.0000e+00 - false_positives: 12.0000 - false_negatives: 0.0000e+00
 2/40 [>.............................] - ETA: 3:36 - loss: 91.5013 - Accuracy: 0.6094 - auc: 0.4923 - precision: 0.6094 - recall: 1.0000 - true_positives: 39.0000 - true_negatives: 0.0000e+00 - false_positives: 25.0000 - false_negatives: 0.0000e+00
 3/40 [=>............................] - ETA: 3:32 - loss: 104.1209 - Accuracy: 0.6354 - auc: 0.4787 - precision: 0.6354 - recall: 1.0000 - true_positives: 61.0000 - true_negatives: 0.0000e+00 - false_positives: 35.0000 - false_negatives: 0.0000e+00
 4/40 [==>...........................] - ETA: 3:28 - loss: 101.9827 - Accuracy: 0.6016 - auc: 0.5027 - precision: 0.5984 - recall: 1.0000 - true_positives: 76.0000 - true_negatives: 1.0000 - false_positives: 51.0000 - false_negatives: 0.0000e+00    
 5

In [26]:
est.save("chest_x_ray.model")

'chest_x_ray.model'

In [27]:
est.shutdown()

In [28]:
# Note: You should call stop_orca_context() when your program finishes.
stop_orca_context()

Stopping orca context


# Conclusion:
The accuracy obtained when applying our model based on transfer learning of a ResNet50 in the training set is **99.8%** and a precision of 99.9%. However when we tested it with the test set, we obtained an accuracy of 62.5% and a precision of 62.5%. This seems that our model is highly overfitted. This might be due 2 principal reasons:
 
1. The proportion of given data is unbalanced. On the one hand, the consat training data set of 5216 images, which represents 89% of the total data, while the test set consists of 624, that is, 11%. Finally, the total number of validation images is 16, barely 1% of the total. However data partitioning methods which have been suggested in several literatures in the field of Machine learning are 70% for training data, 15% for testing data and the last 15% for validation data [8]. 
2. On the other hand, the proportion of NORMAL and PNEUMONIA samples is not the same. For instance, in the training set we got 1341 normal images and 3875 for pneumonia diagnosis. This represent almost double. Consequently, it is reasonable to think that our model learns better to recognize pneumonia images than normal.
 
To sum up, if we want to solve this overfitted problem and maintain the presented model, we will need more data or more diversity between them.
 
 
 



# Global references:

[1] https://www.mountsinai.org/health-library/report/pneumonia

[2] https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html

[3] https://machinelearningmastery.com/transfer-learning-for-deep-learning/

[4] https://arxiv.org/abs/1512.03385

[5] https://viso.ai/deep-learning/resnet-residual-neural-network/

[6] https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html

[7] https://blog.roboflow.com/why-preprocess-augment/#:~:text=Preprocessing%20is%20required%20to%20clean%20image%20data%20for%20model%20input.&text=Adjusting%20existing%20training%20data%20to,collected%20datasets%20may%20be%20small.

[8] https://www.researchgate.net/post/Is-there-an-ideal-ratio-between-a-training-set-and-validation-set-Which-trade-off-would-you-suggest