# 1. Introduction + Set-up

This repository related to [TensorFlow Pneumonia Classification on X-rays](https://www.kaggle.com/amyjang/tensorflow-pneumonia-classification-on-x-rays)

Machine learning has a phenomenal range of applications, including in health and diagnostics. This tutorial will explain the complete pipeline from loading data to predicting results, and it will explain how to build an X-ray image classification model from scratch to predict whether an X-ray scan shows presence of pneumonia. This is especially useful during these current times as COVID-19 is known to cause pneumonia.

This tutorial will explain how to utilize TPUs efficiently, load in image data, build and train a convolution neural network, finetune and regularize the model, and predict results. Data augmentation is not included in the model because X-ray scans are only taken in a specific orientation, and variations such as flips and rotations will not exist in real X-ray images. For a tutorial on image data augmentation, check out this [tutorial](https://www.kaggle.com/amyjang/tensorflow-data-augmentation-efficientnet).

Run the following cell to load the necessary packages. Make sure to change the Accelerator on the right to `TPU`.

In [4]:
import re
import os
import numpy as np
import pandas as pd
import tensorflow as tf
#from kaggle_datasets import KaggleDatasets
import kaggledatasets as kd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Device:', tpu.master())
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except:
    strategy = tf.distribute.get_strategy()
print('Number of replicas:', strategy.num_replicas_in_sync)
    
print(tf.__version__)

Number of replicas: 1
2.3.1


In [8]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
#GCS_PATH = kd().get_gcs_path()
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
IMAGE_SIZE = [180, 180]
EPOCHS = 25

# 2. Load the data

The Chest X-ray data we are using from [*Cell*](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) divides the data into train, val, and test files. There are only 16 files in the validation folder, and we would prefer to have a less extreme division between the training and the validation set. We will append the validation files and create a new split that resembes the standard 80:20 division instead.

In [9]:
filenames = tf.io.gfile.glob(str('./chest-x-ray-images-pneumonia/chest_xray/train/*/*'))
filenames.extend(tf.io.gfile.glob(str('./chest-x-ray-images-pneumonia/chest_xray/val/*/*')))

train_filenames, val_filenames = train_test_split(filenames, test_size=0.2)

In [13]:
print(len(train_filenames))
print(len(val_filenames))

4188
1048


Run the following cell to see how many healthy/normal chest X-rays we have and how many pneumonia chest X-rays we have.

In [14]:
COUNT_NORMAL = len([filename for filename in train_filenames if "NORMAL" in filename])
print("Normal images count in training set: " + str(COUNT_NORMAL))

COUNT_PNEUMONIA = len([filename for filename in train_filenames if "PNEUMONIA" in filename])
print("Pneumonia images count in training set: " + str(COUNT_PNEUMONIA))

Normal images count in training set: 1090
Pneumonia images count in training set: 3098


Notice that the there are way more images that are classified as pneumonia than normal. This shows that we have a imbalance in our data. We will correct for this imbalance later on in our notebook.

In [19]:
train_list_ds = tf.data.Dataset.from_tensor_slices(train_filenames)
val_list_ds = tf.data.Dataset.from_tensor_slices(val_filenames)

for f in train_list_ds.take(5):
    print(f)
    print(f.numpy())

tf.Tensor(b'./chest-x-ray-images-pneumonia/chest_xray/train/NORMAL/NORMAL2-IM-0797-0001.jpeg', shape=(), dtype=string)
b'./chest-x-ray-images-pneumonia/chest_xray/train/NORMAL/NORMAL2-IM-0797-0001.jpeg'
tf.Tensor(b'./chest-x-ray-images-pneumonia/chest_xray/train/PNEUMONIA/person1608_bacteria_4235.jpeg', shape=(), dtype=string)
b'./chest-x-ray-images-pneumonia/chest_xray/train/PNEUMONIA/person1608_bacteria_4235.jpeg'
tf.Tensor(b'./chest-x-ray-images-pneumonia/chest_xray/train/PNEUMONIA/person545_bacteria_2289.jpeg', shape=(), dtype=string)
b'./chest-x-ray-images-pneumonia/chest_xray/train/PNEUMONIA/person545_bacteria_2289.jpeg'
tf.Tensor(b'./chest-x-ray-images-pneumonia/chest_xray/train/NORMAL/NORMAL2-IM-0820-0001.jpeg', shape=(), dtype=string)
b'./chest-x-ray-images-pneumonia/chest_xray/train/NORMAL/NORMAL2-IM-0820-0001.jpeg'
tf.Tensor(b'./chest-x-ray-images-pneumonia/chest_xray/train/PNEUMONIA/person300_bacteria_1422.jpeg', shape=(), dtype=string)
b'./chest-x-ray-images-pneumonia/ches

Run the following cell to see how many images we have in our training dataset and how many images we have in our validation set. Verify that the ratio of images is 80:20.

In [20]:
TRAIN_IMG_COUNT = tf.data.experimental.cardinality(train_list_ds).numpy()
print("Training images count: " + str(TRAIN_IMG_COUNT))

VAL_IMG_COUNT = tf.data.experimental.cardinality(val_list_ds).numpy()
print("Validating images count: " + str(VAL_IMG_COUNT))

Training images count: 4188
Validating images count: 1048


As expected, we have two labels for our images.

In [28]:
CLASS_NAMES = np.array([str(tf.strings.split(item, os.path.sep)[-1].numpy())[2:-1]
                        for item in tf.io.gfile.glob(str("./chest-x-ray-images-pneumonia/chest_xray/train/*"))])
CLASS_NAMES

array(['PNEUMONIA', 'NORMAL'], dtype='<U9')

Currently our dataset is just a list of filenames. We want to map each filename to the corresponding (image, label) pair. The following methods will help us do that.

As we only have two labels, we will rewrite the label so that `1` or `True` indicates pneumonia and `0` or `False` indicates normal.