## 🧠💡 Intelligent Systems  for Smart Health 👨‍⚕👩‍⚕️

# CNN (Convolutional Neural Networks)

Convolutional Neural Networks, or CNNs for short, are a powerful type of neural network commonly used in computer vision tasks. They are particularly well-suited to tasks like image classification and object detection because they are able to automatically learn and extract relevant features from input images. CNNs consist of multiple layers, each of which performs a different type of processing on the input data. These layers typically include convolutional layers, which extract features from the input images, and pooling layers, which downsample the output of the convolutional layers. By stacking these layers on top of one another, a CNN is able to learn increasingly complex representations of the input data.

## The dataset

For this sesseion, we will be using the [ChestX-ray8 dataset](https://arxiv.org/abs/1705.02315) which contains 108,948 frontal-view X-ray images of about 30,000 unique patients. 
- Each image in the data set contains multiple text-mined labels identifying 14 different pathological conditions. 
- These in turn can be used by physicians to diagnose 8 different diseases. 
- We will use this data to develop a single model that will provide binary classification predictions for each of the 14 labeled pathologies. 
- In other words it will predict 'positive' or 'negative' for each of the pathologies.
 
The full dataset is available for free [here](https://nihcc.app.box.com/v/ChestXray-NIHCC).

**However, we will work with a smaller subset containing about 10% of the original data!!**

In [1]:
# Import necessary packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb

## 1. Exploration

Read the data from `csv` files.

In [2]:
path_data = "../data/ChestX_subset/"

metadata_df = pd.read_csv(os.path.join(path_data, "metadata.csv"))
metadata_df.head()

Unnamed: 0,image,follow_up_no,patient_id,patient_age,gender,view_position,atelectasis,cardiomegaly,consolidation,edema,...,emphysema,fibrosis,hernia,infiltration,mass,no_finding,nodule,pleural_thickening,pneumonia,pneumothorax
0,00013244_008.png,8,13244,58,F,PA,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,00016807_023.png,23,16807,60,M,PA,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,00019988_002.png,2,19988,55,M,AP,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,00018840_022.png,22,18840,21,F,AP,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,00002072_007.png,7,2072,9,M,AP,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### 1.1 Data Types and Null Values Check

Run the next cell to explore the data types present in each column and whether any null values exist in the data.

In [None]:
# Look at the data type of each column and whether null values are present
metadata_df.info()

### 1.2 Unique IDs Check

"PatientId" has an identification number for each patient. One thing you'd like to know about a medical dataset like this is if you're looking at repeated data for certain patients or whether each image represents a different person.

In [None]:
print(
    f"The total patient ids are {metadata_df['patient_id'].count()}, \
from those the unique ids are {len(metadata_df['patient_id'].unique())} "
)

As you can see, the number of unique patients in the dataset is less than the total number so there must be some overlap. For patients with multiple records, you'll want to make sure they do not show up in both training and test sets in order to avoid data leakage (covered later in this week's lectures).

### 1.3 Data Labels

Run the next two code cells to create a list of the names of each patient condition or disease. 

In [None]:
columns = metadata_df.columns
print(columns.values)

In [None]:
# Define actual labels (or classes)
non_class_columns = ['image',
                     'follow_up_no',
                     'patient_id',
                     'patient_age',
                     'gender',
                     'view_position']
classes = [c for c in metadata_df.columns if c not in non_class_columns]

# Get the total classes
print(f"There are {len(classes)} classes (or: labels)")
print(f"This includes: {classes}")

In [None]:
# check how often each label occurs...

Have a look at the counts for the labels in each class above. Does this look like a balanced dataset?

### 1.4 Data Visualization

Using the image names listed in the csv file, you can retrieve the image associated with each row of data in your dataframe. 

Run the cell below to visualize a random selection of images from the dataset.

In [None]:
path_images = "../data/ChestX_subset/images/"

# Pick 9 random images
np.random.seed(1)
random_images = np.random.choice(metadata_df.image, 9)


# Adjust the size of your images
plt.figure(figsize=(10,10))

# Iterate and plot random images
for i, filename in enumerate(random_images):
    filename = filename
    plt.subplot(3, 3, i + 1)
    img = plt.imread(os.path.join(path_images, filename))
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    
# Adjust subplot parameters to give specified padding
plt.tight_layout()    

### 1.5 Investigating a Single Image

Run the cell below to look at the first image in the dataset and print out some details of the image contents.

In [None]:
# Get the first image that was listed in the train_df dataframe
sample_img = metadata_df.image[0]
raw_image = plt.imread(os.path.join(path_images, sample_img))

plt.imshow(raw_image, cmap='gray')
plt.colorbar()
plt.title('Raw Chest X Ray Image')

print(f"The dimensions of the image are {raw_image.shape}.")
print(f"The maximum pixel value is {raw_image.max():.4f} and the minimum is {raw_image.min():.4f}")
print(f"The mean value of the pixels is {raw_image.mean():.4f} and the standard deviation is {raw_image.std():.4f}")

### 1.6 Investigating Pixel Value Distribution

Run the cell below to plot up the distribution of pixel values in the image shown above. 

In [None]:
# Plot a histogram of the distribution of the pixels
sb.distplot(raw_image.ravel(), 
            label=f'Pixel Mean {np.mean(raw_image):.4f} & Standard Deviation {np.std(raw_image):.4f}', kde=False)
plt.legend(loc='upper center')
plt.title('Distribution of Pixel Intensities in the Image')
plt.xlabel('Pixel Intensity')
plt.ylabel('# Pixels in Image')

# Split the data!

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(#...
                                    )
print(f"Training set size: {train_df.shape}")
print(f"Test set size: {test_df.shape}")

In [None]:
# second split

train_df, val_df = train_test_split(#...
                                    )
print(f"Training set size: {train_df.shape}")
print(f"Validation set size: {val_df.shape}")
print(f"Test set size: {test_df.shape}")

## everything good?
- leakage?


## 2. Image Preprocessing in Keras

Before training, you'll first modify your images to be better suited for training a convolutional neural network. For this task you'll use the Keras [ImageDataGenerator](https://keras.io/preprocessing/image/) function to perform data preprocessing and data augmentation.

Run the next two cells to import this function and create an image generator for preprocessing.

In [None]:
# Import data generator from keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
# Normalize images
image_generator = ImageDataGenerator(
    # what should the base generator do....?
)

In [None]:
label_column="no_finding"#"mass"#

# Define the data generators
train_generator = image_generator.flow_from_dataframe(
        dataframe=train_df,
        directory=path_images,
        x_col="image",
        y_col=label_column,
        target_size=(320, 320),
        batch_size=32,
        class_mode='raw',
        color_mode="grayscale")

val_generator = image_generator.flow_from_dataframe(
        dataframe=val_df,
        directory=path_images,
        x_col="image",
        y_col=label_column,
        target_size=(320, 320),
        batch_size=32,
        class_mode='raw',
        color_mode="grayscale")

## Inspect generator results

## Try 2D convolutions
- Simply try out different convolution kernels and see what they do to the image.
- See also: https://en.wikipedia.org/wiki/Kernel_(image_processing)

In [None]:
import numpy as np
from scipy import signal
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

# Download the image from a URL
response = requests.get('https://cdn.pixabay.com/photo/2023/04/20/03/18/lizard-7938887_1280.jpg')

# Open the image file
img = Image.open(BytesIO(response.content)).convert('L') # Convert image to grayscale
img = np.array(img)

plt.imshow(img, cmap="gray")

In [None]:
# Define the 3x3 sharpening filter

kernel = np.array([[0, -1, 0],
                   [-1, 5,-1],
                   [0, -1, 0]])

# Apply the kernel to the image
result = # compute a convolution...

# Display the original and the resulting image
plt.figure(figsize=(10,5))

plt.subplot(121)
plt.imshow(img, cmap=plt.cm.gray)
plt.title('Original image')
plt.axis('off')

plt.subplot(122)
plt.imshow(result, cmap=plt.cm.gray)
plt.title('Image after applying the convolution')
plt.axis('off')

plt.show()

In [None]:
import tensorflow as tf

In [None]:
tf.__version__

In [None]:
import keras
keras.__version__

In [None]:

from keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense
from keras.models import Model
from keras.preprocessing.image import ImageDataGenerator


In [None]:
fig, axes = plt.subplots(2, 1, figsize=(7, 9))

axes[0].plot(history.history["loss"], "o-")
axes[0].plot(history.history["val_loss"], "o-")
axes[1].plot(history.history["accuracy"], "o-")
axes[1].plot(history.history["val_accuracy"], "o-")