**DeepCryoPicker - PreProcessing**

**Imports**

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import cv2
import shutil
from PIL import Image
from skimage import exposure, restoration, color
from skimage.morphology import disk, closing
from skimage.filters import gaussian
from scipy.signal import convolve2d


**Importing Datasets and Directory Set Up**

imports raw dataset into notebook

setsup output directories

This was tested on google drive
if using google drive run mount drive and connect

Copy path to raw dataset

Set the path for output directory

In [None]:
# mount drive
# if you are not running this through google drive, than skip
from google.colab import drive
drive.mount('/content/drive')

**Set the required directory paths**

path - directory containing the raw datasets

path_output - where this notebook will save the proccessed images

In [1]:
#drive path to directory containg datasets
path = "/content/drive/MyDrive/DeepCryoPicker/Data Sets"

#drive path to output directory for preprocessed_data
path_output = "/content/drive/MyDrive/DeepCryoPicker/preprocessed_data"

**Import raw data into a dict**

dict is saved as dataset[folder_name]->data

In [None]:
#empty dictonary to store image data
dataset = {}
# Loop through all folders in the directory and import images
# images will be stored as dataset['folder'][data]
for folder in os.listdir(path):
  folder_path = os.path.join(path, folder)
  data = []
  for image_path in os.listdir(folder_path):
    # imprt data
    image = cv2.imread(os.path.join(folder_path, image_path))
    # convert data to RGB values
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    if image is not None:
            data.append(image)
  # Convert data to numpy arrays
  data = np.array(data)
  # Add the data to the dataset dictionary
  dataset[folder] = data
# remove 'Test' folder from our dataset
# test was used for various testing files (removing it breaks the dictonary structure for some reason)
del dataset['Test']

Mounted at /content/drive


  data = np.array(data)


**Set up output directory**

This is expecting a path to an empty directory.

will create subdirectories for each folder_name found in dataset{}

In [None]:
# Remove all files and subdirectories inside the directory
if os.path.exists(path_output):
    shutil.rmtree(path_output)
# Create the directory and label subdirectories
# store the output of the preprocessing into seperate folders
os.makedirs(path_output)
# get all labels for directory creation
labels = dataset.keys()
# create and label subdirectories
for label in labels:
    # create path
    label_dir = os.path.join(path_output, label)
    # make dir
    if not os.path.exists(label_dir):
      os.makedirs(label_dir)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**PreProcessing step**

Preprocessing is a critical step in image analysis that involves preparing the image data for further processing. This typically involves a series of operations to correct for various artifacts and distortions in the image data, and to extract relevant features from the images. 


The goal of preprocessing is to improve the quality and relevance of the image data, and to prepare it for further processing and analysis. Proper preprocessing can have a significant impact on the accuracy and reliability of downstream analysis, so it is important to carefully choose and optimize preprocessing steps for each particular application. 

For this model the following are the preprocessing steps taken:


>   **Image Normalization**
>> Indented scales the pixel values of an image to a consistent range or distribution. This helps to remove biases or inconsistencies in the data and improve its suitability for subsequent processing and analysis.

> **Image Adjustment**
>> involves modifying the pixel values of an image to improve its quality or contrast. Here we use stretch and gamma adjustment

> **Image Restoration**
>>  aims to improve the quality of an image by removing noise, blurring, or other distortions.The practices used here are:
>>>**Histogram equalization**
>>>>enhances the contrast of an image by redistributing the pixel intensities. This involves adjusting the image's histogram so that it is more evenly distributed across the available range of pixel values.

>>>**Weiner Deconvolution**
>>>> technique used to restore images that have been degraded by blurring or noise. It involves applying a deconvolution filter to the image, which estimates the original, unblurred image by removing the effects of the blurring process. 

>**Adaptive Histogram Equalization**
>>hance the contrast of an image, particularly in areas with low contrast or uneven illumination. Unlike traditional histogram equalization, which operates on the entire image, AHE applies the equalization on local patches of the image. This means that the contrast enhancement is localized to specific regions of the image, preserving the contrast in other regions. 

>**Gaussian Filtering**
>>smoothing an image by applying a Gaussian filter to the image data. The Gaussian filter is a low-pass filter that effectively removes high-frequency noise from the image, while preserving the overall spatial structure of the image.

>**Morphological Operations (closing)**
>> technique that involves modifying the shape or structure of objects in an image. These operations can be used to remove noise or to enhance specific features in an image, such as edges or boundaries. 








In [None]:
# create new dictionary to hold the preprocessed data
processed_dataset = {}
print("Staring data processing:\n")
# Loop over each image in the dataset
for label in dataset.keys():
  data_pre = []
  print("Current Dataset:"+ str(label))
  i=0
  # Loop over each image in the label's 'data' list
  for data in dataset[label]:
    print("image number: " + str(i))
    i=i+1
    print("start...")
    # preprocessing
    # image normalization: pixel values are in the range [0,1]
    normalized_data = data/np.max(data)
    # conver image to greyscale
    gray_image_data = color.rgb2gray(normalized_data)
    # # image adjustment (stretch and adjust)
    # # stretch the contrast of the image to represent the 5th and 95th percentiles of the pixel intensity distribution
    p1, p2 = np.percentile(gray_image_data, (5,95))
    stretched_data = exposure.rescale_intensity(gray_image_data, in_range=(p1, p2))
    # Adjust the contrast and brightness of the image
    adjusted_data = exposure.adjust_gamma(stretched_data, gamma=0.5)
    # image restoration (histogram equalization and weiner deconvolution)
    # histogram equalization
    histeq_data = exposure.equalize_hist(adjusted_data)
    # weiner deconvolution
    # the psf used is a general approximation, a better estimate can be created for our data set (possible improvment, though probably small)
    psf = np.ones((3, 3)) / 9
    blurred_data = convolve2d(histeq_data, psf, mode='same', boundary='symm')
    wiener_deconvolved_data = restoration.wiener(blurred_data, psf, balance=0.1)
    # histogram equlization
    histeq_data = exposure.equalize_hist(wiener_deconvolved_data)
    # adaptive histogram equalization
    adapeq_data = exposure.equalize_adapthist(histeq_data, clip_limit=0.02, kernel_size=None)
    # adaptive histogram equalization again
    adapeq_data = exposure.equalize_adapthist(adapeq_data, clip_limit=0.99, kernel_size=None)
    # Gaussian filtering 4 times
    filtered_image_data = adapeq_data.copy()
    for j in range(4):
      filtered_image_data = gaussian(filtered_image_data, sigma=1)
      filtered_image_data = restoration.denoise_tv_chambolle(filtered_image_data, weight=0.1)
    # morphological closing operation
    selem = disk(5)
    closed_image_data = closing(filtered_image_data, selem)
    # save the preprocessed image to directory
    data_pre.append(closed_image_data)
    print("completed")
  # Convert data and labels to numpy arrays
  data_pre = np.array(data_pre)
  # Add the data and labels to the dataset dictionary
  processed_dataset[label] = data_pre
  print("Finished processing for Dataset: " + str(label))

  data_pre = np.array(data_pre)


**Save processed Images too output directory**

In [None]:
for folder in processed_dataset.keys():
  folder_path = folder_path = os.path.join(path_output, folder)
  # Loop through all images in the folder
  for i, image in enumerate(processed_dataset[folder]):
    # Construct the file name
    file_name = f"{folder}_{i+1:03}.png"
    file_path = os.path.join(folder_path, file_name)
    image = image
    cv2.imwrite(file_path,image)

**Information Gathering**

Gather data regarding pre and post processed data:
MSE, PSNR, SNR, histograms

In [None]:
#save example images(first 5 of each set) for report (normalized values to integer range)
path_save = "/content/drive/MyDrive/DeepCryoPicker/Report/Processed Images"
for folder in processed_dataset.keys():
  # Create a new directory with the same name as the folder
  folder_path = os.path.join(path_save, folder)
  os.makedirs(folder_path, exist_ok=True)
  # Loop through all images in the folder
  for i, image in enumerate(processed_dataset[folder]):
    # Construct the file name
    if i < 5:
      file_name = f"{folder}_{i+1:03}.png"
      file_path = os.path.join(folder_path, file_name)
      image = image*255
      cv2.imwrite(file_path,image)