# Exercise Project

(TKO_2027-3001)

</br>

**Oona Lepp채nen**
</br>
1800509
</br>
oklepp@utu.fi

</br>

In this file a Covid-19 dataset is created based on two datasets: *Covid19-dataset* and *COVID_IEEE* datasets. *Covid19-dataset* has 317 images divided into training set and test set with 251 and 66 images, respectively. The training set consists of 70 healthy cases, 70 pneumonia cases and 111 covid cases and the test set consists of 20 healthy cases, 20 pneumonia cases and 26 covid cases. There is no inbuilt validation set so it must be created to train the CNN model properly in the main exercise project file.

*COVID_IEEE* dataset has 1823 images divided into covid, normal and virus sets with 536, 668 and 619 images, respectively. The name 'normal' refers to healthy patients and the name 'virus' refers to pneumonia patients meaning pneumonia caused by viruses. Only the covid patients' images from this dataset are used in this project because the pneumonia and normal case images are from the dataset that's been used to train the base model and can't be therefore used to train, validate or test the fine-tuned model.

Those covid case images from the *COVID_IEEE* dataset are divided into training set, validation set and test set and are added among the corresponding sets in the *Covid19-dataset* dataset to form together the Covid-19 dataset to be used to fine-tune the CNN model in the main exercise project file.

## Imports

In [None]:
import numpy as np

import os
import glob
import cv2
from PIL import Image

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

## Loading the data

Uncomment only when using Google Drive.

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

Mounted at /content/drive


Defining a function for loading datasets and getting the labels of the loaded data based on the folder names of the data.

In [None]:
def get_dataset(file_path):
  open_files = glob.glob(file_path)
  x = []
  y = []

  for file in open_files:
    image = cv2.resize(cv2.imread(file), (256, 256))
    x.append(image)

    folder = os.path.basename(os.path.dirname(file))
    y.append(folder)

  images = np.array(x)
  labels = np.array(y)

  return images, labels

Loading the data from Covid-19 dataset (*Covid19-dataset*).

In [None]:
# Training set data
x_train_normal, y_train_normal = get_dataset('X-ray_image_classification/Covid19-dataset/train/Normal/*')
x_train_pneu, y_train_pneu = get_dataset('X-ray_image_classification/Covid19-dataset/train/Viral Pneumonia/*')
x_train_covid, y_train_covid = get_dataset('X-ray_image_classification/Covid19-dataset/train/Covid/*')

# Test set data
x_test_normal, y_test_normal = get_dataset('X-ray_image_classification/Covid19-dataset/test/Normal/*')
x_test_pneu, y_test_pneu = get_dataset('X-ray_image_classification/Covid19-dataset/test/Viral Pneumonia/*')
x_test_covid, y_test_covid = get_dataset('X-ray_image_classification/Covid19-dataset/test/Covid/*')

Loading only covid-19 images from another Covid-19 dataset (*COVID_IEEE*).

In [None]:
x_covid_C_IEEE, y_covid_C_IEEE = get_dataset('X-ray_image_classification/COVID_IEEE/covid/*')

## Splitting and combining the data.

Splitting the COVID_IEEE data into training, validation and test sets, splitting the training set of Covid19-dataset into training and validation set and combining separate training sets, validation sets and test sets to form one training set, validation set and test set.

Splitting the covid cases from the COVID_IEEE dataset into training set, validation set and test set.

In [None]:
x_train_C_IEEE, x_test_C_IEEE, y_train_C_IEEE, y_test_C_IEEE = train_test_split(x_covid_C_IEEE, y_covid_C_IEEE, test_size = 0.2, random_state = 2)
x_train_C_IEEE, x_valid_C_IEEE, y_train_C_IEEE, y_valid_C_IEEE = train_test_split(x_train_C_IEEE, y_train_C_IEEE, test_size = 0.2, random_state = 2)

Splitting images of the training set of Covid19-dataset into training set and validation set.

In [None]:
# Healthy cases
x_train_normal, x_valid_normal, y_train_normal, y_valid_normal = train_test_split(x_train_normal, y_train_normal,
                                                                                  test_size = 0.2, random_state = 2)

# Pneumonia cases
x_train_pneu, x_valid_pneu, y_train_pneu, y_valid_pneu = train_test_split(x_train_pneu, y_train_pneu,
                                                                          test_size = 0.2, random_state = 2)

# Covid-19 cases
x_train_covid, x_valid_covid, y_train_covid, y_valid_covid = train_test_split(x_train_covid, y_train_covid,
                                                                              test_size = 0.2, random_state = 2)

Combining the covid images from COVID_IEEE dataset to the training set, validation set and test set image arrays.

In [None]:
x_train_covid = np.concatenate((x_train_covid, x_train_C_IEEE), axis = 0)

x_valid_covid = np.concatenate((x_valid_covid, x_valid_C_IEEE), axis = 0)

x_test_covid = np.concatenate((x_test_covid, x_test_C_IEEE), axis = 0)

Checking the sizes of training set, validation set and test set with respect to patient condition.

In [None]:
print('       train valid test')
print('normal', len(x_train_normal), '  ', len(x_valid_normal), '  ', len(x_test_normal))
print('pneum ', len(x_train_pneu), '  ', len(x_valid_pneu), '  ', len(x_test_pneu))
print('covid ', len(x_train_covid), ' ', len(x_valid_covid), ' ', len(x_test_covid))

       train valid test
normal 56    14    20
pneum  56    14    20
covid  430   109   134


Checking the shapes of the training set, validation set and test set image arrays with respect to patient condition.

In [None]:
print('Shape of the image arrays')
print('             train                valid                test')
print('normal', x_train_normal.shape, '  ', x_valid_normal.shape, '  ', x_test_normal.shape)
print('pneum ', x_train_pneu.shape, '  ', x_valid_pneu.shape, '  ', x_test_pneu.shape)
print('covid ', x_train_covid.shape, ' ', x_valid_covid.shape, ' ', x_test_covid.shape)

Shape of the image arrays
             train                valid                test
normal (56, 256, 256, 3)    (14, 256, 256, 3)    (20, 256, 256, 3)
pneum  (56, 256, 256, 3)    (14, 256, 256, 3)    (20, 256, 256, 3)
covid  (430, 256, 256, 3)   (109, 256, 256, 3)   (134, 256, 256, 3)


## Saving the images

Creating a folder where the images can be saved.

In [None]:
image_folder_path = "X-ray_image_classification/Covid-19_dataset_new_split"
os.mkdir(image_folder_path)

Save method for saving images.

In [None]:
def saveImages(image_path, image_set):
  os.mkdir(image_path)
  index = 0

  for image in image_set:
    image = Image.fromarray(image)
    image = image.save(f"{image_path}/image{index}.png")
    index+=1

### Saving training set images

Creating a folder to the training set images.

In [None]:
image_folder_path = "X-ray_image_classification/Covid-19_dataset_new_split/train"
os.mkdir(image_folder_path)

Saving the training set images to their folder.

In [None]:
image_path = "X-ray_image_classification/Covid-19_dataset_new_split/train/normal"
saveImages(image_path, x_train_normal)

image_path = "X-ray_image_classification/Covid-19_dataset_new_split/train/pneumonia"
saveImages(image_path, x_train_pneu)

image_path = "X-ray_image_classification/Covid-19_dataset_new_split/train/covid"
saveImages(image_path, x_train_covid)

### Saving validation set images

Creating a folder for validation set.

In [None]:
image_folder_path = "X-ray_image_classification/Covid-19_dataset_new_split/valid"
os.mkdir(image_folder_path)

Saving the validation set images to their folder.

In [None]:
image_path = "X-ray_image_classification/Covid-19_dataset_new_split/valid/normal"
saveImages(image_path, x_valid_normal)

image_path = "X-ray_image_classification/Covid-19_dataset_new_split/valid/pneumonia"
saveImages(image_path, x_valid_pneu)

image_path = "X-ray_image_classification/Covid-19_dataset_new_split/valid/covid"
saveImages(image_path, x_valid_covid)

### Saving test set images

Creating a folder for test set.

In [None]:
image_folder_path = "X-ray_image_classification/Covid-19_dataset_new_split/test"
os.mkdir(image_folder_path)

Saving the test set images to their folder.

In [None]:
image_path = "X-ray_image_classification/Covid-19_dataset_new_split/test/normal"
saveImages(image_path, x_test_normal)

image_path = "X-ray_image_classification/Covid-19_dataset_new_split/test/pneumonia"
saveImages(image_path, x_test_pneu)

image_path = "X-ray_image_classification/Covid-19_dataset_new_split/test/covid"
saveImages(image_path, x_test_covid)