In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
    #for filename in filenames:
        #print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from glob import glob

my_glob = glob('../input/data/*/images/*.png')
img_path = {os.path.basename(x): x for x in my_glob}

print(len(my_glob))

# How did we get the balanced dataset csv file?
Our group firstly used the sample dataset provided, where there are totally $5606$ images and $244$ classes. However, when we count the unique labels, we could see in this table that many classes only have one image, which is indeed a highly imbalanced dataset. Since images with ‘No finding’ label not only contain X-ray of a healthy chest, but also diseases that could not be detected, we ignored this class in our project and select the next 15 classes for training.

To deal with the imbalance problem, we tried a method called **class_weights**, where different classes are mapped to different weights, which are used to adjust the loss function during training. 
However, when we put class_weights into training process, the results shows that althought the the training loss is lower, test loss still flunctuated largely, and when looking at the confusion matrix, the model almost predicted all the x-ray images as one disease which has large amount of dataset, however, for those classes with small number of data, the model performs really bad. 

Then we turned back to our whole dataset, where there are $112120$ images with $836$ classes in total, which means there is enough space for us to choose our dataset. 
For this 4-classes classification, we chose the 4 diseases 'Pneumothorax', 'Atelectasis', 'Nodule', 'Infiltration' with 1000 images per class to make the dataset balanced and generate a new csv. file.

In [None]:
xray_data = pd.read_csv('../input/4diseases/4diseases(1).csv')
xray_data

# Assign labels to images

To deal with this multi-label classification problem, we have two methods. First is classified each disease one by one (for each time, do a binary classification), and the second is use a long vector to represent the category.
 
Note: For the first method, classifying diseases one by one, an assumption is made: diseases A|B (disease A combined with disease B) owns features simply added up by disease A and disease B. If the features for A|B are very different a lot from simply adding features from disease A and disease B from the medical point of view, this method can’t be adopted.
 
Our outcome for the first method is very bad and we think this mainly due to:
 
(1)    The unbalanced ratio between class 0 (diseases other than disease A, the one we want to detect) and class 1 (disease A)

(2)    Class 0 contains lots of diseases and the features for these diseases are complex and vary a lot. So it’s hard for the machine to conclude some features that can be used to distinguish class 0 and class 1
 
For the second method, representing the labels using a long vector, we think it is the category per entry that works well, especially when we have a small amount of categories because category per entry can prevent machine from creating some new combination of diseases on its own. If the disease per entry are adopted, the machine would be confused by the similar features shared by different diseases and couldn’t predict correctly.

In [None]:
labels = [ 'Pneumothorax', 'Atelectasis', 'Nodule', 'Infiltration']
for label in labels:
    xray_data[label] = xray_data['Finding Labels'].map(lambda result: 1.0 if label in result else 0)
xray_data['target_vector'] = xray_data.apply(lambda target: [target[labels].values], 1).map(lambda target: target[0])
xray_data

# RGB2Gray, Image dimension adjustment & Image selection

### GRB2GRAY 
After visualizing our data, we found that xray images are all not colorful. But each image still consists of 3 matrices, for example, one original image shape is (1024,1024,3) Professor reminded us that we can use RGB2GRAY to reduce 3 matrices to 1 matrices since the 3 matrices are the same for each image. After doing that, we can save memory and also reduce calculation quantity later. We include one example in the photo file to show there is no difference between the images before and after applying RGB2GRAY.

### Image dimension adjustment
At the very beginning, we directly resized our images with pixel value $128*128$ without further consideration. We just thought the original image size is too large. However, professor reminded us that the resized image may loss many features since xray images include lots of details. After discussion, we made a plan on making image dimension adjustment.

Since our original image dimension is $1024*1024$, which is too big, we think resizing our images might be a good idea if it would not cost losing too many features of the images. How to determine the pixel value needs a rigorous process. 

First, we chose an example and visualized it with different sizes. The three images are contained in our photo file, named 128, 512 and 1024. Seen by eyes, we found that the image with size $128*128$ look not quite clear. It seems that the image with dimension $512*512$ do not have much difference with the image with dimension $1024*1024$.

We have already used our eyes to make a preliminary judgement, then we need to use image with different dimensions to train our models to see the influence of the image dimension to our models' performance. Then we can make a decision which pixel value to use.

Then we need to prepare dataset with different image sizes. The code below is one example process of preparing npz files (X and Y) as data to train the models later. This prepared dataset contains 4 categories diseases: 'Pneumothorax', 'Atelectasis', 'Nodule', 'Infiltration'. Each category has 1000 data and the image size is $128*128$.


In [None]:
import cv2
def proc_images():
    x = [] # images as arrays
    y = [] # labels of 'Pneumothorax', 'Atelectasis', 'Nodule' and 'Infiltration'
    WIDTH = 128
    HEIGHT = 128

    for img in my_glob:
        base = os.path.basename(img)
        if base in xray_data2['Image Index'].values:
            full_size_image = cv2.imread(img)
            image_grey = cv2.cvtColor(full_size_image, cv2.COLOR_BGR2GRAY)
            x.append(cv2.resize(image_grey, (WIDTH,HEIGHT), interpolation=cv2.INTER_CUBIC))
            #x.append(full_size_image)
            ylabel = xray_data["target_vector"][xray_data["Image Index"] == base].values
            y.append(ylabel)
        else:
            continue
    return x,y

We successfully prepare dataset with image size $128*128$ and $512*512$. However, kaggle warned us that our notebook tried to allocate more memory than is availiable when we try to prepare dataset with image size $1024*1024$ even though we only tend to contain totally 1000 images in the dataset. Since we also need to guarantee enough data in each category, we decide to give up the dataset with original image size due to equipment constraint.

Then we use the two datasets to train our own model respectively. When we do 2 categories classification, as the confusion matrices shown on PPT page 20, the dataset with image size $512*512$ performs better than the dataset with image size $128*128$. 
However, when we do 3 categories classification and we want to use the dataset with image size $512*512$ to train more complex models, Resource Exhausted Error occurred! OOM killer killed our training process. We tried three ways to fight with OOM killer, but considering memory cost, time cost and equipment constraint, we finally decided to use the dataset with image size $128*128$ and tried our best to make our model perform well.

# Turn png to npz.

In [None]:
x,y = proc_images()
x=np.array(x)
y=np.array(y)

In [None]:
#np.savez("x4000128", x)
#np.savez("y4000128", y)