#In this notebook,  we would like to show an additional idea we have tried but didn't apply in our training, it could act as a suppliment of our work.

### Some preparation, get data & create dataset:

In [1]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [2]:
import torch
device = 'cpu'
if torch.cuda.device_count() > 0 and torch.cuda.is_available():
    print("Cuda installed! Running on GPU!")
    device = 'cuda'
else:
    print("No GPU available!")

Cuda installed! Running on GPU!


In [2]:
import os
os.chdir("/content/gdrive/MyDrive/acse4")
!unzip acse4-ml-2020.zip -d "/content/sample_data/"

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6485.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6486.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6487.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6488.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6489.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-649.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6490.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6491.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6492.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6493.png  
  inflating: /content/sample_data/xray-data/xray-data/train/normal/Normal-6494.png  
 

In [29]:
from torchvision.datasets import ImageFolder 
import torchvision.transforms as tt
import torch
import progressbar
from torch.utils.data import WeightedRandomSampler, DataLoader

In [7]:
# define the augmentation for initialize the training set
train_transform = tt.Compose([
    tt.RandomHorizontalFlip(0.5),
    tt.CenterCrop((224, 224)), # resize the pictures
    tt.ToTensor(),
])

In [9]:
os.chdir("/content/sample_data/")
# define the different folder pathes
data_directory = os.listdir('./xray-data/')
data_dir = "./xray-data/xray-data/"
train_dir = os.path.join(data_dir, "train/")

# get the data from train folder and test folder
train_data = ImageFolder(train_dir, train_transform)

# **Weighted Sampling for training** 

[**Note:** In practice, we found that Weighted Sampling is not only time consuming, but also does not improve the performence in a great sense, thus we just turned it off for most training cases. If you want to turn it on, just paste the following cells in the main notebook, and pass `shuffle = False, sampler = train_sampler` to the train DataLoader.]

For unbalanced dataset we may create a weighted sampler, first count the classes present in the trainning data set 

For convience, I didn't swap the keys generated by `ImageFolder` in this notebook, which means the keys are:<br>

keys = {
"covid":0,
"lung_opacity":1,
"pneumonia":3,
"normal":2
}



In [11]:
label_dict = {}
cnt_progress = 0
total = len(train_data)
bar=progressbar.ProgressBar(maxval=total)
for (img, label) in train_data:
    if label in label_dict:
        label_dict[label] += 1
    else:
        label_dict[label] = 1
    cnt_progress+=1
    bar.update(cnt_progress)
bar.finish()
label_dict

100% (20215 of 20215) |##################| Elapsed Time: 0:00:46 Time:  0:00:46


{0: 3454, 1: 5742, 2: 9734, 3: 1285}

In [16]:
# calculate weights for every single data, return an array with equal length of our train_data
def calculateWeights(label_dict, train_data):
    arr  = [0] * len(label_dict.items())
    weight = 0
    for label,count in label_dict.items():
        weight = count / len(train_data)
        # a fix
        arr[label] = weight
    return arr

In [19]:
print("labels_count_dict:", label_dict)
weights = calculateWeights(label_dict, train_data) 
weights = torch.DoubleTensor(weights)
print("corresponding weights for labels:", weights)
weights_inv = 1. / weights.float()
print("The reciprocal of weights for labels:", weights_inv)

labels_count_dict: {0: 3454, 1: 5742, 2: 9734, 3: 1285}
corresponding weights for labels: tensor([0.1709, 0.2840, 0.4815, 0.0636], dtype=torch.float64)
the reciprocal of weights for labels: tensor([ 5.8526,  3.5206,  2.0767, 15.7315])


In [21]:
train_samples_weight = torch.zeros(len(train_data)).float()

Fllowing cell that calculate weights for every single data takes about 1 minute, may be better approach exists?

In [22]:
cnt_progress = 0
bar=progressbar.ProgressBar(maxval=total)
for i, (_, label) in enumerate(train_data):
    train_samples_weight[i] = weights_inv[int(label)].clone()
    cnt_progress+=1
    bar.update(cnt_progress)
bar.finish()

100% (20215 of 20215) |##################| Elapsed Time: 0:00:46 Time:  0:00:46


In [23]:
 train_samples_weight

tensor([ 5.8526,  5.8526,  5.8526,  ..., 15.7315, 15.7315, 15.7315])

In [27]:
train_sampler = WeightedRandomSampler(train_samples_weight, len(train_data), replacement = True)

### Passing to Data Loaders to observe its function
<br> To see the useage of our `train_sampler`, we could pass it to a `DataLoader` and observe the number of each classes (labels) it generates within each batch.


In [34]:
train_loader = DataLoader(train_data, batch_size=64, shuffle=False, sampler = train_sampler, num_workers=0)

With a successful run, you could see the number of data with the 4 labels almost equal in every batch.

In [35]:
cnt = 0
for i, (x, y) in enumerate(train_loader):
    print(r"batch number {}, label 0/1/2/3: {}/{}/{}/{}".format(i, (y == 0).sum(), (y == 1).sum(), (y == 2).sum(), (y == 3).sum()))
    cnt += 1
    if (cnt == 25): break

batch number 0, label 0/1/2/3: 10/16/22/16
batch number 1, label 0/1/2/3: 11/19/19/15
batch number 2, label 0/1/2/3: 17/15/11/21
batch number 3, label 0/1/2/3: 13/19/19/13
batch number 4, label 0/1/2/3: 13/22/11/18
batch number 5, label 0/1/2/3: 21/20/10/13
batch number 6, label 0/1/2/3: 14/16/10/24
batch number 7, label 0/1/2/3: 13/18/13/20
batch number 8, label 0/1/2/3: 14/19/17/14
batch number 9, label 0/1/2/3: 16/16/19/13
batch number 10, label 0/1/2/3: 20/18/10/16
batch number 11, label 0/1/2/3: 15/24/11/14
batch number 12, label 0/1/2/3: 20/14/18/12
batch number 13, label 0/1/2/3: 15/20/19/10
batch number 14, label 0/1/2/3: 12/14/21/17
batch number 15, label 0/1/2/3: 20/14/13/17
batch number 16, label 0/1/2/3: 15/12/14/23
batch number 17, label 0/1/2/3: 18/21/12/13
batch number 18, label 0/1/2/3: 19/18/14/13
batch number 19, label 0/1/2/3: 20/23/7/14
batch number 20, label 0/1/2/3: 16/23/11/14
batch number 21, label 0/1/2/3: 16/18/18/12
batch number 22, label 0/1/2/3: 17/19/16/12

Compare to the case without weighted sampling:

In [36]:
train_loader = DataLoader(train_data, batch_size=64, shuffle=True, num_workers=0)

In [37]:
cnt = 0
for i, (x, y) in enumerate(train_loader):
    print(r"batch number {}, label 0/1/2/3: {}/{}/{}/{}".format(i, (y == 0).sum(), (y == 1).sum(), (y == 2).sum(), (y == 3).sum()))
    cnt += 1
    if (cnt == 25): break

batch number 0, label 0/1/2/3: 10/20/27/7
batch number 1, label 0/1/2/3: 15/19/29/1
batch number 2, label 0/1/2/3: 11/15/33/5
batch number 3, label 0/1/2/3: 10/17/32/5
batch number 4, label 0/1/2/3: 10/20/30/4
batch number 5, label 0/1/2/3: 11/18/33/2
batch number 6, label 0/1/2/3: 13/13/35/3
batch number 7, label 0/1/2/3: 15/18/29/2
batch number 8, label 0/1/2/3: 10/19/34/1
batch number 9, label 0/1/2/3: 10/18/35/1
batch number 10, label 0/1/2/3: 13/12/35/4
batch number 11, label 0/1/2/3: 13/18/28/5
batch number 12, label 0/1/2/3: 11/24/28/1
batch number 13, label 0/1/2/3: 11/20/29/4
batch number 14, label 0/1/2/3: 15/18/28/3
batch number 15, label 0/1/2/3: 15/23/24/2
batch number 16, label 0/1/2/3: 10/14/39/1
batch number 17, label 0/1/2/3: 17/16/26/5
batch number 18, label 0/1/2/3: 12/20/27/5
batch number 19, label 0/1/2/3: 12/12/32/8
batch number 20, label 0/1/2/3: 5/27/28/4
batch number 21, label 0/1/2/3: 9/20/32/3
batch number 22, label 0/1/2/3: 14/19/26/5
batch number 23, label 