## DataSet Creator
### __Some Background__
- The data is collected from 2 different datasets
- The COVID +ve xrays are from https://github.com/ieee8023/covid-chestxray-dataset
- The COVID -ve xrays are actually the normal samples of human lungs collected from kaggle : https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia

- We need to extract the data accordingly

### DATA PREP :: 1 - For COVID +ve Samples

In [1]:
import pandas as pd

In [2]:
file_path = "./covid-chestxray-dataset-master/covid-chestxray-dataset-master/metadata.csv"
image_path = "./covid-chestxray-dataset-master/covid-chestxray-dataset-master/images/"

In [3]:
## We need to select particular images using the csv file
df = pd.read_csv(file_path)

In [4]:
df.head()

Unnamed: 0,patientid,offset,sex,age,finding,RT_PCR_positive,survival,intubated,intubation_present,went_icu,...,date,location,folder,filename,doi,url,license,clinical_notes,other_notes,Unnamed: 29
0,2,0.0,M,65.0,Pneumonia/Viral/COVID-19,Y,Y,N,N,N,...,"January 22, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-a-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
1,2,3.0,M,65.0,Pneumonia/Viral/COVID-19,Y,Y,N,N,N,...,"January 25, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-b-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
2,2,5.0,M,65.0,Pneumonia/Viral/COVID-19,Y,Y,N,N,N,...,"January 27, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-c-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
3,2,6.0,M,65.0,Pneumonia/Viral/COVID-19,Y,Y,N,N,N,...,"January 28, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-d-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
4,4,0.0,F,52.0,Pneumonia/Viral/COVID-19,Y,,N,N,N,...,"January 25, 2020","Changhua Christian Hospital, Changhua City, Ta...",images,nejmc2001573_f1a.jpeg,10.1056/NEJMc2001573,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,diffuse infiltrates in the bilateral lower lungs,,


In [5]:
## We will move the COVID +ve images in the Covid dir inside the Dataset folder
import os

# Creating the Covid folder
target_dir = "Dataset/Covid"

if not os.path.exists(target_dir):
    os.mkdir(target_dir)
    print("Covid folder created")


Covid folder created


In [6]:
## Total Covid images available
cnt = 0

for (i,row) in df.iterrows():
    
    if row["finding"] == "Pneumonia/Viral/COVID-19":
        cnt += 1
        
print(cnt)

584


 - Not all Xrays are of the same orientation
 - We only want those Xrays, which have Front-view or posteroanterior (PA) view
 - Thus, that info is stored in another column called "view"

In [7]:
## Total Covid images available with PA view
cnt = 0

for (i,row) in df.iterrows():
    
    if row["finding"] == "Pneumonia/Viral/COVID-19" and row["view"] == "PA":
        cnt += 1
        
print(cnt)

196


In [8]:
## Transferring the images to the Covid folder

import shutil

## Total Covid images available with PA view
cnt = 0

for (i,row) in df.iterrows():
    
    if row["finding"] == "Pneumonia/Viral/COVID-19" and row["view"] == "PA":
        
        file_name = row["filename"]
        
        # src
        img_src = os.path.join(image_path, file_name)
        
        # dst
        img_dst = os.path.join(target_dir, file_name)
        
        # Transferring a copy of the image
        shutil.copy2(img_src, img_dst)
        
        # Logs
        print("Image Moved Successfully", cnt)
        
        cnt += 1


Image Moved Successfully 0
Image Moved Successfully 1
Image Moved Successfully 2
Image Moved Successfully 3
Image Moved Successfully 4
Image Moved Successfully 5
Image Moved Successfully 6
Image Moved Successfully 7
Image Moved Successfully 8
Image Moved Successfully 9
Image Moved Successfully 10
Image Moved Successfully 11
Image Moved Successfully 12
Image Moved Successfully 13
Image Moved Successfully 14
Image Moved Successfully 15
Image Moved Successfully 16
Image Moved Successfully 17
Image Moved Successfully 18
Image Moved Successfully 19
Image Moved Successfully 20
Image Moved Successfully 21
Image Moved Successfully 22
Image Moved Successfully 23
Image Moved Successfully 24
Image Moved Successfully 25
Image Moved Successfully 26
Image Moved Successfully 27
Image Moved Successfully 28
Image Moved Successfully 29
Image Moved Successfully 30
Image Moved Successfully 31
Image Moved Successfully 32
Image Moved Successfully 33
Image Moved Successfully 34
Image Moved Successfully 35
Im

## DATA PREP :: 2 - For COVID -ve/ Normal Samples

In [9]:
# Creating the Normal folder
target_dir_normal = "Dataset/Normal"

if not os.path.exists(target_dir_normal):
    os.mkdir(target_dir_normal)
    print("Normal folder created")

Normal folder created


In [10]:
## The dataset from kaggle has much more images than the Covid +ve samples
## Therefore, we will select some random images so as to keep the ratios equal

import random

kaggle_file_path = "./archive/chest_xray/train/NORMAL/"


In [11]:
## Images in the Kaggle's Normal folder
image_names = os.listdir(kaggle_file_path)

In [12]:
len(image_names)

1341

In [13]:
## Randomly shuffling the images
random.shuffle(image_names)

In [14]:
## Moving only 196 images
## Note : The paths are shuffled

for i in range(196):
    
    img_name = image_names[i]
    
    # src
    img_src = os.path.join(kaggle_file_path, img_name)
    
    # dst
    img_dst = os.path.join(target_dir_normal, img_name)
    
    # Transferring a copy of the image
    shutil.copy2(img_src, img_dst)

    # Logs
    print("Image Moved Successfully", i)

Image Moved Successfully 0
Image Moved Successfully 1
Image Moved Successfully 2
Image Moved Successfully 3
Image Moved Successfully 4
Image Moved Successfully 5
Image Moved Successfully 6
Image Moved Successfully 7
Image Moved Successfully 8
Image Moved Successfully 9
Image Moved Successfully 10
Image Moved Successfully 11
Image Moved Successfully 12
Image Moved Successfully 13
Image Moved Successfully 14
Image Moved Successfully 15
Image Moved Successfully 16
Image Moved Successfully 17
Image Moved Successfully 18
Image Moved Successfully 19
Image Moved Successfully 20
Image Moved Successfully 21
Image Moved Successfully 22
Image Moved Successfully 23
Image Moved Successfully 24
Image Moved Successfully 25
Image Moved Successfully 26
Image Moved Successfully 27
Image Moved Successfully 28
Image Moved Successfully 29
Image Moved Successfully 30
Image Moved Successfully 31
Image Moved Successfully 32
Image Moved Successfully 33
Image Moved Successfully 34
Image Moved Successfully 35
Im