# Working in Google CoLab Setup

This sheet gives a detailed explanation and prep work code needed to run the project properly in Google CoLab.  You will need your Kaggle username as well as your Kaggle API key.  Furthermore, when running the lab itself, you might need to decrease the patience (which will affect the end results) for the more complex models. 

## Download the Data

This section goes through preparing the Google CoLab environment to receive and download the dataset from Kaggle.  Again, please note you will need to have your Kaggle username and API key on hand. 

In [None]:
# Imports 

import json
import os
import shutil

import numpy as np
np.random.seed(88)

import tensorflow as tf
tf.random.set_seed(88)

In [None]:
# Prep Google CoLab environment to download data from Kaggle
!mkdir ~/.kaggle
!touch ~/.kaggle/kaggle.json

username = ''  ## Your Kaggle username
api_key = ''  ## Your Kaggle API key

api_token = {"username": username,
             "key": api_key}

with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(api_token, file)

!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Download the dataset from Kaggle
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia

In [1]:
# This cell unzips the downloaded data
shutil.unpack_archive('chest-xray-pneumonia.zip', '/content')

## Preview the Data

The Original Dataset is already presorted into train/validate/test groups.  However, less than 1% of the data was assigned to the validation folder.  For the purposes of this project, approximately 10% of the images from the train set were transferred to the validation set to get closer to an 80/10/10 split between the train/validate/test groups. 

The below section goes through the process of investigating the file organization of the downloaded data.  For a better understanding of the project, you can run through this code.  If you already are comfortable with this, feel free to skip to the next section where the images are moved from train to validate. 

In [None]:
# View structure of the raw downloaded data
print(os.listdir('./chest_xray'))

# View structure of train folder
print(os.listdir('./chest_xray/train'))

In [None]:
print(len(os.listdir('./chest_xray/train/NORMAL')))

In [None]:
len_normal_train = len(os.listdir('./chest_xray/train/NORMAL'))
len_pneu_train = len(os.listdir('./chest_xray/train/PNEUMONIA'))
len_total_train = len_normal_train + len_pneu_train

print("There are", len_normal_train, "normal xrays in the training set.")
print("There are", len_pneu_train, "pneumonia xrays in the training set.")
print("There are", len_total_train, "images total in the training set.\n")

print('Target Distributon:')
print('{}% normal'.format(round(len_normal_train/len_total_train * 100, 2)))
print('{}% pneumonia'.format(round(len_pneu_train/len_total_train * 100, 2)))

In [None]:
len_normal_val = len(os.listdir('./chest_xray/val/NORMAL'))
len_pneu_val = len(os.listdir('./chest_xray/val/PNEUMONIA'))
len_total_val = len_normal_val + len_pneu_val

len_normal_test = len(os.listdir('./chest_xray/test/NORMAL'))
len_pneu_test = len(os.listdir('./chest_xray/test/PNEUMONIA'))
len_total_test = len_normal_test + len_pneu_test

print("There are", len_total_val, "images total in the validation set.")
print("There are", len_total_test, "images total in the test set.")

In [2]:
num_images_total = len_total_train + len_total_val + len_total_test
print('Using {}% of data to train'.format(round(len_total_train / num_images_total *100,2)))
print('Using {}% of data to validate'.format(round(len_total_val / num_images_total *100,2)))
print('Using {}% of data to test'.format(round(len_total_test / num_images_total *100,2)))

## Move Pictures

As noted above, less than 1% of all the data was assigned to the validation folder.  For the purposes of this project, an 80/10/10 split was more desirable.  Run the cells below to transfer about 10% of the data from the train folder to the validation folder.

In [None]:
# About how many images to move over -- 10% of total train data.  
len_total_train * .1  / 2

In [None]:
# Create lists of file names for train normal + train pneumonia

normal_train_images = [file for file in os.listdir('./chest_xray/train/NORMAL')]
pneu_train_images = [file for file in os.listdir('./chest_xray/train/PNEUMONIA')]

# randomly choose indicies for 5% of data (both normal + pneumonia)
normal_inds = np.random.choice(range(len_normal_train), size=260, replace=False)
pneu_inds = np.random.choice(range(len_pneu_train), size=260, replace=False)

In [None]:
# Move chosen images to validation folders

for i in normal_inds:
    image = normal_train_images[i]
    origin = './chest_xray/train/NORMAL/' + image
    destination = './chest_xray/val/NORMAL/' + image
    shutil.move(origin, destination)
    
for i in pneu_inds:
    image = pneu_train_images[i]
    origin = './chest_xray/train/PNEUMONIA/' + image
    destination = './chest_xray/val/PNEUMONIA/' + image
    shutil.move(origin, destination)

## From Here

Now your Google CoLab environment is set up to run the project!  The full project can be found on [Git Hub](https://github.com/Bella3s/xray_image_classification/blob/main/index.ipynb).