# PP5 - ML Brain Tumor Detector

## Notebook 1 - Data Collection

### Objectives

* Fetch data from kaggle and prepare data for further processes.


### Inputs

* 


### Outputs

| **output**      |          |       |
|-----------------|----------|-------|
| **train/**      | no_tumor | tumor |
| **test/**       | no_tumor | tumor |
| **validation/** | no_tumor | tumor |


### Additional Comments

* Dataset: [Kaggle](https://www.kaggle.com/datasets/sartajbhuvaji/brain-tumor-classification-mri?select=Training)
* License: [MIT](https://www.mit.edu/~amini/LICENSE.md)

---

### Import packages

In [1]:
%pip install -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\tobis\\Documents\\GitHub\\ml-brain-tumor-detection\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\tobis\\Documents\\GitHub\\ml-brain-tumor-detection'

## Get data from Kaggle

**Install Kaggle**

In [5]:
%pip install kaggle==1.6.8

Note: you may need to restart the kernel to use updated packages.


**Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON**

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
os.chmod("kaggle.json", 0o600)

**Set the kaggle dataset and download it**

In [7]:
KaggleDatasetPath = "sartajbhuvaji/brain-tumor-classification-mri"
DestinationFolder = "input/"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading brain-tumor-classification-mri.zip to input




  0%|          | 0.00/86.8M [00:00<?, ?B/s]
  1%|          | 1.00M/86.8M [00:00<00:53, 1.69MB/s]
  2%|▏         | 2.00M/86.8M [00:00<00:29, 3.01MB/s]
  3%|▎         | 3.00M/86.8M [00:00<00:20, 4.38MB/s]
  6%|▌         | 5.00M/86.8M [00:01<00:12, 7.10MB/s]
  9%|▉         | 8.00M/86.8M [00:01<00:07, 11.6MB/s]
 13%|█▎        | 11.0M/86.8M [00:01<00:05, 14.6MB/s]
 15%|█▍        | 13.0M/86.8M [00:01<00:05, 15.1MB/s]
 17%|█▋        | 15.0M/86.8M [00:01<00:04, 15.5MB/s]
 21%|██        | 18.0M/86.8M [00:01<00:03, 18.2MB/s]
 24%|██▍       | 21.0M/86.8M [00:01<00:03, 19.9MB/s]
 28%|██▊       | 24.0M/86.8M [00:01<00:03, 21.8MB/s]
 31%|███       | 27.0M/86.8M [00:02<00:02, 22.9MB/s]
 35%|███▍      | 30.0M/86.8M [00:02<00:02, 22.0MB/s]
 38%|███▊      | 33.0M/86.8M [00:02<00:02, 21.6MB/s]
 41%|████▏     | 36.0M/86.8M [00:02<00:02, 22.4MB/s]
 45%|████▍     | 39.0M/86.8M [00:02<00:02, 22.9MB/s]
 48%|████▊     | 42.0M/86.8M [00:02<00:02, 17.9MB/s]
 51%|█████     | 44.0M/86.8M [00:03<00:02, 17.2MB/s]
 

**Unzip the file and delete the zip folder.**

In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/brain-tumor-classification-mri.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/brain-tumor-classification-mri.zip')

---

## Prepare the Data

**Create folder structure**

In [9]:
import shutil

In [10]:
testing_folder = 'input/Testing'
training_folder = 'input/Training'
parent_folder = 'input'

In [11]:
def create_folder_structure(parent_folder):
    brain_mri_scans_folder = os.path.join(parent_folder, 'brain-mri-scans')
    os.makedirs(brain_mri_scans_folder, exist_ok=True)
    
    tumor_folder = os.path.join(brain_mri_scans_folder, 'tumor')
    no_tumor_folder = os.path.join(brain_mri_scans_folder, 'no_tumor')
    os.makedirs(tumor_folder, exist_ok=True)
    os.makedirs(no_tumor_folder, exist_ok=True)

create_folder_structure(parent_folder)

**Merge pre-split folders (no_tumor)**

In [12]:
def move_and_rename_no_tumor_images(testing_folder, training_folder, parent_folder):
    no_tumor_destination = os.path.join(parent_folder, 'brain-mri-scans', 'no_tumor')
    
    counter = 1
    
    def rename_file(file_path):
        nonlocal counter
        file_name, file_ext = os.path.splitext(file_path)
        new_file_name = f'no_tumor_{counter}{file_ext}'
        os.rename(file_path, os.path.join(os.path.dirname(file_path), new_file_name))
        counter += 1
    
    testing_no_tumor_folder = os.path.join(testing_folder, 'no_tumor')
    if os.path.exists(testing_no_tumor_folder):
        for file in os.listdir(testing_no_tumor_folder):
            source_file_path = os.path.join(testing_no_tumor_folder, file)
            shutil.move(source_file_path, no_tumor_destination)
            rename_file(os.path.join(no_tumor_destination, file))
    
    training_no_tumor_folder = os.path.join(training_folder, 'no_tumor')
    if os.path.exists(training_no_tumor_folder):
        for file in os.listdir(training_no_tumor_folder):
            source_file_path = os.path.join(training_no_tumor_folder, file)
            shutil.move(source_file_path, no_tumor_destination)
            rename_file(os.path.join(no_tumor_destination, file))

move_and_rename_no_tumor_images(testing_folder, training_folder, parent_folder)

**Merge pre-split folders (glioma_tumor)**

In [13]:
def move_and_rename_glioma_tumor_images(testing_folder, training_folder, parent_folder):
    no_tumor_destination = os.path.join(parent_folder, 'brain-mri-scans', 'tumor')
    
    counter = 1
    
    def rename_file_glioma(file_path):
        nonlocal counter
        file_name, file_ext = os.path.splitext(file_path)
        new_file_name = f'glioma_{counter}{file_ext}'
        os.rename(file_path, os.path.join(os.path.dirname(file_path), new_file_name))
        counter += 1
    
    testing_glioma_tumor_folder = os.path.join(testing_folder, 'glioma_tumor')
    if os.path.exists(testing_glioma_tumor_folder):
        for file in os.listdir(testing_glioma_tumor_folder):
            source_file_path = os.path.join(testing_glioma_tumor_folder, file)
            shutil.move(source_file_path, no_tumor_destination)
            rename_file_glioma(os.path.join(no_tumor_destination, file))
    
    training_glioma_tumor_folder = os.path.join(training_folder, 'glioma_tumor')
    if os.path.exists(training_glioma_tumor_folder):
        for file in os.listdir(training_glioma_tumor_folder):
            source_file_path = os.path.join(training_glioma_tumor_folder, file)
            shutil.move(source_file_path, no_tumor_destination)
            rename_file_glioma(os.path.join(no_tumor_destination, file))

move_and_rename_glioma_tumor_images(testing_folder, training_folder, parent_folder)

**Merge pre-split folders (meningioma_tumor)**

In [14]:
def move_and_rename_meningioma_tumor_images(testing_folder, training_folder, parent_folder):
    no_tumor_destination = os.path.join(parent_folder, 'brain-mri-scans', 'tumor')
    
    counter = 1
    
    def rename_file_meningioma(file_path):
        nonlocal counter
        file_name, file_ext = os.path.splitext(file_path)
        new_file_name = f'meningioma_{counter}{file_ext}'
        os.rename(file_path, os.path.join(os.path.dirname(file_path), new_file_name))
        counter += 1
    
    testing_meningioma_tumor_folder = os.path.join(testing_folder, 'meningioma_tumor')
    if os.path.exists(testing_meningioma_tumor_folder):
        for file in os.listdir(testing_meningioma_tumor_folder):
            source_file_path = os.path.join(testing_meningioma_tumor_folder, file)
            shutil.move(source_file_path, no_tumor_destination)
            rename_file_meningioma(os.path.join(no_tumor_destination, file))
    
    training_meningioma_tumor_folder = os.path.join(training_folder, 'meningioma_tumor')
    if os.path.exists(training_meningioma_tumor_folder):
        for file in os.listdir(training_meningioma_tumor_folder):
            source_file_path = os.path.join(training_meningioma_tumor_folder, file)
            shutil.move(source_file_path, no_tumor_destination)
            rename_file_meningioma(os.path.join(no_tumor_destination, file))

move_and_rename_meningioma_tumor_images(testing_folder, training_folder, parent_folder)

**Merge pre-split folders (pituitary_tumor)**

In [15]:
def move_and_rename_pituitary_tumor_images(testing_folder, training_folder, parent_folder):
    no_tumor_destination = os.path.join(parent_folder, 'brain-mri-scans', 'tumor')
    
    counter = 1
    
    def rename_file_pituitary(file_path):
        nonlocal counter
        file_name, file_ext = os.path.splitext(file_path)
        new_file_name = f'pituitary_{counter}{file_ext}'
        os.rename(file_path, os.path.join(os.path.dirname(file_path), new_file_name))
        counter += 1
    
    testing_pituitary_tumor_folder = os.path.join(testing_folder, 'pituitary_tumor')
    if os.path.exists(testing_pituitary_tumor_folder):
        for file in os.listdir(testing_pituitary_tumor_folder):
            source_file_path = os.path.join(testing_pituitary_tumor_folder, file)
            shutil.move(source_file_path, no_tumor_destination)
            rename_file_pituitary(os.path.join(no_tumor_destination, file))
    
    training_pituitary_tumor_folder = os.path.join(training_folder, 'pituitary_tumor')
    if os.path.exists(training_pituitary_tumor_folder):
        for file in os.listdir(training_pituitary_tumor_folder):
            source_file_path = os.path.join(training_pituitary_tumor_folder, file)
            shutil.move(source_file_path, no_tumor_destination)
            rename_file_pituitary(os.path.join(no_tumor_destination, file))

move_and_rename_pituitary_tumor_images(testing_folder, training_folder, parent_folder)

**Delete existing "Testing" and "Training" folder from input**

In [16]:
def delete_testing_and_training_folders(testing_folder, training_folder):
    if os.path.exists(testing_folder):
        shutil.rmtree(testing_folder)
    if os.path.exists(training_folder):
        shutil.rmtree(training_folder)

delete_testing_and_training_folders(testing_folder, training_folder)

In [17]:
os.listdir('input/brain-mri-scans')

['no_tumor', 'tumor']

---

## Split train/validation/test set

In [18]:
import os
import shutil
import random

In [19]:
# function taken from the CI walthrough project malaria detector and fitted for this project
def split_data_images(input_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    '''
    Description:
    Splits dataset in train, validation and test sets

    Parameters:
    input_dir: input directory containing the images
    train_set_ratio: ratio for images included in train set
    validation_set_ratio: ratio for images included in validation set
    test_set_ratio: ratio for images included in test set

    Returns:
    None

    '''
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    labels = os.listdir(input_dir)
    if 'test' in labels:
        pass
    else:
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=input_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(input_dir + '/' + label)
            random.shuffle(files)

            train_set_size = int(len(files) * train_set_ratio)
            validation_set_size = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_size:
                    shutil.move(input_dir + '/' + label + '/' + file_name,
                                input_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_size + validation_set_size):
                    shutil.move(input_dir + '/' + label + '/' + file_name,
                                input_dir + '/validation/' + label + '/' + file_name)

                else:
                    shutil.move(input_dir + '/' + label + '/' + file_name,
                                input_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(input_dir + '/' + label)


The image dataset is divided in the following ratio:
+ Training set 0.7
+ Validation set 0.1
+ Test set 0.2

In [20]:
split_data_images(input_dir="input/brain-mri-scans/",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

In [21]:
os.listdir('input/brain-mri-scans/')

['test', 'train', 'validation']