# **Data Collection**

## Objectives

* Fetch breast ultrasound image dataset from Kaggle and prepare the data for anaylsis and modelling. 

## Inputs

* Kaggle JSON file - authentication token.

## Outputs

* Generate dataset: inputs/datasets/breast_cancer_dataset

## Additional Comments

* Clean the data and split the dataset into train, validation and test sets.


---

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir("/workspace/breast-cancer-detection")
print("You set a new current directory.")

In [None]:
current_dir = os.getcwd()
current_dir

# Install Kaggle

In [None]:
# install kaggle package
%pip install kaggle

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
KaggleDatasetPath = "aryashah2k/breast-ultrasound-images-dataset"
DestinationFolder = "inputs/breast_cancer_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
import zipfile
with zipfile.ZipFile(DestinationFolder + "/breast-ultrasound-images-dataset.zip", "r") as zip_ref:
    zip_ref.extractall(DestinationFolder)
os.remove(DestinationFolder + "/breast-ultrasound-images-dataset.zip")

---

# Data Cleaning
### Check and remove non-image files

In [None]:
def remove_non_image(data_dir):
    image_extension = (".png", ".jpg", ".jpeg", ".tiff")
    folders = os.listdir(data_dir)
    for folder in folders:
        files = os.listdir(data_dir + "/" + folder)
        image = []
        non_image = []
        for file in files:
            if not file.lower().endswith(image_extension):
                non_image.append(file)
                os.remove(data_dir + "/" + folder + "/" + file)
            # Remove the masks (ground truth)
            elif "_mask" in file.lower():
                os.remove(data_dir + "/" + folder + "/" + file)
            else:
                image.append(file)
        print(f"Folder: {folder} - has image file", len(image))
        print(f"Folder: {folder} - has non-image file", len(non_image))



In [None]:
remove_non_image(data_dir="inputs/breast_cancer_dataset/ultrasound_images")

---

# Data Preparation

### Split data into train, validation and test sets

In [None]:
import os
import shutil
import random
import joblib

def split_image_dataset(data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("The set ratios should sum up to 1!")
        return
    
    # To get the folder name
    labels = os.listdir(data_dir) 
    if "test" in labels:
        pass
    else:
        for folder in ["train", "validation", "test"]:
            # To create a sub folder for classes within train/valid/test folders
            for label in labels:
                os.makedirs(name=data_dir + "/" + folder + "/" + label)

        for label in labels:
            files = os.listdir(data_dir + "/" + label)
            random.shuffle(files)

            train_files_qty = int(len(files)) * train_set_ratio
            validation_files_qty = int(len(files)) * validation_set_ratio

            count = 1

            for file in files:
                if count <= train_files_qty:
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/train/" + label + "/" + file)
                elif count <= (train_files_qty + validation_files_qty):
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/validation/" + label + "/" + file)
                else:
                    shutil.move(data_dir + "/" + label + "/" + file,
                                data_dir + "/test/" + label + "/" + file)
                count += 1
            
            os.rmdir(data_dir + "/" + label)


In [None]:
my_data_dir = "inputs/breast_cancer_dataset/ultrasound_images"
split_image_dataset(data_dir=my_data_dir, 
                    train_set_ratio=0.7, 
                    validation_set_ratio=0.15, 
                    test_set_ratio=0.15)

---

# Conclusion

Data has been cleaned to remove non-image files (if there were any) and images that were marked as "mask". The data was split into three datasets, train, validation and set in a (0.7, 0.15, 0.15) ratio.

### Next Steps

The data will be analysed for differences between the three classes with the standard deviation and mean of the images calculated to determine image variability and the average image. An image montage will be created to display the images on the dashboard.
