# Dataset Pre-Processing
### This Notebook contains the code which carries out the dataset preprocessing before it is used in both the binary and multiclass tasks in the assessment
### Will process the Original Raw dataset and label csv files into appropriate formats for both tasks
#### To Do:
#### Add description of the destination folders and files created for both the binary and multiclass notebooks to use

In [1]:
#Importing required libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
#Debug to check tensorflow version
print(tf.__version__)

Using TensorFlow backend.


2.1.0


### Exploring the dataset

In [2]:
#Loading the CSV Label file
mri_scan_labels = pd.read_csv('./dataset/label.csv')
print(mri_scan_labels.head())

# 3000 labels for 3000 images
mri_scan_labels.shape

# The collection of MRI images are stored as IMAGE_0000 to _2999 as there are 3000 images with their respective target labels

        file_name             label
0  IMAGE_0000.jpg  meningioma_tumor
1  IMAGE_0001.jpg          no_tumor
2  IMAGE_0002.jpg  meningioma_tumor
3  IMAGE_0003.jpg      glioma_tumor
4  IMAGE_0004.jpg  meningioma_tumor


(3000, 2)

In [3]:
#.value_counts checks the number of unique classes in the dataframe, in this case specifically the label column
mri_scan_labels.value_counts(['label'])

#We can see that there are 4 unique outputs, 3 types of tumors and 1 being no tumor.
#Thus there are 4 distince target classes present in the label.csv file.
#This will be used for subsequent indexing of binary and multiclass datasets for model training and testing 

label           
glioma_tumor        860
meningioma_tumor    855
pituitary_tumor     831
no_tumor            454
dtype: int64

# Binary Task Dataset Re-labelling

In [4]:
#Taking just the label portion for editing into our Target Y array
Y = mri_scan_labels[['label']]
Y

Unnamed: 0,label
0,meningioma_tumor
1,no_tumor
2,meningioma_tumor
3,glioma_tumor
4,meningioma_tumor
...,...
2995,no_tumor
2996,meningioma_tumor
2997,glioma_tumor
2998,glioma_tumor


### Exploring Dataframe parameter code

In [10]:
#Getting dataframe length parameters
len(Y.index)
Y.shape[0]
3000
#Learning Dataframe indexing
Y.loc[2].at['label']

#Referencing dataframe using integer indexes
Y.iat[2,0]

# In this case we manipulate the dataframe using np.arrays first so we convert from a dataframe to a np.array
# Converts dataframe to numpy array
Y_np = Y.to_numpy()
len(Y_np)

#Checks indexing
Y_np[2,0]
#Testing string compare on np array elements
print(Y_np[2] == 'meningioma_tumor')

#Initialises empty array for Y data for binary task
Y_binary = np.zeros(len(Y_np))
print(Y_binary)
print(Y_binary[2])

[ True]
[0. 0. 0. ... 0. 0. 0.]
0.0


#### The Binary task wants us to Build a classifier to identify whether there is a tumor in the MRI images.
#### Therefore the target labels should just be binary, 0 or 1 indicating the presence of a tumor or not in the MRI image
#### The type of tumor in this case is not required. Just need to know whether it is a tumor or not

In [11]:
#For loop through the number of elements in the label dataset, in this case 3000
#Loop will check if the array element is == to no_tumor in a string compare condition.
#If it returns true, that means the element is labelling no_tumor and therefore we set the corresponding element value of the Y_binary array to 0
#As we want 0 to be for no_tumor and 1 to represent the presence of a tumor in the binary classifier task
#Therefore if the output of the compare returns false, regardless of the type of tumor we set the element value to = 1 
#Meaning the target label is showing a tumor in the mri image.

for x in range(len(Y_np)):

    if Y_np[x] == 'no_tumor':
        Y_binary[x] = 0
    else:
        Y_binary[x] = 1

In [12]:
#Shows the resultant binary numpy array populated with the labels in binary form (comapred to string form originally)
Y_binary

array([1., 0., 1., ..., 1., 1., 1.])

In [13]:
#Converts it into a DataFrame for CSV file storage, this is so the subsequent notebook code can access the created label file
#Also shows successful dataset manipulation for Target classes
Y_Binary_Label = pd.DataFrame(Y_binary, columns = ['MRI_Binary_Label'])
Y_Binary_Label

Unnamed: 0,MRI_Binary_Label
0,1.0
1,0.0
2,1.0
3,1.0
4,1.0
...,...
2995,0.0
2996,1.0
2997,1.0
2998,1.0


In [14]:
#Creates the Y_Binary_Label.csv file for storage locally with the binary labels inputted corresponding to the Image filenames
Y_Binary_Label.to_csv('./dataset/Y_Binary_Label.csv')