# **EDA Simplified: UWM GI Tract Image Segmentation w/ WandB**
![](https://storage.googleapis.com/kaggle-competitions/kaggle/27923/logos/header.png?t=2021-06-02-20-30-25)

## Introduction
Did you know, that in 2019, approxminately 5 million people were diagnosed with a cancer of the gastro-intestinal tract worldwide. But to stop all of this, radiation oncologists try to deliver high doses of radiation using X-ray beams pointed to tumors while avoiding the stomach and intestines. The good news is, that with newer technology such as integrated magnetic resonance imaging and linear accelerator systems, also known as MR-Linacs, oncologists are able to visualize the daily position of the tumor and intestines, which can vary day to day. In these scans, radiation oncologists must manually outline the position of the stomach and intestines in order to adjust the direction of the x-ray beams to increase the dose delivery to the tumor and avoid the stomach and intestines. However, this may take time and labor intensive process that can prolong treatments from 15 minutes a day to an hour a day, which can be difficult for patients to tolerate—unless deep learning could help automate the segmentation process. A method to segment the stomach and intestines would make treatments much faster and would allow more patients to get more effective treatment. The purpose of this competition is to track healthy organs in medical scans to improve cancer treatment by using image segmentation. But to do that for sure, we use EDA first things first!

## Imports
To get started, we import the necessary modules for our EDA analysis: which is the the os module for file management, cv2 (OpenCV) for machine learning, the glob module for file management again, numpy module as np for linear algebra, pandas as pd for data science, and tqdm from the tqdm module for progress. We also import modules for plotting and image analysis which is the matplotlib module with the pyplot submodule as plt, the seaborn module as sns, plotly module with the express submodule as px, and finally, the Image module from the PIL module.

In [None]:
import os
import cv2
import glob
import numpy as np
import pandas as pd
from tqdm import tqdm

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from PIL import Image

## Setup 
To start our EDA, we list out our structure of this competition data by defining the variable ROOT_DIR to the directory leading to the Competition Link. Finally, we list our root directory using the listdir function, containing the ROOT_DIR variable.

In [None]:
ROOT_DIR = '../input/uw-madison-gi-tract-image-segmentation/'
os.listdir(ROOT_DIR)

As you can see, unlike most previous competitions (like [Feedback Prize](https://www.kaggle.com/code/dinowun/eda-simplified-feedback-prize/notebook)), this competition we're on contained no test samples (oh, not to mention, [JPX](https://www.kaggle.com/code/dinowun/eda-simplified-jpx-stock-exchange-prediction-b) is one of em). But, at least there should be some overlaps on case levels between train and test datasets. Thus, there's also evident from this line, in which it says:
>"The goal of this competition is to be able to generalize to both partially and wholly unseen cases."

Next, we find out the number of case directories! We use a print statement that contained the len function to find the number of entities of the full path concatenated by the ROOT_DIR variable and the 'train' string by using the os module with the listdir function! 

In [None]:
print("# of entities in case dirs:", len(os.listdir(ROOT_DIR + "train")))

As always, there are 85 entities of case dirs in the train folder of the competition notebook. Now after all for sure, let's create dataframes and analyze them one by one and plot them with our plotting modules! 

## Chapter 1: A Basic Start to the train_df DataFrame From Loading the train.csv
Whenever we do all EDA, we first create our dataframe from a csv file! Simple, right? Just define a variable called train_df and assign it to the pd module with the read_csv function, containing the concatenated ROOT_DIR variable and the "train.csv" string. Next, we print out the number of entities in the train_df dataframe by enclosing the train_df dataframe with the len function thus displaying the first 5 rows of it with the head function plugged into the train_df dataframe (on the next line). 

In [None]:
train_df = pd.read_csv(ROOT_DIR + "train.csv")
print('Length of train_df:', len(train_df))
train_df.head()

After we run this code cell above, we see that there are 115,488 entities in the train_df dataframe! Not only that, when we display the 5 rows of our train_df dataframe, we clearly see that the all entities in the segmentation data index are off-course, NaN. Alas, we may find out more of that!

To find out how many NaNs were there in the segmentation data index of the train_df dataframe, we're going plotting with Matplotlib, but first, we need to find out the number of entities over NaNs and Non-NaNs by printing out the train_df dataframe with the isna function followed by the sum function.

In [None]:
train_df.isna().sum()

In [None]:
train_df.notna().sum()

As we run the previous two code cells above, we see that about 81,575 entities weren't filled and 33,913 were filed in the segmentation data index. With all of that, let's plot em up! First, we define a variable called nan_or_no to an array of two numbers, which is 81575 and 33913 (based on previous counts). Next we define another variable called labels to two strings, which is: "NaN" and "Filled". Now, we create our figure by defining two additional variables, fig and ax to the subplots function plugged to the plt module. We create a pie chart figure by plugging the ax variable into the pie function, containing nan_or_no variable array, the labels parameter set to the labels variable of two strings, autopct parameter set to '%1.1f%%' format, shadow parameter set to True, and startangle paramter set to 90. We now apply the axis function to the ax variable again, setting it to equal so that the equal aspect ratio ensures that pie is drawn as a circle. Finally, let's show the figure by using the show function with the plt module.

In [None]:
nan_or_no = [81575, 33913]
labels = "NaN", "Filled"

fig, ax = plt.subplots()
ax.pie(nan_or_no, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

When we ran this pie chart, we see that 29.4% of segmentation data was "Filled" and 70.6% of segmentation data was "NaN". Now that we see a lot of hunks of NaNs, we define the train_df dataframe again but with the dataframe again but containing a dataframe itself but with segmentation data attribute plugged to the notna function inside, and on the outside of it, call the reset_index function, setting the drop parameter to False. Finally, we display the first 5 rows of the train_df dataframe with the head function.

In [None]:
train_df = train_df[train_df.segmentation.notna()].reset_index(drop=False)
train_df.head()

Now after that, let's start our analysis over the number of classes with the number of samples by using Seaborn!

Using Seaborn over the number of classes with the number of samples is easy. We use the sns module with the displot function, containing our train_df dataframe, and setting the x parameter to "class".

In [None]:
sns.displot(train_df, x="class")

As you can see, this histplot from seaborn contained three classes to segment: large_bowel (aka large intestine), small_bowel (aka small intestine), and stomach, which the number of their entities are: 14085 (large_bowel), 11201 (small_bowel), and 8627 (stomach).

Now after that, let's proceed to finding the number of cases! First, we define our new function, get_case_str, containing the row variable inside. We then define a variable inside of this function, case_num to the row function variable with the id data attribute and split away the underscores by using the split function containing the underscore in strings therefore adding a slice index of 0 after the split function. Finally, we return the case_num variable. We proceed to create another function called get_case_id, with the same variable row. It's more like the same as the get_case_str function, but we return the integer conversion with the int function of the case_num variable that has the slice index of four colon (4:) to slice the string, "case", away. Now, we compromised the get_case_str and get_case_id functions to our train_df dataframe by defining it with the case_str data index (case_id for the 2nd one) to the train_df dataframe plugged the the apply function, containing the row lambda, over the function call of the get_case_str (get_case_id for 2nd one) function containing the row variable, setting the axis parameter to 1. Finally, let's display the first 5 rows of the train_df dataframe!

In [None]:
def get_case_str(row):
    case_num = row.id.split('_')[0]
    return case_num

def get_case_id(row):
    case_num = row.id.split('_')[0]
    return int(case_num[4:])

train_df['case_str'] = train_df.apply(lambda row: get_case_str(row), axis=1)
train_df['case_id'] = train_df.apply(lambda row: get_case_id(row), axis=1)
train_df.head()

And after that, we print out the number of cases on our train_df dataframe by putting the len function (to find the number of entities) to the train_df dataframe with the case_str data attribute linked to the unique function.

In [None]:
print("No. of cases: ", len(train_df.case_str.unique()))

When we run this, we see that there are 85 cases in our train_df dataframe! Now, let's proceed to count every case id in sorted form by using the train_df dataframe with the case_id data attribute along with the value_counts function to count the values of it and the sort_index function to sort every counted index in order. 

In [None]:
train_df.case_id.value_counts().sort_index()

After this, we see that some cases has way more images than others, so when that happened, it's probably because that there are more number of slices per daily basis or days per case. So, let's hop into getting our day number to our train_df dataframe! 

In order to get day number to our train_df dataframe, we define two functions: get_day_str and get_day_id, each containing the row variable inside. For the get_day_str function, we return the row variable function with the id data attribute and split out the underscores using the split function containing an underscore in strings and setting the slice index to 1, since we're finding the day string. And for the get_day_id function, we return the integer with the int function of the row variable function with the id data attribute with the split function with the underscore in strings to remove the underscores away, thus having a slice index of 1 and 3 colon to expel the "day" away in our row variable function. We then add our new data indexes to our train_df dataframe by defining itself with the day_str and day_id to similar to what we did to add the data full of cases in our train_df but we call the get_day_str and get_day_id functions, each containing the row variable inside. Finally, let's display our first five rows of our newly updated train_df dataframe by using the head function to our train_df dataframe!

In [None]:
def get_day_str(row):
    return row.id.split('_')[1]

def get_day_id(row):
    return int(row.id.split('_')[1][3:])

train_df['day_str'] = train_df.apply(lambda row: get_day_str(row), axis=1)
train_df['day_id'] = train_df.apply(lambda row: get_day_id(row), axis=1)

train_df.head()

After we run this cell above, we see that the day_str and day_id data entities were registered to our train_df dataframe! But, with all of the data entities of day_str and day_id, how can we find the number of unique days for every scan taken? 

Well, we print out the number of unique days a scan was taken by print out the number of entities with the len function over our train_df dataframe's day_str data entity with the unique function to sort out our unique values in here.

In [None]:
print("No. of unique days a scan was taken: ", len(train_df.day_str.unique()))

When we run that code cell above, we see that there are 35 unique days for a scan taken. Now with all the data of cases and days registered in our train_df dataframe, now it's time to plot them, with seaborn again!

Now, for our sns plotting between days and cases, we create a catplot figure by using the sns module with the catplot function, containing the three parameters in which x set to day_id, and y set to case_id, and data set to our train_df dataframe. Oh, the numbers for day_id labels were clamped up, so we included another parameter in our catplot function, which is aspect (to determine the width of a figure) to 2 so that the day_id labels will isolate without them touching.

In [None]:
sns.catplot(x="day_id", y="case_id", data=train_df, aspect=2)

Or, we could plot a displot figure out of our numbers full of cases and days with the sns module with the displot function, containing our train_df dataframe and three parameters which is: x set to day_id, y set to case_id, and kind set to kde.

In [None]:
sns.displot(train_df, x="day_id", y="case_id", kind="kde")

As we observed our catplot figure, we observed that hunks of data plots occurred in the number 0 in day_id while others scattered in the day_id range of 6 to 39. 

Meanwhile in our kde displot figure, we kinda observed that most data ranged from the day_id of 0, while some scattered in day_id ranges of 10 to 40. With all being said, let's get the slide ids!

In order to get the slide ids, we define a new function, get_slice_str, containing the row variable as its function input. Inside the get_slice_str function, we define a variable, slice_id to the row variable function input with the id data attribute and split it apart with the split function inputting the underscore in strings to remove the underscores away, setting the slice index of -1 to fetch the end of the sliced input. We proceed to return the formatted string of slice_ and the slice_id variable inside the format parentheses.

Now, on the outside of the get_slice_str function, we define the train_df dataframe with the slice_str data index to the train_df dataframe itself with the apply function, applying the lambda of the row variable to the function call of the get_slice_str function, setting the row variable inside, and the axis parameter to 1. Finally we display the first rows of the updated train_df dataframe with the head function.

In [None]:
def get_slice_str(row):
    slice_id = row.id.split("_")[-1]
    return f"slice_{slice_id}"

train_df['slice_str'] = train_df.apply(lambda row: get_slice_str(row), axis=1)
train_df.head()

After running this code cell above, we now see that the slice_str data has been registered to our train_df dataframe. With all parts done, let's add the paths of a GI!

To add the path leading to a GI, we define a variable, filepaths to the glob module with the glob function to list out all of the given file directories, compromising the ROOT_DIR variable and the string saying: `train/*/*/*/*`. We display the filepaths variable with the slice index of 5 after the colon to display the first rows of the filepaths.

In [None]:
filepaths = glob.glob(ROOT_DIR + "train/*/*/*/*")
filepaths[:5]

Since this code cell displayed the first five GI image scan directories, let's compromise them to our train_df dataframe! First, we define a variable dataframe, file_df and create our new dataframe with the pd module with the DataFrame function, setting the columns parameter to an array with case_str, day_str, slice_str, filename, and filepath.

Next, to assemble our newly file_df dataframe, we create a for loop, looping the defined variables: idx and filepath, in the tqdm function (to make a progress bar), containing the enumerate function that eased over the filepaths dataframe. Inside the for loop, the case_day_str variable was defined to the filepath variable that was being split by the split function with the slashes, setting the slice index to 5 and the case_str, day_str variables was defined to the case_day_str variable being split with the split function containing the underscores. Furthermore, the variable filename is defined to the filepath that was split by the split function over the slashes with the slice index of -1 to it. The slice_id variable was also defined to similar of what we did to the slice_id variable, but the split function contained underscores and the slice index of 1, along with the slice_str variable that was defined to the formatted string of "slice_" and the slice_id variable in parentheses. Still inside the for loop, we then define the file_df dataframe with the loc attribute (to access) with the slice index of the idx variable to an array with five variables: case_str, day_str, slice_str, filename, and filepath, as it was the same thing as to what we define the file_df dataframe.

Finally, outside of that for loop, we display the first five rows of the file_df dataframe with the head function.

In [None]:
file_df = pd.DataFrame(columns=['case_str', 'day_str', 'slice_str', 'filename', 'filepath'])

for idx, filepath in tqdm(enumerate(filepaths)):
    case_day_str = filepath.split('/')[5]
    case_str, day_str = case_day_str.split('_')

    filename = filepath.split('/')[-1]
    slice_id = filename.split('_')[1]
    slice_str = f'slice_{slice_id}'
    
    file_df.loc[idx] = [case_str, day_str, slice_str, filename, filepath]

file_df.head()

After some minutes of running, we can see that the filepath data index has been added to our file_df dataframe, along with the case_str, day_str, slice_str, and filename data indexes, as we know that the four mentioned is similar to our train_df dataframe. But since we have two dataframes, file_df and train_df, what are we going to do about it? The answer is merging them.

In order to merge the two dataframes together into one, we define our train_df dataframe again to the pd module with the merge function, setting the train_df dataframe itself along with the file_df dataframe thus setting the on parameter to a array of three items: case_str, day_str, and slice_str, all encased in strings. Finally, we show our newly merged train_df dataframe by using the head function to the train_df dataframe.

In [None]:
train_df = pd.merge(train_df, file_df, on=['case_str', 'day_str', 'slice_str'])
train_df.head()

After merging the two dataframes, we now see that the filepath data has been also added to our train_df dataframe. But what about the other information from filename? The quick solution is to extract it.

Before we extract other info from filename, we need to know that every image filename includes 4 numbers like 4_4_8_2.png. They represent the slice height/width (int in pixels) and height/width pixel spacing (aka floating points in mm). Keep in mind that the first two integers represents the resolution of the slide, while the last two record the physical size of each pixel.

With all of that being said, let's go on to other info extraction! First, we define four functions each containing the row variable input, which is get_image_height, get_image_width, get_pixel_height, and get_pixel_width. Inside of each function, we return the integer (with the int function) and floats (with the float function on the last 2) of the row variable input with the filename data attribute with the slice index of a colon and -4 to show the last 4 index and stopping it along splitting the underscores with the split function thus applying the slice index of 2, 3, 4, 5 seperately.

Next, we define the train_df dataframe, each by creating new indexes, which is img_height, img_width, pixel_height (mm), and pixel_width (mm) to the train_df dataframe itself with the apply function to apply the data indexes to the lambda of the row variable to the function calls of each four functions mentioned above containing the row variable inside, thus setting the axis parameter to 1. Finally, we display our newly updated train_df dataframe with the head function.

In [None]:
def get_image_height(row):
    return int(row.filename[:-4].split('_')[2])

def get_image_width(row):
    return int(row.filename[:-4].split('_')[3])

def get_pixel_width(row):
    return float(row.filename[:-4].split('_')[4])

def get_pixel_height(row):
    return float(row.filename[:-4].split('_')[5])

train_df["img_height"] = train_df.apply(lambda row: get_image_height(row), axis=1)
train_df["img_width"] = train_df.apply(lambda row: get_image_width(row), axis=1)
train_df["pixel_height (mm)"] = train_df.apply(lambda row: get_pixel_height(row), axis=1)
train_df["pixel_width (mm)"] = train_df.apply(lambda row: get_pixel_width(row), axis=1)

train_df.head()

As you can see, we now register the get_image_height, get_image_width, pixel_height (mm), and pixel_width (mm) data to the train_df dataframe! Furthermore, we don't want the index data anymore, so we use the drop function to the train_df dataframe, setting the index data, the axis parameter to 1, and the inplace parameter to True.

In [None]:
train_df.drop('index', axis=1, inplace=True)

After running this code cell, we wave our goodbye to the index data in the train_df dataframe! But, what about our analysis of the data from img_height, img_width, pixel_height (mm), and pixel_width (mm)? Let's plot this in a histogram with Plotly.

In order to plot the data mentioned above in a histogram with Plotly, we define a figure variable, fig, to the px module with the histogram function to create a new histogram chart, setting the data_frame input to the train_df dataframe, the x parameter to every four data indexes mentioned from the train_df dataframe, the marginal dataframe to violin, and nbins set to 400.

We then use the update_layout function to update our layout of our figure, setting the template parameter to any built-in layout Plotly gave out (e.g. seaborn, presentation). Finally, we show our fig variable figure graph with the show function.

In [None]:
fig = px.histogram(data_frame=train_df, x="img_height", marginal="violin", nbins=400)
fig.update_layout(template="presentation")
fig.show()

In [None]:
fig = px.histogram(data_frame=train_df, x="img_width", marginal="violin", nbins=400)
fig.update_layout(template="presentation")
fig.show()

In [None]:
fig = px.histogram(data_frame=train_df, x="pixel_height (mm)", marginal="violin", nbins=400)
fig.update_layout(template="presentation")
fig.show()

In [None]:
fig = px.histogram(data_frame=train_df, x="pixel_width (mm)", marginal="violin", nbins=400)
fig.update_layout(template="presentation")
fig.show()

When we observe the four graphs above, we clearly see that there are some data given with two and three data entries. The most counts for the data in get_image_width and get_image_height is 266 while the least is 276. On the other hand the most and least counts of the data in pixel_width (mm) and pixel_height (mm) is 1.5 and 1.63. And we finished the chapter of "A Basic Start to the train_df DataFrame From Loading the train.csv" Now let's head on to another chapter!

## Chapter 2: The Analysis over Case, Day, and the Slice Level
Before we begin in this chapter, here's what we need to know about the data. Each image (aka slice) can be reached by traversing the three levels: Case, Day, and Slice. Keep in mind that the Case represents which case the scan belong, the Day represents when does the scan was produced, and the Slice represents that each one can have several rows in which each one has a unique segmentation mask in the context of the csv file provided. With all that being said, let's hop into this chapter.

To analyze the case, day, and slice level, we define a variable, by_case, to the train_df dataframe with the groupby function to groupby the case_str data. Then, we create a new dataframe, naming as the case_df variable, to the by_case variable with the get_group function, grouping all the case123 data. Finally, we display the rows of our case_df dataframe with the head function.

In [None]:
by_case = train_df.groupby('case_str')
case_df = by_case.get_group('case123')
case_df.head()

When we observed this dataframe, case_df, we see that it shows the data starting with case123. But how many day_str data entities in case_df?

The answer is that we use the case_df dataframe again with the day_str data attribute and count them with the value_counts function. Let's see how it goes.

In [None]:
case_df.day_str.value_counts()

After running this code cell, we see that day0 has 132 data entities, day22 has 130 data entities, and day20 has 113 data entities. So, with all value counts specified in the case_df dataframe, let's plot them with Plotly.

First, we define a variable, fig, to the px module with the bar function to create a bar chart consisting every count of the data of the day_str, setting the x parameter to the np module with the unique function to find the unique elements of the case_df dataframe that has the day_str data index, the y parameter to an array setting the list with the list function of the case_df dataframe with the day_str data index and count it with the count function setting the i variable from a for loop looping the i variable to the np module with the unique function that finds the unique elements of the case_df dataframe with the day_str data index, the color parameter to np module with the unique function that finds the unique elements of the case_df dataframe with the day_str data index again, and the color_continuous_scale parameter to any built-in color input Plotly provides. 

Next, we use the fig variable to update the x-axis and y-axis with the update_xaxes and update_yaxes functions, titling it with the title parameter as day_str and counts. We then update our layout of our fig variable figure by applying the update_layout function, setting the show_legend parameter to True, and the title parameter to a dictionary (see below), and the template to any built-in template Plotly gave out. Finally we show our fig figure with the show function.

In [None]:
fig = px.bar(x=np.unique(case_df["day_str"]), y=[list(case_df["day_str"]).count(i) for i in np.unique(case_df["day_str"])], color=np.unique(case_df["day_str"]), color_continuous_scale="solar")

fig.update_xaxes(title="day_str")
fig.update_yaxes(title="count")
fig.update_layout(showlegend=True, 
                  title={
                      "text": "Number of Data in day_str",
                      "y": 0.95, # y position
                      "x": 0.5, # x position
                      "xanchor": "center", # x position anchored
                      "yanchor": "top" # y position anchored
                  }, template="seaborn")

fig.show()

When we analyzed this graph, we see that it's clearly the same thing as what we analyzed the day_str data from the case_df dataframe with the value_counts function.

Now let's find the data "by_day"! First, we define a variable, by_day to the case_df dataframe with the groupby function to group by the slice_str data index. Then, we define a new dataframe called slice_df to the by_slice variable with the get_group function to get the data group of the day0 data index or any day with a number data indexes. Finally, we show our newly created slice_df dataframe by using the head function.

In [None]:
by_day = case_df.groupby('day_str')
day_df = by_day.get_group('day0')
day_df.head()

After running this cell, our new dataframe, day_df has been created and we see that this dataframe sorts out the data by day_str. Now let's count the data from it. All we need to do is to use the day_df dataframe with the slice_str data attribute and count it with the value_counts function.

In [None]:
day_df.slice_str.value_counts()

As a result of running this code cell above, we see that slice_0081 has 3 data entities in three rows, while slice_0049 has 1 data row entity. But, we can make a conclusion saying that from slice_0081 to slice_0049, the 3 data row entities may decrease to 2 and then to 1.

Lastly, let's create one more dataframe. We define a variable first called by_slice to group all the data of the slice_str data index in the day_df dataframe with the groupby function. Then, we create our new dataframe called slice_df to the by_slice variable and get the grouping data of slice_0075 or any data after that with the get_group function. Finally, we display the rows of our slice_df dataframe with the head function.

In [None]:
by_slice = day_df.groupby('slice_str')
slice_df = by_slice.get_group('slice_0075')
slice_df.head()

As you can see, the slice_df dataframe has been created and we see three rows of data starting from 287 to 289. And that's all for chapter 2 of our analysis over the case, day, and the slice level!

## Chapter 3: Mask Segmentation Visualization (feat. W&B)
Now, here's the fun part we've on. We are going to visualize segmentation masks! Before we begin, we import the cv2 module. To get started, we define the filepath variable first to the slice_df dataframe with the fliepath data attribute and find the array with the same type with the values attribute, using the slice index of any number from 0 to 3. Then, we define another variable called image to the cv2 module with the imread function to read out the filepath variable, thus adding the cv2 module with the IMREAD_UNCHANGED attribute to load the image to grayscale. Finally, we print out the shape of the image file variable with the shape attribute.

In [None]:
filepath = slice_df.filepath.values[1]
image = cv2.imread(filepath, cv2.IMREAD_UNCHANGED)
image.shape

As always, we see that the shape of the image file variable is (266, 266). This probably meant to us that this shape from the image file variable is the height and width of a specific GI Tract image. 

Now, let's find out what does a GI Tract Image looks like! First, we create our figure graph by using the plt module with the figure function, setting the figsize parameter to 10 by 10 shape. Then, we show our image using the imshow function that was linked to the plt module, inputting the image file variable, and setting the cmap parameter to gray.

In [None]:
plt.figure(figsize=(10, 10))
plt.imshow(image, cmap='gray');

We just displayed a picture of a GI Tract after running this code of cell above! But, there's still more to find out, since there is some analysis of the mask over the GI Tract.

To outline the mask over the GI Tract, we define a new function called rle2mask, setting the variables rles, class_names, height, width, and class_dict. Then, inside this function, we define a variable, img to the np module with the zeros function to return a new array with zeros, muliplying height and width variables inside and setting the dtype parameter to the uint16 of the np module. 

We then create a for loop, lopping the rle and class_name variables in the zip object of the rles and class_names function input variables. Inside, the s variable was defined to splitting the spaces (" ") of the rle variable with the split function, along with the starts and lengths variables defined to an array containing the np module with the asarray function that converts the x variable that looped in a tuple that has the two seperate s variables that has the two slice indexes (0:, ::2 and 1:, ::2) into an array and setting the dtype parameter to integer (int). Also, the starts variable is defined again to get subtracted by one, the ends variable defined to the starts variable adding to the lengths variable, and another for loop has been made, looping lo and hi in the zip object of the starts and ends variable, containing the img variable with the slice index of lo and hi for loop variables to the class_dict function input variable with the slice index of the class_name for loop variable.

Now, outside of that for loop in the rle2mask funciton, we define an additional variable, mask to the reshaping the img variable with the reshape variable, containing a tuple of width and height variables. Finally, we return the mask variable.

In [None]:
def rle2mask(rles, class_names, height, width, class_dict):
    img = np.zeros(height*width, dtype=np.uint16)
    for rle, class_name in zip(rles, class_names):
        s = rle.split(' ')
        starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
        starts -= 1
        ends = starts + lengths
        
        for lo, hi in zip(starts, ends):
            img[lo:hi] = class_dict[class_name]
            
    mask = img.reshape((width, height))
    return mask

After creating the rle2mask function, we define a variable called class2id to a dictionary, setting the class_name variable to the idx variable plus 1 in which the class_name and idx variables looped in enumerating the train_df dataframe with the class data index followed by the unique function to find the unique elements of it. We also define another variable, id2class to another dictionary, but this time, setting the v and k variables inside in which they were looped in getting the items in the class2id dictonary with the items function. Finally, we print out the id2class variable (without the built-in print).

In [None]:
class2id = {class_name: idx+1 for idx, class_name in enumerate(train_df['class'].unique())} # Note that 0 is reserved for background.
id2class = {v:k for k, v in class2id.items()}
id2class

As you can see, we can see a dictonary printed out, in which the keys 1 to 3 output the GI tract organs that were used for image segmentation, which is stomach, large_bowel, and small_bowel.

And what's next, it's the time to create a mask layout! We define a variable mask, to the function call of rle2mask function, setting the slice_df dataframe with the segmentation data attribute and get the values of it with the values function, along with the slice_df dataframe that has the data index of class, the img_height and img_width data attributes(although the last two needs to apply the slice index of 0), and the class2id variable.

In [None]:
mask = rle2mask(slice_df.segmentation.values,
                slice_df['class'].values,
                slice_df.img_height.values[0],
                slice_df.img_width.values[0],
                class2id)

Time to display the outline of the GI Tract Mask! We create our new figure with the plt module with the figure function, setting the figure size with the figsize parameter to 10 by 10 (a tuple with 10 and 10). Finally, we show the image of the mask variable using the imshow function from the plt module.

In [None]:
plt.figure(figsize=(10, 10))
plt.imshow(mask)

After of all of that, we just displayed an outline of the GI tract! And for sure, we made a good comparison to the actual image that the actual image and the GI tract mask outline is way almost the same.

It's time to use the Weights and the Biases over the Segmentation Mask Visualization! Before we start doing that, we import the wandb module first. Next, we import the UserSecretsClient from Kaggle's kaggle_secrets module to have our secret attached in our notebook and define the user_secrets variable to the UserSecretsClient function call. We then define the wandb_key variable to the user_secrets variable and get our secret with the get_secret function, setting it to wandb. Finally, we login to Weights and Biases by using the login function from the wandb module, setting the key parameter to login with to the wandb_key variable. Thus, we import the tensorflow module as tf.

In [None]:
import wandb
from kaggle_secrets import UserSecretsClient
import tensorflow as tf

user_secrets = UserSecretsClient()
wandb_key = user_secrets.get_secret("wandb")
wandb.login(key=wandb_key)

Now we have our wandb login and setup being done, let's move onto visualization! We begin with generating a dict of mask data to log by defining the wandb_mask variable to a dictionary, containing the gt_mask dictionary key to another dictionary in which the mask_data key is defined to the mask variable and the class_labels key is defined to the id2class variable.

In [None]:
wandb_mask = {
    'gt_mask': {
        'mask_data': mask,
        'class_labels': id2class
    }
}

Time to run the "Weights and Biases"! We define the variable, run, to init our project to "UW-Madison-Viz" or any name with the init function from wandb. Next, we log the logs of running the wandb project with the wandb module with the log function, containing a dictionary setting the "Ground Truth Segmentation" key to the image display with the Image function connected by the wandb module, inputting the image variable inside thus setting the masks parameter to wandb_mask variable. We then finish up the Weights and Biases project by using the finish function provided by the wandb module. Finally, we call the run variable.

In [None]:
run = wandb.init(project='UW-Madison-Viz')
wandb.log({'Ground Truth Segmentation': wandb.Image(image, masks=wandb_mask)})
wandb.finish()
run

If you click the "Display W&B run" button, you'll see that the GI Tract Image was mostly white, while the masks layout, its like the same thing as the mask layout displayed out on the Matplotlib graph.

Time to visualize a case day-wise by using W&B Tables! We first create a dictionary with key id and name for logging segmentation masks as W&B Tables by defining the wandb_class_set variable to the wandb module with the Classes function, setting up the classes in the tuple and then the dictionary setting the id and name keys to id and name variables that looped in the items of the id2class variables with the items function.

In [None]:
wandb_class_set = wandb.Classes([{
                      'id': id,
                      'name': name
                  } for id, name in id2class.items()])

Next, we create a for loop, looping the day variables and the day_df dataframe in the by_day variable, and inside that for loop, let's find out in the code cell when we show out the comments.

In [None]:
for day, day_df in by_day:
    # Before we start, we print out the day the scan was taken with the day variable.
    print("The day the scan was taken is: ", day)
    
    '''
    Now, let's get started! First, let's initalize our W&B run by defining the run variable
    to initing the project with the init function from the wandb module, setting the project
    parameter to "UW-Madison-Viz" or the same custom project name again and the group 
    parameter to 'case123-viz'.
    '''
    run = wandb.init(project="UW-Madison-Viz", group="case123-viz")
    
    '''
    Next, We initialize a W&B table by defining the data_at variable to the wandb module with
    the Table function, setting the columns parameter to a tuple with the slice and image 
    caked in strings.
    '''
    data_at = wandb.Table(columns=['slice', 'image'])
    
    '''
    We also group by slice by defining another variable, by_slice, to the day_df dataframe and
    group by the slice_str with the groupby function.
    '''
    by_slice = day_df.groupby('slice_str')
    
    '''
    Thus, we iterate through each slice, open the images, and then obtain masks by creating a 
    for loop, looping the slice_num variable and the slice_df dataframe in the tqdm with the 
    tqdm function, containing the by_slice variable.
    '''
    for slice_num, slice_df in tqdm(by_slice):
        # Inside the for loop, We open the image by defining the filepath variable to the
        # slice_df dataframe with the filepath data attribute and then find the values of it
        # with the values attribute, setting the slice index to 0. We additionally define the
        # image variable to reading the image file with the cv2 module and the imread function,
        # which it is the filepath variable and setting it to grayscale with the 
        # IMREAD_UNCHANGED attribute from the cv2 module first, then converting it to tensor
        # by using the convert_to_tensor function from the tf module, inputting the image
        # variable inside, and setting the dtype parameter to the uint16 attribute from the tf
        # module again, and lastly convert the image datatype to float16 by using the tf
        # module with the image attribute along with the convert_image_dtype function, 
        # inputting the image inside and setting the dtype parameter to the tf module with the 
        # float16 attribute.
        filepath = slice_df.filepath.values[0]
        image = cv2.imread(filepath, cv2.IMREAD_UNCHANGED)
        image = tf.convert_to_tensor(image, dtype=tf.uint16)
        image = tf.image.convert_image_dtype(image, dtype=tf.float16)
        
        # We now get the mask by defining the mask variable to the function call of the 
        # rle2mask function, inputting the slice_df dataframe with the values (attribute)
        # of the class data index, the img_height and img_width data attributes (with the 
        # slice index of 0 in the values attribute), and the class2id variable.
        mask = rle2mask(slice_df.segmentation.values,
                        slice_df['class'].values,
                        slice_df.img_height.values[0],
                        slice_df.img_width.values[0],
                        class2id)
        
        # Then, we generate a dict of mask data to log by creating a dictionary of mask 
        # data to log. It's like the same thing with visualizing a mask layout before we
        # visualize a case day-wise with W&B Tables.
        wandb_mask = {
            'gt_mask': {
                'mask_data': mask,
                'class_labels': id2class
            }
        }
        
        # We proceed to add the data as a new row by using the add_data function to the
        # data_at W&B Table, inputting the slice_num variable and retrieving the image
        # with the Image function that connects to the wandb module, setting the image
        # variable, the masks parameter to the wandb_mask variable, and the classes 
        # parameter to the wandb_class_set variable.
        data_at.add_data(
            slice_num,
            wandb.Image(image, masks=wandb_mask, classes=wandb_class_set)
        )
        
    '''
    Outside of this for loop in that for loop we're in, we log the table onto the W&B 
    dashboard by using the log function from the wandb module, setting up a dictionary
    that contained the string formatted key of Segmentation Viz {day} to the data_at 
    variable.
    '''
    wandb.log({f"Segmentation Viz {day}": data_at})
    
    '''
    Finally, let's close our W&B run by using the finish function with the wandb module.
    '''
    wandb.finish()

As you can see, if you click the link from: "Synced (adjective)-(thing)-(number):", you'll get redirected to our W&B project workspace in which you could see the mask visualizations ranging from slice 1 to slice 62. Upon one of them, you may see that the GI Tract Layout Image Masks descriptively detect disturbances of tumors from the patients' stomach, large, and small bowels.

## Conclusion
As we finish all three chapters of explaining EDA over the GI Tract Segmentation, it's time to wrap up! In summary, we've made analysis of several data from loading the train.csv file into our train_df dataframe to visualizing the cases, days, and the slice levels, and then examining closely to the masks layouts of the GI Tract with Weights & Biases. As a result, we can help most radiation oncologists to deliver high doses of radiation safely without touching the stomach, so that the daily dose of treatments to the GI Tract Cancer Patients will speed up effectively with no or little side effects and improve long-term cancer control. With all of what was much to say, let's help the doctors to identify tumors effectively someday!