# SIIM: Step-by-Step Image Detection for Beginners 
## Mini Part. Preprocessing for Multi-Output Regression that Detect Opacities

👉 Part 1. [EDA to Preprocessing](https://www.kaggle.com/songseungwon/siim-covid-19-detection-10-step-tutorial-1)

👉 Part 2. [Basic Modeling - Simplest Image Classification Models using Keras](https://www.kaggle.com/songseungwon/siim-covid-19-detection-10-step-tutorial-2)

> Index
```
Step 1. Import Dataset
Step 2. Test Sample data(1 row) before make the preprocessing function
     2-a. The image with the most opacity detected is taken as a sample
     2-b. visualize resized image without boxes
     2-c. extract position information
     2-d. Extract all box's information for sample image.
     2-e. Extract corrected positions that resizing ratio is calculated
     2-f. visualize resized image with boxes
Step 3. Build Function for reuse
     3-a. Test the functions that go into the function
     3-b. Build Function and Create New DataFrame with loop
     3-c. concat dataframe and save
```

Now we are going to create a neural network (drawing boxes) that detects opacity. The model is planned to be constructed in the form of simply returning four continuous dependent variables y.

To do this, we need a training dataset consisting of X matrices in the form of images and 4-y vectors.

Let's create a short training dataset in this mini part.

## Step 1. Import Dataset

In [None]:
import pandas as pd

In [None]:
train_df = pd.read_csv('/kaggle/input/siim-covid19-preprocessed-datasettrain/custom_train.csv')

In [None]:
train_df.head()

## Step 2. Test Sample data(1 row) before make the preprocessing function

### 2-a. The image with the most opacity detected is taken as a sample.

In [None]:
sorted(train_df.OpacityCount.unique())

In [None]:
train_df[train_df.OpacityCount == 8]

In [None]:
sample_outlier = train_df[train_df.OpacityCount == 8]
sample_outlier

### 2-b. visualize resized image without boxes

In [None]:
import matplotlib.pyplot as plt

In [None]:
img = plt.imread(sample_outlier.path.values[0])
img

In [None]:
plt.imshow(img, cmap='gray');

### 2-c. extract position information

In [None]:
sample_box_position = sample_outlier.boxes.values[0]
sample_box_position

In [None]:
print('count of x : ',sample_box_position.count('x'))
print('count of y : ',sample_box_position.count('y'))
print('count of height : ',sample_box_position.count('height'))
print('count of width : ',sample_box_position.count('width'))

In [None]:
import re
p = re.compile("[-+]?\d*\.\d+|\d+") # extract floats from a string
p_list = p.findall(sample_box_position) # return in word bundle form
print(p_list)

# ^ : start char string
# [0-9] : range (all of numbers)
# + : no limit of count of each number
# $ : end char string

In [None]:
count_box = len(p_list) // 4
count_box

### 2-d. Extract all box's information for sample image.

In [None]:
x_idx = []
y_idx = []
height_idx = []
width_idx = []
for i in range(count_box):
    i *= 4
    x_idx.append(i)
    y_idx.append(i+1)
    height_idx.append(i+2)
    width_idx.append(i+3)
print('x_idx : ',x_idx)
print('y_idx : ',y_idx)
print('height_idx : ',height_idx)
print('width_idx : ',width_idx)

In [None]:
[p_list[x] for x in x_idx]

In [None]:
x_list = [float(p_list[idx]) for idx in x_idx]
y_list = [float(p_list[idx]) for idx in y_idx]
height_list = [float(p_list[idx]) for idx in height_idx]
width_list = [float(p_list[idx]) for idx in width_idx]
print('x_list : ',x_list)
print('y_list : ',y_list)
print('height_list : ',height_list)
print('width_list : ',width_list)

### 2-e. Extract corrected positions that resizing ratio is calculated

In [None]:
train_df[train_df.OpacityCount == 8]

In [None]:
sample_height_ratio = train_df[train_df.OpacityCount == 8].height_ratio.values
sample_width_ratio = train_df[train_df.OpacityCount == 8].width_ratio.values

In [None]:
x_list

In [None]:
sample_height_ratio

In [None]:
resized_x_list = x_list*sample_width_ratio
resized_y_list = y_list*sample_height_ratio
resized_width_list = width_list*sample_width_ratio
resized_height_list = height_list*sample_height_ratio

In [None]:
print('resized_x_list : \n',resized_x_list)
print('resized_y_list : \n',resized_y_list)
print('resized_width_list : \n',resized_width_list)
print('resized_height_list : \n',resized_height_list)

### 2-f. visualize resized image with boxes

In [None]:
import matplotlib
import matplotlib.pyplot as plt

In [None]:
resized_x_list

In [None]:
count_box

In [None]:
fig, ax = plt.subplots(1,1, figsize=(4,4))
for i in range(count_box):
    p = matplotlib.patches.Rectangle((resized_x_list[i], resized_y_list[i]),
                                      resized_width_list[i], resized_height_list[i],
                                      ec='r', fc='none', lw=2.)
    ax.add_patch(p)
    
ax.imshow(img, cmap='gray')
plt.show()

## Step 3. Build Function for reuse

In [None]:
train_df.head()

### 3-a. Test the functions that go into the function

In [None]:
p = re.compile("[-+]?\d*\.\d+|\d+")
box_positions = train_df.boxes.apply(lambda x : p.findall(str(x)))
box_positions

In [None]:
train_df.OpacityCount

### 3-b. Build Function and Create New DataFrame with loop

In [None]:
import numpy as np

def resize_box_position(df, c):
    count_box = train_df.OpacityCount[c]
    x_idx = []
    y_idx = []
    height_idx = []
    width_idx = []

    for i in range(count_box):
        i *= 4
        x_idx.append(i)
        y_idx.append(i+1)
        height_idx.append(i+2)
        width_idx.append(i+3)

    if train_df.boxes[c] != train_df.boxes[c]:
        return pd.Series([0,0,0,0], index=df.columns)
    
    else:
        p_list = p.findall(train_df.boxes[c]) 
        x_list = [float(p_list[idx]) for idx in x_idx]
        y_list = [float(p_list[idx]) for idx in y_idx]
        height_list = [float(p_list[idx]) for idx in height_idx]
        width_list = [float(p_list[idx]) for idx in width_idx]

        x_ratio = np.array(train_df.width_ratio[c])
        y_ratio = np.array(train_df.height_ratio[c])

        resized_x_list = x_list*x_ratio
        resized_y_list = y_list*y_ratio
        resized_width_list = width_list*x_ratio
        resized_height_list = height_list*y_ratio
        return pd.Series([resized_x_list, resized_y_list, resized_width_list, resized_height_list], index=df.columns)

In [None]:
resized_box_x = []
resized_box_y = []
resized_box_width = []
resized_box_height = []

df = pd.DataFrame(columns=['resized_box_x', 'resized_box_y', 'resized_box_width', 'resized_box_height'])

for idx in train_df.index:
    df = df.append(resize_box_position(df, idx), ignore_index=True)
    if idx % 500 == 0:
        print('saved - {}/{}'.format(idx, max(train_df.index)))
    elif idx == 6333:
        print('complete - {}/{}'.format(idx, max(train_df.index)))

In [None]:
df

### 3-c. Concat DataFrame and Save

In [None]:
train_df = pd.concat([train_df,df], axis=1)
train_df

In [None]:
train_df.to_csv('train_full_info.csv')