# 👨‍⚕️ OSIC Pulmonary Fibrosis Progression 👩‍⚕️
![](https://medicaldialogues.in/h-upload/2020/05/18/128958-idiopathic-pulmonary-fibrosis.jpg)

### In this notebook, I will show you how to preprocess and prepare both our Tabular and Image datasets.

#### **<font color='red'>Disclaimer</font>** : This notebook is a continuation of my [previous notebook](https://www.kaggle.com/sarthak97/osic-starter-eda-dicom-viz-analysis/notebook) on EDA and DICOM data viz. and analysis. Make sure to check it out to get a better understanding.

In [None]:
# Importing relevant packages
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# _<font color='red'>1. Tabular Data Preprocessing + Preperation</font>_ 🛠️

## _1.1 Train Data_

In [None]:
train_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.describe()

#### We have seen in the [previous notebook](https://www.kaggle.com/sarthak97/osic-starter-eda-dicom-viz-analysis/notebook) that there are no missing values in train set. So no need to check again.

### Since 'Patient' feature contains only the unique ID of patient, we can remove it from our dataset.

In [None]:
# Create checkpoint before deleting 'patient' feature
original_df = train_df.copy()

# Delete the feature now
train_df = train_df.drop(['Patient'], axis=1)
train_df.head()

### Let's handle the categorical variables now...

<font color='red'>**To do this, we would be using the 'get_dummies' method of pandas which will one-hot encode the categorical variables and remove extra features to avoid dummy variable trap.**</font>

In [None]:
# Create a checkpoint
df_without_patient = train_df.copy()

# Convert the categorical variables now...
train_df = pd.get_dummies(train_df, drop_first=True)
train_df.head()

### From above dataframe head, all values are now numerical. It is clear that some values are really big while others are small. 

<font color='red'>**This is a problem as our Machine Learning algorithm may find out a relation between the result and the higher and lower values in such a way that it considers higher values to be of more importance than the lower ones which is totally not the case. So, we need to handle such a thing.**</font>

We can do so by **Scaling** our dataset so that each feature value is in the same range

In [None]:
# Create a checkpoint
df_with_dummies = train_df.copy()

# Now scale each feature in dataframe
scaler = StandardScaler()
scaler.fit(train_df)
train_df = scaler.transform(train_df)
train_df

'train_df' now contains all scaled values and is a 2D numpy array.

In [None]:
# Convert numpy array into dataframe
final_train_df = pd.DataFrame(train_df)

# get column headers from checkpoint df
col_names = df_with_dummies.columns

# Set column headers in final dataframe
final_train_df.columns = col_names

final_train_df.head()

## _1.2 Test Data_
### We will be repeating all same steps which we performed above for test data as well

In [None]:
test_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')
test_df.head()

In [None]:
test_df.shape

In [None]:
test_df.describe()

#### We have seen in the [previous notebook](https://www.kaggle.com/sarthak97/osic-starter-eda-dicom-viz-analysis/notebook) that there are no missing values in test set also. So no need to check again.

### Since 'Patient' feature contains only the unique ID of patient, we can remove it from our dataset.

In [None]:
# Create checkpoint before deleting 'patient' feature
original_test_df = test_df.copy()

# Delete the feature now
test_df = test_df.drop(['Patient'], axis=1)
test_df.head()

In [None]:
# Create a checkpoint
test_df_without_patient = test_df.copy()

# Convert the categorical variables now...
test_df = pd.get_dummies(test_df, drop_first=True)
test_df.head()

In [None]:
# Create a checkpoint
test_df_with_dummies = test_df.copy()

# Now scale each feature in dataframe
scaler = StandardScaler()
scaler.fit(test_df)
test_df = scaler.transform(test_df)
test_df

In [None]:
# Convert numpy array into dataframe
final_test_df = pd.DataFrame(test_df)

# get column headers from checkpoint df
col_names = test_df_with_dummies.columns

# Set column headers in final dataframe
final_test_df.columns = col_names

final_test_df.head()

### _(Optional) You can run below cell and download the final csv data if you want which will be a ready-to-use data helpful when we create our baseline model_

In [None]:
final_train_df.to_csv('final_train.csv', index=False)
final_test_df.to_csv('final_test.csv', index=False)
'''
1.Hit commit and run at the right hand corner of the kernel.
2.Wait till the kernel runs from top to bottom.
3.Checkout the 'Output' Tab from the Version tab. Or go to the snapshot of your kernel and checkout the 'Output' tab.
  Your csv file will be there!!
4. Download it.
'''


# _<font color='red'>2. Image Data Preprocessing</font>_ 📸

**Background**: When we deal with images in image-based problems and deploy a deep learning solution, it is better to have a fast image reading and transforming library. I will be converting DICOM images to numpy arrays just to make the process of preprocessing a lot simpler along with OpenCV.

### Below are the steps that can be performed in general to preprocess your image dataset.

#### _<font color='blue'>NOTE : I will be showing this on one image just to demonstrate how you can implement this in your notebooks.</font>_

In [None]:
!pip install imutils

I am uisng imutils package here as the methods it contains are pretty simple and easy to use.

In [None]:
# Importing relevant packages

import os
import cv2
import glob
import imutils
import pydicom
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Convert .dcm image to .png image

path = '../input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430/1.dcm'
file = path.split('/')[-1]
outdir = './'

# read dcm file
dcm_img = pydicom.dcmread(path)

# get pixel arrays, replace .dcm extension with .png and place image in output directory
img = dcm_img.pixel_array
cv2.imwrite(outdir + file.replace('.dcm','.png'),img)

In [None]:
image = cv2.imread("./1.png")
plt.imshow(image)

## _2.1 Blur Image_

In [None]:
# Gaussian Blur
blur = cv2.GaussianBlur(image, (7,7), 0)
plt.imshow(blur)

In [None]:
# Median Blur
median_blur = cv2.medianBlur(image, 5)
plt.imshow(median_blur)

## _2.2 Flip Image_
The image is flipped according to the value of flipCode as follows:

- flipcode = 0: flip vertically
- flipcode > 0: flip horizontally
- flipcode < 0: flip vertically and horizontally

In [None]:
# Flip vertically 
flip_vertical = cv2.flip(image, flipCode=0)
plt.imshow(flip_vertical)

In [None]:
# Flip horizontally 
flip_horizontal = cv2.flip(image, flipCode=1)
plt.imshow(flip_horizontal)

In [None]:
# Flip both horizontally and vertically
flip_both = cv2.flip(image, flipCode=-1)
plt.imshow(flip_both)

## _2.3 Edge Detection_

In [None]:
edged = cv2.Canny(image, 100, 200)
plt.imshow(edged)

## _2.4 Rotating an image_

Following standard is used in imutils package when rotating an image

- Angle > 0 -> Counter clockwise
- Angle < 0 -> Clockwise

In [None]:
# clockwise rotation
rotate_clock = imutils.rotate(image, -45)
plt.imshow(rotate_clock)

In [None]:
# counter clockwise rotation
rotate_counter = imutils.rotate(image, 90)
plt.imshow(rotate_counter)

## _2.5 Thresholding an image_

In [None]:
_, thresh1 = cv2.threshold(image, 200, 255, cv2.THRESH_BINARY)
plt.imshow(thresh1)

In [None]:
_, thresh2 = cv2.threshold(image, 200, 255, cv2.THRESH_BINARY_INV)
plt.imshow(thresh2)

## _2.6 Erosion and Dilation_
Erosion and Dilation are operations of Morphological Transformations.

While **<font color='blue'>Erosion</font>** is helpful in removing white noise (Always try to keep foreground in white), **<font color='blue'>Dilation</font>** is useful in image binding and joining broken parts of an object.

Normally, in cases like noise removal, erosion is followed by dilation. Because, erosion removes white noises, but it also shrinks our object. So we dilate it. Since noise is gone, they won’t come back, but our object area increases.

In [None]:
# Erosion
erode = cv2.erode(thresh1, (5,5), iterations=1)
plt.imshow(erode)

In [None]:
dilate = cv2.dilate(thresh1, (5,5), iterations=1)
plt.imshow(dilate)

### <font color='orange'>If you find this notebook useful, please **UPVOTE** it 😊. It keeps me motivated to do more hard work and produce and bring out more quality content for everyone.</font>