<h1 style='background:#acd6fa; border:0; color:black'><center>Google Brain - Ventilator Pressure Prediction EDA Starter</center></h1>

<center><img width=700px; src="https://cdn.diabetesselfmanagement.com/2020/11/dsm-diabetes-and-lung-health-shutterstock_1452313949.jpg"></center>

### This is simple EDA notebook. I'll keep updating. PLEASE UPVOTE if you like this notebook. It will keep me motivated to update my notebook.
## **Upvote is Free 🤗**

#### References
- ANDREW LUKYANENKO's [Ventilator Pressure Prediction: EDA, FE and models](https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models)
- CARL MCBRIDE ELLIS's [Ventilator Pressure: EDA and simple submission](https://www.kaggle.com/carlmcbrideellis/ventilator-pressure-eda-and-simple-submission)

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import missingno as msno
import os
import time
import lightgbm as lgb
from sklearn.model_selection import GroupKFold
from sklearn import metrics

plt.style.use('ggplot')
%matplotlib inline

<h1 style='background:#acd6fa; border:0; color:black'><center>1. Data Overview</center></h1>

# I. Read Data

In [None]:
data_path = '/kaggle/input/ventilator-pressure-prediction'

train = pd.read_csv(os.path.join(data_path, 'train.csv'))
test = pd.read_csv(os.path.join(data_path, 'test.csv'))
submission = pd.read_csv(os.path.join(data_path, 'sample_submission.csv'))

# II. Downcast train, test data to save memory

This is a tip to save memory and speed up processing time by downcasting. 

In [None]:
def downcast(df, verbose=True):
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        dtype_name = df[col].dtype.name
        if dtype_name == 'object':
            pass
        elif dtype_name == 'bool':
            df[col] = df[col].astype('int8')
        elif dtype_name.startswith('int') or (df[col].round() == df[col]).all():
            df[col] = pd.to_numeric(df[col], downcast='integer')
        else:
            df[col] = pd.to_numeric(df[col], downcast='float')
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose:
        print('{:.1f}% Compressed'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

all_df = [train, test]
for df in all_df:
    df = downcast(df)

# III. Take a look at train, test, submission data

In [None]:
print(train.shape)
train.head()

`pressure` is the target value which is we should predict in test data

In [None]:
print(test.shape)
test.head()

In [None]:
print(submission.shape)
submission.head()

# IV. Create Feature Summary

## Function to create feature summary

We can get data type, number of null values, number of unique values, 1st value, 2nd value, 3rd value by feature summary.

In [None]:
def resumetable(df):
    summary = pd.DataFrame(df.dtypes, columns=['Data Type'])
    summary = summary.reset_index()
    summary = summary.rename(columns={'index': 'Feature'})
    summary['Num of null'] = df.isnull().sum().values
    summary['Num of unique'] = df.nunique().values
    summary['First value'] = df.loc[0].values
    summary['Second value'] = df.loc[1].values
    summary['Third value'] = df.loc[2].values
    return summary

In [None]:
resumetable(train)

We can get useful information from feature summary.

- **`id`** : The number of `id` is equal to the number of train data(6,036,000). Therefore, `id` is the unique value of the train data itself.
- **`breath_id`** : Globally-unique time step for breaths. num of unique `id` divided by num of unique `breath_id` is 6,360,000 / 75,450 = 80. **Therefore each 80 lines has the same `breath_id`.** For example, `id` 0\~79 has `breath_id` 1, `id` 80\~159 has `breath_id` 2, `id` 160\~239 has `breath_id` 3.
- **`R`** : Lung attribute indicating how restricted the airway is. There are only three values in `R` so that **`R` is categorical feature.**
- **`C`** : Lung attribute indicating how compliant the lung is. There are only three values in `C` so that **`C` is also categorical feature.**
- **`time_step`** : The actual time stamp. **Continuous feature**
- **`u_in`** : The control input for the inspiratory solenoid valve. Ranges from 0 to 100(0 is completely closed and no air is let in and 100 is completely open). **Continuous feature.**
- **`u_out`** : The control input for the exploratory solenoid valve. Either 0(closed) or 1(open). **Categorical feature**
- **`pressure`** : the airway pressure measured in the respiratory circuit (Target Value)

# V. Null Value Check

In [None]:
msno.bar(df=train, figsize=(7, 4));

There is no null value in every feature.

<h1 style='background:#acd6fa; border:0; color:black'><center>2. Data Visualizations</center></h1>

# I. Distributions

## Target distribution

In [None]:
mpl.rc('font', size=15)
plt.figure(figsize=(8, 5))

ax = sns.histplot(x='pressure', data=train)
ax.set_title('Target Distribution');

There are lots of values around 7~8.

## `R` and `C` distributions

### Function for writing percent at the top of the bar graph

In [None]:
def write_percent(ax, total_size):
    '''Traverse the figure object and display the ratio at the top of the bar graph.'''
    for patch in ax.patches:
        height = patch.get_height() # Figure height (number of data)
        width = patch.get_width() # Figure width
        left_coord = patch.get_x() # The x-axis position on the left edge of the figure
        percent = height/total_size*100 # percent
        
        # Type text in the (x, y) coordinates
        ax.text(x=left_coord + width/2.0, # x-axis position
                y=height + total_size*0.001, # y-axis position
                s=f'{percent:1.1f}%', # Text
                ha='center') # in the middle

In [None]:
fig, ax = plt.subplots(figsize = (12, 8))

plt.subplot(2, 2, 1)
ax1 = sns.countplot(x='R', data=train)
plt.title('Counts of R in train')
write_percent(ax1, len(train))

plt.subplot(2, 2, 2)
ax2 = sns.countplot(x='R', data=test)
plt.title('Counts of R in test')
write_percent(ax2, len(test))

plt.subplot(2, 2, 3)
ax3 = sns.countplot(x='C', data=train)
plt.title('Counts of C in train')
write_percent(ax3, len(train))

plt.subplot(2, 2, 4)
ax4 = sns.countplot(x='C', data=test)
plt.title('Counts of C in test')
write_percent(ax4, len(test))

plt.tight_layout() # Space between the graphs

`R` and `C` categorical features have a similar distribution in train and test data which  means that the pattern of train data and the pattern of test data are similar.

## `u_in` distribution

In [None]:
mpl.rc('font', size=15)
plt.figure(figsize=(8, 5))

ax = sns.histplot(x='u_in', data=train)
ax.set_title('U_in Distribution');

In [None]:
print(f'u_in min : {train["u_in"].min()}')
print(f'u_in median : {train["u_in"].median()}')
print(f'u_in mean : {train["u_in"].mean()}')
print(f'u_in max : {train["u_in"].max()}')

Most of `u_in` values are concentrated between 0 and 10.

## `u_out` distribution

In [None]:
mpl.rc('font', size=12)
fig, ax = plt.subplots(figsize = (10, 5))

plt.subplot(1, 2, 1)
ax1 = sns.countplot(x='u_out', data=train)
plt.title('Counts of u_out in train')
write_percent(ax1, len(train))

plt.subplot(1, 2, 2)
ax2 = sns.countplot(x='u_out', data=test)
plt.title('Counts of u_out in test')
write_percent(ax2, len(test))

plt.tight_layout() # Space between the graphs

`u_out` Distribution is exactly same between tarin and test data.

# II. Plot heatmap of features representing correlations

In [None]:
fig, ax= plt.subplots() 
fig.set_size_inches(10, 8)

features = ['R', 'C', 'time_step', 'u_in', 'u_out']

corrMatt = train[features].corr() # correlation matrix by features

sns.heatmap(corrMatt, annot=True) # Plot correlation matrix heatmap
ax.set(title='Heatmap of features');

# III. Analysis for single breath_id

This graph is taken from ANDREW LUKYANENKO's [notebook](https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models).

In [None]:
fig, ax1 = plt.subplots(figsize = (10, 6))

breath_1 = train.loc[train['breath_id'] == 1]
ax2 = ax1.twinx()

ax1.plot(breath_1['time_step'], breath_1['pressure'], 'r-', label='pressure')
ax1.plot(breath_1['time_step'], breath_1['u_in'], 'g-', label='u_in')
ax2.plot(breath_1['time_step'], breath_1['u_out'], 'b-', label='u_out')

ax1.set_xlabel('Timestep')

ax1.legend(loc=(1.1, 0.8))
ax2.legend(loc=(1.1, 0.7));

`u_in` and `u_out` show the opposite pattern. Also, when `u_out` reaches 0, the pressure gradually increases, and when `u_out` reaches 1, the pressure abruptly decreases.

# To Be Continued...