<a id="top"></a>

<div class="list-group" id="list-tab">
<h1 class="list-group-item" style='color:white; background:deeppink; border:0'><center>Quick Navigation</center></h1>

* [1. Data loading and overview](#1)

    
* [2. Feature Exploration](#2)
  * [a. breath_id](#3)
  * [b. R](#4)
  * [c. C](#5)
  * [d. R and C](#6)
  * [e. time_step](#7)
  * [f. u_in](#8)
  * [g. u_out](#9)
  * [h. pressure](#10)
    
    
* [3. Visualize for each breahth id and R_C](#200)


<a id="1"></a>
<h1 style='background:deeppink; border:0; color:white'><center>1.Data loading and overview</center></h1>

In [None]:
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib_venn import venn2
from matplotlib import pyplot
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv',index_col=0)
test  = pd.read_csv('../input/ventilator-pressure-prediction/test.csv', index_col=0)


In [None]:
display(train.head())

display(test.head())

In [None]:
train['train_test'] = 'train'
test['train_test'] = 'test'
full = pd.concat([train, test], axis=0)

<div class="list-group">
<a id="2"></a>
<h1 style='background:deeppink; border:0; color:white'><center>2.Feature Exploration</center></h1>

<a id="3"></a>
<h2>breath_id</h2>

> breath_id - globally-unique time step for breaths

In [None]:
column = 'breath_id'
plt.figure(figsize=(3,3))
venn2(subsets=(set(train[column].unique()), set(test[column].unique())),
      set_labels=('Train', 'Test'))
plt.title(column)

In [None]:
display(train['breath_id'].value_counts())
print('***')
display(train['breath_id'].value_counts().unique())

In [None]:
display(test['breath_id'].value_counts())
print('***')
display(test['breath_id'].value_counts().unique())

There are no duplicates in breath_id, and there are 80 data in each.

<a id="4"></a>
<h2>R</h2>

> R - lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow.

In [None]:
print('train')
display(train['R'].value_counts().sort_index())

print('test')
display(test['R'].value_counts().sort_index())

In [None]:
sns.countplot(data=full, x='R', hue='train_test')

<a id="5"></a>
<h2>C</h2>

> C - lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow.

In [None]:
print('train')
display(train['C'].value_counts().sort_index())

print('test')
display(test['C'].value_counts().sort_index())

In [None]:
sns.countplot(data=full, x='C', hue='train_test')

<a id="6"></a>
<h2>R and C</h2>

In [None]:
train['R_C'] = [f'{r}_{c}' for r, c in zip(train['R'], train['C'])]
test['R_C'] = [f'{r}_{c}' for r, c in zip(test['R'], test['C'])]
full['R_C'] = [f'{r}_{c}' for r, c in zip(full['R'], full['C'])]

In [None]:
sns.countplot(data=full, x='R_C', hue='train_test')

<a id="7"></a>
<h2>time_step</h2>

> time_step - the actual time stamp.

In [None]:
plt.figure(figsize=(16, 4))
plt.subplot(121)
plt.hist(train['time_step'], bins=100) 
plt.title('train')

plt.subplot(122)
plt.hist(test['time_step'], bins=100);
plt.title('test')

plt.tight_layout()

<a id="8"></a>
<h2>u_in</h2>

> the control input for the inspiratory solenoid valve. Ranges from 0 to 100.

In [None]:
plt.figure(figsize=(16, 4))
plt.subplot(121)
plt.hist(train['u_in'], bins=100) 
plt.title('train')

plt.subplot(122)
plt.hist(test['u_in'], bins=100);
plt.title('test')

plt.tight_layout()

<a id="9"></a>
<h2>u_out</h2>

> u_out - the control input for the exploratory solenoid valve. Either 0 or 1.

In [None]:
plt.figure(figsize=(16, 4))
plt.subplot(121)
plt.hist(train['u_out'], bins=100) 
plt.title('train')

plt.subplot(122)
plt.hist(test['u_out'], bins=100);
plt.title('test')

plt.tight_layout()

In [None]:
train.groupby('breath_id')['u_out'].count()

In [None]:
tmp = train.groupby('breath_id')['u_out'].value_counts().to_frame().unstack()
tmp.columns = [f'{i[0]}_{i[1]}' for i in tmp.columns]
display(tmp.head())

In [None]:
display(tmp['u_out_0'].value_counts().sort_index())
display(tmp['u_out_1'].value_counts().sort_index(ascending=False))

In [None]:
tmp['u_out_0'].mean()

There are 80 pieces of data in one breath_id, but on average, only about 30.3 pieces are included in the score calculation.

<a id="10"></a>
<h2>pressure</h2>

> pressure - the airway pressure measured in the respiratory circuit, measured in cmH2O.


In [None]:
plt.hist(train['pressure']);

In [None]:
plt.figure(figsize=(16, 16))
i = 1
for r_c in train['R_C'].unique():
    tmp = train.query('R_C == @r_c')
    plt.subplot(5, 2, i)
    plt.hist(tmp['pressure'], bins=50, range=(0, 70))
    r, c = r_c.split('_')
    plt.title(f'R: {r}, C: {c}')
    i += 1
    
plt.tight_layout()
plt.show()

>The competition will be scored as the mean absolute error between the predicted and actual pressures during the inspiratory phase of each breath. The expiratory phase is not scored. The score is given by:

Only where u_out is zero is it used to calculate the score. [[ref: What is the "expiratory phase" ?]](https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/273906#1522267) Let's separate them by u_out and visualize them.


In [None]:
plt.figure(figsize=(16, 16))
i = 1
for r_c in train['R_C'].unique():
    tmp = train.query('R_C == @r_c')
    plt.subplot(5, 2, i)
    tmp_u_out_0 = tmp.query('u_out == 0')
    plt.hist(tmp['pressure'], bins=50, range=(0, 70), label='u_out = 0', alpha=0.5)
    tmp_u_out_1 = tmp.query('u_out == 1')
    plt.hist(tmp_u_out_1['pressure'], bins=50, range=(0, 70), label='u_out = 1', alpha=0.5)
    r, c = r_c.split('_')
    plt.title(f'R: {r}, C: {c}')
    i += 1
    plt.legend()
    
    
plt.tight_layout()
plt.show()

<a id="200"></a>

<h1 style='background:deeppink; border:0; color:white'><center>3.Visualize for each breahth id and R_C</center></h1>

In [None]:
!pip install ipyplot

In [None]:
tmp_dir = Path('../tmp')
tmp_dir.mkdir(exist_ok=True)

In [None]:
image_paths = []
labels = []

for r_c in train['R_C'].unique():
    df = train.query('R_C == @r_c')
    for breath_id in df['breath_id'].unique()[:12]:
        tmp_df = train.query('breath_id == @breath_id')
        tmp_u_out_0_df = tmp_df.query('u_out == 0')
        tmp_u_out_1_df = tmp_df.query('u_out == 1')
        R = tmp_df.iloc[0, 1]
        C = tmp_df.iloc[0, 2]
        plt.scatter(tmp_u_out_0_df['time_step'], tmp_u_out_0_df['pressure'], label='pressure: u_out=0', s=3, color='r')
        plt.scatter(tmp_u_out_1_df['time_step'], tmp_u_out_1_df['pressure'], label='pressure: u_out=1', s=3)
        plt.scatter(tmp_u_out_0_df['time_step'], tmp_u_out_0_df['u_in'], label='u_in: u_out=0', s=3, color='y')
        plt.scatter(tmp_u_out_1_df['time_step'], tmp_u_out_1_df['u_in'], label='u_in: u_out=1', s=3)
        # plt.scatter(tmp_df['time_step'], tmp_df['u_out'], label='u_out')
        plt.title(f'R: {R}, C: {C}, breath_id: {breath_id}')
        plt.legend()
        plt.savefig(tmp_dir / f'{breath_id}.jpg')
        image_paths.append(str(tmp_dir / f'{breath_id}.jpg'))
        labels.append(f'R: {R}, C: {C}')
        plt.close()

In [None]:
import ipyplot
ipyplot.plot_class_tabs(image_paths, labels, force_b64=True, img_width=350)