# Overview

#### ※In this notebook , main language is English, sublanguage is Japanese.

## thanks Reference
*  [Ventilator Pressure: Preliminary EDA (EN/JPN)](https://www.kaggle.com/kaitohonda/ventilator-pressure-preliminary-eda-en-jpn/edit)
* [Google-Brain_Starter](https://www.kaggle.com/drcapa/google-brain-starter) 
* [Ventilator Pressure Prediction: EDA, FE and models](https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models#Model-training)
* [Ventilator Pressure Prediction [EDA]](https://www.kaggle.com/manojkumars00/ventilator-pressure-prediction-eda)
* [LGBM on CPU+Optuna Tuning](https://www.kaggle.com/towhidultonmoy/lgbm-on-cpu-optuna-tuning)

### Problem
* developing new methods for controlling mechanical ventilators is prohibitively expensive, even before reaching clinical trials. High-quality simulators could reduce this barrier

### Goal
* to  simulate a ventilator connected to a sedated patient's lung that take lung attributes compliance and resistance into account.  
(鎮静状態の患者の肺に接続された人工呼吸器のシミュレーションを行う。シミュレーションは肺の特性である追従性や抵抗を考慮する)

# Data

* The ventilator data used in this competition was produced using a modified open-source ventilator connected to an artificial bellows test lung via a respiratory circuit  
(このコンペティションで使用した人工呼吸器のデータは、オープンソースの人工呼吸器を改造し、呼吸回路を介して人工的な試験肺に接続して作成された)
* he diagram below illustrates the setup, tow control inputs highlighted in green and the state variable (airway pressure) to predict in blue  
(下図では2つの制御入力を緑で、予測する状態変数（気道圧）を青で示す)
* 1st contorl input is  a continuous variable from 0 to 100 representing the percentage the inspiratory solenoid valve is open to let air into the lung (i.e., 0 is completely closed and no air is let in and 100 is completely open)  
(1つ目の制御入力は0〜100の連続変数で、空気を肺に入れるために吸気電磁弁を開く割合を表します（すなわち、0は完全に閉じて空気を入れず、100は完全に開く。  
* 2nd control input is a binary variable representing whether the exploratory valve is open (1) or closed (0) to let air out  
(2つ目の制御入力は、空気を出すための排気電磁弁が開いている（1）か閉じている（0）かを表す二値変数)

![](https://raw.githubusercontent.com/google/deluca-lung/main/assets/2020-10-02%20Ventilator%20diagram.svg)


Each time series represents an approximately 3-second breath. The files are organized such that each row is a time step in a breath and gives the two control signals, the resulting airway pressure, and relevant attributes of the lung, described below.  
各時系列データは約3秒の呼吸を表しています。ファイルは各行が呼吸の時間ステップとなるように構成されており、2つの制御信号、その結果としての気道圧、および以下に述べる肺の関連属性が与えられています。

### Columns

- id - globally-unique time step identifier across an entire file / ファイル全体で一意のタイムステップ識別子
- breath_id - globally-unique time step for breaths / 呼吸の一意なタイムステップ識別子
- R - lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow. / 気道がどの程度制限されているかを示す肺属性（単位：cmH2O/L/S）。物理的には、流量（時間当たりの空気量）の変化に対する圧力の変化です。直感的には、ストローで風船を膨らませるようなイメージです。ストローの直径を変えることでRを変化させることができ、Rが大きいほど吹きにくくなります。
- C - lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow. / 肺の適合性を示す肺属性（単位：mL/cmH2O）。物理的には、圧力の変化に対する体積の変化を表します。直感的には、同じ風船の例を想像してください。風船のラテックスの厚さを変えることでCを変化させることができます。Cが大きいほどラテックスが薄く、吹きやすくなります。
- time_step - the actual time stamp. / 実際のタイムスタンプ
- ★u_in - the control input for the inspiratory solenoid valve. Ranges from 0 to 100. / 吸気ソレノイドバルブの制御入力です。0～100の範囲で設定できます。
- ★u_out - the control input for the exploratory solenoid valve. Either 0 or 1. / 排気ソレノイドバルブの制御入力です。0または1のいずれかです。
- pressure - the airway pressure measured in the respiratory circuit, measured in cmH2O. / 呼吸回路で測定された気道の圧力で、単位はcmH2Oです。

#### Termiology


- PEEP ・・・ to apply positive pressure (PEEPは息を吐いた時に陽圧をかけておくこと)  
    - ⇨improvement effect of oxygenation　(酸素化の改善効果)
    - ⇄ risk of incomplete circulatioin(循環不全のリスク)
- postive pressure ・・・ state high pressure than outside (陽圧..外よりも気圧が高い状態)


- R / C  
A ventilator needs to take into account the structure of the lung to determine the optimal pressure to induce. Such
structural factors include compliance (C), or the change in lung volume per unit pressure, and resistance (R), or the
change in pressure per unit flow.

### Metrics
**mean absolute error |𝑋−𝑌|**
  
where 𝑋 is the vector of predicted pressure and 𝑌 is the vector of actual pressures across all breaths in the test set.

In [None]:
!pip install optuna


### Library

In [None]:
import numpy as np
import pandas as pd
import os
import time
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GroupKFold
from sklearn import metrics 
import lightgbm as lgb
import optuna

In [None]:
path = "../input/ventilator-pressure-prediction/"
os.listdir(path)

In [None]:
train = pd.read_csv(path + 'train.csv')

In [None]:
test = pd.read_csv(path + 'test.csv')

In [None]:
train

Objecttive Variable is **pressure**

In [None]:
test

In [None]:
submission = pd.read_csv(path + 'sample_submission.csv')

In [None]:
submission

### EDA

#### pressure(objective variable)

In [None]:
# histgoram of pressure
plt.figure(figsize= (10,5))
train['pressure'].hist(bins=50)
print("mean: {}, std: {}".format(train['pressure'].mean(), train['pressure'].std()))
plt.show()

#### Visualize TimeStamp

In [None]:
plt.figure(figsize = (10,5))
sns.histplot(data=train,x='time_step', bins=20)
plt.show()

#### u_in  
(The control input for the inspiratory solenoid valve. Ranges from 0 to 100)

In [None]:
plt.figure(figsize = (10,5))
sns.histplot(data=train,x='u_in', bins=30)
plt.show()

u_out  
(The control input for the exploratory solenoid valve. Either 0 or 1)

In [None]:
sns.countplot(x='u_out', data=train)
plt.title('Count of u_out in train')
plt.show()

Check whether or not test / train data overlap

In [None]:
print(set(test['breath_id'].unique()).intersection(set(train['breath_id'].unique())))
print("breath_id in train: {0},breath_id in test: {1}".format(train['breath_id'].nunique(), test['breath_id'].nunique()))

the breath_id in train and test don't overlap.we should consider prevent from overfitting one breath_id in train.

In [None]:
fig, ax = plt.subplots(figsize = (12, 8))
plt.subplot(2, 2, 1)
sns.countplot(x='R', data=train)
plt.title('Count of R in train')
plt.subplot(2, 2, 2)
sns.countplot(x='R', data=test)
plt.title('Count of R in test')
plt.subplot(2, 2, 3)
sns.countplot(x='C', data=train)
plt.title('Count of C in train')
plt.subplot(2, 2, 4)
sns.countplot(x='C', data=test)
plt.title('Count of C in test')

#### Check 1 ventilation cycle

In [None]:
ventilation_cycle = train[train['breath_id']==2]
print(f"Unique value counts in each time stamp\n{ventilation_cycle.nunique()}\n")

##### each time stump is constantly 80

### Time series data(pressure/ u_in / u_out)

In [None]:
fig, ax1 = plt.subplots(figsize = (12, 8))

breath_1 = train.loc[train['breath_id'] == 928]
ax2 = ax1.twinx()

ax1.plot(breath_1['time_step'], breath_1['pressure'], 'r-', label='pressure')
ax1.plot(breath_1['time_step'], breath_1['u_in'], 'g-', label='u_in')
ax2.plot(breath_1['time_step'], breath_1['u_out'], 'b-', label='u_out')

ax1.set_xlabel('Timestep')

ax1.legend(loc=(1.1, 0.8))
ax2.legend(loc=(1.1, 0.7))
plt.show()

##### Our target(pressure) is rising and,after u_in become 0 and u_out beacomes 1 at the same time the pressure has suddenly drop

### Feature Engineering

In [None]:
# cumulative sum of the u_in feature
train["u_in_cumsum"] = (train['u_in']).groupby(train['breath_id']).cumsum()
test['u_in_cumsum'] = (test['u_in']).groupby(test['breath_id']).cumsum()

In [None]:
train['u_in_lag'] = train.groupby('breath_id')['u_in'].shift(2)
train = train.fillna(0)
test['u_in_lag'] = test.groupby('breath_id')['u_in'].shift(2)
test = test.fillna(0)

In [None]:
fig, ax1 = plt.subplots(figsize = (12, 8))

breath_1 = train.loc[train['breath_id'] == 928]
ax2 = ax1.twinx()

ax1.plot(breath_1['time_step'], breath_1['pressure'], 'r-', label='pressure')
ax1.plot(breath_1['time_step'], breath_1['u_in'], 'g-', label='u_in')
ax2.plot(breath_1['time_step'], breath_1['u_in_lag'], 'b-', label='u_in_lag')


ax1.set_xlabel('Timestep')

ax1.legend(loc=(1.1, 0.8))
ax2.legend(loc=(1.1, 0.7))
plt.show()

In [None]:
train

In [None]:
test

### hyperparameters Tuning

In [None]:
# import pickle

In [None]:
# def objective(trial, data=train, target=['pressure']):
#     train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.2, random_state=42)
#     param = {
#         "metric": "mae",
#         'random_state': 7014,
#         'n_estimators': 100,
#         'learning_rate' : trial.suggest_categorical('learning_rate', [0.25,0.3,0.35,0.4,]),
#         'max_depth': trial.suggest_categorical('max_depth', [2,3,4,5]),       
#         "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
#         "num_leaves": trial.suggest_int("num_leaves", 2, 256),
#         'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
#         "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
#         "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),      
#     }
#     model = lgb.LGBMRegressor(**param)
    
#     model.fit(train_x, train_y, eval_set=[(valid_x, valid_y)], early_stopping_rounds=100,verbose=False)
    
#     preds = model.predict(valid_x)
    
#     rmse = mean_absolute_error(valid_y, preds)
    
#     with open("{}.pickle".format(trial.number), "wb") as fout:
#         pickle.dump(model, fout)
    
#     return rmse

In [None]:
# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=100)
# print('Number of finished trials:', len(study.trials))
# print('Best trial:', study.best_trial.params)

In [None]:
# Best trial: {'max_depth': 5, 'lambda_l1': 2.7070003149190957e-06, 'num_leaves': 134, 'colsample_bytree': 0.4, 'feature_fraction': 0.7441879399513947, 'min_child_samples': 98

In [None]:
#study.trials_dataframe()

In [None]:
#optuna.visualization.plot_optimization_history(study)

In [None]:
#optuna.visualization.plot_parallel_coordinate(study)
#optuna.visualization.plot_slice(study)
#optuna.visualization.plot_contour(study, params=['num_leaves', 'max_depth', 'subsample'])
#optuna.visualization.plot_param_importances(study)

In [None]:
#optuna.visualization.plot_param_importances(study)

### Model Training

In [None]:
scores = []
feature_importance = pd.DataFrame()
models = []
columns = [col for col in train.columns if col not in ['id', 'breath_id', 'pressure']]
X = train[columns]
y = train['pressure']

In [None]:
#Best trial: {'learning_rate': 0.4, 'max_depth': 5, 'lambda_l1': 4.492527545624383, 'num_leaves': 103, 'colsample_bytree': 0.3, 'feature_fraction': 0.9988579060918809, 'min_child_samples': 34}

In [None]:
params = {'objective': 'regression',
          'boosting_type' : 'gbdt',
          'metric' : 'mae',
          'n_jobs' : -1,
          'learning_rate': 0.4,
          'max_depth': 5,
          'lambda_l1': 4.5,
          'num_leaves': 103,
          'colsample_bytree': 0.3,
          'feature_fraction': 1,
          'min_child_samples': 34}

### Split data (Version2)

In [None]:
folds = GroupKFold(n_splits=5)
for fold_n, (train_index, valid_index) in enumerate(folds.split(train, y, groups=train['breath_id'])):
    print(f'fold {fold_n} started at {time.ctime()}')
    X_train, X_valid =  X[columns].iloc[train_index], X[columns].iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    model = lgb.LGBMRegressor(**params, n_estimators=100)
    model.fit(X_train, y_train,
             eval_set = [(X_train, y_train), (X_valid, y_valid)],
             verbose=1000, early_stopping_rounds=100)
    score = metrics.mean_absolute_error(y_valid, model.predict(X_valid))
    
    models.append(model)
    scores.append(score)

In [None]:
print('CV mean score: {0:.4f}, std{1:.4f}'.format(np.mean(scores), np.std(scores)))

In [None]:
#feature_importance["importance"] /= 5

### Prediction

In [None]:
test[columns]

In [None]:
test[columns]

In [None]:
for model in models:
    submission['pressure'] += model.predict(test[columns])
submission['pressure'] /= 5

### Submission

In [None]:
submission

In [None]:
submission.to_csv('sumbission.csv', index=False)