## <mark style="background: #d9ead3">Motiavtion</mark>

When a patient has trouble breathing the doctors use a ventilator to pump oxygen into patient's lungs. However, this procedure is operated by the help of a clinician, which is a limitation.   The current simulators are not dynamic they are modeled to simulate a single lung setting. In reality, lungs and its attributes vary from patient to patient.                                                                                                                                 
## <mark style="background: #d9ead3">Goal</mark>

In this competition, our goal is to simulate a ventilator connected to a sedated patient's lung. By taking lungs attributes and other constraints in account we should simulate a mechanical ventilator that can take burdens off of clinicians.

As a result, ventilator treatments may become more widely available to help patients breathe.

# <mark style="background: #FFBF00">Contents</mark>
- [Imports](#import)
- [Explaining the Attributes](#explain)
- [Exploratory Data Analysis](#eda)
- [Data Processing](#process)
- [Model Creation](#model)
- [Prediction](#prediction)
- [Submission](#sub)

# <a name="import"></a><mark style="background: #FFBF00">Imports</mark>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import tqdm.notebook as tqdm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GroupKFold, KFold
from sklearn import metrics
import tensorflow as tf
from tensorflow import keras

plt.rcParams.update({'font.size': 18})
plt.style.use('ggplot')

pd.set_option('display.max_colwidth',None)

reading the given `train` and `test` files

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')

print(train.shape, test.shape)

### <a name="explain"></a><mark style="background: #FFBF00">Explaining the attributes</mark>

**The columns**

- id - globally-unique time step identifier across an entire file
- breath_id - globally-unique time step for breaths
- R - lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow. 
> (Basically, the diameter of the airway of the lung)

- C - lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow. 
> (Basically, the thickness of the airway (i.e. how thivk or thin is the airway))

- time_step - the actual time stamp.

    **The below two columns are control input.**

- u_in - the control input for the inspiratory solenoid valve. Ranges from 0 to 100. 
> (Basically, represents the opening state of the inspiratory valve. 0 being completely closed, no air can get in. 100 is when the valve is completely open).

- u_out - the control input for the exploratory solenoid valve. Either 0 or 1. 
> (Basically, represents the opening state of the exploratory valve. 1 --> valve open and 0 --> valve closed,

- pressure - the airway pressure measured in the respiratory circuit, measured in cmH2O.

Here `R` and `C` are categorical features

In [None]:
train.head(2)

In [None]:
test.head(2)

# <a name="eda"></a><mark style="background: #FFBF00">Exploratory Data Analysis</mark>

there are no null values

In [None]:
train.isnull().sum()
test.isnull().sum()

#### Checking for unique `breath_id`

In [None]:
unique_breath_id = train['breath_id'].nunique()
print('Number of unique breath IDs in train are: ', unique_breath_id)

unique_breath_id_test = test['breath_id'].nunique()
print('Number of unique breath IDs in test are: ', unique_breath_id_test)

#### Checking if all the `breath_id` from `test` is present in `train`

No `breath_id` of `test` are included in `train`

In [None]:
train_breath_id = [x for x in (np.unique(train['breath_id']))]
test_breath_id = [x for x in (np.unique(test['breath_id']))]

In [None]:
print(len(list(set(test_breath_id) - set(train_breath_id))))

In [None]:
set(test_breath_id).intersection(train_breath_id)

#### `pressure` hiatogram

In [None]:
train.pressure.hist(figsize=(16, 4))

Kdeplot is a Kernel Distribution Estimation Plot which depicts the probability density function of the continuous or non-parametric data variables.

Kdeplot plot for `pressure`

In [None]:
sns.kdeplot(train['pressure'])

In [None]:
sns.kdeplot(train['R'].to_numpy(), color = 'red')
sns.kdeplot(test['R'].to_numpy(), color = 'green')

In [None]:
sns.kdeplot(train['C'].to_numpy(), color = 'red')
sns.kdeplot(test['C'].to_numpy(), color = 'green')

#### Countplot on columns

In [None]:
sns.countplot(train['R'])

`R` for `test`

In [None]:
sns.countplot(test['R'])

In [None]:
sns.countplot(train['u_out'])

`u_out` for `test`

In [None]:
sns.countplot(test['u_out'])

By looking at the correlation plot below we can see that `u_out` and `time_step` has kind of strong correlation.

In [None]:
corr = train.corr().abs()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(train.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(train.columns)
ax.set_yticklabels(train.columns)
plt.show()

In [None]:
corr

Max `pressure`

In [None]:
train.pressure.max()

### Breath Plot

lets plot a breath with a `breath_id == 1` 

I took help from the kernel [here](https://www.kaggle.com/marutama/eda-about-time-step-and-u-out)

In [None]:
breath_id_1 = train[train['breath_id'] == 1]
breath_id_1.head()

In [None]:
breath_id_1.shape

In [None]:
fig, ax1 = plt.subplots(figsize = (6, 4))
ax2 = ax1.twinx()
ax1.plot(breath_id_1['time_step'], breath_id_1['pressure'], 'm-', label='pressure')
ax1.plot(breath_id_1['time_step'], breath_id_1['u_in'], 'g-', label='u_in')
ax2.plot(breath_id_1['time_step'], breath_id_1['u_out'], 'b-', label='u_out')

ax1.set_xlabel('Timestep')

R = breath_id_1['R'][0]
C = breath_id_1['C'][0]
ax1.set_title(f'breath_id:{1}, R:{R}, C:{C}')

ax1.legend(loc=(1.1, 0.8))
ax2.legend(loc=(1.1, 0.7))
plt.show()

Reference kernel [here](https://www.kaggle.com/ahmedaffan789/google-brain-eda-lstm-bi-directional)

As described above `u_in` == valve pressure

So, when `u_in == 0` then the valve is closed and that is inhale pressure from the patient (patient inhales)

and, when `u_in == 1` then the valve is open and that is exhale pressure from the patient (patient exhales)

In [None]:
sns.lineplot(x = 'id',y='pressure',data=breath_id_1[breath_id_1['u_out']==0],color='green',label='inhale pressure');
sns.lineplot(x = 'id',y='pressure',data=breath_id_1[breath_id_1['u_out']==1],color='orange',label='exhale pressure');
sns.lineplot(x = 'id',y='u_in',data=breath_id_1,color='blue',label='valve pressure')
plt.title(f"Variation of Pressure and Input valve position during breath Id 1");
plt.show()

We can see from the plot above that when valve is closed the pressure on valve is higher and when opened the pressure decreases.

Time step plot for `breath_id_1`

In [None]:
plt.title(f'breath_id:{1}, Time Step Plot')
plt.ylabel('Timestep')
plt.xlabel('Row No.')
plt.plot(breath_id_1['time_step'])
plt.show()

`time_step` value ranges from 0 ~ 3

In [None]:
plt.figure(figsize = (10,5))
sns.histplot(data=train,x='time_step', bins=20)
plt.show()

we can see that we have 80 unique `time_step` per breath.

In [None]:
train.groupby("breath_id")["time_step"].count()

Maximum `time_step` | how long a breath last?

In [None]:
print("For train max time_step: ",train.time_step.max())
print("For test max time_step: ",test.time_step.max())

# <a name="process"></a><mark style="background: #FFBF00">Data Processing</mark>

In [None]:
print(train.nunique().to_frame())
print('------------------------------')
print(test.nunique().to_frame())

In [None]:
train.columns.values

### feature engineering on train data:

[Really helpful kernel](https://www.kaggle.com/yasuosuzuki/various-feature-and-lightgbm)

In [None]:
def feature_engineering(df):
    # adding feature last_value_u_in
    
    # for each breath fetching maximum time_step value
    idxmax_time_step = df.groupby('breath_id')['time_step'].idxmax()
    # for a maximum time_step value this column is fetching associated breath_id & u_in
    last_value_u_in = df.loc[idxmax_time_step, ['breath_id','u_in']]
    last_value_u_in.columns = ['breath_id','last_value_u_in']
    df = df.merge(last_value_u_in, on='breath_id')
    
    
    # adding feature mean_value_u_in
    
    mean_u_in = df.groupby('breath_id')['u_in'].mean().to_frame()
    mean_u_in.columns = ['mean_value_u_in']
    df = df.merge(mean_u_in,on='breath_id')
    
    
    # adding feature 'diff_u_in'
    # this is basically, u_in[1] = u_in[1] - u_in[0], u_in[2] = u_in[2] - u_in[1] ...
    df['diff_u_in'] = df.groupby('breath_id')['u_in'].diff()
    df = df.fillna(0)
    
    
    # adding feature 'diff_diff_u_in'
    # i.e. 
    df['diff_diff_u_in'] = df.groupby('breath_id')['diff_u_in'].diff()
    df = df.fillna(0)
    
    
    # adding feature cummulative sum 'u_in_cumsum'
    df['u_in_cumsum'] = df.groupby(['breath_id'])['u_in'].cumsum()
    
    
    # adding feature sum_value_u_in
    # sum of all u_in values for a particular breath_id
    sum_u_in = df.groupby('breath_id')['u_in'].sum().to_frame()
    sum_u_in.columns = ['sum_value_u_in']
    df = df.merge(sum_u_in,on='breath_id')
    
    
    # adding feature u_in_cumsum_rate
    df['u_in_cumsum_rate'] = df['u_in_cumsum'] / df['sum_value_u_in']
    
    
    df = df.fillna(0)
    
    
    #adding feature lag_u_in
    df['lag_u_in'] = df.groupby('breath_id')['u_in'].shift(1)
    df = df.fillna(0)
    
    
    #adding feature lag2_u_in
    df['lag2_u_in'] = df.groupby('breath_id')['u_in'].shift(2)
    df = df.fillna(0)
    
    
    #adding feature lag_-1_u_in
    df['lag_-1_u_in'] = df.groupby('breath_id')['u_in'].shift(-1)
    df = df.fillna(0)
    
    
    #adding feature lag_-2_u_in
    df['lag_-2_u_in'] = df.groupby('breath_id')['u_in'].shift(-2)
    df = df.fillna(0)
    
    
    #adding feature lag3_u_in
    df['lag3_u_in'] = df.groupby('breath_id')['u_in'].shift(3)
    df = df.fillna(0)
    
    
    #adding feature lag2_u_in
    df['lag_-3_u_in'] = df.groupby('breath_id')['u_in'].shift(-3)
    df = df.fillna(0)
    
    
    #adding feature max_u_in_breath_id
    # maximum u_in value for a particular breath_id
    df['max_u_in_breath_id'] = df.groupby('breath_id')['u_in'].transform('max')
    
    
    #adding feature R*C
    df['R*C'] = df['R'] * df['C']
    
    
    #adding feature min_u_in_breath_id
    # minimum u_in value for a particular breath_id
    df['min_u_in_breath_id'] = df.groupby('breath_id')['u_in'].transform('min')
    
    
    #adding max_u_in_breath_id_diff
    df['max_u_in_breath_id_diff'] = df.groupby('breath_id')['u_in'].transform('max') - df['u_in']
    
    
    #adding mean_u_in_breath_id_diff
    df['mean_u_in_breath_id_diff'] = df.groupby('breath_id')['u_in'].transform('mean') - df['u_in']
    
    
    #adding u_in_partition_out_sum
    df['u_in_partition_out_sum'] = df.groupby(['breath_id', 'u_out'])['u_in'].transform('sum')
    
    
    #adding area
    df['area'] = df['time_step'] * df['u_in']
    df['area'] = df.groupby('breath_id')['area'].cumsum()
    
    return df


Applying the above features to `train` and `test` data

In [None]:
df_train = feature_engineering(train)
df_test = feature_engineering(test)

In [None]:
df_train

In [None]:
df_train.shape

In [None]:
df_train.columns

In [None]:
df_test

# <a name="model"></a><mark style="background: #FFBF00">Model Creation</mark>

Defining `X` and `y` for model creation. 

Not including columns like `id`, `breath_id`, `pressure` for `X`

`y` has the `pressure` column only from `df_train`

In [None]:
models = []
columns = [col for col in df_train.columns if col not in ['id', 'breath_id', 'pressure']]
X = df_train[columns]
y = df_train['pressure']

In [None]:
print(X.shape, y.shape)

making an empty array for storing prediction values.

In [None]:
prediction = []

Defining `X_test` to make prediction on `df_test`

In [None]:
X_test = df_test[columns]

In [None]:
X_test

In [None]:
X_test.shape

Reshaping the `X` and `y` variables

In [None]:
X = np.array(X)
X = X.reshape(-1, 80, X.shape[-1])
y = y.to_numpy().reshape(-1, 80)
X_test = np.array(X_test)
X_test = X_test.reshape(-1, 80, X_test.shape[-1])

print(X.shape, y.shape, X_test.shape)

I have used the below kernels as inspiration to make the model:

- [kernel 1](https://www.kaggle.com/kaitohonda/beginner-lgbm)
- [kernel 2](https://www.kaggle.com/coder247/simple-xgboost-solution-for-beginner-s)


In [None]:

folds = KFold(n_splits=5, shuffle=True, random_state=2021)

for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
    print(f'------------------ Fold {fold_n} ------------------')
    X_train, X_valid =  X[train_index], X[valid_index]
    y_train, y_valid = y[train_index], y[valid_index]

    scheduler = tf.keras.optimizers.schedules.ExponentialDecay(1e-3, 200*((len(X_train)*0.8)/1024), 1e-5)
    model = keras.models.Sequential([
        keras.layers.Input(shape=(80, 25)),
        keras.layers.Bidirectional(keras.layers.LSTM(200, return_sequences=True)),
        keras.layers.Bidirectional(keras.layers.LSTM(150, return_sequences=True)),
        keras.layers.Bidirectional(keras.layers.LSTM(100, return_sequences=True)),
        keras.layers.Dense(100, activation='relu'),
        keras.layers.Dense(1),
    ])
    model.compile(optimizer="adam", loss="mae")
    model.fit(X_train, y_train,
              validation_data=(X_valid, y_valid),
              epochs = 180,
              batch_size = 256,
              callbacks=[tf.keras.callbacks.LearningRateScheduler(scheduler)])
    model.save(f'Fold {fold_n+1} Weights')


    prediction.append(model.predict(X_test).squeeze().reshape(-1, 1).squeeze())
    models.append(model)


In [None]:
model.summary()

# <a name="prediction"></a><mark style="background: #FFBF00">Prediction</mark>

In [None]:
submission = pd.read_csv('../input/ventilator-pressure-prediction/sample_submission.csv')
submission.head()

In [None]:
prediction

In [None]:
len(prediction[0])

In [None]:
len(prediction)

To well develop the prediction I will first calculate the `mean` and `median` of the prediction values. Once this has been done the `standard deviation` of the prediction will be calculated and will be clipped within this range to finally calculate the average of `clipped data`.

[Reference Kernel](https://github.com/angliu-bu/Kaggle-Google-Brain/blob/main/google_brain.ipynb)

In [None]:
mean = np.mean(prediction, axis=0)
med = np.median(prediction, axis=0)
std = np.std(prediction, axis=0)



# mean of values inside the standard mean
clipped_pres = np.clip(np.vstack(prediction), mean-std, mean+std)
clipped_mean = np.mean(clipped_pres, axis=0)

In [None]:
print('clipped mean is: ', clipped_mean)

In [None]:
clipped_pres

In [None]:
prediction_ = clipped_pres.ravel()
prediction_

# <a name="sub"></a><mark style="background: #FFBF00">Submission</mark>

choosing the `clipped_mean` for `pressure` prediction

In [None]:
submission_ = pd.DataFrame({"id":submission["id"],"pressure":clipped_mean})

In [None]:
submission_

In [None]:
submission_.shape

In [None]:
submission_.to_csv('submission.csv', index = False)
print('submission saved -- !')