# Overview:
A [ventilator](https://en.wikipedia.org/wiki/Ventilator) is a machine that provides mechanical ventilation by moving breathable air into and out of the lungs, to deliver breaths to a patient who is physically unable to breathe, or breathing insufficiently. A ventilator is needed when a patient suffers [respiratory failure](https://www.nhlbi.nih.gov/health-topics/respiratory-failure). Mechanical ventilators are mainly used in hospitals and in transport systems such as ambulances and MEDEVAC air transport etc.

![image.png](attachment:4ec2e3e4-f1fd-49c1-995c-d4195b72e2fe.png)!

Ventilator has been a key component for Covid-19 treatment. This competetion aims to simulate a ventilator connected to a sedated patient's lung. Currently ventilators are simulated using PID controllers and it is belived that a better performance can be obtained by Machine Learning. As a **Mechatronics Engineering** Student this is going to be an exciting task for me.

# Table of Contents:
* [Let's Know our Data](#Let's-know-our-data)
* [Basic Trainset Info](#Basic-Trainset-info)
* [Basic Testset Info](#Basic-Testset-info)
* [Basic EDA](#EDA)
* [Feature Engineering](#Feature-Engineering)
* [Modeling](#Modeling)

# Let's know our data

The ventilator data used in this competition was produced using a modified open-source ventilator connected to an artificial bellows test lung via a respiratory circuit. 
The first control input is a continuous variable from 0 to 100 representing the percentage the inspiratory solenoid valve is open to let air into the lung (i.e., 0 is completely closed and no air is let in and 100 is completely open). The second control input is a binary variable representing whether the exploratory valve is open (1) or closed (0) to let air out.

In this competition, we are given numerous time series of breaths and need to learn to predict the **airway pressure** in the respiratory circuit during the breath, given the time series of control inputs.

**Evalution:**
The competition will be scored as the mean absolute error between the predicted and actual pressures during the inspiratory phase of each breath. 

### Data Description:
* **id** - globally-unique time step identifier across an entire file
* **breath_id** - globally-unique time step for breaths
* **R** - lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow.
* **C** - lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow.
* **time_step** - the actual time stamp.
* **u_in** - the control input for the inspiratory solenoid valve. Ranges from 0 to 100.
* **u_out** - the control input for the exploratory solenoid valve. Either 0 or 1.
* **pressure** - the airway pressure measured in the respiratory circuit, measured in cmH2O.

In [None]:
# import basic libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# datasets
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')

# Basic Trainset info

In [None]:
train.head()

In [None]:
train.tail()

In [None]:
train.shape

In [None]:
# a concise summary of a DataFrame.
train.info()

In [None]:
# Generate descriptive statistics of the dataset
train.describe()

In [None]:
# checking null values
train.isna().sum()

In [None]:
train.nunique().to_frame()

# Basic Testset info

In [None]:
test.head()

In [None]:
test.tail()

In [None]:
test.shape

In [None]:
test.info()

In [None]:
test.describe()

In [None]:
test.isna().sum()

In [None]:
test.nunique().to_frame()

# EDA

In [None]:
# R(indicating how restricted the airway is) counts
fig, ax = plt.subplots(figsize = (16, 4))
plt.subplot(1, 2, 1)
sns.countplot(x='R', data = train)
plt.title('R Counts in train');
plt.subplot(1, 2, 2)
sns.countplot(x='R', data = test)
plt.title('R Counts in test');

In [None]:
# C(indicating how compliant the lung is) counts
fig, ax = plt.subplots(figsize = (16, 4))
plt.subplot(1, 2, 1)
sns.countplot(x='C', data = train)
plt.title('C Counts in train');
plt.subplot(1, 2, 2)
sns.countplot(x='C', data = test)
plt.title('C Counts in test');

In [None]:
# u_out represent the control input for the exploratory solenoid valve.
fig, ax = plt.subplots(figsize = (16, 4))
plt.subplot(1, 2, 1)
sns.countplot(x='u_out', data = train)
plt.title('u_out Counts in train');
plt.subplot(1, 2, 2)
sns.countplot(x='u_out', data = test)
plt.title('u_out Counts in test');

In [None]:
plt.figure(figsize=(16, 4))
plt.subplot(1,2,1)
plt.hist(train['time_step'], bins=500, color = 'lightblue') 
plt.title('Train time steps')

plt.subplot(1,2,2)
plt.hist(test['time_step'], bins=500, color = 'salmon');
plt.title('Test time steps');

In [None]:
plt.figure(figsize=(16, 4))
plt.subplot(1,2,1)
plt.hist(train['u_in'], bins=100, color = 'lightblue') 
plt.title('Train control input for the inspiratory solenoid valve')

plt.subplot(1,2,2)
plt.hist(test['u_in'], bins=100, color = 'tomato');
plt.title('Test control input for the inspiratory solenoid valve');

In [None]:
# pressure distribution w.r.t u out
fig, ax = plt.subplots(figsize=(14, 7))
ax = sns.distplot(train.loc[train["u_out"] == 0, "pressure"], ax=ax, label="u_out=0", bins=500)
ax = sns.distplot(train.loc[train["u_out"] == 1, "pressure"], ax=ax, label="u_out=1", bins=500)
ax.legend(loc='upper right');

In [None]:
# pressur distribution w.r.t R
fig, ax = plt.subplots(figsize=(14, 7))
ax = sns.distplot(train.loc[train["R"] == 5, "pressure"], ax=ax, label="R=5", bins=500)
ax = sns.distplot(train.loc[train["R"] == 20, "pressure"], ax=ax, label="R=20", bins=500)
ax = sns.distplot(train.loc[train["R"] == 50, "pressure"], ax=ax, label="R=50", bins=500)
ax.legend(loc='upper right');

In [None]:
# pressure distribution w.r.t C
fig, ax = plt.subplots(figsize=(14, 7))
ax = sns.distplot(train.loc[train["C"] == 10, "pressure"], ax=ax, label="C=10", bins=500)
ax = sns.distplot(train.loc[train["C"] == 20, "pressure"], ax=ax, label="C=20", bins=500)
ax = sns.distplot(train.loc[train["C"] == 50, "pressure"], ax=ax, label="C=50", bins=500)
ax.legend(loc='upper right');

# Feature Engineering

In [None]:
# credit: https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models

train['last_value_u_in'] = train.groupby('breath_id')['u_in'].transform('last')
train['u_in_lag'] = train['u_in'].shift(1)
train['u_out_lag'] = train['u_out'].shift(1)
train['u_in_lag_back'] = train['u_in'].shift(-1)
train['u_out_lag_back'] = train['u_out'].shift(-1)
train = train.fillna(0)

# max value of u_in and u_out for each breath
train['breath_id__u_in__max'] = train.groupby(['breath_id'])['u_in'].transform('max')
train['breath_id__u_out__max'] = train.groupby(['breath_id'])['u_out'].transform('max')

# difference between consequitive values
# IMPORTANT: need to rewrite it so that it is calculated only within separate breaths
train['u_in_diff'] = train['u_in'] - train['u_in_lag']
train['u_out_diff'] = train['u_out'] - train['u_out_lag']

# difference between the current value of u_in and the max value within the breath
train['breath_id__u_in__diffmax'] = train.groupby(['breath_id'])['u_in'].transform('max') - train['u_in']
train['breath_id__u_in__diffmean'] = train.groupby(['breath_id'])['u_in'].transform('mean') - train['u_in']

# OHE
train = train.merge(pd.get_dummies(train['R'], prefix='R'), left_index=True, right_index=True).drop(['R'], axis=1)
train = train.merge(pd.get_dummies(train['C'], prefix='C'), left_index=True, right_index=True).drop(['C'], axis=1)

# https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/273974
train['u_in_cumsum'] = train.groupby(['breath_id'])['u_in'].cumsum()

In [None]:
# all the same for the test data
test['last_value_u_in'] = test.groupby('breath_id')['u_in'].transform('last')
test['u_in_lag'] = test['u_in'].shift(1)
test['u_out_lag'] = test['u_out'].shift(1)
test['u_in_lag_back'] = test['u_in'].shift(-1)
test['u_out_lag_back'] = test['u_out'].shift(-1)
test = test.fillna(0)

test['breath_id__u_in__max'] = test.groupby(['breath_id'])['u_in'].transform('max')
test['breath_id__u_out__max'] = test.groupby(['breath_id'])['u_out'].transform('max')

test['u_in_diff'] = test['u_in'] - test['u_in_lag']
test['u_out_diff'] = test['u_out'] - test['u_out_lag']

test['breath_id__u_in__diffmax'] = test.groupby(['breath_id'])['u_in'].transform('max') - test['u_in']
test['breath_id__u_in__diffmean'] = test.groupby(['breath_id'])['u_in'].transform('mean') - test['u_in']

test = test.merge(pd.get_dummies(test['R'], prefix='R'), left_index=True, right_index=True).drop(['R'], axis=1)
test = test.merge(pd.get_dummies(test['C'], prefix='C'), left_index=True, right_index=True).drop(['C'], axis=1)

test['u_in_cumsum'] = test.groupby(['breath_id'])['u_in'].cumsum()

In [None]:
train.shape, test.shape

# Modeling

In [None]:
train.drop(['id', 'breath_id'], axis = 1, inplace = True)

In [None]:
# dependent and independent features
x = train.drop(['pressure'], axis = 1)
y = train['pressure']

In [None]:
#splitting the dataset into train and test set.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 31)

In [None]:
len(x_train), len(x_test), len(y_train), len(y_test)

In [None]:
#model evaluation function
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import cross_val_score

def model_evaluate(y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    return r2, mae
    

In [None]:
%%time
import lightgbm as lgbm
lgbm = lgbm.LGBMRegressor(objective="mae",
                        n_estimators=5000,
                        num_leaves=31,
                        random_state=2021,
                        importance_type="gain",
                        colsample_bytree=0.3,
                        learning_rate=0.5).fit(x_train, y_train)
cv_r2 = cross_val_score(lgbm, x_train, y_train, cv = 3)
print(cv_r2)
y_preds = lgbm.predict(x_test)
cv_r2 = np.mean(cv_r2)
print("Cross val score: " + str(cv_r2))
r2, mae = model_evaluate(y_test, y_preds)
print("R^2 score: " + str(r2))
print("Mean Absolute Erro: " + str(mae))

# Preparing to predict

In [None]:
test_df = test.drop(['id', 'breath_id'], axis = 1)

In [None]:
test_df.head()

In [None]:
preds = lgbm.predict(test_df)

In [None]:
preds[:10]

In [None]:
sub = pd.DataFrame()
sub['id'] = test.id
sub['pressure'] = preds
sub.head()

In [None]:
sub.to_csv('submission.csv', index=False)

### If you find it helpful, don't forget upvotting.