# Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
from matplotlib import pyplot as plt
# import sklearn
# from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, GroupKFold

# Looking at Sample Submission

In [None]:
sample_submission = pd.read_csv(r"../input/ventilator-pressure-prediction/sample_submission.csv")

In [None]:
sample_submission.head()

In [None]:
sample_submission.shape, sample_submission.id.dtype, sample_submission.pressure.dtype

In [None]:
sample_submission.id.nunique(), sample_submission.pressure.nunique()

# Train CSV EDA

```
Files

train.csv - the training set
test.csv - the test set
sample_submission.csv - a sample submission file in the correct format

Columns

id - globally-unique time step identifier across an entire file
breath_id - globally-unique time step for breaths
R - lung attribute indicating how restricted the airway is (in cmH2O/L/S). Physically, this is the change in pressure per change in flow (air volume per time). Intuitively, one can imagine blowing up a balloon through a straw. We can change R by changing the diameter of the straw, with higher R being harder to blow.
C - lung attribute indicating how compliant the lung is (in mL/cmH2O). Physically, this is the change in volume per change in pressure. Intuitively, one can imagine the same balloon example. We can change C by changing the thickness of the balloon’s latex, with higher C having thinner latex and easier to blow.
time_step - the actual time stamp.
u_in - the control input for the inspiratory solenoid valve. Ranges from 0 to 100.
u_out - the control input for the exploratory solenoid valve. Either 0 or 1.
pressure - the airway pressure measured in the respiratory circuit, measured in cmH2O.
```

In [None]:
train_csv = pd.read_csv(r"../input/ventilator-pressure-prediction/train.csv")

In [None]:
train_csv.head()

In [None]:
train_csv.columns, train_csv.index, train_csv.isna().sum().sum()

## ID and Breath ID

In [None]:
train_csv.shape, train_csv.id.nunique(), train_csv.breath_id.nunique()

In [None]:
(train_csv.breath_id.value_counts().sort_index().values == 80).sum()

Every globally-unique Breath ID has exactly 80 rows.

In [None]:
train_csv.breath_id.value_counts().sort_index().index

For the variable breath_id, there are 75450 unique values, however, the IDs themselves do not go from 1 to 75450. As you can see they end at 125749. I wonder: why is there a gap like that?

## Looking at a Single Breath ID

In [None]:
train_csv_breath_id_5 = train_csv[train_csv.breath_id == 10]  # Taking a look at the Breath ID 5. 
train_csv_breath_id_5

In [None]:
(train_csv_breath_id_5.index == range(320, 400)).sum()

Looks like within the train_csv, all observations for a Breath ID are in top down/chronological order. And consequently, the "id" column would be too.

In [None]:
train_csv_breath_id_5[["R", "C"]]

In [None]:
train_csv_breath_id_5.R.value_counts(), train_csv_breath_id_5.C.value_counts()

In [None]:
train_csv_breath_id_5.time_step

What is time step measured in and is this even a useful variable? 

In [None]:
train_csv_breath_id_5.u_in.values

In [None]:
train_csv_breath_id_5.u_out.values

These 2 are definitely correlated. When "u_in" is open and as it gradually closes (as it goes to 0), the u_out opens up. u_out opens up when u_in is 0 but also when it is open again. This trend of closing u_out while decreasing u_in, then gradually opening u_in till it plateaus while opening u_out could be something to be considered. 

In [None]:
x = range(80)
y1 = train_csv_breath_id_5.u_in
y2 = train_csv_breath_id_5.u_out

plt.xlabel("Time")
plt.ylabel("u_in/u_out range")
plt.plot(x, y1, label="u_in")
plt.plot(x, y2, label="u_out")
plt.legend()

Notice how the orange line goes to 1 right when "u_in" goes to 0 meaning the inspiratory solenoid valve is closed. The exploratory solenoid valve is kept open all throughout this time and when the inspiratory solennoid valve is opened (opened to roughly just 5).

In [None]:
plt.xlabel("Time")
plt.ylabel("Pressure")
plt.plot(x, train_csv_breath_id_5.pressure.values, label="pressure")
plt.ylabel("pressure/u_in/u_out")
plt.plot(x, y1, label="u_in")
plt.plot(x, y2, label="u_out")
plt.legend()
plt.legend()

One interesting thing I noticed about this is that the pressure changes (drops) during the steepest drop in u_in and slightly after the opening of u_out. The pressure then oscillates a little bit and begins to plateau much like the u_in value. 

These findings were checked with observation with Breath ID 10 and it seems that generally the u_in would escalate upwards steeply then curve downwards slowly at first, then plummet. It would plateau towards the bottom for a bit before gradually curving upwards (concaving down) and then plateauing. u_out would switch from 0 to 1 when u_in plummets to 0 (closing of the solenoid valve). u_out would stay opened for the duration of the breath (is it breath? I'm still bit a unsure!). Pressure would rapidly and steeply climb upwards and then oscillate (or curve downwards) before dropping near vertically at around the opening of u_out (which is also the closing of u_in). It then oscillates for a bit and stabilizes at a plateau. 

# Moving back to Exploring the Entire Train CSV

## Looking at R and C Values

In [None]:
display(train_csv.R.value_counts()), print(""), display(train_csv.C.value_counts())

## Sanity Checking a Few Things

In [None]:
# for idx, group in train_csv.groupby("breath_id"):
#     assert group.R.nunique() == 1
#     assert group.C.nunique() == 1
#     assert np.all(group[["R", "C"]].values == group[["R","C"]].iloc[0].values)
#     assert group.time_step.max() < 3.0
#     assert group.time_step.min() == 0

# All true.

These conditions guarantee the following:

1. Only 1 unique value for all R values for any given Breath ID (same for all C values). 
2. All R and C pairs are the same for any given Breath ID.
3. The max time step is never above 3 and the min time step is always 0 for each Breath ID.

In [None]:
train_csv.time_step.min(), train_csv.time_step.max()

Some Breath IDs do approach 3 seconds (slight variations on a per Breath ID basis). 

In [None]:
train_csv.u_in.min(), train_csv.u_in.max()

u_in does reach 100.0 at times.

In [None]:
train_csv.info()

R and C should be categorical variables not int64.

In [None]:
train_csv.describe()

In [None]:
train_csv.u_in.hist()

Skewed right distribution with most u_in values being between 0 and 10. 

In [None]:
train_csv.u_out.hist()

In [None]:
pd.plotting.scatter_matrix(train_csv.drop(columns=["id", "breath_id", "R", "C"])[:320])

In [None]:
train_csv.pressure.hist()

Most of the pressure values are around 5-10 and it's skewed right. There are some cases where the pressure reaches 50. 

# Looking at Test CSV

In [None]:
test_csv = pd.read_csv(r"../input/ventilator-pressure-prediction/test.csv")

In [None]:
test_csv