# Simple Linear Regression between *u_in* and *pressure*

Vladimir Simões da Luz Junior

[LinkedIn](https://www.linkedin.com/in/vladimir-simoes-da-luz-junior/)

[GitHub](https://www.linkedin.com/in/vladimir-simoes-da-luz-junior/)

This is a baseline solution in order to undestand the RC relationship between "*u_in*" and the target vector "*pressure*". Simple EDA and fit of simple Linear Regression. Each individual breath was selected as the model input. Therefore each input/output signal is an array of 80 values.


## Import libraries

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
 

## Load training data and EDA

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
print(train.shape)
train.head()

In [None]:
train.describe()

In [None]:
train.columns.isnull()

In [None]:
train.time_step.value_counts()

In [None]:
unique_breaths = train['breath_id'].unique()
num_breaths = len(unique_breaths)
print(num_breaths)

In [None]:
train['breath_id'][:500].plot();

In [None]:
breath_lengths = train[['id','breath_id']].groupby('breath_id').count()['id']
breath_lengths.unique()

In [None]:
# Each breath consist of 80 values
BREATH_LENGTH = breath_lengths.unique()[0]

## R and C
R and C values are constant within each breath (having zero standard deviation)

In [None]:
r_c_std_in_breaths = train[['breath_id','R','C']].groupby('breath_id').std()
print(r_c_std_in_breaths['R'].unique())
print(r_c_std_in_breaths['C'].unique())

R has only three distinct values:

In [None]:
r_values = train[['breath_id', 'R']].groupby('breath_id').mean()['R']
print(r_values)
print()
print('Unique values:')
print(r_values.value_counts())

r_unique = np.sort(r_values.unique()).astype(int)

So does C:

In [None]:
c_values = train[['breath_id', 'C']].groupby('breath_id').mean()['C']
print(c_values)
print()
print('Unique values:')
print(c_values.value_counts())

c_unique = np.sort(c_values.unique()).astype(int)

There is about a factor two scatter in the various R/C combinations.

For R = 20 we see C = 50 most often, for R = 5, 50 we see C = 10 most often.



In [None]:
rc_values = np.array([
    [r, c, len(train[(train['R'] == r) & (train['C'] == c)])//BREATH_LENGTH] 
    for r in r_unique 
    for c in c_unique
])

x = range(len(rc_values))
plt.bar(x, rc_values[:,2])
plt.xticks(x, [str(r) + '_' + str(c) for r, c in rc_values[:,:2] ])
plt.xlabel('R_C')
plt.ylabel('Number counts')
plt.show()

### Time steps in individual breaths
Take a look at time sampling for the first two breaths. Looks like pretty uniform sampling in time.

In [None]:
first_breath  = train[train['breath_id'] == 1]
second_breath = train[train['breath_id'] == 2]

x = range(BREATH_LENGTH)
t1 = first_breath['time_step']
t2 = second_breath['time_step']
plt.plot(x, t1)
plt.plot(x, t2, ls = '--')

One time step seems to correspond to about {[](http://)breath_timestep}

In [None]:
breath_timestep = (max(t1) - min(t1)) / BREATH_LENGTH

## What about the target vector *pressure*?

In [None]:
plt.plot(train.pressure[:1000])

it seems to have a strong relation with *u_in* time series

In [None]:
plt.plot(train.pressure[:80])
plt.plot(train.u_in[:80])
plt.show()

From that we hipotesized that it would be possible to fit a linear regression  between u_in and pressure

## Select input feature and target vector

In [None]:
X = train['u_in'].tolist()
train_y = train['pressure'].tolist()
#test_x = test['columns_in'].values
minimo = np.min(X)
maximo = np.max(X)
X -= minimo
X /= maximo

minimo = np.min(train_y)
maximo = np.max(train_y)
train_y -= minimo
train_y /= maximo




In [None]:
n = 80
X = [X[i:i + n] for i in range(0, len(X), n)]
train_y = [train_y[i:i + n] for i in range(0, len(train_y), n)]

## Linear Regression model trainig


In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X, train_y)
train_predictions = lin_reg.predict(X)

In [None]:
plt.plot(train_predictions[1])

## Predict over test set

In [None]:
test = pd.read_csv("../input/ventilator-pressure-prediction/test.csv")

In [None]:
X_test = test['u_in'].tolist()

tst_minimo = np.min(X_test)
tst_maximo = np.max(X_test)
X_test -= tst_minimo
X_test /= tst_maximo



In [None]:
n = 80
X_test = [X_test[i:i + n] for i in range(0, len(X_test), n)]

In [None]:
test_predictions = lin_reg.predict(X_test)

In [None]:
plt.plot(test_predictions[0])

Denormalizing output

In [None]:
minimo = np.min(train.pressure.tolist())
maximo = np.max(train.pressure.tolist())
test_predictions = test_predictions * maximo + minimo

In [None]:
plt.plot(test_predictions[0])

In [None]:
test_predictions.shape

In [None]:
test_predictions[0]

In [None]:
test_predictions = np.array(test_predictions)
samples, _ = test_predictions.shape
pressure = []
test_predictions = list(test_predictions)

for signal in range(samples):
    breath_pressure = test_predictions[signal-1]#[j for i in test_predictions[signal] for j in i]
    pressure.extend(breath_pressure)

In [None]:
len(pressure)

## Assigning predicted values to submission csv

In [None]:
sub = pd.read_csv('../input/ventilator-pressure-prediction/sample_submission.csv')

In [None]:
sub.pressure = pressure

In [None]:
sub.head()

In [None]:
sub.to_csv('submission_regression.csv', index=False)