# Consolidated Data Preparation:

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Dictionary:" data-toc-modified-id="Data-Dictionary:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Dictionary:</a></span></li><li><span><a href="#Import-Modules:" data-toc-modified-id="Import-Modules:-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Modules:</a></span></li><li><span><a href="#Import-the-Data:" data-toc-modified-id="Import-the-Data:-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Import the Data:</a></span></li><li><span><a href="#Compute-EDA-Relevent-Quantities:" data-toc-modified-id="Compute-EDA-Relevent-Quantities:-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Compute EDA Relevent Quantities:</a></span><ul class="toc-item"><li><span><a href="#Set-up-New-Data-Frame:" data-toc-modified-id="Set-up-New-Data-Frame:-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Set up New Data Frame:</a></span></li><li><span><a href="#Compute-the-Change-in-Number-of-Open-Ion-Channels:" data-toc-modified-id="Compute-the-Change-in-Number-of-Open-Ion-Channels:-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Compute the Change in Number of Open Ion Channels:</a></span></li><li><span><a href="#Compute-the-Lenght-of-Time-Between-a-Change-in-the-Number-of-Ion-Channels-Open:" data-toc-modified-id="Compute-the-Lenght-of-Time-Between-a-Change-in-the-Number-of-Ion-Channels-Open:-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Compute the Lenght of Time Between a Change in the Number of Ion Channels Open:</a></span></li><li><span><a href="#Save-New-Data-Frame:" data-toc-modified-id="Save-New-Data-Frame:-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Save New Data Frame:</a></span></li></ul></li><li><span><a href="#Compute-Data-for-Model:" data-toc-modified-id="Compute-Data-for-Model:-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Compute Data for Model:</a></span><ul class="toc-item"><li><span><a href="#Set-up-New-Data-Frames:" data-toc-modified-id="Set-up-New-Data-Frames:-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Set up New Data Frames:</a></span></li><li><span><a href="#Compute-Exponential-Weighted-Average:" data-toc-modified-id="Compute-Exponential-Weighted-Average:-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Compute Exponential Weighted Average:</a></span></li><li><span><a href="#Compute-the-Gradient-of-the-Signal:" data-toc-modified-id="Compute-the-Gradient-of-the-Signal:-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Compute the Gradient of the Signal:</a></span></li><li><span><a href="#Save-New-Data-Frame:" data-toc-modified-id="Save-New-Data-Frame:-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Save New Data Frame:</a></span></li></ul></li></ul></div>

### Data Dictionary:

- "time" : The time from the start.
- "signal" : A small electrical current from the ion channels allowing charged ions to pass.
- "open_channels" : The number of open ion channels.

NB: The time data is from discrete batches of 50 seconds long 10 kHz samples (500,000 rows per batch).

### Import Modules:

In [1]:
import pandas as pd
import numpy as np

### Import the Data:

In [2]:
# Filepaths / names:
train_file = '../Data/train'
test_file = '../Data/test'
clean_train_file = '../Data/train_clean'
clean_test_file = '../Data/test_clean'

In [3]:
train_df = pd.read_csv(f'{train_file}.csv')
train_df.head(2)

Unnamed: 0,time,signal,open_channels
0,0.0001,-2.76,0
1,0.0002,-2.8557,0


In [4]:
test_df = pd.read_csv(f'{test_file}.csv')
test_df.head(2)

Unnamed: 0,time,signal
0,500.0001,-2.6498
1,500.0002,-2.8494


In [5]:
train_clean_df = pd.read_csv(f'{clean_train_file}.csv')
train_clean_df.head(2)

Unnamed: 0,time,signal,open_channels
0,0.0001,-2.76,0
1,0.0002,-2.8557,0


In [6]:
test_clean_df = pd.read_csv(f'{clean_test_file}.csv')
test_clean_df.head(2)

Unnamed: 0,time,signal
0,500.0001,-2.649831
1,500.0002,-2.849463


### Compute EDA Relevent Quantities:

#### Set up New Data Frame:

In [7]:
df_eda = train_df
df_eda.head(2)

Unnamed: 0,time,signal,open_channels
0,0.0001,-2.76,0
1,0.0002,-2.8557,0


#### Compute the Change in Number of Open Ion Channels:

In [8]:
delta_n = []
length = df_eda.shape[0]

for i in range(length):
    if i == 0:
        delta_n.append(df_eda.iloc[i,2])
    else:
        delta = df_eda.iloc[i,2] - df_eda.iloc[i-1,2]
        delta_n.append(delta)

df_eda['diff_open_channels'] = delta_n

df_eda['diff_open_channels'].value_counts()

 0    3643337
 1     579401
-1     579059
-2      87172
 2      87141
-3      10847
 3      10838
-4       1050
 4        978
 5         84
-5         83
 6          6
-6          2
 8          1
 7          1
Name: diff_open_channels, dtype: int64

#### Compute the Lenght of Time Between a Change in the Number of Ion Channels Open:

In [9]:
length_same = []
df_eda_length = df_eda.shape[0]

for i in range(df_eda_length):
    count = 0
    if i == 0:
        count = 0
    else:
        if df_eda.iloc[i,2] == df_eda.iloc[i-1,2]:
            count = length_same[i-1] + 1
        if df_eda.iloc[i,2] != df_eda.iloc[i-1,2]:
            count = 0
    length_same.append(count)

df_eda['len_same_open_channels'] = length_same

df_eda['len_same_open_channels'].value_counts()

0        1356664
1         693638
2         387894
3         234419
4         152022
          ...   
22910          1
23876          1
22909          1
22908          1
22778          1
Name: len_same_open_channels, Length: 24024, dtype: int64

#### Save New Data Frame:

In [10]:
df_eda.to_csv(f'{train_file}_eda.csv')

### Compute Data for Model:

#### Set up New Data Frames:

In [11]:
df_data_train = train_df
df_data_train.head(2)

Unnamed: 0,time,signal,open_channels,diff_open_channels,len_same_open_channels
0,0.0001,-2.76,0,0,0
1,0.0002,-2.8557,0,0,1


In [12]:
df_data_test = test_df
df_data_train.head(2)

Unnamed: 0,time,signal,open_channels,diff_open_channels,len_same_open_channels
0,0.0001,-2.76,0,0,0
1,0.0002,-2.8557,0,0,1


In [13]:
df_data_train_clean = train_clean_df
df_data_train_clean.head(2)

Unnamed: 0,time,signal,open_channels
0,0.0001,-2.76,0
1,0.0002,-2.8557,0


In [14]:
df_data_test_clean = test_clean_df
df_data_test_clean.head(2)

Unnamed: 0,time,signal
0,500.0001,-2.649831
1,500.0002,-2.849463


#### Compute Exponential Weighted Average:

In [15]:
df_data_train['ewm_signal'] = df_data_train['signal'].ewm(alpha=0.5).mean()
df_data_train.head(2)

Unnamed: 0,time,signal,open_channels,diff_open_channels,len_same_open_channels,ewm_signal
0,0.0001,-2.76,0,0,0,-2.76
1,0.0002,-2.8557,0,0,1,-2.8238


In [16]:
df_data_test['ewm_signal'] = df_data_test['signal'].ewm(alpha=0.5).mean()
df_data_test.head(2)

Unnamed: 0,time,signal,ewm_signal
0,500.0001,-2.6498,-2.6498
1,500.0002,-2.8494,-2.782867


In [17]:
df_data_train_clean['ewm_signal'] = df_data_train_clean['signal'].ewm(alpha=0.5).mean()
df_data_train_clean.head(2)

Unnamed: 0,time,signal,open_channels,ewm_signal
0,0.0001,-2.76,0,-2.76
1,0.0002,-2.8557,0,-2.8238


In [18]:
df_data_test_clean['ewm_signal'] = df_data_test_clean['signal'].ewm(alpha=0.5).mean()
df_data_test_clean.head(2)

Unnamed: 0,time,signal,ewm_signal
0,500.0001,-2.649831,-2.649831
1,500.0002,-2.849463,-2.782919


#### Compute the Gradient of the Signal:

In [19]:
dsdt = np.gradient(df_data_train['signal'])
d2sdt2 = np.gradient(dsdt)

df_data_train['dsdt'] = dsdt
df_data_train['d2sdt2'] = d2sdt2

In [20]:
dsdt = np.gradient(df_data_test['signal'])
d2sdt2 = np.gradient(dsdt)

df_data_test['dsdt'] = dsdt
df_data_test['d2sdt2'] = d2sdt2

In [21]:
dsdt = np.gradient(df_data_train_clean['signal'])
d2sdt2 = np.gradient(dsdt)

df_data_train_clean['dsdt'] = dsdt
df_data_train_clean['d2sdt2'] = d2sdt2

In [22]:
dsdt = np.gradient(df_data_test_clean['signal'])
d2sdt2 = np.gradient(dsdt)

df_data_test_clean['dsdt'] = dsdt
df_data_test_clean['d2sdt2'] = d2sdt2

#### Save New Data Frame:

In [23]:
df_data_train.to_csv(f'{train_file}_processed.csv',float_format='%.4f')

In [24]:
df_data_test.to_csv(f'{test_file}_processed.csv',float_format='%.4f')

In [25]:
df_data_train_clean.to_csv(f'{clean_train_file}_processed.csv',float_format='%.4f')

In [26]:
df_data_test_clean.to_csv(f'{clean_test_file}_processed.csv',float_format='%.4f')