### Competition Objective

The University of Liverpool’s Institute of Ageing and Chronic Disease is working to advance ion channel research.  Ion channels are pore-forming proteins present in animals and plants. They encode learning and memory, help fight infections, enable pain signals, and stimulate muscle contraction. 

When ion channels open, they pass electric currents. Existing methods of detecting these state changes are slow and laborious. In this competition, we will use ion channel data to better model automatic identification methods.

![Pic](https://storage.googleapis.com/kaggle-media/competitions/Liverpool/ion%20image.jpg)

### Notebook Objective

Objective of the notebook is to explore the given data and find any useful insghts along the way.

### Dataset Summary

First let us look at the files that are given to us.

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib.patches as patches

from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
pd.set_option('max_columns', 100)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

We have been given three files - train, test and sample submission. Very straight forward.

Now let us read the train and test data and look at the top few rows and total number of rows.

In [None]:
data_path = "/kaggle/input/liverpool-ion-switching/"
train_df = pd.read_csv(data_path + "train.csv")
test_df = pd.read_csv(data_path + "test.csv")
train_df.head()

In [None]:
print(f"Number of rows in the train data : {train_df.shape[0]}")
print(f"Number of rows in the test data : {test_df.shape[0]}")

In [None]:
train_df.shape[0] / 500000

Column descriptions are

* time - time of observation
* signal - the input signal to use
* open_channels - target variable

It is also mentioned in the dataset page that, while the time series appears continuous, the data is from discrete batches of 50 seconds long 10 kHz samples (500,000 rows per batch). In other words, the data from 0.0001 - 50.0000 is a different batch than 50.0001 - 100.0000, and thus discontinuous between 50.0000 and 50.0001.

**So this means, we have 10 batches in our training data and 4 batches in our test data**

### Target Distribution

First let us start exploring the target variable - `open_channels`. We can get the count of each of the classes in the target.

In [None]:
cnt_srs = train_df["open_channels"].value_counts()
plt.figure(figsize=(16,12))
sns.countplot(data=train_df, x="open_channels", color="green")
plt.xlabel('Open Channel', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title("Count of open channels in train dataset", fontsize=20)
plt.show()

**Inference:**
* `open_channels` can take 11 possible values - 0 to 10, which means the number of open ion channels can range between 0 to 10. 
* We have more than 1.2 million records when the ion channels are closed[](http://)

   To explore more, let us take the first five seconds and look at the distribution of `signal` and `open_channels`

In [None]:
from bokeh.models import Panel, Tabs
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.layouts import gridplot

output_notebook()

def get_plots(df, col="signal", title="Signal change over time for first 5 seconds in the train data"):
    p = figure(plot_width=1000, plot_height=350, x_axis_type="datetime", title=title)
    p.title.align = "center"
    p.line(df['time']*1000, df[col], color='navy', alpha=0.5)
    return p

temp_df = train_df.iloc[:50000]
p1 = get_plots(temp_df)
p2 = get_plots(temp_df, "open_channels", "Open channels change over time for first 5 seconds in train data")
show(gridplot([p1,p2], ncols=1, plot_width=800, plot_height=400, toolbar_location=None))

**Inference:**
* We could clearly see that when there is a surge in the signal value, one of the open_channel is also open

As a next step, let us plot the data for the first second in all the 10 discrete batches in the train dataset and look at them.

Each tab represents a separate batch in the training dataset.

In [None]:
from bokeh.models import Panel, Tabs

def get_plots(df, col="signal", title="Signal change over time for first 5 seconds in the train data"):
    p = figure(plot_width=1000, plot_height=350, title=title)
    p.title.align = "center"
    p.line(df['time'], df[col], color='navy', alpha=0.5)
    return p

tab_list = []
for i in range(10):
    temp_df = train_df.iloc[(i*500000):(i*500000+10000)]
    p1 = get_plots(temp_df, "signal", "Signal change over time for the first 1 second of the batch")
    p2 = get_plots(temp_df, "open_channels", "Open channels change over time for the first 1 second of the batch")
    q = gridplot([p1,p2], ncols=1, plot_width=800, plot_height=400, toolbar_location=None)
    tab = Panel(child=q, title=f"Batch:{i+1}")
    tab_list.append(tab)
    
tabs = Tabs(tabs=tab_list)
show(tabs)

**Inference:**

* Seems like a nice linear pattern is present in the data
* When the signal values are greater than -1.5, the number of open channels seem to be 0 most of the times
* When the signal values are between -0.5 to -1.5, the number of open channels seem to be 1 most of the times
* Number of open channels seem to be peaking at 10, when the signal values are near 8 most of the times
* Batch 5 and batch 10 seem to have higher values for the number of open channels



### Signal Value Vs Number of Open Channels

In order to validate the above points, let us look at the signal value distribution for each of the target classes

In [None]:
output_notebook()
def make_plot(title, hist, edges, xlabel):
    p = figure(title=title, tools='', background_fill_color="#fafafa")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color="#1E90FF", line_color="white", alpha=0.5)

    p.y_range.start = 0
    p.xaxis.axis_label = 'Signal value'
    p.yaxis.axis_label = 'Probability'
    p.grid.grid_line_color="white"
    p.title.align = "center"
    return p

all_plots = []
for target_value in range(11):
    temp_df = train_df[train_df["open_channels"]==target_value]
    hist, edges = np.histogram(temp_df["signal"].values, density=True, bins=50)
    all_plots.append(make_plot(f"Signal distribution when number of open channels is {target_value}", hist, edges, target_value))
    
show(gridplot(all_plots, ncols=1, plot_width=800, plot_height=400, toolbar_location=None))

**Inference**
* Target classes 0 to 3 exhibit similar distribution
* Target classes 4 and 5 are similar 
* Target classes 6 to 10 exhibit similar distribution. Very interesting patterns indeed!


** Work in progress. Please stay tuned!**