<a href="https://colab.research.google.com/github/KristynaPijackova/Tutorials_NNs_and_signals/blob/main/Classification_ECG_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification of ECG signals

The aim of this notebook is to introduce you how you can use recurrent layers and 1d convolutional layer for classification tasks with signals. It will also show you how you can prepare a dataset for training. 

---



We are going to use ECG5000 dataset, which has 5000 samples of ECG signals with 5 classes: 

1 - Normal

2 - R-on-T Premature Ventricular Contraction

3 - Premature Ventricular Contraction 

4 - Supraventricular Premature beat

5 - Unclassified Beat




## Download data

For this we use data from following page: 

http://www.timeseriesclassification.com/description.php?Dataset=ECG5000

Options you have to download the data:


1.   Go to the website, download it localy or if you have  and upload back into your Colab workspace. However, uploading data in here isn't the quickest and you lose the data and have to re-do the process everytime you switch workspace (**not recommended**).

2.   Use command !wget along with the link which lets you download the data. Command !unzip along with the name with the zip file then unzips the content of the file.
```
! wget http://www.timeseriesclassification.com/Downloads/ECG5000.zip
! unzip ECG5000.zip
```

3. Another option is to store your data in gdrive and use another command to download the data into Colab.
```
!gdown --id 1jo255jnoJniagZZd3IKbixb1i0kUIQCr
!unzip /content/data.zip
```
To get the file id go the the sharable link for your data and copy the higlihted part (between d/ and /view?...)

    https:// drive.google.com/file/d/__1jo255jnoJniagZZd3IKbixb1i0kUIQCr__/view?usp=sharing

Mind that you lose the data with the later two options as well after you switch workspace, but the process of uploading the data back this way is much faster.


### Download the ECG dataset 

Here we use the direct link to the website along with !wget and unzip the file - if you don't know the name of the file you can see the name of the downloaded file in the left panel if you click on the folder icon 📁

Or you can use another unix command !ls /content/ (this is how the folder where you work with in here is named). 

This shows you the file stored in here. 

Now we know how our file is named we can unzip it

We see there are quite few files - we are going to use the text files `ECG5000_TEST.txt` and `ECG5000_TRAIN.txt`

And we can once again use another unix command !head to get a peak at how are data looks like.

## Import libraries

Now that we have a basic idea what we are going to work with, we can import libraries we are about to use.

## Let's prepare our data

We start by creating a dataframe - basicly we just get content of the textfiles into pandas dataframe so it is easier for us to work with it.

See? It's basicaly the same thing as wee saw just few moments ago 😉

However, since pandas df doesn't like just numerical columns, we add a prefix to it, so we can work with them.

## Analyze the dataset

In machine and deep learning it is important to know what data you are working with. Not just what is the data meant to be, but what is the structure, what values can I expect, what is the classes distribution, how does the data even look like, etc...

So the following part is supposed to get us eve more familiar with the data we are working with.

Here we can see basic info about the data - we have 141 columns c0 to c140. 
The data is stored as float64 and the memory usage is 4.8 MB. 

Next we can use `.describe()` which gets us a bit of a statistical view on the data. 

What we can see is that the max.min value are +/-7, the mean value is more or less in range between (-1,1) so we might be okay without normalization. 

We could also have a peak at how the classes are distributed in the dataset. 

This dataset stores the info about the classes in the first column, thus we index the df and use `.value_counts()` to see byt how many samples is each class represented.



And we can even visualize it... 

## Split the dataframe into data-points and labels

The first column of our dataframe 'c0' holds the labels from 1 to 5 which represent the classes we want to classify. 

At this point we will separate the data from the labels and create `x_train, x_test, y_train, y_test` arrays.

## Handling imbalanced datasets

As we can see, out dataset quite imbalanced - **which is not good!**


---


So what can we actually do about it? 

*   Undersampling
*   Oversampling
*   Combine under- and oversampling
*   Generate new data - not ideal, not always possible
*   Generate new data with generative networks

---

Few blog posts about over- and undersampling

*   [5 Techniques to Handle Imbalanced Data For a Classification Problem](https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-handle-imbalanced-data-for-a-classification-problem/)


* [Stop using SMOTE to handle all your Imbalanced Data](https://towardsdatascience.com/stop-using-smote-to-handle-all-your-imbalanced-data-34403399d3be)
 

---

Library which helps us deal with imbalanced dataset

[imbalanced-learn](https://imbalanced-learn.org/stable/auto_examples/index.html)




Here is a simple function which will apply under/oversampling method to our data. It returns arrays of new samples and their labels.

In [None]:
from imblearn.combine import SMOTETomek
def over_under_sampling(dataframe):
    """
    Use SMOTETomek technique to oversample our dataset. 
    
    This function is written to be applied to our datasets, 
    where the first column holds the labels, and the rest is the 
    time sequence. 

    It passes the under-represented data - classes 2-5 along
    with the dominant class 1 into the SMOTETomek over- & undersampler
    to balance the dataset. 
    """
    # lists to store the created values in
    x_res = []
    y_res = []

    for i in range(2,6):

        # create copy of the dataframe
        df_copy = dataframe.copy()
        # choose samples of i-th class
        df = df_copy[df_copy['c0'] == i]
        # add samples from 1st class
        df = df.append(df_copy[df_copy['c0'] == 1])
        # split the dataframe into x - data and y - labels
        x = df.values[:,1:]
        y = df.values[:,0]

        # define the imbalance function
        smtomek = SMOTETomek(random_state=42)
        # fit it on our data
        x_r, y_r = smtomek.fit_resample(x, y)
        
        # we want to skip the data we fit it on - only want the new data
        skip = y.shape[0]
        # append the data into our above lists
        x_res.append(x_r[skip:,:])
        y_res.append(y_r[skip:])

    # return the data as concatenated arrays -> only one array of all samples
    # instead of a list of arrays
    return np.concatenate(x_res), np.concatenate(y_res)

Here we call the above function

And now we combine it with our original data.

Let's see how the before imbalanced data looks now...

That's better, isn't it? But we could maybe check how the new synthetic data looks like and if it is somewhat similar to the original data?

## Let's visualize the signals

Let's write a for loop where we get few indexes of our target class with the help of `np.where()` and then plot two subplots one with original data and one with the synthetical for comparisson.

## And now we can get to training

Just few last things we need to do, so that our models can train on the data



We normalize the data with the help of MinMaxScaler that was imported from sklearn.

Here the scaler holds the MinMaxScaler and we fit it on our training data in the second row of the code, which will be applied on the data we want to scale just in a second with `data_scaler.transform(data_we_want_to_scale)`

Rigth now we have data with dimensions (n, 140), however we need (n,140,1)

## Define functions to plot loss, accuracy and confusion matrix

## Model with Conv1D

Comment to the input shape: 

We have data with 140 timesteps and only one feature (array 140x1). However, if you happen to have data which have more features - I/Q representation with (tx2) shape where t are the timesteps you can simply change the 1 to match the number of your feature representations. 

In `.summary()` you can see the output shape has None at the beginning - this is for the batchsize which can vary - you set the value in the `.fit()` function

## LSTM

Comment to the input shape: 

We have data with 140 timesteps and only one feature (array 140x1). However, if you happen to have data which have more features - I/Q representation with (tx2) shape where t are the timesteps you can simply change the 1 to match the number of your feature representations. 

However unlike when using conv1d only, you could set the timesteps value to None. This allows you to train and test your data on a variable sequence length. 

The only thing you have to do when training the model is to have batches of the same size. 

## GRU

## Conv1d + LSTM