In [None]:
import os
from collections import Counter

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import mode

SEED = 420

# OpenVaccine: Quick start EDA

<img src="https://images.squarespace-cdn.com/content/v1/5eb6fae5122cea533e0c3c3f/1589051238766-K3676NNLELHFO72M9OBQ/ke17ZwdGBToddI8pDm48kDFgITcRoterXoQdllT5ciUUqsxRUqqbr1mOJYKfIPR7LoDQ9mXPOjoJoqy81S2I8N_N4V1vUb5AoIIIbLZhVYxCRW4BPu10St3TBAUQYVKcV7ZyRJyI8bwZiMJRrgPaAKqUaXS0tb9q_dTyNVba_kClt3J5x-w6oTQbPni4jzRa/eterna_covid-19_web_1200_v2.jpg?format=1500w" alt="Fight Covid-19 with eterna" width="400"/>

## Table of contents

* [<font size=4>Competition overview</font>](#1)

  * [Goal of competition](#1a)
  * [About the organisers](#1b)


* [<font size=4>Quick start guide to data</font>](#2)

  * [Load data](#2a)
  * [Understanding inputs](#2b)
  * [Understadning labels](#2c)


* [<font size=4>In-depth EDA</font>](#3)

  * [`sequence` column](#3a)
  * [`structure` column](#3b)
  * [`loop_type_predictions` column](#3c)
  * [Exploring labels](#3d)
  * [Exploring Bpps data](#3e)


* [<font size=4>Modelling approaches (WIP)</font>](#4)

## Competition Overview <a id="1"></a>

[mRNA vaccines](https://www.nature.com/articles/nrd.2017.243) are a promising improvement to conventional vaccines because they are cheaper & faster to make and safer to adminster.

They have a major downside: RNA molecules have the tendency to spontaneously degrade making transportion difficult.

In this competition, we seek to predict which part of the RNA is most prone to degradation. If we can succeed in doing this, we may be able to improve the stability of mRNA vaccines and therefore, accelerate the process of developing and mass producing a COVID19 vaccine.

### Problem goal <a id="1a"></a>

Given an RNA sequence, can you predict the likely degradation rates at each position, as a sequence of values?

Can you also predict the rates under a variety of conditions?


### About the organisers <a id="1b"></a>

"Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles.

The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules.

The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the world’s most potent intellectual resources for the betterment of the public."

<iframe width="560" height="315" src="https://www.youtube.com/embed/pGMu569jkEc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

## Quick start guide to data <a id=2></a>

### Load data <a id=2a></a>

In [None]:
train_df = pd.read_json('../input/stanford-covid-vaccine/train.json',lines=True)
test_df = pd.read_json('../input/stanford-covid-vaccine/test.json', lines=True)
sample_submission_df = pd.read_csv('../input/stanford-covid-vaccine/sample_submission.csv')

### Model inputs <a id=2b></a>

You are given a JSON object for each sample.

Each sample contains 

1. an RNA sequence represented as `A`, `G`, `U` or `C` characters.
2. metadata for each sequence element, including the structure property, which represents if it's "paired" or "unpaired", and predicted loop type,  which provides an estimate of the "loop type".

We'll look at each of these fields in some detail here, than greater detail in section 3.

#### `sequence`

In the public train/test set the sequence is 107 in the public train/test set and 130 long in the private test set.

This is an important consideration when creating a model: *you need to build a solution that can handle sequence lengths longer than anything in the training set*.

Let's take a sequence from the first sampple in the training set and explore it:

In [None]:
one_example = train_df.iloc[0]
one_example.sequence

Each character in the sequence is either: `A`, `G`, `U` or `C`. We can rotate the table to see the data in a more conventional format.

In [None]:
rotated_df = pd.DataFrame(list(one_example.sequence), columns=['sequence'])
rotated_df

#### structure

Describes whether the sequence element is "paired" or "unpaired". Represented as either a `(` (paired), `.`, (unpaired) or `)` (paired).

In [None]:
one_example.structure

Here's the first 10 sequence characters for the original sample and their corresponding structure.

In [None]:
rotated_df_structure = pd.concat([
    rotated_df,
    pd.DataFrame(list(one_example.structure), columns=['structure'])
], axis=1)
rotated_df_structure.iloc[:10]

#### predicted_loop_type

Apparently 'loop type' describes the structural context. Either `S` (paired "Stem") `M` (Multiloop), `I` (Internal loop), `B` (Bulge), `H` (Hairpin loop), `E` (dangling End) and `X` (eXternal loop)

Let's join those to our dataframe.

In [None]:
rotated_df_pred_loop = pd.concat([
    rotated_df_structure,
    pd.DataFrame(list(one_example.predicted_loop_type), columns=['predicted_loop_type'])
], axis=1)
rotated_df_pred_loop.iloc[:10]

### Labels <a id=2c></a>

As labels, we are provided 5 vectors that need to be predicted for each sample in the test set.

They are float arrays that have either 68 in the public train/test set and 91 elements in the private test set.

* `reactivity`: "used to determine "the likely secondary structure of the RNA sample".
* `deg_Mg_pH10`:  "used to determine the likelihood of degradation at the base/linkage after incubating with magnesium in high pH (pH 10)."
* `deg_pH10`: "used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high pH (pH 10)."
* `deg_Mg_50C`: "used to determine the likelihood of degradation at the base/linkage after incubating with magnesium at high temperature (50 degrees Celsius)."
* `deg_50C`: "used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high temperature (50 degrees Celsius)."

Let's take a look at a label for one sample:

In [None]:
pd.DataFrame(
    {
        'reactivity': list(one_example.reactivity),
        'deg_Mg_pH10': list(one_example.deg_Mg_pH10),
        'deg_pH10': list(one_example.deg_pH10),
        'deg_Mg_50C': list(one_example.deg_Mg_50C),
        'deg_50C': list(one_example.deg_50C)
    }
)

There are also some error values for each of the elements of the label sequence. We'll ignore those for now.

## In depth EDA <a id=3></a>

Let's take another look at the sequence values across the train and test datasets, exploring their frequency.

### Sequence <a id=3a></a>

#### Frequencies

Let's start by exploring the frequency of each sequence character across all samples.

In [None]:
train_sequence_breakdown = Counter(''.join(list(train_df.sequence)))
test_sequence_breakdown = Counter(''.join(list(test_df.sequence)))

x_labels = ['G', 'A', 'C', 'U']

plt.figure(figsize=(16, 5))
plt.subplot(1, 2, 1)
plt.title('Sequence character counts (train)')
plt.bar(x_labels, [train_sequence_breakdown[l] for l in x_labels])

plt.subplot(1, 2, 2)
plt.title('Sequence character counts (test)')
plt.bar(x_labels, [test_sequence_breakdown[l] for l in x_labels])

plt.show()

I'm interested in learning about how frequently certain characters appear side-by-side. To do that, i'm going to break the sequences into bigrams and explore the frequency.

#### Sequence n-gram frequency

In [None]:
# This function takes a list of sequences and converts into a python `Counter` of ngram tokens

def get_ngrams_counters(sequences, n=2):
    output = Counter()
    for sequence in sequences:
        output += Counter([sequence[i:i+n] for i in range(len(sequence)-1)])
        
    return output

In [None]:
train_ngram_sequence = get_ngrams_counters(train_df.sequence)
test_ngram_sequence = get_ngrams_counters(test_df.sequence)

# Used to sort by frequency.
train_ngram_sequence = dict(train_ngram_sequence.most_common(10000))
test_ngram_sequence = dict(test_ngram_sequence.most_common(10000))

plt.figure(figsize=(16, 5))
plt.subplot(1, 2, 1)
plt.title(f'Sequence character bigram counts (train) ({len(train_ngram_sequence)} unique bigrams)')
plt.bar(dict(train_ngram_sequence).keys(), dict(train_ngram_sequence).values())

plt.subplot(1, 2, 2)
plt.title(f'Sequence character bigram counts (test) ({len(test_ngram_sequence)} unique bigrams)')
plt.bar(dict(test_ngram_sequence).keys(), dict(test_ngram_sequence).values())

plt.show()

Trigrams may be another interesting approach to learning about how sequences are typically ordered. 

In [None]:
train_ngram_sequence = get_ngrams_counters(train_df.sequence, 3)
test_ngram_sequence = get_ngrams_counters(test_df.sequence, 3)

# Used to sort by frequency.
train_ngram_sequence = dict(train_ngram_sequence.most_common(10000))
test_ngram_sequence = dict(test_ngram_sequence.most_common(10000))

plt.figure(figsize=(25, 10))
plt.title(f'Sequence character trigram (train) ({len(train_ngram_sequence)} unique trigrams)')
plt.bar(dict(train_ngram_sequence).keys(), dict(train_ngram_sequence).values())
plt.xticks(rotation=45)

plt.figure(figsize=(25, 10))
plt.title(f'Sequence character trigram (test) ({len(test_ngram_sequence)} unique trigrams)')
plt.bar(dict(test_ngram_sequence).keys(), dict(test_ngram_sequence).values())
plt.xticks(rotation=45)

plt.show()

So, it seems that there are some characters that commonly appear together.

### Structure <a id=3b></a>

#### Frequency

In [None]:
train_structure_breakdown = Counter(''.join(list(train_df.structure)))
test_structure_breakdown = Counter(''.join(list(test_df.structure)))

x_labels = ['(', ')', '.']

plt.figure(figsize=(16, 5))
plt.subplot(1, 2, 1)
plt.title('Structure character counts (train)')
plt.bar(x_labels, [train_structure_breakdown[l] for l in x_labels])

plt.subplot(1, 2, 2)
plt.title('Structure character counts (test)')
plt.bar(x_labels, [test_structure_breakdown[l] for l in x_labels])

plt.show()

It seems that ( and ) are equal, which likely means they are parseable. ( opens and ) closes the "pairs".

#### Relationship between structure and sequence

Let's combine the structure and pairing to understand how they relate to each other.

The below function pairs 2 sequences (eg element 0 of sequence is paired with element 0 of structure and so on) and can be used to find which sequences characters are paired with structure.

In [None]:
def get_paired_tokens(*sequences):
    output = Counter()
    for seq_chars in zip(*sequences):
        for i in range(len(seq_chars[0])-1):
            new_token = ''
            for seq_char in seq_chars:
                new_token += ' '+seq_char[i]
            output += Counter([new_token])
        
    return output

In [None]:
train_sequence_structure_pairs = get_paired_tokens(train_df.sequence, train_df.structure)
test_sequence_structure_pairs = get_paired_tokens(train_df.sequence, train_df.structure)

train_sequence_structure_pairs = dict(train_sequence_structure_pairs.most_common(1000))
test_sequence_structure_pairs = dict(test_sequence_structure_pairs.most_common(1000))

plt.figure(figsize=(25, 10))
plt.title(f'Train sequence and structure character pairs dist')
plt.bar(train_sequence_structure_pairs.keys(), train_sequence_structure_pairs.values())
plt.xticks(rotation=45)

plt.figure(figsize=(25, 10))
plt.title(f'Test sequence and structure character pairs dist')
plt.bar(test_sequence_structure_pairs.keys(), test_sequence_structure_pairs.values())
plt.xticks(rotation=45)

plt.show()


Let's turn our attention to `predicted_loop_type`

### Predicted loop type <a id=3c></a>

#### Frequency

In [None]:
train_predicted_loop_type_breakdown = Counter(''.join(list(train_df.predicted_loop_type)))
test_predicted_loop_type_breakdown = Counter(''.join(list(test_df.predicted_loop_type)))

train_predicted_loop_type_breakdown = dict(train_predicted_loop_type_breakdown.most_common(1000))
test_predicted_loop_type_breakdown = dict(test_predicted_loop_type_breakdown.most_common(1000))

x_labels = train_predicted_loop_type_breakdown.keys()

plt.figure(figsize=(16, 5))
plt.subplot(1, 2, 1)
plt.title('Structure character counts (train)')
plt.bar(x_labels, [train_predicted_loop_type_breakdown[l] for l in x_labels])

plt.subplot(1, 2, 2)
plt.title('Structure character counts (test)')
plt.bar(x_labels, [test_predicted_loop_type_breakdown[l] for l in x_labels])

plt.show()

Now let's see how predicted_loop relates to the sequence character.

In [None]:
train_sequence_predicted_loop_pairs = get_paired_tokens(train_df.sequence, train_df.predicted_loop_type)
test_sequence_predicted_loop_pairs = get_paired_tokens(train_df.sequence, train_df.predicted_loop_type)

train_sequence_predicted_loop_pairs = dict(train_sequence_predicted_loop_pairs.most_common(1000))
test_sequence_predicted_loop_pairs = dict(test_sequence_predicted_loop_pairs.most_common(1000))

plt.figure(figsize=(25, 10))
plt.title(f'Train sequence and predicted loop character pairs dist')
plt.bar(train_sequence_structure_pairs.keys(), train_sequence_structure_pairs.values())
plt.xticks(rotation=45)

plt.figure(figsize=(25, 10))
plt.title(f'Test sequence and predicted loop character pairs dist')
plt.bar(test_sequence_structure_pairs.keys(), test_sequence_structure_pairs.values())
plt.xticks(rotation=45)

plt.show()

### Labels <a id=3d></a>

In [None]:
# This function performs some basic statististical analysis on the various labels.

def do_analysis(df, column_name):
    all_vals = [y for x in df[column_name] for y in x]
    print(f"Analysis across all samples for {column_name}")
    print(f'Mean: {np.mean(all_vals)}')
    print(f'Max: {np.max(all_vals)}')
    print(f'Min: {np.min(all_vals)}')
    print(f'Mode: {mode(all_vals).mode[0]}')
    print(f'STD: {np.std(all_vals)}')
    print()
    
    plt.hist(all_vals)
    plt.title(f'Histogram for {column_name} across all samples')
    plt.show()
    
    print("Statistics aggregated per sample")
    fig, axes = plt.subplots(1, 4, figsize=(15, 5), squeeze=False)

    df[column_name].apply(
        lambda x: np.mean(x)).plot(
            kind='hist',
            bins=50, ax=axes[0,0],
            title=f'Mean dist {column_name}')

    df[column_name].apply(
        lambda x: np.max(x)).plot(
            kind='hist',
            bins=50, ax=axes[0,1],
            title=f'Max dist {column_name}')

    df[column_name].apply(
        lambda x: np.min(x)).plot(
            kind='hist',
            bins=50, ax=axes[0,2],
            title=f'Min dist {column_name}')
    df[column_name].apply(
        lambda x: np.std(x)).plot(
            kind='hist',
            bins=50, ax=axes[0,3],
            title=f'Std {column_name}')
    plt.show()

#### reactivity

In [None]:
do_analysis(train_df, 'reactivity')

Let's explore the same thing for the other 4 labels:

#### deg_Mg_pH10

In [None]:
do_analysis(train_df, 'deg_Mg_pH10')

#### deg_50C

In [None]:
do_analysis(train_df, 'deg_50C')

#### deg_Mg_50C

In [None]:
do_analysis(train_df, 'deg_Mg_50C')

### [3e. Bpps files](#3e)

We are also provided a series of `bpps` files, though they aren't described anywhere on the competition home page. [This](https://www.kaggle.com/c/stanford-covid-vaccine/discussion/182021) is a solid discussion on them.

The approach I've used for plotting the first 25 of them is based on [this](https://www.kaggle.com/isaienkov/openvaccine-eda-feature-engineering-modeling) helpful notebook.

In [None]:
bpps_files = os.listdir('../input/stanford-covid-vaccine/bpps/')
example_bpps = np.load(f'../input/stanford-covid-vaccine/bpps/{bpps_files[0]}')
print('bpps file shape:', example_bpps.shape)

In [None]:
plt.style.use('default')
fig, axs = plt.subplots(5, 5, figsize=(15, 15))
axs = axs.flatten()
for i, f in enumerate(bpps_files):
    if i == 25:
        break
    example_bpps = np.load(f'../input/stanford-covid-vaccine/bpps/{f}')
    axs[i].imshow(example_bpps)
    axs[i].set_title(f)
plt.tight_layout()
plt.show()

## Modelling approaches (WIP) <a id=4>

In progress.

#### Treat as a sequence-to-sequence problem.

A sequence-to-sequence model like a Transformer or LSTM could be good starting points. However, care should be taken that the model can output variable length sequences, as the private test set has longer sequence lengths than 

#### Treat each row in isolation

Each sequence character could possibly be treated like a single isolated row and be used to predict each output label. Then a GBM tool like XGBoost could be used as is standard in structured data competitions. However, this approach disregards all the context from the entire dataset.

I suspect a combination of the 2 approaches will be the best bet for this competition.