# Week 6

## 1. Progress
We trained the insertion, deletion and indel models using the training data that was available on FigShare. This
resulted in six models, namely 2 regularizer types * 3 models. It wasn't extremely clear in their paper but in their
supplementary they mention that for their final model they simply pick the best performing model whether it be L2 or L1
at the specific value it perform best at.

We also rewrote some of the code that was used to train the models to allow for better jupyter widgets and make it more
clear what the difference is between the different types of models when it comes to training.

### Training the models and reproducing regularization strength images
![Insertion MSE](../../assets/InsertionModelMSE.png)
Since there is a slight dip in l1 around 10e-3 we opted to go for the l1 model

![Deletion MSE](../../assets/DeletionModelMSE.png)
Since there is a clear dip in l1 around 10e-3 we opted to go for the l1 model

![Indel MSE](../../assets/IndelModelMSE.png)
Since there is a slightly lower dip in L2 after 10e-3 we opted to go for the l2 model

### Running our trained model on the test set and generating the MSE count figure (6.B)
After getting these three models trained, we wrote a script that ran the test set over the models. This then allowed us
to plot the mse count estimating the bin size that was used in Figure 6.B:
```python
errors = run_test_set(file, file_dir)
x_range_label = np.arange(0, 1.6, .2)
x_range_label_str = list(map(lambda x: str(round(x, 1)), x_range_label))
x_range = (10 ** -3) * x_range_label

fig, (ax) = plt.subplots(nrows=1)
ax.hist(errors, bins=6 * len(x_range_label), color='aqua', alpha=.4, edgecolor='grey')
ax.xaxis.set_ticks(x_range)
ax.xaxis.set_ticklabels(x_range_label_str)
ax.xaxis.set_ticks_position('bottom')
ax.set_xlim(x_range[0] - x_range[1] / 9, x_range[-1])
plt.xlabel("MSE 10e-3")
plt.ylabel("Count")
```

As you can see from running this, the plot generated is not much different from the 6.B plot.
<p float="left">
  <img src="../../assets/6BMseCount.png" style="float:left;"/>
  <img src="../../assets/MseCount.png" style="float:left;"/>
</p>

To generate such an image we took a bit out of the Predictor.py and used the 6.A figure as a reference

```python
input_indel = onehotencoder(seq)
input_ins = onehotencoder(seq[-6:])
input_del = x_in

dratio, insratio = softmax(np.dot(input_indel, w1) + b1)
ds = softmax(np.dot(input_del, w2) + b2)
ins = softmax(np.dot(input_ins, w3) + b3)
y_hat = np.concatenate((ds * dratio, ins * insratio), axis=None)
errors.append(mse(y_hat, y_out))
```

### Issues with profiles and basepairs
We quickly noticed that to create a model that was similar to the Predictor.py model we would need to 65p sequence and not
the 20bp sequence that was found in the target set. We then had contact with the TA's and they helped us out by stating that
we were supposed to be able to generate our own targets + features + lindel profiles based on the algient file and the rep matrices.

When this was said to us, it made sense because it was hard to see the challenge past running the models on the training/test set that
was provided on the figshare. From that moment we split the tasks:

1. One group needs to figure out how to use the matrices, feature_index_all and algient files to construct the binary features
required to train the model. The hard part here is generating the microhomology binary features since those of the sequence are just
one-hot-encoded based on the 20pb target and are quite easy to get by.


2. The other group needs to figure out how to use the same data to generate the lindel profiles. The progress here is already
good. We now understand how the matrix files correspond to the labels and that we can produce a total frequency for each target and event
and normalize this to get the lindel profile.

## 2. Plan for thursday/this week

1. We want to be able to generate our own training/test set files based on the files mentioned above
    a. If this gets done, we want to train this on our model and see how it performs in comparison to the provided sets

2. We aim to understand how redundant deletion events can be combined in order to end up with ~450 classes as opposed to
the current 558

## 3. Questions for thursday

1. We were able to train the Lindel model on the Lindel dataset in a timeframe if +- ~1hour. Assuming that generation of this data
does not take longer (which so far seems to be the case), do we need to train a model on the combination of forecast/lindel data?
Why else would we need the google credits?


2. If we reproduce the model to this extent (test/training set generation from matrices/algient files, training models based on that
and feeding weights to the Predictor.py in order to make it functional), what is the next step? We will technically have reproduced the results of
the paper as far as we understand it?
