<center>
<h1>Exploring some derived features to improve regression via ML Models</h1>
</center>

---

In this notebook we will explore some feature engineering of the raw features

In [None]:
import os
import sys 
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

sns.set_style("darkgrid")

sys.path.append("..")
from src.data import replace_mod, load_data

In [None]:
data = load_data("../data/Peptides_and_iRT.tsv")
data.fillna(-1).head(15)

## Construct more features:

For the sequences, we will look at the counts of every type of amino acid in the sequence. 

To further incorporate information about the modifications, we include at which position in the peptide the modification occurs (beyond the information if/how many modifications occur).


In [None]:
mod_index = data.query("is_mod == 1").index
re_mod = re.compile(r"\[([\+A-Za-z0-9]+)\]")
data.loc[mod_index, "modification"] = data.query("is_mod == 1")["sequence_raw"].str.findall(re_mod)

data.loc[mod_index, "modification_num"] = data.loc[mod_index, "modification"].apply(len)
data.loc[mod_index, "modification_loc"] = data.query("is_mod == 1")["sequence_raw"].apply(lambda s: [match.span()[0] for match in re.finditer(re_mod,s)])

max_mod = data.loc[mod_index, "modification_loc"].apply(len).max()
for m in range(max_mod):
    data.loc[mod_index, f"modification_loc_{m + 1}"] = data.loc[mod_index, "modification_loc"].apply(lambda l: l[m] if len(l) > m  else -1)
    data.loc[mod_index, f"modification_type_{m + 1}"] = data.loc[mod_index, "modification"].apply(lambda l: l[m] if len(l) > m else "")

    data.loc[:, f"modification_type_{m+1}"] = data.loc[:, f"modification_type_{m+1}"].fillna("").astype("category").cat.codes

data.fillna(-1).head(15)

### Split the sequences

In [None]:
mod_types = list(f"[{s}]" for s in data.query("is_mod == 1")["modification"].explode().unique() if len(s) > 0)
data.loc[:, "sequence_proc"] =  data.loc[:, "sequence_raw"].apply(lambda s: replace_mod(s, mod_types))

### Using Tfidf and count vectorizers, inspect AA frequencies and occurences in documents

In [None]:
vocabulary = data["sequence_proc"].explode().unique()
len(vocabulary)

In [None]:
vec = CountVectorizer(token_pattern=r"(?u)\b\w\d?\b", vocabulary=list(vocabulary), lowercase=False)
vec.fit_transform(data["sequence_proc"].apply(lambda s: " ".join(s)).to_numpy()).toarray()[0]

In [None]:
tfidf = TfidfVectorizer(token_pattern=r"(?u)\b\w\d?\b", vocabulary=list(vocabulary), lowercase=False)
tfidf.fit_transform(data["sequence_proc"].apply(lambda s: " ".join(s)).to_numpy()).toarray()[0]

In [None]:
plt.figure()
pd.DataFrame(zip(tfidf.vocabulary, 1 / tfidf.idf_), columns=["Amino Acid", "doc frequency"]). \
    sort_values("doc frequency", ascending=False).plot.bar(x="Amino Acid", figsize=(8,8), color="darkred", rot=45, legend=False)


plt.title("Amino Acid document frequencies")
plt.tight_layout()
plt.show()

## Assessment of vectorizers: Which vectorization produces stronger correlations?

### Count vectorizer:

In [None]:
data.loc[:, [f"AAcount_{v}" for v in vocabulary]] = vec.fit_transform(data["sequence_proc"].apply(lambda s: " ".join(s)).to_numpy()).toarray()


feat_correlations = data.fillna(-1).corr(method="pearson")

plt.figure(figsize=(20,20))
sns.heatmap(feat_correlations,
           square=True,
           center=0,
           annot=np.round(feat_correlations,2),
           fmt="",
           linewidths=.5,
           cmap="vlag",
           cbar_kws={"shrink": 0.8})

plt.tight_layout()
plt.show()

### Tfidf:

In [None]:
data.loc[:, [f"AAcount_{v}" for v in vocabulary]] = tfidf.fit_transform(data["sequence_proc"].apply(lambda s: " ".join(s)).to_numpy()).toarray()


feat_correlations = data.fillna(-1).corr(method="pearson")

plt.figure(figsize=(20,20))
sns.heatmap(feat_correlations,
           square=True,
           center=0,
           annot=np.round(feat_correlations,2),
           fmt="",
           linewidths=.5,
           cmap="vlag",
           cbar_kws={"shrink": 0.8})

plt.tight_layout()
plt.show()

## Conclusions
We see that the count vectorizer produces stronger correlation patterns between iRT and AA types. While this does not preclude the utility of Tfidf for the regression task, it is a first hint that count vectorization will produce more expressive features. 


### Code export:

The results of this notebook have been written to `src/data/preprocess.py`