# Preprocessing 

**In this notebook:**
* Loading raw text and variation data 
* Adding features such as physiochemical distance (Grantham, 1974) 
* Vectorization of the data
* Dimension reduction
* Saving the features

## Imports

In [1]:
import os
import pandas as pd
import sys

sys.path.append("../utils/")
from our_preprocessor import *
from preprocessing import get_phsysiochem_distance

In [10]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import LabelEncoder

## Load Data

In [3]:
data_path = "../../data/msk-redefining-cancer-treatment"

### Training Data - Text and Genetic Variants Information

In [4]:
training_variants_df = pd.read_csv(os.path.join(data_path, "raw/training_variants"))
training_text_df = pd.read_csv(
    os.path.join(data_path, "raw/training_text"),
    sep="\|\|",
    engine="python",
    skiprows=1,
    names=["ID", "Text"],
)

training_merge_df = training_variants_df.merge(
    training_text_df, left_on="ID", right_on="ID"
)

# Delete Samples with Empty Text
training_merge_df = training_merge_df.loc[~training_merge_df.Text.isnull()]

training_size = training_merge_df.shape[0]
print("Number of Training Samples", training_size)
training_merge_df.head()

Number of Training Samples 3316


Unnamed: 0,ID,Gene,Variation,Class,Text
0,0,FAM58A,Truncating Mutations,1,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,CBL,W802*,2,Abstract Background Non-small cell lung canc...
2,2,CBL,Q249E,2,Abstract Background Non-small cell lung canc...
3,3,CBL,N454D,3,Recent evidence has demonstrated that acquired...
4,4,CBL,L399V,4,Oncogenic mutations in the monomeric Casitas B...


### Validation Data - Text and Genetic Variants Information

In [5]:
validation_variants_df = pd.read_csv(os.path.join(data_path, "raw/test_variants"))
validation_text_df = pd.read_csv(
    os.path.join(data_path, "raw/test_text"),
    sep="\|\|",
    engine="python",
    skiprows=1,
    names=["ID", "Text"],
)

valdiation_merge_df = validation_variants_df.merge(
    validation_text_df, left_on="ID", right_on="ID"
)

# Validation Solution
validation_class_df = pd.read_csv(
    os.path.join(data_path, "raw/stage1_solution_filtered.csv")
)
validation_class_df.columns = ["ID", 1, 2, 3, 4, 5, 6, 7, 8, 9]
validation_class_df = (
    validation_class_df.melt("ID", var_name="Class")
    .query("value== 1")
    .sort_values(["ID", "Class"])
    .drop("value", 1)
)
valdiation_merge_df = valdiation_merge_df.merge(
    validation_class_df, left_on="ID", right_on="ID"
)

# Delete Samples with Empty Text
valdiation_merge_df = valdiation_merge_df.loc[~valdiation_merge_df.Text.isnull()]

validation_size = valdiation_merge_df.shape[0]
print("Number of Validation Samples:", validation_size)
valdiation_merge_df.head()

Number of Validation Samples: 367


Unnamed: 0,ID,Gene,Variation,Text,Class
0,12,TET2,Y1902A,TET proteins oxidize 5-methylcytosine (5mC) on...,1
1,19,MTOR,D2512H,Genes encoding components of the PI3K-Akt-mTOR...,2
2,21,KIT,D52N,Myeloproliferative disorders (MPD) constitute ...,2
3,55,SPOP,F125V,"In the largest E3 ligase subfamily, Cul3 binds...",4
4,64,KEAP1,C23Y,Keap1 is the substrate recognition module of a...,4


### Test Data - Text and Genetic Variants Information

In [6]:
# Test Data - Text and Genetic Variants Information
test_variants_df = pd.read_csv(os.path.join(data_path, "raw/stage2_test_variants.csv"))
test_text_df = pd.read_csv(
    os.path.join(data_path, "raw/stage2_test_text.csv"),
    sep="\|\|",
    engine="python",
    skiprows=1,
    names=["ID", "Text"],
)

test_merge_df = test_variants_df.merge(test_text_df, left_on="ID", right_on="ID")
print("Number of Test Samples:", test_merge_df.shape[0])

# Test Solution
test_class_df = pd.read_csv(os.path.join(data_path, "raw/stage_2_private_solution.csv"))
test_class_df.columns = ["ID", 1, 2, 3, 4, 5, 6, 7, 8, 9]
test_class_df = (
    test_class_df.melt("ID", var_name="Class")
    .query("value== 1")
    .sort_values(["ID", "Class"])
    .drop("value", 1)
)
test_merge_df = test_merge_df.merge(test_class_df, left_on="ID", right_on="ID")
test_size = test_merge_df.shape[0]
print("Number of Test Samples:", test_size)
test_merge_df.head()

Number of Test Samples: 986
Number of Test Samples: 125


Unnamed: 0,ID,Gene,Variation,Text,Class
0,8,RNF6,G244D,Human ESCCs 2 occur frequently worldwide (1) ....,4
1,15,ERBB2,G746S,The protein-kinase family is the most frequent...,9
2,16,TP53,Y234S,Among the best-studied therapeutic targets in ...,8
3,18,EGFR,P546S,Head and neck squamous cell carcinoma (HNSCC) ...,2
4,19,ERBB2,G279E,Functional characterization of cancer-associat...,2


## Transform Data

In [6]:
df = pd.concat(
    [training_merge_df, valdiation_merge_df, test_merge_df],
    axis=0,
    ignore_index=True,
    sort=False,
)

# free memory
del training_merge_df, valdiation_merge_df, test_merge_df

NameError: name 'test_merge_df' is not defined

In [7]:
df.head()

Unnamed: 0,ID,Gene,Variation,Class,Text
0,0,FAM58A,Truncating Mutations,1,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,CBL,W802*,2,Abstract Background Non-small cell lung canc...
2,2,CBL,Q249E,2,Abstract Background Non-small cell lung canc...
3,3,CBL,N454D,3,Recent evidence has demonstrated that acquired...
4,4,CBL,L399V,4,Oncogenic mutations in the monomeric Casitas B...


### Including Additional Features

Adding information about the variation, such as original amino acid, replaced amino acid and location of replacement. Along with this, information regarding whether the original and replaced amino acid had somewhat similar properties is included. 

#### Physiochemical Distance

In [8]:
df["PhsysiochemDistance"] = df.apply(
    lambda row: get_phsysiochem_distance(row.Variation), axis=1
)
df["PhsysiochemDistance"].fillna((df["PhsysiochemDistance"].mean()), inplace=True)
df.head()

Unnamed: 0,ID,Gene,Variation,Class,Text,PhsysiochemDistance
0,0,FAM58A,Truncating Mutations,1,Cyclin-dependent kinases (CDKs) regulate a var...,50.592068
1,1,CBL,W802*,2,Abstract Background Non-small cell lung canc...,50.592068
2,2,CBL,Q249E,2,Abstract Background Non-small cell lung canc...,29.0
3,3,CBL,N454D,3,Recent evidence has demonstrated that acquired...,23.0
4,4,CBL,L399V,4,Oncogenic mutations in the monomeric Casitas B...,32.0


#### Features from Literature parsing for mutation classification

These additional features are taken from the blog written by McAteer, Matthew contain information about text length, gene and variation description. https://matthewmcateer.me/blog/literature-parsing-for-mutation-classification/

In [17]:
for columns in df.columns:
    if df[columns].dtype == "object":
        if columns in ["Gene", "Variation"]:
            lbl = LabelEncoder()
            df[columns + "_lbl_enc"] = lbl.fit_transform(df[columns].values)
            df[columns + "_len"] = df[columns].map(lambda x: len(str(x)))
            df[columns + "_words"] = df[columns].map(lambda x: len(str(x).split(" ")))
        elif columns != "Text" and columns != "PC_distance":
            lbl = LabelEncoder()
            df[columns] = lbl.fit_transform(df[columns].values)
        if columns == "Text":
            df[columns + "_len"] = df[columns].map(lambda x: len(str(x)))
            df[columns + "_words"] = df[columns].map(lambda x: len(str(x).split(" ")))

df["Gene_Share"] = df.apply(
    lambda r: sum([1 for w in r["Gene"].split(" ") if w in r["Text"].split(" ")]),
    axis=1,
)
df["Variation_Share"] = df.apply(
    lambda r: sum([1 for w in r["Variation"].split(" ") if w in r["Text"].split(" ")]),
    axis=1,
)

df.head()

KeyboardInterrupt: 

#### (Optional) Save Data with Additional Features to Interim

In [None]:
df.iloc[:-test_size].to_csv(os.path.join(data_path, "interim/training_data_additional_features"))
df.iloc[-test_size:].to_csv(os.path.join(data_path, "interim/test_data_additional_features"))

### Verctorize and Reduce Dimension

In [17]:
x_train = df.iloc[:-test_size]; print(x_train.shape)  
x_test = df.iloc[-test_size:]; print(x_test.shape)

y_train = train['Class'].values
x_train = train.drop(['Class'], axis=1)

y_test = test['Class'].values
x_test = test.drop(['Class'], axis=1)

del df 
del test

(3683, 5)
(125, 5)


In [None]:
# Feature Union in Pipeline macht das Sinn? 

In [19]:
class cust_regression_vals(BaseEstimator, TransformerMixin):
    #
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        x = x.drop(['Gene', 'Variation','ID','Text'],axis=1).values
        return x
    
class cust_txt_col(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        print(x[self.key].apply(str))
        return x[self.key].apply(str)    

In [22]:

combined_features = Pipeline(
    [
        (
            "union",
            FeatureUnion(
                n_jobs=1,
                transformer_list=[
                    # customized features as physiochemical distance 
                    ("additional_features", cust_regression_vals()),
                    
                    # vectorizing gene name
                    (
                        "pi1",
                        Pipeline(
                            [
                               # ("Gene", cust_txt_col("Gene")),
                                (
                                    "count_Gene",
                                    CountVectorizer(
                                        analyzer=u"char", ngram_range=(1, 8)
                                    ),
                                ),
                                (
                                    "tsvd1",
                                    TruncatedSVD(
                                        n_components=25, n_iter=25, random_state=12
                                    ),
                                ),
                            ]
                        ),
                    ),
                    (
                        "pi2",
                        Pipeline(
                            [
                                ("Variation", cust_txt_col("Variation")),
                                (
                                    "count_Variation",
                                    CountVectorizer(
                                        analyzer=u"char", ngram_range=(1, 8)
                                    ),
                                ),
                                (
                                    "tsvd2",
                                    TruncatedSVD(
                                        n_components=25, n_iter=25, random_state=12
                                    ),
                                ),
                            ]
                        ),
                    ),
                    (
                        "pi3",
                        Pipeline(
                            [
                                ("Text", cust_txt_col("Text")),
                                (
                                    "tfidf_Text",
                                    TfidfVectorizer(
                                        preprocessor=my_preprocessor,
                                        tokenizer=tokenizeratleastthree,
                                        ngram_range=(1, 3),
                                        stop_words=stopWords,
                                        min_df=1,
                                    ),
                                ),
                                (
                                    "tsvd3",
                                    TruncatedSVD(
                                        n_components=50, n_iter=50, random_state=12
                                    ),
                                ),
                            ]
                        ),
                    ),
                ],
            ),
        )
    ]
)

In [23]:
%%time
train_transformed = combined_features.fit_transform(train[:50])
print(train_transformed.shape)

0          Truncating Mutations
1                         W802*
2                         Q249E
3                         N454D
4                         L399V
5                         V391I
6                         V430M
7                      Deletion
8                         Y371H
9                         C384R
10                        P395A
11                        K382E
12                        R420Q
13                        C381A
14                        P428L
15                        D390Y
16         Truncating Mutations
17                        Q367P
18                        M374V
19                        Y371S
20                         H94Y
21                        C396R
22                        G375P
23                        S376F
24                        P417A
25                        H398Y
26                          S2G
27                        Y846C
28                        C228T
29                        H412Y
30                        H876Q
31      

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 50 and the array at index 1 has size 5

## Save Data