# Preprocessing 

**In this notebook:**
* Loading raw text and variation data 
* Adding features such as physiochemical distance (Grantham, 1974) 
* Vectorization of the data
* Dimensionality reduction
* Saving the features

## Imports

In [1]:
import os
import pandas as pd
import sys

sys.path.append("../utils/")
from preprocessing import *
from numpy import save

In [2]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

In [3]:
data_path = "../../data/msk-redefining-cancer-treatment"

## Load Data

### Training Data - Text and Genetic Variants Information

In [4]:
training_merge_df = get_data(
    text_file_path="raw/training_text", variants_file_path="raw/training_variants"
)
training_size = training_merge_df.shape[0]
print("Number of Training Samples", training_size)

Number of Training Samples 3316


### Validation Data - Text and Genetic Variants Information

In [5]:
validation_merge_df = get_data(
    text_file_path="raw/test_text",
    variants_file_path="raw/test_variants",
    solution_file_path="raw/stage1_solution_filtered.csv",
)
validation_size = validation_merge_df.shape[0]
print("Number of Validation Samples:", validation_size)

Number of Validation Samples: 367


### Test Data - Text and Genetic Variants Information

In [6]:
test_merge_df = get_data(
    text_file_path="raw/stage2_test_text.csv",
    variants_file_path="raw/stage2_test_variants.csv",
    solution_file_path="raw/stage_2_private_solution.csv",
)
test_size = test_merge_df.shape[0]
print("Number of Test Samples:", test_size)

Number of Test Samples: 125


### Concatenate Data

In [7]:
df = pd.concat(
    [training_merge_df, validation_merge_df, test_merge_df],
    axis=0,
    ignore_index=True,
    sort=False,
)

# free memory
del training_merge_df, validation_merge_df, test_merge_df

df.head()

Unnamed: 0,ID,Gene,Variation,Class,Text
0,0,FAM58A,Truncating Mutations,1,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,CBL,W802*,2,Abstract Background Non-small cell lung canc...
2,2,CBL,Q249E,2,Abstract Background Non-small cell lung canc...
3,3,CBL,N454D,3,Recent evidence has demonstrated that acquired...
4,4,CBL,L399V,4,Oncogenic mutations in the monomeric Casitas B...


In [8]:
shuffled = True
if shuffled:
    df.sample(frac=1)
    print(
        "Data got shuffled. Therefore results are not comparable anymore to the Kaggle competition"
    )

Data got shuffled. Therefore results are not comparable anymore to the Kaggle competition


## Transform Data

### Extract Text Sections


Extracting text sections of the entire text information given. Thus reducing the size of text input. 

In [9]:
# extract section
df['Text'] = df[['Text', 'Gene','Variation', 'ID']].apply(lambda x: extract_text_sections(x[0].lower(),x[1].lower(),x[2].lower(), 1000), axis=1)
df.head()

Unnamed: 0,ID,Gene,Variation,Class,Text
0,0,FAM58A,Truncating Mutations,1,cyclin-dependent kinases (cdks) regulate a var...
1,1,CBL,W802*,2,ncer (nsclc) is a heterogeneous group of disor...
2,2,CBL,Q249E,2,ll lung cancer (nsclc) is a heterogeneous grou...
3,3,CBL,N454D,3,alysis but failed to detect any further sequen...
4,4,CBL,L399V,4,d) compared to either a549 or hek293t cells (m...


### Include Additional Features

Adding information about the variation, such as original amino acid, replaced amino acid and location of replacement. Along with this, information regarding whether the original and replaced amino acid had somewhat similar properties is included. 

####  (Optional) Physiochemical Distance

In [10]:
df["PhsysiochemDistance"] = df.apply(
    lambda row: get_phsysiochem_distance(row.Variation), axis=1
)
df["PhsysiochemDistance"].fillna((df["PhsysiochemDistance"].mean()), inplace=True)
df.head()

Unnamed: 0,ID,Gene,Variation,Class,Text,PhsysiochemDistance
0,0,FAM58A,Truncating Mutations,1,cyclin-dependent kinases (cdks) regulate a var...,50.801628
1,1,CBL,W802*,2,ncer (nsclc) is a heterogeneous group of disor...,50.801628
2,2,CBL,Q249E,2,ll lung cancer (nsclc) is a heterogeneous grou...,29.0
3,3,CBL,N454D,3,alysis but failed to detect any further sequen...,23.0
4,4,CBL,L399V,4,d) compared to either a549 or hek293t cells (m...,32.0


#### Features based on blog: Literature parsing for mutation classification

These additional features are taken from the blog written by McAteer, Matthew contain information about text length, gene and variation description. https://matthewmcateer.me/blog/literature-parsing-for-mutation-classification/

In [11]:
for columns in df.columns:
    if df[columns].dtype == "object":
        if columns in ["Gene", "Variation"]:
            lbl = LabelEncoder()
            df[columns + "_lbl_enc"] = lbl.fit_transform(df[columns].values)
            df[columns + "_len"] = df[columns].map(lambda x: len(str(x)))
            df[columns + "_words"] = df[columns].map(lambda x: len(str(x).split(" ")))
        elif columns != "Text" and columns != "PC_distance":
            lbl = LabelEncoder()
            df[columns] = lbl.fit_transform(df[columns].values)
        if columns == "Text":
            df[columns + "_len"] = df[columns].map(lambda x: len(str(x)))
            df[columns + "_words"] = df[columns].map(lambda x: len(str(x).split(" ")))

df["Gene_Share"] = df.apply(
    lambda r: sum([1 for w in r["Gene"].split(" ") if w in r["Text"].split(" ")]),
    axis=1,
)
df["Variation_Share"] = df.apply(
    lambda r: sum([1 for w in r["Variation"].split(" ") if w in r["Text"].split(" ")]),
    axis=1,
)

df.head()

Unnamed: 0,ID,Gene,Variation,Class,Text,PhsysiochemDistance,Gene_lbl_enc,Gene_len,Gene_words,Variation_lbl_enc,Variation_len,Variation_words,Text_len,Text_words,Gene_Share,Variation_Share
0,0,FAM58A,Truncating Mutations,0,cyclin-dependent kinases (cdks) regulate a var...,50.801628,91,6,1,3010,20,2,17779,2769,0,0
1,1,CBL,W802*,1,ncer (nsclc) is a heterogeneous group of disor...,50.801628,39,3,1,3265,5,1,19066,2937,0,0
2,2,CBL,Q249E,1,ll lung cancer (nsclc) is a heterogeneous grou...,29.0,39,3,1,2158,5,1,19049,2942,0,0
3,3,CBL,N454D,2,alysis but failed to detect any further sequen...,23.0,39,3,1,1896,5,1,9788,1507,0,0
4,4,CBL,L399V,3,d) compared to either a549 or hek293t cells (m...,32.0,39,3,1,1648,5,1,2810,460,0,0


In [12]:
train = df.iloc[:-test_size]
print(train.shape)
test = df.iloc[-test_size:]
print(test.shape)

(3683, 16)
(125, 16)


#### (Optional) Save Data with Additional Features to Interim

In [None]:
train.to_csv(
    os.path.join(data_path, "interim/training_data_additional_features"), index=False
)
test.to_csv(
    os.path.join(data_path, "interim/test_data_additional_features"), index=False
)

### Verctorize, Reduce Dimension and Normalize

#### (Optional) Load saved training and test data

In [None]:
train = pd.read_csv(
    os.path.join(data_path, "interim/training_data_additional_features")
)
test = pd.read_csv(os.path.join(data_path, "interim/test_data_additional_features"))

In [13]:
y_train = train["Class"].values
x_train = train.drop(["Class"], axis=1)
del train

y_test = test["Class"].values
x_test = test.drop(["Class"], axis=1)

del test

In [14]:
x_train.head()

Unnamed: 0,ID,Gene,Variation,Text,PhsysiochemDistance,Gene_lbl_enc,Gene_len,Gene_words,Variation_lbl_enc,Variation_len,Variation_words,Text_len,Text_words,Gene_Share,Variation_Share
0,0,FAM58A,Truncating Mutations,cyclin-dependent kinases (cdks) regulate a var...,50.801628,91,6,1,3010,20,2,17779,2769,0,0
1,1,CBL,W802*,ncer (nsclc) is a heterogeneous group of disor...,50.801628,39,3,1,3265,5,1,19066,2937,0,0
2,2,CBL,Q249E,ll lung cancer (nsclc) is a heterogeneous grou...,29.0,39,3,1,2158,5,1,19049,2942,0,0
3,3,CBL,N454D,alysis but failed to detect any further sequen...,23.0,39,3,1,1896,5,1,9788,1507,0,0
4,4,CBL,L399V,d) compared to either a549 or hek293t cells (m...,32.0,39,3,1,1648,5,1,2810,460,0,0


#### Pipeline

In [15]:
gene_pipeline = Pipeline(
    [
        ("Gene", CustTxtCol("Gene")),
        ("count_Gene", CountVectorizer(analyzer=u"char", ngram_range=(1, 8)),),
        ("tsvd1", TruncatedSVD(n_components=25, n_iter=25, random_state=12),),
    ]
)

variation_pipeline = Pipeline(
    [
        ("Variation", CustTxtCol("Variation")),
        ("count_Variation", CountVectorizer(analyzer=u"char", ngram_range=(1, 8)),),
        ("tsvd2", TruncatedSVD(n_components=25, n_iter=25, random_state=12),),
    ]
)

text_pipeline = Pipeline(
    [
        ("Text", CustTxtCol("Text")),
        (
            "tfidf_Text",
            TfidfVectorizer(
                # Clearning
                preprocessor=custom_preprocessor,
                # Splitting text into tokens(words, senteces, etc.) with at least three characters
                tokenizer=tokenizer_at_least_three,
                ngram_range=(1, 3),
                # Removal of most common words in a language, which usualy does not bring additional meaning
                stop_words=stop_words,
                min_df=1,
            ),
        ),
        ("tsvd3", TruncatedSVD(n_components=50, n_iter=50, random_state=12),),
    ]
)

In [16]:
feature_pipeline = Pipeline(
    [
        (
            "union",
            FeatureUnion(
                n_jobs=1,
                transformer_list=[
                    ("standard", CustRegressionVals()),
                    ("gene_pipeline", gene_pipeline),
                    ("variation_pipeline", variation_pipeline),
                    ("text_pipeline", text_pipeline),
                ],
                verbose=True,
            ),
        ),
        ("scaler", MinMaxScaler()),
    ],
    verbose=True,
)

In [17]:
%%time
train_processed = feature_pipeline.fit_transform(x_train)

[FeatureUnion] ...... (step 1 of 4) Processing standard, total=   0.0s
[FeatureUnion] . (step 2 of 4) Processing gene_pipeline, total=   0.3s
[FeatureUnion]  (step 3 of 4) Processing variation_pipeline, total=   1.5s
[FeatureUnion] . (step 4 of 4) Processing text_pipeline, total=20.3min
[Pipeline] ............. (step 1 of 2) Processing union, total=20.3min
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.2s
CPU times: user 15min 50s, sys: 5min 20s, total: 21min 10s
Wall time: 20min 19s


In [18]:
%%time
test_processed = feature_pipeline.transform(x_test)

CPU times: user 38.6 s, sys: 5.32 s, total: 44 s
Wall time: 44.6 s


In [19]:
print("Shape of processed training data:".ljust(35), train_processed.shape)
print("Shape of processed test data:".ljust(35), test_processed.shape)

Shape of processed training data:   (3683, 111)
Shape of processed test data:       (125, 111)


## Save Data

In [21]:
# save to npy file
save(os.path.join(data_path, "processed/x_train_shuffled"), train_processed)
save(os.path.join(data_path, "processed/x_test_shuffled"), test_processed)
save(os.path.join(data_path, "processed/y_train_shuffled"), y_train)
save(os.path.join(data_path, "processed/y_test_shuffled"), y_test)