# Data Overview

**In this notebook:**

* Training data is loaded
* Data exploration via pandas profiling 
* Samples without text are deleted
* Text sections sourrounding the gene variation name are extracted

**Key insight:**

* Training Samples 3321
* 5 Samples without Text
* Unbalanced Dataset

## Imports

In [1]:
import sys
import pandas as pd
import os
from pandas_profiling import ProfileReport
import numpy as np

sys.path.append("../utils/")

from preprocessing import extract_text_sections
from preprocessing import get_data

## Load Raw Data
Load, explore, and prepare all required data.

In [2]:
data_path = "../../data/msk-redefining-cancer-treatment"

In [3]:
# Training Data - Text and Genetic Variants Information
training_merge_df = get_data(
    text_file_path="raw/training_text", variants_file_path="raw/training_variants"
)
training_size = training_merge_df.shape[0]
print("Number of Training Samples", training_size)
training_merge_df.head()

# Validation Data - Text and Genetic Variants Information
validation_merge_df = get_data(
    text_file_path="raw/test_text",
    variants_file_path="raw/test_variants",
    solution_file_path="raw/stage1_solution_filtered.csv",
)
validation_size = validation_merge_df.shape[0]
print("Number of Validation Samples:", validation_size)

raw_data_df = training_merge_df.append(validation_merge_df, sort=False)

Number of Training Samples 3316
Number of Validation Samples: 367


**Class Definitions:**
* 1: Likely Loss-of-function
* 2: Likely Gain-of-function
* 3: Neutral
* 4: Loss-of-function
* 5: ...

### Classification Example:

In [4]:
raw_data_df[raw_data_df["Variation"] == "V391I"]

Unnamed: 0,ID,Gene,Variation,Class,Text
5,5,CBL,V391I,4,Oncogenic mutations in the monomeric Casitas B...


In [5]:
raw_data_df[raw_data_df["Variation"] == "V391I"]["Text"].tolist()[0][
    31228 - 35 : 31228 + 43
]

'mutations (L399V, G375P, P395A and V391I) which attenuated the CBL E3 activity'

In the text belonging to the **CBL V391I** genetic variation we could find the section *''mutations (L399V, G375P, P395A and V391I) which attenuated the CBL E3 activity'*. This reflects label 4, indicating a loss of function.

### Explore Raw Data

In [9]:
ProfileReport(raw_data_df).to_notebook_iframe()

## Load Data with Additional Features

In [10]:
train_processed = pd.read_csv(
    os.path.join(data_path, "interim/training_data_additional_features")
)

In [11]:
ProfileReport(train_processed).to_notebook_iframe()