<p style="font-size:36px;text-align:center"> <b>Personalized Cancer Diagnosis</b> </p>

<h2>1. Understanding Business Problem</h2>

<h3>1.1 Description</h3>

A lot has been said during the past several years about how precision medicine and, more concretely, how genetic testing is going to disrupt the way diseases like cancer are treated. But this is only partially happening due to the huge amount of manual work still required.

Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).

Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.

MSKCC is making available an expert-annotated knowledge base where world-class researchers and oncologists have manually annotated thousands of mutations.

Thus, develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.

Data provided by: Memorial Sloan Kettering Cancer Center (MSKCC)

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/


<h3>1.2 Real world/Business objectives and constrains</h3>

* Classification of Class lebel given gene mutation of interest
* No low-latency requirement
* Model interpretability is important
* Errors can be very costly
* Probability of a data point belongs to each class is needed

<h2>2. Machine Learning Problem Formulation</h2>

<h3>2.1 Machine Learning objective and constrain</h3>

Objective: Give a data point, predict the probability it belongs to each of the nine classes<br>

Constraints:
* Interpretability
* Class probabilities are needed
* High cost of error
* No latency constraints

<h3>2.2 Understanding Data</h3>

In [1]:
# IMPORT BLOCK
import pandas as pd

In [8]:
# READ TEXT DATA
text_data = pd.read_csv(
    "training/training_text",
    sep=r"\|\|",
    engine="python",
    names=["ID", "Text"],
    skiprows=1
)
print(f"Text data consists of: \n {text_data}")
print(f"Length of text data: {len(text_data)}")
print(f"Headers of text data : {text_data.head()}")

Text data consists of: 
         ID                                               Text
0        0  Cyclin-dependent kinases (CDKs) regulate a var...
1        1   Abstract Background  Non-small cell lung canc...
2        2   Abstract Background  Non-small cell lung canc...
3        3  Recent evidence has demonstrated that acquired...
4        4  Oncogenic mutations in the monomeric Casitas B...
...    ...                                                ...
3316  3316  Introduction  Myelodysplastic syndromes (MDS) ...
3317  3317  Introduction  Myelodysplastic syndromes (MDS) ...
3318  3318  The Runt-related transcription factor 1 gene (...
3319  3319  The RUNX1/AML1 gene is the most frequent targe...
3320  3320  The most frequent mutations associated with le...

[3321 rows x 2 columns]
Length of text data: 3321
Headers of text data :    ID                                               Text
0   0  Cyclin-dependent kinases (CDKs) regulate a var...
1   1   Abstract Background  Non-small cell

In [9]:
# READ CLASS DATA
class_data = pd.read_csv("training/training_variants")
print(f"Class data consists of: \n {class_data}")
print(f"Length of class data: {len(class_data)}")
print(f"Headers of class data : {class_data.head()}")

Class data consists of: 
         ID    Gene             Variation  Class
0        0  FAM58A  Truncating Mutations      1
1        1     CBL                 W802*      2
2        2     CBL                 Q249E      2
3        3     CBL                 N454D      3
4        4     CBL                 L399V      4
...    ...     ...                   ...    ...
3316  3316   RUNX1                 D171N      4
3317  3317   RUNX1                 A122*      1
3318  3318   RUNX1               Fusions      1
3319  3319   RUNX1                  R80C      4
3320  3320   RUNX1                  K83E      4

[3321 rows x 4 columns]
Length of class data: 3321
Headers of class data :    ID    Gene             Variation  Class
0   0  FAM58A  Truncating Mutations      1
1   1     CBL                 W802*      2
2   2     CBL                 Q249E      2
3   3     CBL                 N454D      3
4   4     CBL                 L399V      4


<h3>2.3 Mapping the business problem to ML problem</h3>

<h4>2.3.1 Type of Machine Learning Model</h4>

There are 9 unique classes: {1, 2, 3, 4, 5, 6, 7, 8, 9}, a given gene mutation can be classified into <br>
So, ML model will be a multiclass classification model

<h4>2.3.2 Performance Metric(Key Performance Indicator)</h4>

Metric(s):
* Multiclass log-loss
* Confusion matrix

<h4>2.3.3 Train, Cross Validation and Test Dataset</h4>

* No temoral nature
* Splite the dataset randomly into thress parts: Train(64%), CV(16%) and Test(20%)

<h2>3. Data Filtering</h2>

In [16]:
# CHECK FOR ANY REPEATION IN ID
class_duplicate_rows = class_data[class_data["ID"].duplicated(keep=False)]
class_duplicate_rows.sort_values("ID")

text_duplicate_rows = class_data[text_data["ID"].duplicated(keep=False)]
text_duplicate_rows.sort_values("ID")


Unnamed: 0,ID,Gene,Variation,Class


In [None]:
# MERGING THE DATA
combined_data = text_data.merge(
    class_data,
    on="ID",
    how="left"
)
print(f"Length of combined data: {len(combined_data)}")
print('Features : ', combined_data.columns.values)

Length of combined data: 3321
Features :  ['ID' 'Text' 'Gene' 'Variation' 'Class']


* CHECK FOR EMPTY OR NULL VALUES
* CHECK FOR CLASS DISTRIBUTION
* CHECK FOR MISSED VALUE CLASS

In [20]:
# count of missing values per column
combined_data.isna().sum()

ID           0
Text         5
Gene         0
Variation    0
Class        0
dtype: int64

In [21]:
# Check for Empty Strings
empty_text = (combined_data["Text"].str.strip() == "").sum()
empty_gene = (combined_data["Gene"].str.strip() == "").sum()
empty_variation = (combined_data["Variation"].str.strip() == "").sum()

empty_text, empty_gene, empty_variation


(np.int64(0), np.int64(0), np.int64(0))

In [24]:
problem_rows = combined_data[
    combined_data.isna().any(axis=1) |
    (combined_data[["Text", "Gene", "Variation"]].apply(
        lambda col: col.astype(str).str.strip() == ""
    ).any(axis=1))
]

problem_rows


Unnamed: 0,ID,Text,Gene,Variation,Class
1109,1109,,FANCA,S1088F,1
1277,1277,,ARID5B,Truncating Mutations,1
1407,1407,,FGFR3,K508M,6
1639,1639,,FLT1,Amplification,6
2755,2755,,BRAF,G596C,7


In [26]:
# frop the null text row
combined_data = combined_data.dropna(subset=["Text"])
combined_data.isna().sum()

ID           0
Text         0
Gene         0
Variation    0
Class        0
dtype: int64

In [None]:
# Check Missing Target Values
combined_data["Class"].isna().sum()

np.int64(0)

In [None]:
# data integrity lock
assert not combined_data["Text"].isna().any()
assert (combined_data["Text"].str.strip() != "").all()
