Personalized Cancer Diagnosis

Software

Jupyter Notebook

Packages

1) Pandas 5) NLTK

2) Scipy 6) Numpy

3) Sci-Kit Learn 7) Plotly

4) Seaborn 8) Matplotlib

Installation of Packages

Open cmd and type the following commands:

  pip3 install pandas

  pip3 install matplotlib

  pip3 install nltk

  pip3 install numpy

  pip3 install scipy

  pip3 install scikit-learn

  pip3 install seaborn

  pip3 install plotly

Concepts Used

Hyperparameter Tuning
K-Nearest Neighbours
Logistic Regression
Exploratory Data Analysis
Support Vector Machine
Random Forest Classifier
One Hot Encoding
Response Encoding / Mean Value Replacement
Naive Bayes
Laplace Smooting

Problem Overview

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/
We have two data files:

1) Containing the information about the Genetic Mutations.

2) Containing the clinical evidence (text) that human experts/pathologists use to classify the Genetic Mutations.
Both these data files have a common column called ID
Data file's information:

1) training_variants (ID , Gene, Variations, Class)

2) training_text (ID, Text)
There are 9 different classes a genetic mutation can be classified into => Multi class classification problem
Performance Metric(s) to be used:

1) Multi class log-loss

2) Confusion matrix

Objective & Constraints

Objective: Predict the probability of each data-point belonging to each of the nine classes.
Constraints:
1. Interpretability
2. Class probabilities are needed.
3. Penalize the errors in class probabilites => Metric is Log-loss.
4. No Latency constraints.

Workflow Analysis

Step 1

Reading Gene and Variation Data

Reading Text Data

Preprocessing of Text

Splitting the Data into Train, Test and Cross Validation (64:20:16)

Clearly Classes 1, 2, 4 and 7 have more number of data points. This distribution of points will determine how our models will likely work.

Step 2

Prediction using a Random Model

This Random Model acts as a Threshold for other Models i.e. we should try to keep the other Model's Log-Loss value below that of Random Model!!

Step 3

Univariate Analysis

Uni means one and variate means variable, so in univariate analysis, there is only one dependable variable. The objective of univariate analysis is to derive the data, define and summarize it, and analyze the pattern present in it. In a dataset, it explores each variable separately. It is possible for two kinds of variables- Categorical and Numerical.

We do the Univariate Analysis on 3 Features:

i) Gene (Categorical Variable)

ii) Variation (Categorical Variable)

iii) Text Feature (Words)

Stacking The Three Types of Features

Step 4

Machine Learning Models

I) Naive Bayes

My Output:

II) K Nearest Neighbour

My Output:

III) Logistic Regression with Class Balancing

My Output:

IV) Logistic Regression without Class Balancing

My Output:

V) Linear Support Vector Machine

My Output:

VI) Random Forest Classifier

My Output:

Summarized Log-Losses and Misclassified Points

Logistic Regression with Class Balancing using One Hot Encoding gave the least Misclassified Points Percentage out of all the other models!!

Even though Voting Classifier which uses a Combination of Naive Bayes, Linear Regression and Support Vector Machine gave the least Misclassified Points Percentage out of all the Models. However, it's Interpretability is almost Negligible, and therefore not recommended.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Kaggle Dataset (Link)		Kaggle Dataset (Link)
Screenshots		Screenshots
LICENSE		LICENSE
Personalized Cancer Analysis.ipynb		Personalized Cancer Analysis.ipynb
README.md		README.md

License

PrasunDutta007/Cancer-Diagnosis-Analysis

Folders and files

Latest commit

History

Repository files navigation

Personalized Cancer Diagnosis

Software

Jupyter Notebook

Packages

1) Pandas 5) NLTK

2) Scipy 6) Numpy

3) Sci-Kit Learn 7) Plotly

4) Seaborn 8) Matplotlib

Installation of Packages

Concepts Used

Problem Overview

1) Containing the information about the Genetic Mutations.

2) Containing the clinical evidence (text) that human experts/pathologists use to classify the Genetic Mutations.

1) training_variants (ID , Gene, Variations, Class)

2) training_text (ID, Text)

1) Multi class log-loss

2) Confusion matrix

Objective & Constraints

Workflow Analysis

Step 1

Step 2

Step 3

i) Gene (Categorical Variable)

ii) Variation (Categorical Variable)

iii) Text Feature (Words)

Step 4

I) Naive Bayes

My Output:

II) K Nearest Neighbour

My Output:

III) Logistic Regression with Class Balancing

My Output:

IV) Logistic Regression without Class Balancing

My Output:

V) Linear Support Vector Machine

My Output:

VI) Random Forest Classifier

My Output:

Summarized Log-Losses and Misclassified Points

References & Resources

About

Resources

License

Stars

Watchers

Forks

Languages